Decoupling Retrieval and Ranking: Designing Learning to Rank for a Multi-Tenant Search Stack

In the first post in this series, we introduced the idea that multi-tenant learning to rank requires a recipe: a shared pipeline that takes each customer’s data and produces a model tuned to their search behaviour. That post was about the philosophy. This one is about the engineering: the design choices we made, the offline experiments that tested them, and the discoveries that gave us the confidence to move toward production. The challenge is that a production search stack with dozens of ranking signals, customer-configured rules, and a broad base of enterprise customers imposes constraints that most introductions to LTR never mention.

Use What the System Already Computes

When we sat down to design LTR at Coveo, the natural starting point was features. What signals should the model learn from? In the LTR literature, this is usually where the feature engineering begins: you compute query-document similarity scores, extract content features, build user profiles, and so on.

We took a different approach. Coveo’s search stack already computes a rich set of ranking signals for every query. Our Index produces text relevance scores — TF-IDF, phrase matching, title similarity. Other ML models contribute their own signals: collaborative filtering for popularity, session-based personalisation, and semantic similarity from vector search. On top of these, customers configure their own ranking rules: boost expressions, featured results, and custom ranking functions. All of these signals are combined via a hand-tuned linear weighted sum to produce a final ranking score.

Rather than building a parallel feature computation system for LTR, we realised we could use the signals the search stack already computes. Every one of these ranking scores is logged to our data warehouse as a “ranking log” — a record of exactly which signals contributed to each document’s position for each query. The same signals are available at serving time, because they’re computed by the Index for every request regardless of whether LTR is present.

This led us to three design principles that guided the initial architecture:

Simplicity. Ranking logs are generated by the Index and logged into our data warehouse. Using them and only them makes acquiring the same features at training and serving time straightforward. No separate feature pipeline to build. No additional data sources to maintain. No risk of feature computation diverging between training and production.

Decoupling. By relying on other systems to generate features, those systems can be improved in isolation with positive impacts to LTR for free. When the team improving semantic search ships a better model, LTR benefits without any changes to the ranking pipeline. When collaborative filtering gets more accurate, the signal that flows into LTR improves automatically. Each system focuses on what it does well.

Generalisation. By not embedding problem-specific modelling into LTR, we enable it to function across different verticals where the important features may vary significantly. An industrial parts distributor and a fashion retailer may rely on very different signals — text matching might dominate for the former, while real-time personalisation matters more for the latter. A model that consumes whatever signals are configured, rather than hardcoding its own feature set, is a model that can generalise across the customer base.

This approach also has a notable implication: LTR replaces a hand-tuned linear weighted sum with a learned non-linear model — specifically, LambdaMART trained with a LambdaRank NDCG objective — that consumes the same signals. The model learns which signals matter and how they interact, rather than relying on weights set by hand and rarely revisited. In a system with dozens of ranking signals, the combinatorial space is vast. A learned non-linear combination can capture interactions that a linear sum cannot, and in practice it tends to outperform hand-tuned weights as the number of signals grows.

Where LTR Fits in the Search Stack

So: the features are settled. Where does the model actually go? In a production search system, placement matters. It affects latency, what information the model has access to, how failures are handled, and which team owns which parts of the system.

At Coveo, the majority of our Machine Learning systems sit upstream of the Index, providing their predictions either as instructions for the Index to execute or as specific products to retrieve and rank. However, this pattern didn’t fit LTR, which requires the Index to have already retrieved a set of products for the model to reorder.

We considered placing LTR after the Index as a post-processing step, or using our Search API as an orchestrator with multiple round-trips. Both added latency and forced the model to manage UX complexity — pagination, folding, enrichment — that belongs in the other layers. Co-locating the model inside the Index was attractive for latency but too complex for initial delivery.

The approach we chose places LTR within the Index’s pipeline. The Index over-fetches candidates, sends them with their ranking features to the ML platform for reranking, receives scores back, and then continues its normal processing. Critically, the Index remains the orchestrator throughout — LTR is one component that plugs into the query execution flow, not a separate system that takes over. The reasoning was practical:

How LTR is embedded within the Index's query execution pipeline.

Reduced latency. By keeping LTR inside the Index’s processing flow, we avoid extra network hops carrying full document payloads. It sends only candidate IDs and their ranking features — a compact payload — to the ML platform and receives scores back.

The Index stays in control. Folding, pagination, and result enrichment remain the Index’s responsibility and happen after reranking, in the correct order. Post-processing is complex — child documents get folded under parents, result pages get truncated to the requested size, and additional metadata is attached for the frontend. By keeping LTR as a step within the Index’s pipeline rather than an external post-processor, the model ranks the full candidate set while the Index handles everything else. Query execution concerns stay where they belong.

Flexible candidate windows. The Index can send a large candidate window to LTR, have them reranked, and then paginate. This window governs a performance-versus-quality trade-off. A larger window gives the model more room to surface relevant documents from deeper in the initial ranking. A smaller window reduces latency. The optimal size is something we tuned empirically.

Graceful degradation. Because the Index orchestrates the flow, it also controls what happens when things go wrong. If the ML platform is slow or unavailable, the Index falls back to its own ranking — a decision it can make locally, without coordinating with other systems. No reranking is better than a timeout that breaks the search experience.

Separating Customer Intent from Learned Ranking

In our previous post we described why separating business rules from learned ranking matters. Here’s how we implemented it.

Enterprise customers configure their search to express business intent — pinning products to positions, boosting in-stock items, applying custom ranking functions. These aren’t noise; they’re deliberate business decisions, made by expert users. They are also implemented by the Index, upstream of LTR. Rather than treating these as another signal and allowing LTR to override them, we instead restrict the model to organic signals: text relevance, popularity, personalisation, semantic similarity, etc. The excluded rule-based scores are then reapplied after predictions, so pinned products and boost rules take effect on top of the learned ranking. The model works with each customer’s merchandising strategy, not against it.

Testing the Design

With the architecture defined, we needed to know whether it actually worked. Before building any production infrastructure, we ran a series of offline experiments across a number of customers selected to span different verticals and data volumes, training our new model on their ranking logs and evaluating against held-out data.

The First Experiments

The initial results were encouraging. Using the composite ranking log features — the aggregated scores the Index already logs — the model produced significant improvements in NDCG@5 (bootstrap confidence intervals excluded zero at the 95% level) for around 75% of customers. The largest uplift, from a major retailer with millions of monthly searches, was approximately 20%; others ranged from 3% to 5%. One customer showed no clear improvement — their search configuration and user behaviour didn’t yet fit the assumptions baked into our pipeline, a reminder that generalisation has limits even in a recipe-based approach. In all cases the most important features were text relevance scores and the composite ranking expression and ranking function totals — exactly the signals the design was built to consume.

These results validated the core hypothesis: a model trained on existing ranking signals, with no additional feature engineering, could meaningfully improve on a hand-tuned linear combination.

The Features Hiding Inside the Features

One observation we made during these initial experiments was that the ranking logs contained two large composite scores — an artefact of how they were implemented in the Index. Each bundled multiple distinct signals (collaborative filtering, facet-based ranking, popularity, and others) into a single number. Was the model losing information?

It was. When we decomposed the composites into their individual components — giving the model access to each signal separately — performance improved across all customers. The customer that had previously shown no uplift jumped to a 9% improvement in NDCG@5. The model had been trying to learn from pre-aggregated scores where distinct signals were mixed together, conflating their intents and utility. Once it could see each signal individually, it could learn which ones mattered and how they interacted.

This is a feature engineering insight that’s specific to building LTR on top of an existing search stack. When your features come from another system, you inherit that system’s abstractions. Those abstractions may be useful for their original purpose — computing a weighted sum — but unhelpful for a model that needs to learn non-linear combinations. Decomposition was straightforward once we knew to look for it, but the default was to consume the features as they were logged, and the default was wrong.

Rules Held Up

We also validated the business rules separation design. Models trained only on organic signals — excluding customer-configured ranking expressions, featured results, and custom ranking functions — still significantly outperformed the baseline (although not by quite as much as the unrestricted ones). When we rescored model predictions to reapply the excluded rules, those with heavy rule usage saw a small reduction in the increase of NDCG@5 while others saw no change. In all cases, the rescored model still substantially improved on the existing ranking. The separation of concerns wasn’t just architecturally clean — it was empirically sound.

What Comes Next

The offline experiments told a consistent story. A model trained on the search stack’s own ranking signals — decomposed, cleaned of customer rules, and evaluated honestly — could substantially improve ranking quality across a diverse set of enterprise customers. The design principles we started with — simplicity, decoupling, generalisation — were holding up.

But offline metrics, however promising, are not production results. The gap between offline evaluation and live performance is where most ranking projects either prove themselves or stall. We had also catalogued a set of train-serve skew risks during the design phase — for example, features computed during indexing can differ subtly from the same features computed at query time, and ranking logs capture the training-time version, not the serving-time version. We knew these risks would need empirical investigation once the model met real traffic.

In the next post, we will describe how we ensured our design met real users as efficiently as possible so that we could get ahead of the remaining risks and begin to deliver what really matters: user impact.

Dig Deeper

Explore our AI Guide for Search and Product Discovery.

Learn more