Recipes, Not Bespoke Models: Multi-tenant Learning to Rank for Enterprise Commerce Search

Welcome to the first in a series of posts about learning to rank (LTR) at Coveo. We’ll be writing about the journey: what we built, what worked, what didn’t, and what we’re still figuring out. If you’re an ML engineer, search practitioner, or applied scientist who’s ever shipped a ranking model and watched it do something unexpected in production, this series is for you.

Before we begin, let’s establish some context. Coveo is a multi-tenant Software as a Service (SaaS) platform. We provide search, listings, recommendations, personalization, and agentic capabilities to enterprise clients. In this series we’ll be focusing on commerce specifically: retailers, manufacturers, distributors, direct to consumer, and business to business. Each customer (tenant) gets their own bespoke search experience, but the infrastructure and the models are based on shared principles. When we train a ranking model, it needs to just work: a new customer onboards, their search traffic flows through the pipeline, and the results need to be good without anyone hand-tuning anything.

The catch is that our customers are not all the same. They vary wildly. A footwear retailer and a motor parts distributor have different catalogues, different users, and fundamentally different ideas about what a good search result looks like. One customer generates millions of search events a month with rich behavioural signals. Another has a secondary storefront where users arrive, search, and leave without generating much signal at all. The pipeline that serves them both has to handle that variance: reliably, automatically, and without falling over when someone’s data looks nothing like what you expected.

Daniel Tunkelang captures something important in his essay “Learn to Rank = Learn to be Humble”: even for a single search pipeline, ranking is harder than it looks. Models that shine in offline evaluation can disappoint in production. Feature engineering is more fragile than it appears. Evaluation metrics can mislead you at exactly the moment you need them most. He’s right, and if that’s true for one pipeline, imagine doing it across dozens of customers simultaneously, each with different data, different behaviour, and different definitions of success.

Model building at Coveo has to be done in a way that respects this diversity. We don’t build a model for each customer from scratch, we build a recipe: a shared paradigm that takes each customer’s data as input and produces a model tuned to their search behaviour. The quality of the recipe is measured by how well it generalizes, not one-off results on a single customer.

This is the story of how we built that recipe, where it broke, and what we learned along the way.

What Makes Multi-tenant LTR Different

The dimensions of variance across our customer base are worth spelling out, because each one creates engineering challenges that don’t appear in single-tenant settings.

Behaviour variance. What constitutes a “good search” differs radically across verticals. A user searching for industrial fasteners wants exact part-number matches. A user searching for running shoes wants to browse and discover. The same ranking signals: text relevance, click history, semantic similarity, carry different weight in each context. A model that excels at one may be mediocre at the other.

Configuration variance. Enterprise customers configure their search. They pin products to specific positions, apply custom boost rules, define their own product taxonomies. These configurations express business intent: merchandising decisions, promotional campaigns, contractual obligations, and the model needs to respect them rather than override them.

The constraint all of this imposes: you cannot hand-tune per customer at scale. When you have dozens of customers across multiple verticals and regions, per-customer model tuning doesn’t just fail to scale, it creates maintenance debt that compounds with every new customer. Your pipeline has to be robust and flexible by default.

The Recipe in Practice

So what does the recipe actually look like? The ingredients are each customer’s data, their search logs, their product catalog, their user behaviour patterns. These vary enormously in volume, quality, and the signals they carry. The recipe is the pipeline that processes them: data extraction, relevance grading, temporal train-test splitting, model training, offline evaluation, automated quality gates, and deployment. The same code runs for every customer.

Ther recipe for mulit-tenant learning to rank

The output is a trained model, specific to that customer’s data but produced by the same process. A model for a footwear retailer and a model for an industrial distributor are siblings. Different in what they’ve learned, identical in how they were made.

When we improve a feature, change a hyper-parameter, or modify the grading strategy, the question is always: does this make the recipe better for the population, or just for one customer? And when the recipe fails for a specific customer, the diagnostic question follows the same logic: is the recipe wrong, or are the ingredients unusual? Both happen. A recipe that can’t handle a customer with sparse data is a recipe that needs improving. But a customer whose search configuration routes half their traffic through non-standard query pipelines might need setup work, not a model change.

Keeping this boundary clear, what’s shared versus what varies, is what makes multi-tenant LTR tractable.

Multi-tenant Engineering Requirements

The recipe framing sounds tidy. In practice, it imposes specific engineering requirements that single-tenant teams rarely encounter.

Automated quality gates. A human can’t review every model for every customer. Instead we evaluate each trained model against a baseline on ranking quality metrics complete with uncertainty estimates. The system makes one of three decisions: approve (the model is confidently better), reject (it’s confidently worse), or advisory (the confidence interval spans zero, the result is inconclusive). This isn’t a nice-to-have; it’s infrastructure. Without it, a recipe that silently degrades for one customer would ship unnoticed.

Robust defaults that can flex per customer. Your defaults need to work well enough across the board that the recipe produces reasonable results for any customer out of the box. But “robust” doesn’t mean “identical everywhere.” Different customers may genuinely benefit from different configurations. The discipline is ensuring that when you tune for one customer, you’re not overfitting to their idiosyncrasies at the expense of the recipe’s generality. The question is always: does this change improve the recipe, or just this one dish?

Customer-level evaluation. Aggregated metrics can hide per-customer failures. When we ran a mass test across nine customers (spanning three regions, multiple verticals, and widely varying data volumes) the headline result was positive: seven experiments showed improved relevance. But the headline isn’t the whole story. The two negative cases each had specific explanations that required investigation. One was a client configuration issue. Another reflected a measurement subtlety we hadn’t anticipated. If we’d only looked at the aggregate, we’d have missed the signal.

Separation of customer intent. Enterprise customers configure their search to express business decisions. If a retailer pins a product to the top position for a particular query, because of a supplier contract, a margin target, or a seasonal campaign, the model shouldn’t override that. We separate customer-configured ranking rules before ML inference and reapply them afterward. The model learns from organic signals: text relevance, click behaviour, collaborative filtering, while the customer’s business rules are preserved intact. These are different concerns and should be composed, not entangled.

Knowing when the recipe breaks. Multi-tenant evaluation surfaces measurement subtleties that don’t arise in single-tenant settings. One speciality retailer appeared to show degraded click-rank metrics: users were clicking on lower-ranked results, which looks bad. But deeper investigation revealed that LTR had increased overall engagement: users were clicking more, across all positions. The increased engagement inflated average click-rank even though users were more satisfied. The same metric meant different things for different customers. Per-customer evaluation with careful metric interpretation isn’t optional, it’s the only way to know whether the recipe is working.

The Recipe as Foundation

The recipe approach: shared process, customer-specific data, automated quality control, population-level evaluation, is what makes multi-tenant LTR tractable. It produces good models for every customer, reliably and automatically, and it gives you the diagnostic tools to know when something needs more attention. Everything that follows in this series builds on that foundation.

What Comes Next

Most published LTR work describes single-tenant systems: one company, one model, one evaluation. The multi-tenant SaaS problem brings its own challenges, challenges that rarely get discussed. This series is our attempt to change that.

In the posts that follow, we’ll cover how we designed the architecture to decouple retrieval from ranking, how that clean design met production reality and evolved, the operational journey of getting LTR into production at scale, and the tension between optimizing for clicks and optimizing for revenue.

Dig Deeper

Explore our AI Guide for Search and Product Discovery.

Learn more