Modern ML systems are often described in terms of models, embeddings, feature stores and training pipelines. In practice however, the scalability and reliability of an ML platform is often determined much earlier in the process: at the moment raw interaction data enters the system.

At Coveo, nearly every ranking, personalization and commerce optimization capability depends on processing enormous volumes of behavioral signals generated across websites, applications and backend systems. Every search, click, recommendation interaction and purchase contributes data that may later influence machine learning outcomes.

At first glance, this may sound straightforward: clients send JSON payloads, those payloads are stored, and models train on the resulting data. In reality however, operating a large-scale ML platform across hundreds of integrations quickly becomes much more complicated. Payload formats evolve over time, integrations are difficult to change, and many important business concepts cannot be represented directly as single events.

To manage this complexity, Coveo deliberately separates three concepts that are often conflated:

  • raw payloads received from integrations,
  • normalized events representing business semantics,
  • and higher-level business state derived from collections of events.

That distinction may sound subtle, but it is foundational to how our ML and analytics systems operate at scale. As a simple analogy, consider how a bank may process an online purchase through a payment provider:

  1. Receive a JSON payload from the payment provider:
    {txid: fe24ae8b12c, amt: 49.55}
  2. Interpret this payload as a debit event on your account and store it:
    transaction: -49.55
  3. Process this event and update the bank’s business data model:
    account total: 529.76

The first two steps separate the external payload from the internal event representation. A bank may support many different payment providers or multiple protocol versions, all producing different payload structures. The exact syntax of the incoming JSON is ultimately irrelevant; what matters is preserving the meaning of the transaction in a unified format that downstream systems can understand.

The final step computes current business state from a collection of events. Banks do not simply apply a transaction to a running total and then discard the transaction record itself. Instead, all transactions are retained for auditability, resiliency and historical analysis, while separate processing systems continuously derive the current account state.

Coveo’s event processing architecture follows a very similar pattern.

Payload Normalization

There are a number of practical and architectural reasons for creating a clear separation between the payload received from an integration and the event it represents.

Separation of concerns

It would be extremely difficult for ML data scientists or downstream analytics systems to keep track of the precise details of multiple integration protocols. A model may simply need the revenue associated with a purchase event, without caring whether the value originates from a field named tr in one protocol or transaction.revenue in another.

This abstraction is conceptually similar to interfaces in software engineering. The normalized Event definition acts as the stable interface consumed internally, while the payload-specific transformation layer implements the details required to map each external protocol into that interface.

Without this separation, every downstream system would become tightly coupled to every upstream integration format.

Integration inertia

The ability to abstract away from exact payload syntax is critical because integrations are difficult to change once deployed.

Customers are understandably hesitant to modify working Coveo integrations running on production websites. The operational risk is significant: website changes may interfere with day-to-day business operations, ownership is often fragmented across teams, and the benefits of integration updates are not always immediately visible to stakeholders. As a result, once an integration is deployed, changing it becomes an uphill battle.

By isolating protocol-specific concerns within the normalization layer, internal systems can continue to evolve independently while maintaining backward compatibility with integrations that may remain unchanged for years.

Changing event sources

As the product evolves, the systems responsible for generating events may also change.

In older integration protocols, for example, search event payloads needed to be sent explicitly from the client browser after search results were received from the Coveo Search API. In more modern Coveo deployment architectures, the Search API itself may generate and log the corresponding payload automatically.

This transition was transparent to downstream systems because both implementations ultimately produced the same normalized search event, even though the payload origins and transport mechanisms differed significantly.

One payload, multiple events

The mapping between payloads and events is not always one-to-one.

Some legacy endpoints accept arrays of objects within a single payload and expand them into multiple independent events during normalization. Similarly, a commerce purchase payload may contain multiple line items which are more conveniently processed as separate downstream events for attribution or analytics purposes.

The normalization layer is responsible not only for translating payload formats, but also for reshaping incoming data into forms better suited for downstream processing.

Events Versus State

Returning to the banking analogy, it should be obvious why banks do not discard transaction records after updating an account balance. Customers need to understand how the current balance was produced, and the bank must be able to audit or reconstruct account history if necessary.

At the same time, it would also be impractical for a bank to recompute your account balance from the full transaction history every time you open your banking application.

The same distinction exists within Coveo: events capture the history of interactions, while state represents higher-level business concepts derived from those events.

Some states are implicit

Certain business concepts are never explicitly represented as events.

For example, the concept of a visit is central to many commerce analytics metrics such as revenue per visit. Yet there is no explicit “start visit” or “end visit” event emitted by integrations. Instead, visits are inferred from patterns of related events occurring over time.

Reconstructing visits dynamically from billions of raw events every time a metric is requested would be computationally prohibitive. Persisting derived state makes these analytics practical.

Some states are complex

Other forms of state require sophisticated processing logic spanning many events across extended time windows.

Commerce attribution is a good example. In order to evaluate ML effectiveness, Coveo must relate purchased items back to upstream search, recommendation or personalization interactions. Determining which interaction influenced which purchase is often non-trivial and may require analyzing many events spanning hours or even days.

If every team implemented attribution logic independently, the result would be duplicated compute costs, inconsistent business definitions and inevitable discrepancies between systems.

Centralized state computation ensures these concepts are implemented once and consistently reused across the organization.

Product only has one state

Customers expect a single coherent view of the system.

If different product interfaces display different values for what should be the same metric, trust in the platform quickly deteriorates. Even small implementation differences between teams can produce inconsistent outcomes once edge cases begin to accumulate.

Maintaining a unified business state ensures that analytics, reporting and ML training pipelines all operate against the same definitions and derived outcomes.

State needs to be auditable

Just as a bank must be able to explain how an account balance was produced, Coveo must be able to explain how efficiency metrics, attribution outcomes or ML effectiveness measurements were computed.

This requires retaining both the original event history and the derived state generated from it. Events provide traceability and auditability, while state provides efficient access to business-level concepts.

Coveo’s Data Processing Pipeline

Coveo uses this same three-step pattern — capture, normalize and process — throughout most of its event processing infrastructure.

IIncoming payloads are first captured and stored in a unified data lake. Stateless normalization processes then transform those payloads into standardized event records. A second layer of processing consumes large collections of events and derives higher-level business models and aggregates from them.

Different internal consumers naturally gravitate toward different layers of the system:

  • Teams interested in interaction-level analysis (“How long between a search and a click?”, “How many search events occurred yesterday?”) typically work directly with event data.
  • Teams interested in outcomes and aggregates (“Did this visit result in a purchase?”, “Which ranking model performed better?”, “How much product revenue was generated in North America?”) generally work with the higher-level business models.

Today, the resulting data foundation powers the majority of product analytics and reporting across Coveo, ensuring that customers and internal teams share a consistent view of system behavior.

At a meta-level, the same infrastructure is also reused internally to monitor Coveo’s own ML stack. Model executions, recommendations and downstream outcomes are themselves logged as events, allowing Coveo to evaluate individual model performance, tune algorithms automatically and identify anomalous customer deployments proactively.

In Summary

It is well understood that improving data quality improves ML outcomes. At scale however, data quality is not only about correctness — it is also about abstraction, consistency and evolvability.

Coveo’s platform must simultaneously support decade-old integrations that are difficult to change while continuing to evolve rapidly internally. The abstraction layers between payloads, events and business state is what makes this possible.

Without it, downstream ML training pipelines, analytics systems and reporting infrastructure would quickly collapse under the weight of integration-specific complexity and duplicated business logic.

At the same time, centralized state computation and a shared source of truth ensure that complex metrics are computed consistently and efficiently across the platform. Without that consistency, customers could not reliably trust the analytics and ML outcomes presented to them.

At the time of writing, Coveo processes close to a billion incoming payloads per day, and that volume continues to grow rapidly. Operating efficiently at that scale requires carefully tuned processing infrastructure maintained by a dedicated Data Platform team under strict operational SLOs.