Building a Scalable Product Index with SAP BTP, PIM, and Coveo

This post is for integration architects, SAP developers, and commerce engineers building search solutions with distributed product data.

Modern commerce search experiences depend on accurate, enriched, and timely product data. In complex enterprise landscapes, this data rarely lives in a single system.

When a customer searches for “hydraulic pump” on your commerce site, they expect to see current prices, accurate availability, and complete specifications. But if you’re running an enterprise B2B platform, that data doesn’t live in one place. Product master data sits in your PIM (Product Information Management system), pricing comes from SAP, inventory updates through ERP, and somehow it all needs to land in your search index—fresh, enriched, and consistent.

We built an integration on SAP BTP (Business Technology Platform) Integration Suite that solves this problem for large product catalogs. It pulls data from a PIM, enriches it with real-time SAP pricing, and pushes fully formed documents into Coveo’s search index. The system handles catalogs with hundreds of thousands of products, supports both full rebuilds and incremental updates, and stays resilient when downstream APIs misbehave.

This post walks through how we designed it, why we made certain trade-offs, and what we learned running it in production.

The Challenge: Orchestrating Product Data Across PIM, SAP, and Search at Scale

The problem sounds straightforward: get product data into a search index. But in practice, you’re orchestrating multiple systems with different data models, API constraints, and update cadences.

Most PIMs stores product relationships, attributes, and digital assets. SAP holds pricing that changes daily. Coveo needs documents optimized for search and filtering. Each system has rate limits. Any of them might time out or throttle you during peak load. And you need this pipeline to run reliably, whether you’re indexing 5,000 products or 500,000.

We needed an architecture that could:

Scale to large catalogs without hitting memory or timeout limits
Decouple data discovery from enrichment so one slow API doesn’t block everything
Handle partial failures gracefully and retry just what failed
Keep the search index consistent (no half-indexed catalogs)
Support both full rebuilds and fast incremental updates

How the Integration Works

We built this as a single CPI (Cloud Platform Integration) iFlow with three distinct phases. Each phase solves a specific part of the orchestration problem.

Phase 1: Stream Setup and Product Discovery

The integration starts when an HTTPS endpoint is triggered. This could be a scheduled job, a webhook from the PIM, or a manual trigger during deployment. The caller can pass parameters like the last successful run timestamp (for delta updates) or batch size (to tune throughput).

The first thing we do is open a Coveo Push API stream. This gives us a streamId and a pre-signed uploadUri that we’ll use for all subsequent uploads.

Coveo streams are important here because they let us push large datasets in chunks without worrying about individual request sizes, and they guarantee atomic updates—either the whole index update succeeds or none of it does. No partial rebuilds.

With the stream open, we query the PIM for product entity IDs. We deliberately don’t fetch full product data yet—just the IDs.

For a catalog with 300,000 products, pulling full payloads upfront would overwhelm the API and consume gigabytes of memory in CPI. Instead, we get a lightweight list of IDs (maybe 50KB for thousands of products) and process them in batches later.

The query filters by entity type (Product), checks that products are linked to at least one ProductItem, and optionally filters by LastModified date if this is an incremental run. This keeps the initial request fast and sets us up to scale.

Phase 2: Batch Processing and Enrichment

Now we have thousands of product IDs. We split them into manageable batches.

Why batch at all? Through testing, we found that smaller batches created too much API overhead, while larger batches risked timeouts and made retries expensive. We settled on a batch size that gives good throughput without hitting CPI’s 60-second reply timeout limit. Each batch becomes an independent unit of work, and if one fails, we retry just that batch.

For each batch, we call the PIM entities:fetchdata endpoint to retrieve full product details: attributes, linked items, specifications, category hierarchies, and digital assets like images.

A Groovy script transforms and normalizes this data—flattening multilingual fields, building category paths, formatting field names to match Coveo’s schema. At this point, we have search-ready product documents, but they’re missing prices.

Enriching with pricing data: We extract SAP Material IDs from the batch and call SAP’s OData (Open Data Protocol) pricing service. This is a separate API call per batch, using the current pricing date and sales context.

We’ve built in retry logic with exponential backoff because the pricing service occasionally throttles us during peak hours.

The pricing response gets merged back into the product payload, giving us fully enriched documents ready for indexing. As an alternative, we also implemented an initial upload of pricing data from an SFTP server using CSV files.

Phase 3: Pushing to Coveo and Closing the Stream

Each enriched batch is pushed to Coveo using the uploadUri from Phase 1. The payload is JSON—an array of addOrUpdate operations, one per product, with fields like documentId, ec_name, ec_price, and dozens of filterable attributes.

We send this as application/octet-stream with server-side encryption enabled. If Coveo needs the data in smaller chunks (which happens with very large batches), it responds with a new chunk URI. We handle this transparently by requesting the new URI and resubmitting.

The integration tracks upload progress and respects Coveo’s Retry-After headers when it pushes back.

Once all batches are uploaded, we close the stream. This tells Coveo to finalize indexing. Until the stream closes, none of the updates are visible in search results. This keeps the index consistent—users never see a half-updated catalog.

Beyond the Basic Flow: Multiple Indexing Strategies

That three-phase flow handles a full catalog rebuild. But running it every time something changes would be wasteful. Product master data changes occasionally, but prices update daily and inventory changes constantly. We don’t need to reprocess 300,000 products just to update 50 prices.

So we split the indexing work into four specialized CPI artifacts, each optimized for a different update pattern.

Full (Initial) Product Indexing

It rebuilds the entire catalog from scratch—opening a new stream, querying all products, batching through enrichment, and closing the stream. We run this nightly or weekly, and manually during deployment or schema changes. It’s the foundation.

Incremental Product Indexing

Focuses on speed and freshness. Instead of querying all products, it queries only products modified since the last successful run (using the LastModified timestamp).

It processes just those changed products and pushes updates via addOrUpdate. This runs every 5-15 minutes, keeping product content nearly real-time without the overhead of full rebuilds.

Pricing Update Artifact

This handles pricing changes independently. When SAP publishes new prices (or we detect pricing changes), this artifact fetches just the updated prices and performs partial updates on existing Coveo documents.

Only pricing fields are touched, there’s no need to reprocess product master data, categories, or images. This keeps pricing current without expensive reindexing.

Inventory Update Artifact

This works similarly for stock and availability. Inventory changes more frequently than anything else, sometimes minute-by-minute.

This artifact listens for inventory events, updates stock levels and lead times, and uses Coveo’s partial update API to minimize payload size. Search results now reflect real-time availability, which dramatically improves user trust and conversion.

Each artifact runs on its own schedule, scales independently, and fails independently. If pricing updates break, product indexing keeps running. This separation of concerns makes the system maintainable and lets us optimize each update pattern without compromising the others.

Key Takeways from Production

Building this integration surfaced challenges we didn’t anticipate. Some were technical limitations, others were tooling gaps. Here’s what we figured out.

The SAP BTP learning curve was real
If your team hasn’t worked with Integration Suite before, expect to spend time understanding iFlows, local integration processes, message properties, and adapters. The SAP BTP cockpit has its own logic for subaccounts, security configuration, and transport management. We front-loaded this learning with dedicated environment setup time and established clear naming conventions early. That upfront investment saved us from constant confusion later.

The 60-second reply timeout forced design changes

CPI cannot wait more than 60 seconds for a synchronous response. This became a problem when fetching large batches from the PIM or when SAP’s pricing service was under load. We couldn’t just wait—we had to redesign. We broke processing into smaller batches, implemented explicit retry logic in Groovy scripts instead of relying on long blocking calls, and used message properties to track state across retries.

When Coveo returned Retry-After headers, we built custom wait-and-retry logic that respected the timeout constraint. The system is now asynchronous by necessity, which actually made it more resilient.

External parameters eliminated configuration headaches
Hardcoding API keys, batch sizes, and feature flags directly in the iFlow is unmaintainable. We externalized everything—endpoints, throttling values, debugging toggles—so the same artifact could deploy across dev, QA, and production without modification.

We can also tune batch sizes per environment based on observed performance. This pattern is essential for any real-world CPI project.

Debugging required intentional instrumentation
CPI’s native debugging tools are limited. We built structured logging into every Groovy script, attached intermediate payloads to message logs (when a debug flag is enabled), and logged retry attempts with response codes.

When something breaks in production, we can pull the message log, see exactly which batch failed, and replay just that batch with full visibility. Without this instrumentation, we’d be flying blind.

Why This Pattern Works

This architecture works because it acknowledges reality. Product data is distributed. APIs have limits. Systems fail. Catalogs are huge. Instead of pretending these constraints don’t exist, we designed around them.

We decouple discovery from enrichment so one slow API doesn’t block everything. We batch aggressively to stay within memory and timeout limits. We use streams to guarantee index consistency. We split update strategies by change frequency so we’re not constantly reprocessing static data. And we instrument heavily so we can see what’s happening and fix it fast.

The result is a system that scales to hundreds of thousands of products, keeps search results fresh, and stays resilient when things go wrong. If you’re building commerce search on SAP BTP or orchestrating data across PIM, ERP, and search platforms, this pattern gives you a proven foundation to start from.