RAG-as-a-Service: Production RAG Infrastructure for Your Next AI Application

You’ve built your custom RAG pipeline. It works in dev. Then you try to:

Task	Obstacle
Connect it to SharePoint	Permissions are a nightmare
Add Salesforce	Different API, different auth
Scale to 10M documents	Retrieval slows to a crawl
Handle users who shouldn’t see certain content	Now you’re rebuilding access control

Building RAG infrastructure from scratch means solving problems that have nothing to do with your actual application.

At Coveo, we’ve taken 15+ years of building enterprise search infrastructure and turned it into a platform we call Retrieval Augmented Generation (RAG)-as-a-Service. With agentic architectures maturing, the RAG layer becomes essential. To respond to the market we put our cloud-native offering behind a Coveo-hosted MCP server, designed to bring more precision, security, and scalability to agentic projects.

It’s permission-aware, API-first, production-ready retrieval infrastructure. Built for enterprise engineers who want to extend the capabilities of their copilots, agents, assistants, and AI-powered search.

Not familiar with RAG? Start here instead.

Under the Hood

RAG-as-a-Service connects your content sources to your LLM, enriching each question with the right context before generation:

The Coveo platform handles:

Incremental indexing with permission sync
Lexical/keyword search and semantic retrieval
Semantic embedding and vector search
Machine learning and analytics
Passage-level retrieval and ranking
Runtime permission enforcement

Why Engineering Teams Choose RAG-as-a-Service

Your LLM is only as good as its context.

Whether you’re building a workplace assistant to help employees or a customer-facing chatbot, hallucinations aren’t acceptable. Coveo unifies content access in one place so your LLMs always have access to current and permission-aware information.

Governance isn’t optional.

You’re working with enterprise data. That means respecting access controls, data sensitivity, and compliance requirements. RAG-as-a-Service enforces document-level permissions and works with both structured and unstructured content sources, without requiring data migration.

Reduce time-to-production.

Skip the work of stitching together vector databases, connectors, access control layers, and retrieval logic. RAG-as-a-Service gives you search, passage retrieval, and answer delivery—modular components ready to plug into your app.

The Core APIs

Think of RAG-as-a-Service as infrastructure for building AI experiences. Each API handles a different part of the retrieval and generation workflow:

Passage Retrieval API (PR API)

PR API returns specific text passages from your indexed documents, ranked by relevance. Send a query, get back the most relevant snippets with permission filtering already applied. Built on top of the Search API infrastructure, so your existing query pipelines work here too.

The following example shows the payload to retrieve passages from a Coveo index:

{
  "query": "What are the benefits of using solar energy?", 
  "filter": "@source==\"acme\"", 
  "additionalFields": [ 
    "clickableuri"
  ],
  "maxPassages": 5, 
  "searchHub": "Main", 
  "localization": { 
    "locale": "en-CA",
    "timezone": "America/Montreal"
  },
  "context": { 
    "userAgeRange": "25-35",
    "userRoles": [
      "PremiumCustomer",
      "ProductReviewer"
    ]
  }
}

The following example shows a response payload from the Passage Retrieval API:

{
  "items": [ 
    {
      "text": "Solar energy has several benefits including reducing electricity bills, providing a renewable energy source, and lowering carbon footprint.\",\n", 
      "relevanceScore": 0.95, 
      "document": { 
        "title": "The Benefits of Solar Energy",
        "primaryid": "GAYEG6LHNB2DQ4LLNBKVEUSHKUXDCMZWGEZS4ZDFMZQXK3DU",
        "clickableuri": "https://example.com/search/document-solar-energy"
      }
    },
    {
      // another item
    }
  ],
  "responseId": "c0857557-5579-4f5e-8958-9befd7d1d4a8" 
}

Integrates with Amazon Bedrock, Microsoft Copilot, or your own orchestration layer.

Answer API (In Open Beta)

Answer API generates answers from your data using Relevance Generative Answering. Handles the LLM call and citation generation for you, or use it alongside your own LLM workflow.

The following example shows the request for a payload from the Answer API:

POST https://<YOUR_ORG_REST_URL>/answer/v1/configs/<CONFIG_ID>/generate


{
  "q": "Which gloves are better for autumn?",
  "searchHub": "sports", 
  "pipeline": "Sport goods pipeline",
  // ...
}

Returns streaming server-sent events with the generated answer text and citations. The stream sends header info, then answer text chunks, then citations, then an end-of-stream event.

Search API

The Search API provides standard relevance-ranked search with ML-powered ranking. Returns full document results instead of passages.

The following example shows a search request to the Search API:

 payload = {
  "q": query,
  "searchHub": "sports",
  "pipeline": "Sports goods pipeline"
}

The following example shows a response payload from the Search API:

{
  ...
  "duration": 35,
  "groupByResults": [
    ...
  ],
  "indexDuration": 11,
  "requestDuration": 33,
  "results": [
    {
      "clickUri": "https://example.com/bookstore/books/authors/arthur-conan-doyle/adventures-of-sherlock-holmes",
      "excerpt": "The Adventures of Sherlock Holmes, a collection of 12 Sherlock Holmes tales ... written by Sir Arthur Conan Doyle and published in 1892 ...",
      "excerptHighlights": [
        {
          "length": 10,
          "offset": 4
        }
      ],
      "percentScore": 75.0698,
      "printableUri": "https://example.com/bookstore/books/authors/arthur-conan-doyle/adventures-of-sherlock-holmes",
      "printableUriHighlights": [
        {
          "length": 10,
          "offset": 63
        }
      ],
      "raw": {
        "date": 1532631456000,
        "author": "Arthur Conan Doyle",
        "documenttype": "Book",
        "filename": "adventures-of-sherlock-holmes.html",
        "filetype": "html",
        "indexeddate": 1532631456000,
        "language": [
          "English"
        ],
        "permantentid": "ecc3fac22085f2712c8cd2144f9d195593710963dc2202b5256f8a4f5f6",
        "size": 50683,
        "source": "Books",
        "sourcetype": "Push",
        "title": "The Adventures of Sherlock Holmes",
        ...
      },
      "score": 4904,
      "title": "The Adventures of Sherlock Holmes",
      "titleHighlights": [
        {
          "length": 10,
          "offset": 4
        }
      ],
      "uniqueId": "42.19751$https://example.com/bookstore/books/authors/arthur-conan-doyle/adventures-of-sherlock-holmes",
      "uri": "https://example.com/bookstore/books/authors/arthur-conan-doyle/adventures-of-sherlock-holmes"
    },
    ...
  ],
  "searchUid": "7beff9c1-98f3-401c-ac16-10b90a8b810f",
  "totalCount": 60,
  "totalCountFiltered": 60,
  ...
}

Good for exploration, fallback scenarios, or when users want to browse full results rather than get direct answers.

Fetch API

Full-document retrieval for use cases where you need complete context rather than passages. Currently in beta. Get more information here.

Each API is powered by Coveo’s platform: enterprise AI models, hundreds of connectors to SaaS platforms, advanced indexing, permission enforcement, and flexible deployment options.

Real Implementation Examples

Build Your Own Copilot

Use the Passage Retrieval API to feed your LLM with secure, filtered content from your internal sources. The API respects document-level permissions at query time, so users only see passages from documents they have access to.

How it works:

User asks a question in your copilot interface
Your app sends the query to the Passage Retrieval API with the user’s authentication context
PR API returns relevant passages (max 20) with relevance scores
You construct a prompt with these passages as context
Send to your LLM of choice (OpenAI, Anthropic, your own model)
Return the grounded response to the user

The Passage Retrieval API handles indexing, permission filtering, semantic ranking, and passage extraction. You control the prompt engineering and response generation. Works with Bedrock, Copilot, or custom orchestration.

Power a Chatbot That Actually Answers

Use Answer API for a complete question-answering flow. The Answer API uses an external LLM to generate responses, so you don’t need to manage LLM infrastructure.

How it works:

User asks a question
Your app calls the Answer API with the query
Answer API retrieves relevant passages, generates an answer, and includes citations
Returns a streaming response with answer text and source documents
Your UI displays the answer with clickable citations

The Answer API handles indexing, permission filtering, semantic ranking, passage extraction, prompt engineering, and response generation. Works with Bedrock, Copilot, or custom orchestration.

Add Contextual Search to Any App

Use the Search API for standard search functionality across your content. Unlike the PR API which returns passages, the Search API returns complete document results with titles, URIs, and excerpts.

How it works:

User enters a search query
Your app sends the query to the Search API
API returns ranked results with highlighted excerpts
Display results as a list with titles, snippets, and links

Good for support portals, intranets, or product documentation where users want to explore results rather than get a single answer. The Search API also supports faceting and grouping for filtering large result sets.

Future-Proof with MCP Server

Use the Coveo MCP Server to equip your agent with the search APIs it needs. Through the Model Context Protocol, all Coveo retrieval APIs become standardized tools accessible to any MCP-compatible client, such as Claude Desktop, ChatGPT or custom agents.

How it works:

Enable and configure the Coveo MCP Server in the Coveo platform
Connect it to your MCP-compatible client (Claude Desktop, custom agent, etc.) by using API credentials
Your client gets access to all four APIs as tools: search, passage retrieval, answer generation, and fetch

Why this approach:

Future-proof: as you add new use cases, you already have access to all enterprise data without rewriting integrations
Works with any MCP-compatible client, giving you flexibility to switch tools
You still control the prompts that guide when each tools gets used

The MCP Server acts as a bridge between standardized AI tooling and Coveo’s retrieval infrastructure. It’s available as an open-source project you can run locally or deploy to your infrastructure.

What We Don’t Do

RAG-as-a-Service is infrastructure, not a framework. We don’t lock you in.
We don’t force you into a specific LLM provider—use Salesforce Agentforce, AWS Bedrock, Microsoft Azure, or your own models
We don’t require you to migrate your data—connect to your existing sources in place
We don’t lock you into a monolithic platform—use our APIs where they make sense

Use what works for your stack. Bring your own LLM, use your existing orchestration tools, and integrate at whatever layer makes sense for your application.

When Building Your Own RAG Makes Sense	When RAG-as-a-Service Makes Sense
– You need extreme customization of retrieval algorithms beyond standard configuration – Your data patterns are highly unusual and don’t fit standard indexing approaches – You have dedicated ML engineers focused on search infrastructure – You want full control over every component in the stack	– You need to ship in weeks, not quarters – Managing permissions across 50+ data sources isn’t your team’s core focus – Your team wants to focus on the application layer, not infrastructure – You need enterprise-grade relevance and security working out of the box – You want deployment flexibility (fully managed SaaS or API-only integration)

We handle the infrastructure complexity and support the flexibility to build, letting your team focus on building features that solve actual problems for your users.

Built to Scale With You

RAG-as-a-Service isn’t just a product. It’s the retrieval layer behind Coveo’s agentic AI strategy. The same infrastructure powers:

Coveo’s Relevance Generative Answering (CRGA)
Agent orchestration frameworks using Bedrock and Microsoft Copilot
Enterprise search built by engineering teams at Fortune 500 companies

Start small and scale up, or go from development to enterprise-wide deployment without rearchitecting.

Ready to Build?

Get cloud-native retrieval infrastructure that reduces hallucinations and respects permissions. The same system trusted by engineering teams building production AI applications.