I’ve seen countless teams rush to build AI agents, only to hit a wall when their proof-of-concept can’t handle real-world complexity. The problem isn’t the technology—it’s the mindset. Too many developers treat AI agent development like they’re just wiring up an API: plug in a language model, connect it to some data, and hope for magic.
But here’s what I’ve learned from building production-grade AI systems: enterprise agents aren’t monolithic. They’re orchestrated systems where multiple functional and reasoning steps work in concert. You can’t cherry-pick one impressive component and call it done. Success comes from understanding how all the pieces fit together—and being deliberate about each one.
The question-answering agent we’ll walk through today is a perfect example. It goes well beyond the basic chatbot everyone’s already created. What makes it interesting is the architecture: treating each step as a distinct node where you can call different AI models optimized for specific tasks. This isn’t just theoretical elegance—it’s practical flexibility. Instead of forcing a single model to be a jack-of-all-trades, you get to choose the best tool for each job.
This is the foundation for something that can actually scale to enterprise needs. Let me show you how it works.
Building AI Agents: A Step by Step Framework


Question Answering Agent Pipeline Components
| Component | Purpose | Key Actions |
| Security | Prevent malicious or sensitive data from being processed | Never trust user input; implement guardrails and PII checks; use local AI model for screening |
| Query Analysis | Identify intent behind the user input | Distinguish greetings vs. genuine questions; optimize response strategy accordingly |
| File and Image Analysis | Expand utility beyond text inputs | Extract text from screenshots and images; enrich query context for better responses |
| Sentiment Analysis | Adapt responses based on user emotion | Detect frustration or stress; adjust language and tone dynamically |
| Query Optimization | Improve query effectiveness for retrieval systems | Use conversation history and context; enhance queries via AI |
| Complexity Analysis | Route queries based on their difficulty level | Identify simple vs. complex queries; apply different processing methods |
| Question Decomposition | Break down complex queries into manageable parts | Split into sub-questions; gather info from multiple topics |
| Query Expansion | Increase coverage of search results | Use synonyms and domain terms; apply thesaurus rules |
| Response Generation | Produce a coherent, relevant answer | Use a powerful model with large context window; synthesize from multiple sources |
| Response Evaluation | Ensure output meets quality standards | Check completeness and accuracy; assess clarity |
| Clarification Check | Fill in gaps when needed | Use faceted options to clarify; avoid incomplete responses |
| Hallucination Check | Determine acceptability of creative responses | Allow creative output for greetings and hypothetical tasks; block misleading fabrications |
Security: Never Trust User Input
The foundation of any robust AI agent is security. The principle of never trusting user input should guide your system design, as malicious queries must be stopped at the first step. From prompt injections to data leakage attempts, every input carries potential risk.
Your first line of defense is a strong security layer built on configurable guardrails. Most agentic platforms provide these systems to filter unwanted content. They are essential for redacting personally identifiable information (PII), blocking forbidden topics, and detecting malicious inputs before they are processed.
For advanced protection, you can supplement these guardrails with a specialized AI model, which can be hosted on-premises for maximum data privacy. This model can analyze inputs for nuanced threats that might bypass rule-based filters. This modular approach allows you to layer different security mechanisms, creating a defense-in-depth strategy.

One effective approach is using a separate AI model that runs on your premises specifically for security screening. This ensures that potentially sensitive information never leaves your infrastructure. You maintain complete control over what gets processed and what gets blocked before it reaches your main system.
Stay open-minded about your architecture; each step in your pipeline can utilize different models optimized for their specific security requirements.
Relevant reading: Safeguarding Your Enterprise: 5 Pillars of GenAI Security
Query Analysis: Understanding Intent
Before diving into retrieval and response generation, your system must first determine what type of interaction it’s handling. Is the user asking a genuine question that requires searching your knowledge base, or are they simply making small talk with a greeting like “hello”?
This categorization step is crucial because it determines your entire response strategy. If someone offers a simple greeting, you don’t want to trigger complex retrieval processes searching through enterprise documents. Conversely, when someone asks a detailed question, you need to recognize that intent and route it to your retrieval system.
Being able to categorize and analyze queries saves computational resources and ensures users receive responses appropriate to their intent.
File and Image Analysis: Beyond Text Interactions
Don’t limit your AI agent to text-only interactions. Users often need to share screenshots, error messages, documents, or other visual information when seeking help. Integrating file and image analysis capabilities significantly expands your agent’s utility.
Use specialized AI models to analyze uploaded images, extract text from screenshots, and understand visual context. This analysis should feed back into your query optimization and processing pipeline, enriching the context available for generating comprehensive responses.
While I’ve placed this component toward the end of our discussion, file and image analysis could be integrated earlier in your pipeline – wherever visual context adds the most value to your specific use case.
Sentiment Analysis: Adaptive Communication
Modern AI agents can go beyond simple question-answering by adapting their communication style based on user sentiment. This adds a valuable dimension to user interactions, allowing your agent to respond appropriately to the user’s emotional state.
During conversations, analyze user input for emotional indicators. A neutral tone can receive standard responses, but when you detect frustration or irritation, adjust your response strategy accordingly. Just as human agents adapt their communication style in phone or email interactions, your AI agent can employ calming language or more supportive phrasing when users seem stressed.
This adaptive approach creates more natural, empathetic interactions that can significantly improve user satisfaction and reduce the need for escalation to human support.
Query Optimization: Enhancing User Input
Raw user queries are rarely perfect for retrieval systems. Query optimization takes advantage of all available signals in your interface and system to enhance the original query for better results.
Consider leveraging user context and conversation history from prior interactions. You want to blend all these signals together to generate the most effective query possible from what the user has provided.

Rather than trusting the user query as-is, analyze it and optimize it using an AI model. This approach can dramatically improve the quality and relevance of your retrieval results.
Complexity Analysis: Routing Different Query Types
Not all questions require the same level of processing. Analyzing question complexity early in your pipeline allows you to route different types of queries to appropriate handling strategies.
Simple questions can typically be answered with a single retrieval operation. Straightforward factual queries that can be satisfied with one or two relevant passages from your knowledge base.
A complex task or question, however, requires a different approach. These are questions that need more than one round of retrieval to be answered properly. They might require extensive passage analysis or complete documents from multiple sources. Sometimes what appears to be a simple request like “give me all instances of X, Y, and Z” actually requires complex retrieval across numerous documents.
The complexity isn’t necessarily in the question structure itself, but rather in the retrieval and analysis required to provide a complete, accurate answer. This classification step is one of the most important in your entire pipeline, so don’t cut corners here.
Question Decomposition: Breaking Down Complex Queries
Once you’ve identified a complex question, the next step involves decomposing it into smaller, complementary sub-questions. This approach broadens your retrieval scope and allows you to gather comprehensive information across multiple topics before synthesizing everything into a complete answer.
Effective question decomposition transforms a single complex query into multiple targeted questions that your retrieval system can handle more effectively. This strategy ensures you capture all relevant aspects of the user’s information need rather than missing important details due to overly narrow retrieval.
Query Expansion: Maximizing Retrieval Coverage
Query expansion builds on your processed questions by enhancing them further. Consider using thesaurus rules from existing query pipelines if you have them available. Other strategies include synonym expansion and domain-specific terminology variations.
The goal is to enhance your sub-questions to retrieve more relevant data and increase the effectiveness of your retrieval system. Think of this as casting a wider net while maintaining relevance; you want to ensure you don’t miss important information due to terminology mismatches between your users and your knowledge base.
Advanced Retrieval Logics
Once your pipeline has optimized, expanded, and decomposed the user query, it’s time to decide how to retrieve and assemble the information that will be used to generate the answer. Not all retrieval strategies are equal; each serves a distinct purpose depending on your data granularity, response precision, and system performance goals.
Below are four advanced retrieval logics you can integrate into your custom agent, ranging from lightweight retrieval for fast answers to deep retrieval for high-context synthesis.
1. Document-Level Question-Answering
This is the most direct retrieval strategy. The agent performs a semantic search across entire documents and returns the top-ranked results. It then prompts the language model using the full text snippets or summaries of these documents.
When to use:
- When users ask broad questions that require context across full documents.
When you want faster responses with minimal passage fragmentation.
Trade-offs:
- Fast and simple to implement, but sometimes includes irrelevant sections.
- Works best when documents are concise or well-structured.
This approach is ideal for high-level summaries, report lookups, or questions like “What does the security policy cover?”.
2. Passage-Level Question-Answering
Instead of working at the document level, this method retrieves smaller, semantically coherent text segments (“passages”) from across your knowledge base using the Coveo Passage Retrieval API. Each passage is treated as an atomic unit of meaning, allowing the LLM to ground its reasoning on the most relevant and contextually rich text.
When to use:
- When precision matters more than breadth.
- When your content includes lengthy documents where only small sections are relevant for a specific query.
Trade-offs:
- Requires fine-tuned retrieval and relevance scoring.
Produces highly accurate but sometimes narrower context.
This strategy excels in use cases like technical troubleshooting or product documentation where each passage may contain the direct answer for direct questions such as “What is the warranty period for product X?” or “How do I reset my password?”.
3. Direct Question Answering (Answer API)
The Answer API takes retrieval one step further. Instead of just fetching documents or passages, Coveo’s Answer API generates a pre-synthesized answer grounded in retrieved sources—complete with citations. Your custom agent can then use this generated response as high-quality context for its own reasoning or as a direct user response.
When to use:
- When you need a concise, pre-grounded answer without manual synthesis.
- When latency is more critical than custom LLM orchestration.
Trade-offs:
- You trade some flexibility for speed and out-of-the-box reliability.
- Ideal for production-ready question-answering experiences that require factual grounding.
Coveo Answer API effectively acts as a self-contained RAG layer, combining retrieval and generation in one optimized call.
4. Deep Question-Answering
For complex or multi-part questions, use a multi-stage retrieval pipeline. Here, the agent first performs a broad retrieval to collect the top results, then fetches full document bodies from those hits for deeper inspection. Each document is re-analyzed to extract key content, which is then combined and synthesized by the LLM.
When to use:
- For multi-hop reasoning or exploratory queries like “Compare the onboarding processes across departments and summarize differences.”, “Compare and contrast the features of our top three competitors” or “Compile all project updates from the last month related to the ‘Phoenix’ initiative.”
- When each answer component might reside in different documents.
Trade-offs:
- Higher computational cost and latency.
- Delivers the richest, most comprehensive responses for complex enterprise questions.
This deep retrieval loop mimics how a human researcher would operate: finding relevant sources, reading them fully, and synthesizing insights across multiple perspectives.
Choosing the Right Retrieval Logic
Each logic represents a distinct balance of precision, recall, and processing depth. Your agent can dynamically select the right strategy based on query complexity or user intent:
| Query Type | Recommended Logic |
| Simple factual question | Passage-Level Question-Answering |
| Broad topic exploration | Document-Level Question-Answering |
| Concise summary with grounding | Direct Question-Answering |
| Complex analytical or multi-source query | Deep Question-Answering |
By orchestrating these strategies together, your AI agent evolves from a simple chatbot into a context-aware reasoning system, capable of adapting its retrieval depth to each user’s needs—combining speed, accuracy, and enterprise-grade reliability.
Response Generation: The Main Event
This is the component everyone anticipates: generating the actual response using all the retrieved passages and processed queries you’ve gathered through your pipeline.
Response generation often requires the most powerful model in your system. You need both large context windows and sophisticated reasoning capabilities to synthesize information from multiple sources while maintaining coherence across potentially lengthy responses.
Consider using your strongest available model for this critical step, as response generation quality often determines overall user satisfaction with your system. Remember that different steps in your pipeline can use different models optimized for their specific tasks – response generation may warrant your premium model choice while other components can use more efficient alternatives.
Response Evaluation: Quality Control
Generating a response isn’t the end of your process – you need to evaluate whether that response meets your quality standards. Ask yourself: Is the answer complete? Does it accurately address the user’s question? Is it missing critical information that should prompt further clarification?
Response evaluation acts as a quality control checkpoint, identifying responses that may need improvement, additional clarification, or further processing before reaching the user.
Clarification Check: Handling Information Gaps
When your response evaluation identifies gaps or ambiguities, the clarification check determines whether you need to return to the user for additional information. Rather than providing incomplete or potentially misleading answers, this step ensures users receive the most accurate and complete responses possible.
Consider leveraging faceted search techniques to make clarification easier for users. Instead of asking open-ended clarification questions, provide specific options or categories that users can select from. This approach speeds up the clarification process and reduces friction in the user experience.
Hallucination Check: Managing AI Creativity
Hallucination in AI agents isn’t inherently negative; the key is understanding when it’s appropriate and when it’s problematic. Your hallucination check should analyze AI-generated content to ensure it meets your specific use case requirements.
Consider scenarios where hallucination is acceptable: when a user greets your system with “hello,” you want it to respond conversationally even though that greeting isn’t found in your enterprise data. The response comes from the foundation model, and while it’s not grounded in your specific data, it serves the user’s need for natural interaction.
Similarly, if someone asks your agent to draft a meeting agenda covering specific topics, the resulting document doesn’t exist in your knowledge base. It’s generated based on retrieved information, which is exactly what you want.
The distinction lies between helpful creative responses that enhance user experience and potentially harmful fabrications that could mislead users. As your AI ecosystem matures, you might develop specialized agents for different functions: perhaps a customer service agent that communicates with a specialized retrieval agent, each optimized for their specific roles.
Relevant reading: AI Hallucinations: When No Answer Is the Best Answer
Building the API: Practical Implementation
Now that we’ve covered all the essential components, it’s time to build a practical implementation. Your question-answering agent API should expose necessary parameters while maintaining simplicity for client applications.
Create a basic endpoint using a framework like FastAPI that receives a query and several other parameters. Pay special attention to the chat history parameter – this is where you’ll manage the memory of your chat sessions.

Memory is absolutely essential for conversational AI. Consider this scenario: a user opens your chatbot and asks a question about a bug they encountered. Your system replies with an answer. The user isn’t completely satisfied and wants to know more, so they type “please tell me more.”
Can you use “please tell me more” alone to perform retrieval? Absolutely not. You need to use an AI model to optimize that query based on the previous conversation, including previous questions and answers, to understand what the user is really asking for.
This is why exposing chat history in your API is crucial. Short-term memory through conversation history is essential for any chatbot system. Long-term memory (persistent storage) represents another capability level that we won’t cover in detail here, but short-term conversational memory is non-negotiable.

API Response Design: Transparency and User Experience
Here’s an example of how the API could respond. Each of the steps that we’re going through is going to be sent back. In my example, I use Swagger UI to show all of the returned steps.


Using a stream technology allows you to return processing steps as they complete rather than waiting for the entire process to finish. This enables real-time progress indication and keeps users engaged during processing.
Consider implementing visual feedback showing your chatbot’s chain of thought. Users can track progress through security checks, query analysis, retrieval, and response generation. This transparency serves multiple purposes: it helps users understand how answers were derived, assists with troubleshooting when responses aren’t optimal, and builds trust by showing the reasoning process.
Even if processing takes 10 seconds, providing step-by-step feedback prevents users from feeling like they’re waiting too long. Animation and loading indicators significantly improve the perceived responsiveness of your system.
Building the API: A Concrete Code Example
While the screenshots of the Swagger UI are helpful for visualizing the API, nothing speaks to a developer more clearly than code. Let’s look at a simplified example of how to define our agent’s endpoint, retrieve context, and generate a response using Python with the FastAPI framework.
1. Defining the Request Model
First, it’s crucial to define a clear data structure for the requests your API will receive. This ensures that clients send the correct information and helps prevent errors. Using a library like Pantic, we can define the AgentRequest model. This model includes the user’s query, conversation history, and other important parameters that control the agent’s behavior.
This snippet is from models.py.
from pydantic import BaseModel, field_validator
from typing import List, Dict
class AgentRequest(BaseModel):
"""
Defines the structure for a request to the AI agent.
"""
query: str
chat_history: List[Dict] = []
language: str = "en"
number_of_passages: int = 5
temperature: float = 0.1
user_context: dict = {}
# ... other parameters from your model 2. Creating the API Endpoint
Next, we define the main API endpoint. The code below, from main.py, sets up a /agent endpoint that receives the AgentRequest and returns a StreamingResponse.
This streaming capability is key to the user experience you described. Instead of making the user wait for the entire chain of thought to complete, we can send back updates as each step of the process finishes, from security checks to final response generation.
import uvicorn
import json
import asyncio
from fastapi import FastAPI, Depends
from fastapi.responses import StreamingResponse
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from models import AgentRequest # Import the model from the previous step
# Initialize the FastAPI application
app = FastAPI(title="Custom Coveo Agent API")
security = HTTPBearer()
@app.post("/agent")
async def agent_endpoint(
req: AgentRequest,
credentials: HTTPAuthorizationCredentials = Depends(security),
):
"""
Agent endpoint that provides progressive feedback via a streaming response.
"""
async def generate_stream():
# --- This is where the magic happens ---
# 1. Emit an event for "input_received"
yield f"data: {json.dumps({'type': 'step', 'step': 'input_received'})}\\n\\n"
await asyncio.sleep(0.5) # Simulate work
# 2. Perform security checks and emit event
yield f"data: {json.dumps({'type': 'step', 'step': 'security_check'})}\\n\\n"
await asyncio.sleep(0.5)
# 3. Optimize the query and emit event
yield f"data: {json.dumps({'type': 'step', 'step': 'query_optimization'})}\\n\\n"
await asyncio.sleep(0.5)
# 4. Retrieve passages and emit event
yield f"data: {json.dumps({'type': 'step', 'step': 'passage_retrieval'})}\\n\\n"
await asyncio.sleep(0.5)
# 5. Generate the final response and emit a "complete" event
final_response = "Here is the final, synthesized answer to your question."
yield f"data: {json.dumps({'type': 'complete', 'response': final_response})}\\n\\n"
return StreamingResponse(generate_stream(), media_type="text/event-stream")
# To run this application:
# uvicorn main:app --reload
3. Retrieving Passages from Coveo
At the heart of our agent’s retrieval capabilities is the call to the Coveo API. The agent needs to take the optimized query and fetch relevant passages that will be used to generate the final answer. The following function, found in retriever.py, handles this process.
This function constructs and sends an HTTP POST request to the Coveo Passage Retrieval API endpoint specified in your .env file.
import os
import requests
from utils import SearchContext, Passage
# Load the Coveo API URL from environment variables
COVEO_BASE_URL = os.getenv("COVEO_BASE_URL")
def retrieve_passages(search_context: SearchContext, number_of_passages: int = 5):
"""
Retrieves passages from the Coveo Passage Retrieval API.
"""
if not search_context.bearer_token or not search_context.organization_id:
raise ValueError("Bearer token and Organization ID are required")
headers = {
'Authorization': f'Bearer {search_context.bearer_token}',
'Content-Type': 'application/json',
}
payload = {
"query": search_context.q,
"maxPassages": number_of_passages,
"context": search_context.context or {},
}
try:
response = requests.post(COVEO_BASE_URL, headers=headers, json=payload)
response.raise_for_status()
data = response.json()
items = data.get('items', [])
passages = []
for idx, item in enumerate(items):
p = Passage(
text=item.get('text', ''),
relevance_score=item.get('relevanceScore', 0.0),
document_title=item.get('document', {}).get('title', ''),
clickable_uri=item.get('document', {}).get('clickableuri', ''),
position=idx
)
passages.append(p)
return passages
except requests.HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
return []
except Exception as e:
print(f"An error occurred: {e}")
return []
4. Generating an Answer with the LLM and Retrieved Passages
Once we have the relevant passages from Coveo, we can perform the core action of a RAG (Retrieval-Augmented Generation) system: using that context to generate a high-quality, factual answer. The generate_passage_based_answer function from llm_utils.py is responsible for this.
It formats the retrieved passages into a detailed system message, which instructs the LLM to use the provided text as the source of truth for its response. This grounds the model’s answer in your enterprise data and enables source attribution.
from typing import List
from utils import Passage
def generate_passage_based_answer(
client, # The LLM client (e.g., AzureOpenAI)
passages: List[Passage],
chat_history: List[dict],
temperature: float = 0.0
) -> str:
"""
Generates an answer using retrieved passages as context for the LLM.
"""
system_message = (
"You are a helpful assistant. Use the retrieved passages to answer the user query. "
"Indicate which passages you reference by including their position numbers [X] in your answer."
)
if passages:
passages_info = "\\n\\n".join(
f"[{p.position}] Passage {i+1}:\\n{p.text}" for i, p in enumerate(passages)
)
llm_formatted_system_message = (
f"{system_message}\\n\\n"
"Retrieved passages:\\n\\n```\\n\\n"
f"{passages_info}\\n"
"```"
)
else:
llm_formatted_system_message = f"{system_message}\\n\\nRetrieved passages:\\n\\n```No passages were retrieved.```"
messages = chat_history + [{"role": "system", "content": llm_formatted_system_message}]
try:
response = client.chat.completions.create(
model="your-llm-model-name",
messages=messages,
temperature=temperature
)
final_response = response.choices[0].message.content.strip()
# Mark which passages were actually used in the response
if passages:
for p in passages:
if f"[{p.position}]" in final_response:
p.selected = True
return final_response
except Exception as e:
print(f"Error generating response from passages: {e}")
return "I apologize, but I'm having trouble generating a response."
Performance Trade-Offs: Balancing Accuracy, Speed, and Cost
While designing your RAG pipeline, it’s essential to consider the performance trade-offs that come with different architectural choices. For example, routing every step to your most powerful model may yield the best accuracy, but it can also increase latency and operational costs.
A more balanced approach is to use smaller, faster models for tasks like intent classification or query expansion, and reserve large models for response generation or complex synthesis. Similarly, decisions about where to host inference—on-premises for privacy or in the cloud for elasticity—impact scalability and cost efficiency.
Finally, latency isn’t just a backend issue; even a two-second delay can affect user trust, so techniques like progress indicators and response streaming become critical.
By weighing these trade-offs upfront, you can design an agent that balances accuracy, speed, and cost while still delivering an enterprise-grade experience.
Practical Applications
To make these components more tangible, consider how they apply across common enterprise scenarios.
In customer service, an AI agent can handle tier-one questions by analyzing intent, retrieving troubleshooting steps, and escalating only when sentiment analysis detects frustration.
For knowledge management, employees can query internal documentation with complex requests like “summarize all onboarding processes for remote hires,” where query decomposition and expansion ensure full coverage across policies and guides.
In technical troubleshooting, engineers can upload screenshots or error logs, allowing the agent to combine file and image analysis with retrieval pipelines to pinpoint relevant solutions.
These examples show how modular AI agent design directly translates into real-world value, reducing costs, improving efficiency, and enhancing user satisfaction.
Conclusion
Building an enterprise-grade AI agent requires orchestrating multiple specialized components, each serving a critical role in delivering accurate, secure, and user-friendly responses. From initial security checks that protect your system to final sentiment analysis that personalizes interactions, every component contributes to the overall user experience.
The key insight is that an effective AI agent isn’t built around a single powerful model, but through thoughtful integration of specialized components, each potentially using different AI models optimized for a specific task. Security, query analysis, complexity routing, retrieval optimization, response generation, and quality control must all work together seamlessly.
Transparency and user experience are just as important as technical capability. Users appreciate understanding how their questions are processed, and streaming responses with progress indicators keep them engaged throughout the interaction.
Whether you’re building a customer service agent, internal knowledge assistant, or specialized technical support system, these foundational components provide the blueprint for creating AI agents that truly serve business needs. Start with these essentials, then expand and customize based on your specific requirements and use cases.
The future of agentic AI lies not in individual model improvements alone, but in sophisticated orchestration of specialized components working together – much like a well-coordinated team where each member contributes their unique expertise to achieve superior collective results.

