Over the past 12 months, the role of AI in commerce has shifted dramatically. Large Language Models are no longer used only to generate content or answer questions. Increasingly, they are used to act: to browse, evaluate, plan, compare, retrieve, decide, and execute multi-step tasks through APIs and tools. This shift has sparked a new wave of agentic commerce research papers, exploring how AI agents are reshaping shopping behavior and enterprise product discovery.
This transition – from generative and conversational systemsto agentic ones– is creating a new research landscape around shopping and product discovery. And it coincides with a moment in which platforms like Coveo, named a Leader in the IDC MarketScape for Generative AI in Product Discovery, are investing heavily in retrieval-first, controlled, and safe GenAI that complements (rather than replaces) traditional search and recommendation architecture.
Coveo has been building toward this shift for years: hybrid lexical/vector retrieval, governed RAG, robust analytics, merchandising control, and even retail media capabilities for customers who want monetization frameworks. Many of the themes in the emerging research align with this direction.
Below are ten of the most significant agentic commerce research papers from 2025, introduced in accessible terms and connected to the real challenges retailers face today. Each of them is offering a glimpse into how agentic commerce is reshaping product discovery.
1. What Is Your AI Agent Buying?
Allouah et al., 2025
This paper investigates what happens when the buyer on a marketplace is an AI agent instead of a human. Using a mock marketplace where agents shop through APIs, the authors show that different LLMs make very different purchasing decisions, often concentrating on a small set of products. Interestingly, agents show strong position bias and inconsistent reactions to sponsored tags, raising questions about competition and ranking fairness.
Retailers beginning to invest in retail media will immediately recognize the relevance. The findings underline why transparency, governance, and ranking controls matter – areas where platforms like Coveo already provide configurable pipelines that balance monetization and relevance.
2. Can LLM Agents Simulate Customers to Evaluate Agentic-AI-based Shopping Assistants?
Sun et al., 2025
As agentic AI systems like Amazon Rufus take on full shopping tasks, evaluating them becomes increasingly complex. Much recent work has explored the Agent-as-a-Judge paradigm, where LLMs assess static outputs such as code or factual responses. Useful—but limited to single-turn, objective scenarios.
This paper goes a step further by asking whether LLMs can act as digital twins of real shoppers in multi-turn conversations. Forty participants completed shopping tasks with Rufus, and the authors recreated each as an LLM-based agent. The results offer the first quantitative evidence that agents can approximate human shoppers: not perfectly, but with interaction patterns that trend in similar directions.

The practical implication is clear: digital twins may become a scalable way to test conversational and agentic shopping experiences without relying solely on live traffic. At Coveo, we already use LLM-based evaluators in controlled contexts—for example, generating or judging SEO-optimized landing pages—and the prospect of extending this evaluation approach to conversational product discovery is both relevant and promising.
3. Cite Before You Speak: Grounded Conversational LLM Agents
Zeng et al., Amazon, 2025
A core requirement for any conversational shopping agent is simple: its product statements must be accurate and grounded. Yet two persistent challenges remain. First, LLMs can hallucinate or generate unsupported claims, undermining trust. Second, when agents present information without showing where it came from, customers have no way to verify it.
This paper introduces a practical solution: a citation experience that attaches evidence – such as product attributes, specifications, or review snippets—directly to the agent’s response. The authors demonstrate large gains in grounding quality through a dedicated evaluation suite, and real A/B tests show meaningful impact, with 3–10% improvements in engagement.
The broader takeaway is clear: grounding isn’t an optional feature in commerce – it’s a prerequisite for trust. The industry is steadily moving toward retrieval-first architectures with explicit evidence attribution. Platforms like Coveo already follow this pattern today, ensuring that generative answers remain anchored in verified content and can provide citations when needed, giving enterprises the reliability layer these agents require.
4. A Shopping Agent for Addressing Subjective Product Needs
Dammu, Alonso, Poblete, 2025 (WSDM)
This collaboration between Amazon scientists and academic researchers focuses on a well-known challenge: traditional search systems struggle when the shopper’s need is subjective – choosing a gift, interpreting stylistic preferences, or navigating vague intent. As the authors note, these queries often contain subjective cues or incomplete information that do not exist in the product catalog, leading to poor retrieval and higher user effort.

To address this, the paper introduces an agent designed for subjective product needs (SPN). It explores a broad range of products, mines reviews, extracts opinions, and generates gifting or style ideas by reasoning across ambiguity. The system fills in missing details and supports shoppers through exploratory, non-goal-directed tasks where catalog-driven search typically fails.
For most retailers, the challenge is similar: subjective queries are difficult to map to structured attributes. The direction of this research aligns with how modern product discovery is evolving—using query and content understanding, semantic search, and retrieval-augmented generative answering to interpret and deliver on intent even when a shopper’s needs aren’t fully specified. These techniques are becoming essential in real implementations, including systems like Coveo that rely on grounded answers and enriched content to support more complex, subjective shopping journeys.
5. You Say Search, I Say Recs: Agentic Query Understanding at Spotify
Palumbo et al., 2025
This paper examines how an LLM can function as an intelligent router that decides whether a query should trigger search, recommendations, or a hybrid workflow. Spotify applies this approach particularly to exploratory intents such as “new releases for me,” where a keyword search alone is insufficient and a recommendation-style response is more appropriate.
The idea maps directly onto a broader industry shift: search and recommendations are no longer separate systems. Increasingly, the right response depends on interpreting the user’s intent, not just the surface form of the query. This is precisely the principle behind Coveo’s Intent Box concept, where different types of queries (navigational, informational, exploratory, support-oriented) are routed to different downstream engines – whether that’s a precise product list, a generative explanation, a category suggestion, or a troubleshooting answer, as illustrated in our Intent Box ebook.
Coveo’s platform already supports this kind of hybrid orchestration, combining query understanding with the ability to surface the right mix of products, content, and generative guidance. Spotify’s work essentially validates this direction from a research standpoint: the future of discovery depends on interpreting intent first, then choosing the right mechanism to satisfy it.
6. PAARS: Persona-Aligned Agentic Retail Shoppers
Mansour et al., 2025
This is another 2025 paper that focuses on synthetic shoppers, confirming that this is a growing trend in ecommerce: using LLM agents as synthetic shoppers to test experiences, run experiments, or perform market research without relying on large human samples. The problem is that raw LLMs come with biases – brand preferences, rating inflation, or poor representation of certain groups – so their behavior often drifts from that of real customers.

PAARS proposes a way to fix this. The framework automatically mines personas from historical shopping data, equips agents with retail-specific tools, and evaluates how closely a population of agents matches a population of real shoppers. Importantly, the authors measure alignment at the group level, not individual-by-individual, which is far more realistic for simulating market behavior. The results show that persona-driven agents behave more like humans—though the gap isn’t fully closed.
The biggest promise of this work is speed: synthetic shoppers could enable offline A/B testing, rapid UX evaluation, and agent-based surveys long before involving real users. But the paper also makes clear that synthetic behavior still diverges from actual customer signals, which is why many retailers continue to rely on real-time behavioral learning in production. It’s an exciting direction, but one that requires careful alignment to avoid misleading conclusions.
7. Recommender AI Agent
Huang et al., 2025
This paper looks at a familiar tension in ecommerce: traditional recommender models are great at ranking items using behavioral data, but they can’t explain themselves or carry a conversation. LLMs, meanwhile, excel at reasoning, instructions, and natural dialogue—but lack detailed knowledge of product catalogs and the subtle behavioral patterns learned by recommenders. Fine-tuning LLMs for every domain isn’t practical.
IntecRecAgent offers a simple solution: let each system do what it’s best at. The LLM acts as the conversational brain—interpreting tasks, asking clarifying questions, and guiding multi-step shopping flows—while the recommender engine handles the precise retrieval and ranking. The result is a hybrid agent that’s interactive, explainable, and still grounded in domain-specific signals.
This model reflects how many retailers are already thinking about AI: not LLMs alone, but LLMs orchestrating the proven, high-performance retrieval and ranking systems underneath. For platforms like Coveo, which blend recommendations, semantic understanding, and explainability, this is an especially relevant direction—and one that aligns closely with how real-world product discovery is evolving.
8. ARAG: Agentic Retrieval-Augmented Generation for Personalized Recommendation
This paper from Walmart’s lab takes a fresh look at how to make RAG more useful for recommendations. Traditional RAG pulls in external context but usually relies on simple, static retrieval rules—and that isn’t enough when user preferences are subtle, evolving, or spread across long histories and short sessions.

This paper proposes something more ambitious: a multi-agent RAG pipeline where several specialized LLM agents collaborate. A User Understanding Agent distills long-term and session-level preferences, another agent checks whether candidate items actually align with inferred intent, and additional agents summarize findings before producing ranked recommendations. Instead of one model trying to do everything, ARAG creates a small “team” where each agent handles a specific part of the reasoning process.
The resulting workflow – retrieve, interpret, validate, then rank – looks a lot like the structured, retrieval-first GenAI architectures that are starting to take hold in real-world commerce. It reinforces a key principle for agentic commerce: the most reliable shopping agents will be orchestrators, not monoliths, using LLM reasoning on top of grounded retrieval and proven ranking systems, rather than relying on free-form generation alone.
9. OptAgent: Multi-Agent Optimization for Query Rewriting
Handa et al., Etsy, 2025
This paper from Etsy is particularly interesting because it tackles one of the hardest problems in ecommerce: query rewriting. When shoppers type vague queries—like “nice mug,” “dress for dinner,” or “tools for camping”—there isn’t a single “correct” rewrite. It’s about capturing what the shopper probably meant, and that’s a judgment call.
OptAgent takes a clever approach: instead of asking one model to decide if a rewrite is good, it asks several AI agents to behave like shoppers and judge the rewrite together. Each agent tries the rewritten query, looks at the products it returns, and decides whether they feel relevant or not. Their combined opinions guide an iterative improvement process that refines the rewrite step by step.

The results are impressive. Across hundreds of real queries, this multi-agent method produced noticeably better rewrites—especially for long-tail or messy queries where there’s little historical data to rely on.
For agentic commerce, the implication is straightforward: important decisions shouldn’t depend on a single model’s instinct. Whether it’s rewriting queries, interpreting intent, or deciding what to show next, using multiple signals—multiple “voices”—leads to safer and more reliable outcomes. OptAgent is an early example of how agentic systems can bring more robustness to the upstream steps that power product discovery.
10. The Future Is Agentic: Multi-Agent Recommender Systems
Maragheh & Deldjoo, 2025
This paper makes a simple but powerful argument: the next generation of recommendation and discovery systems won’t rely on one big model doing everything. Instead, they’ll look more like a team of AI specialists working together. One agent plans the task, another retrieves products, another checks for mistakes or policy issues, and another explains the results.

It’s a shift from “a model that ranks items” to a coordinated set of agents that can actually think, check, and act.
The authors also point out the pitfalls—agents can hallucinate, disagree with each other, or drift if you don’t have proper oversight. And this is where the relevance to agentic commerce becomes clear. A shopping agent that can understand intent, retrieve the right products, consider constraints, and justify its choices needs exactly this kind of structure. But it also needs the guardrails to stay grounded and predictable.
Put simply: this paper outlines the architecture that real agentic shopping assistants will likely follow – modular, supervised, retrieval-driven—and the challenges retailers must solve to make them safe and effective.
Closing Thoughts
Together, these agentic commerce research papers outline the technical foundations, risks, and opportunities that will define the next generation of enterprise product discovery.
Across these ten papers, a clear pattern emerges: the future of product discovery is becoming increasingly agentic, but not in a way that replaces traditional systems. Instead, these works show how agents can enhance long-standing tasks – query understanding, recommendations, evaluation, or ranking – by adding new layers of interpretation, collaboration, and judgment.
We saw “digital twins” and synthetic shoppers emerging as scalable evaluators. We see multi-agent approaches improving recommendations and retrieval. We see LLMs stepping into roles like critic, planner, explainer, and router.
The research points to a future where commerce systems are not just smarter – they are more adaptive, more contextual, and more collaborative, blending the strengths of AI agents with the reliability of the retrieval and ranking engines that have powered ecommerce for decades.

