Thanks to Emile Robitaille for his help with discovering the bug and special thanks to Azfar Khoja, Armand Briere and Henrik Lumini for their help in correcting it.
Coveo is currently evaluating and testing AWS Bedrock models for integration into its Coveo Relevance Generative Answering (CRGA) solution. CRGA delivers secure, enterprise-grade generative answers that are grounded exclusively in the customer knowledge base content, by retrieving the most relevant passages the end-user has access to from the index and providing only that controlled context to the LLM to generate the response. The objective is for the LLM to generate more predictable, repeatable answers grounded in the context retrieved by Coveo Search. This determinism and grounding are key to ensuring customers can rely on Coveo and CRGA.
Given the rapid pace of new LLM releases, Coveo is constantly evaluating the best options for its product. In this evaluation cycle, AWS Bedrock models were investigated, specifically Amazon Nova models. In these experiments, Bedrock is invoked through an internal Coveo library that wraps LangChain AWS under the hood. Because predictability and repeatability are priorities, the model temperature is always set to 0. The temperature parameter controls the randomness of token sampling, where lower values make the outputs more deterministic, while higher values are more likely to explore different tokens during generation. With the temperature set to 0, highly deterministic outputs are expected via greedy token decoding (always selecting the token with the highest log probability) rather than sampling. This aligns with the goal of generating more predictable, repeatable answers grounded in context. It is important to note that setting the temperature to 0 does not guarantee determinism, since some minor variability is expected regardless.
Well, on paper, it was supposed to be more deterministic. When the model underperformed on the test dataset, the generated answers were examined more closely. Across multiple inferences, the LLM’s output would change drastically for the same query and context. This variability was abnormal at a temperature of 0 and severely degraded the LLM’s answer, raising concerns about suitability for CRGA.
Initially, the LLM itself was suspected, given the known variability in how LLMs are served. This hypothesis was set aside because the inference engine is provider-owned, and isolating batching effects to validate the hypothesis was not straightforward.
Instead, a simple test compared OpenAI’s GPT-4.1 and Amazon’s Nova Lite, prompting them to generate a “random” number with a temperature of 0 ten times in a row. The expectation was that even if asked to generate a random number, a temperature of 0 would force the model to always generate the same token (the exact number) across the ten runs. GPT-4.1 showed this exact behaviour, but it was a completely different story with Amazon Nova Lite, where it generated different numbers for each inference, confirming that the model wasn’t behaving deterministically.


This suggested unexpected behaviour related to temperature. To confirm this, an experiment queried the LLM to generate a random number 100 times, varying the temperature from 0 to 1 in increments of 0.05. The results were plotted in the figure below. A significant anomaly appears at 0, where the number of unique values generated matches more closely a temperature of roughly 0.6–0.7.

With the plot confirming that there was a problem at temperature 0, the LLM was queried again. Still, instead of setting the temperature to 0, an extremely low value (1E-10) was used to replicate temperature 0 effectively.

The model became more deterministic when the temperature was set to an extremely small value, but not when set to 0. This pointed toward a classic “0 is falsy” bug.
How the Bug Was Found
Once Nova Lite became more deterministic, only when an “almost-zero” temperature was used rather than 0, the working theory shifted from “model bug” to “parameter isn’t being sent as 0.” The problem was reduced to the smallest possible reproduction in LangChain AWS (no Coveo code, no internal toolkit), using ChatBedrockConverse and calling stream(..., temperature=0):
from langchain_aws import ChatBedrockConverse
llm = ChatBedrockConverse(
model="amazon.nova-lite-v1:0",
region_name="us-east-1",
)
messages = [("human", "Hello")]
# temperature=0 is a valid value, but it gets dropped from the request payload
for chunk in llm.stream(messages, temperature=0):
print(chunk.text, end="", flush=True) // Request Payload without the correct inference config
{
"messages": [
{
"role": "user",
"content": [
{
"text": "Hello"
}
]
}
],
"system": [],
"inferenceConfig": {}
}
// Inference config that was expected
{
"inferenceConfig": {
"temperature": 0
}
} Inspection of the outgoing HTTP request payload showed the smoking gun: it contained an empty inferenceConfig. This meant that the temperature was omitted entirely. This instantly explained the observed behaviour from earlier results: A temperature of 0 was assumed to be enforced, but the value never made it into the Bedrock request, so Bedrock used the default temperature of 0.7, which explains the non-determinism.
Tracing this back to the code in LangChain AWS that builds inferenceConfig revealed a classic Python “0 is falsy” pattern:
"temperature": temperature or self.temperature,
Because 0 is falsy in Python, the expression falls back to self.temperature, which defaults to None, and the field gets dropped from the payload. The Bedrock Converse API then uses the LLM’s default temperature of 0.7. An issue and a pull request were created in the LangChain AWS repository. The fix was straightforward: replace the boolean “or” with an explicit None check so that 0 is treated as a valid value. It was merged on January 14, 2026 (Pull Request)!
Learnings
While this caused significant pain during the assessment of Amazon Nova models, several lessons emerged.
Determinism is a product requirement, not a nice-to-have. For CRGA, where the value proposition depends on the quality, repeatability, and groundedness of the answer, non-determinism isn’t a small problem; it becomes a customer-visible quality regression that harms customer trust in Coveo.
Don’t blindly trust the configuration. Significant time was lost assuming that temperature=0 meant that the temperature was set to 0 in the Bedrock request. Initially, inspecting the actual Bedrock payload would have saved a lot of time.
Determinism Sanity Test are invaluable. A simple “random-number-generator” prompt run a handful of times makes it easy to check whether a model is deterministic and would have flagged this behaviour before seeing its impact on a customer dataset.
This blog post is a reminder that “deterministic” isn’t something to assume from a config knob; it’s something to verify end-to-end in the real request payload, especially when SDK layers sit between the application and the provider. LLMs are inherently probabilistic, and noise can be reduced with temperature, but if guaranteed determinism is strictly required, a more complex solution is needed, such as caching LLM responses.
Finally, it was definitely a surprise to find such an obvious bug impacting a core feature of LLM interaction that hadn’t been discovered or tested in the LangChain AWS repo. If you’re using Bedrock through LangChain AWS, make sure you’re on a version that includes the fix, or avoid temperature=0 and use an explicit near-zero value until you can upgrade.

