Guide🧠: How to reduce AI hallucinations with Evals
Product managers need to ensure their AI-powered products deliver real value, perform reliably, and maintain user trust.
Imagine this:
an e-commerce platform launches an AI-powered search bar designed to understand shopper intent and surface highly relevant products.
But users soon notice the AI hallucinating—returning confident but incorrect product matches like listing “wireless headphones” when the query was clearly “wireless mouse.”
Or worse, suggesting out-of-stock or non-existent items with convincing product descriptions. These hallucinations frustrate customers, damage trust, and lead to lost sales.
AI hallucinations threaten user trust and product reliability, making it crucial for Product Managers and product teams to proactively detect and reduce these errors.
One of the most effective ways to do this is through AI evaluations, or AI evals.
In today’s newsletter, we give a basic guide on how Product Managers can address this:
🔹 What Are AI Evaluations (AI Evals)?
🔹 Why AI Evals Are Required
🔹 When to Use AI Evals
🔹 Key Metrics AI Evals Improve
🔹 Setting Up AI Evals Internally + Workflow
🔹 Case Study: Tackling AI Hallucinations in E-Commerce Search
🔹 External Help for AI Evals
🔹 Takeaways for Product Managers and Leaders
Your conversion rate just dropped 15%. Is it the feature or your website?
Funnels and analytics tell you where users leave. But they don’t tell you why. Was it the feature you just shipped, or a broken page that’s killing conversions?
That’s where BrowserStack Website Scanner steps in. With a single click, it scans every page of your website; catching broken links, performance issues, visual bugs, and accessibility problems before your customers do.
Everything you wished your analytics could tell you, in one scan.
🔍What Are AI Evaluations (AI Evals)?
AI evals are structured, systematic processes used to assess how well AI models perform—measuring accuracy, relevance, safety, coherence, and other key output qualities.
Unlike traditional software testing, AI evals focus on probabilistic, generative outputs, using curated datasets, automated scoring, and human reviews to yield actionable insights.
They form the essential quality control mechanism for modern AI systems.
⚠️Why AI Evals Are Required?
AI models inherently produce probabilistic outputs that may include hallucinations, biases, or unsafe content. Without systematic evaluation, these issues degrade user experience, brand trust, and regulatory compliance. AI evals help:
Detect hallucinations before end users see them.
Measure impact of AI updates or prompt changes.
Align AI behavior with business goals and ethical standards.
Drive data-based decisions for continuous performance improvement.
Build internal and external confidence in AI reliability.
⏰When to Use AI Evals?
Deploy AI evals when you observe hallucinations or unsafe outputs, update your models, require objective AI quality metrics, or need to meet business and compliance requirements.
📊Key Metrics AI Evals Improve
AI evaluations typically drive improvements across several key dimensions:
Accuracy: How correct AI outputs are compared to verified facts.
Relevance: How well responses address user questions or intent.
Coherence: Logical consistency and clarity within AI-generated text.
Safety: Absence of harmful, biased, or inappropriate content.
Fluency: Naturalness and readability of AI language.
🏗️Setting Up AI Evals Internally: Stakeholders and Workflow
Key Stakeholders:
Product Managers: Define eval goals, prioritize metrics, and align with business needs.
Data Scientists / ML Engineers: Design eval datasets, implement automated eval pipelines, analyze results.
Domain Experts: Provide human reviews for nuanced metrics like relevance, safety, and factual accuracy.
Quality Assurance (QA) Teams: Integrate evals into CI/CD pipelines and support continuous testing.
Legal & Compliance: Review eval frameworks for regulatory, ethical, and safety considerations.
UX/Customer Support: Provide user feedback and flag hallucination cases from real-world usage.
Executives / Stakeholders: Use eval reports for strategic decisions and risk management.
🛒Workflow:
So, as a Product Manager faced with real-world problem of AI-powered Search in an ecommerce platform, where these AI models can hallucinate, meaning they confidently suggest wrong or invented products.
How would a PM systematically tackle and improve AI hallucinations? Here’s how you would apply a tried-and-true AI evaluation workflow to fix the issue step-by-step:
Define
Set clear evaluation objectives, success criteria, and failure modes. Specifically, you clarify:
Why you are evaluating: to reduce hallucinations that mislead shoppers.
What success looks like: high factual accuracy (correct product matches), relevance (results aligned to user intent), and safety (no inappropriate or false content).
Which failures are unacceptable: hallucinated product matches, incorrect inventory data, fabricated product details.
Collect
Gather representative user search queries coupled with verified ground-truth data from the product catalog and inventory system.
The Product Manager can collect a wide range of actual search phrases that shoppers enter in the search bar, such as “wireless mouse,” “ergonomic office chair,” “leather wallet for men,” or “organic green tea bags.”
For each of these real user queries, the PM obtains the verified correct results from the product catalog and inventory system—meaning the exact products currently in stock and correctly matching those queries based on attributes like product name, category, specs, and availability.
This combined dataset of query and true, ground-truth product matches becomes the test set used to evaluate the AI search bar’s outputs. It ensures that evaluation realistically reflects how users search and what accurate results should be.
Score
Automate scoring using evaluation APIs on similarity, factuality, and toxicity where applicable.
For example, a similarity metric checks if the AI’s recommended products match the ground truth. You complement this with human reviews to capture nuanced judgment on relevance and safety in complex queries.
Analyze
Review evaluation results systematically to identify critical failure patterns. For instance, you might find hallucinations disproportionately happen on searches involving technical attributes or rare products.
Prioritize which failures most impact user experience and business outcomes—hallucinated product matches on high-value search terms may rank highest.
Fix
Based on insights, apply targeted fixes such as prompt engineering to refine AI instructions, integrating retrieval-augmented generation (RAG) to ground AI responses in live product data, or upgrading to a newer, more capable model.
Collaborate with engineers and data teams to deploy these improvements iteratively.
Monitor
Set up continuous monitoring with automated regression tests running on new AI versions and product catalog updates.
Make sure hallucination rates don’t creep back up, and performance remains stable or improves over time. Alerts inform your team early if risks arise.
Report
Regularly share evaluation findings, improvement progress, and ongoing risks with leadership and stakeholders.
Transparency builds confidence, aligns priorities, and secures investment in further AI quality initiatives.
🤝External Help for AI Evals: Companies and Specializations
When internal resources or expertise are limited, external AI evaluation providers can accelerate and deepen AI quality assurance. Notable companies and their specialties include:
Patronus AI: Automated LLM scoring, unbiased failure detection, test generation.
Harvey AI: Legal domain expert reviews, citation verification, compliance-focused evals.
Galileo AI: Comprehensive evaluation platforms with multi-stakeholder reporting and monitoring.
Deeper Insights: Custom AI evaluations across NLP, computer vision, and domain-specific tasks.
OpenAI Evals: APIs for building custom evaluation pipelines and continuous quality tracking.
BCG GAMMA: Consulting and governance services integrating AI evaluations into enterprise frameworks.
🚀So, what does skillsets you can develop if you start adapting AI evals in your approach as a PM?
AI evals are the cornerstone that enables product managers to ensure their AI-powered products deliver real value, perform reliably, and maintain user trust.
Unlike traditional software testing, AI evals measure how well AI systems behave in complex, unpredictable real-world conditions and impact user experience and business outcomes.
Mastering AI evals allows product managers to steer their AI products confidently, identify issues early, and prioritize improvements that truly matter.
Below are the key skillsets developed through using AI evals:
Clear Objective Setting: Defining success metrics that go beyond simple accuracy, focusing on real user impact and meaningful business outcomes.
Comprehensive Evaluation Techniques: Applying holistic testing methods including offline datasets, live testing, human feedback, and continuous monitoring to ensure AI robustness and reliability.
AI and Data Literacy: Building fluency in AI/ML concepts, data pipelines, model behaviors, and quality metrics to collaborate effectively with technical teams.
Analytical and Experimentation Mindset: Using data-driven approaches to analyze AI performance, prioritize features, run experiments, and iterate based on evaluation insights.
Continuous Learning and Adaptation: Establishing processes to regularly re-assess AI models post-launch, addressing shifting data and user behaviors for sustained performance.
Strategic Leadership: Leveraging AI evaluation results to build stakeholder trust, mitigate risks, and lead teams toward impactful product improvements.
Mastering AI Evals is a foundational skill for any product manager working today.
Even if you're vibe-coding your first project, Bandan's comprehensive roundup can get you started with the concepts and terminology.
This guide is incredibly insightful, framing AI evals as essential for product managers.
The systematic workflow...define, collect, score, analyze, fix, monitor, report...is a great roadmap for tackling hallucinations and building user trust. a truly valuable approach.