The GenAI Chatbot Performance Test
Many companies fail to track whether their chatbots are boosting efficiency—or silently alienating users and damaging their brand.
In November last year, I wrote about how KPIs you track as product builder have been evolving as products have moved to include predictive AI and Generative AI elements into them.
One of the most adopted GenAI powered product has been the chatbot. Generative AI has democratized chatbot development, enabling companies to deploy conversational agents faster and cheaper than ever.
Startups and Fortune 500 firms alike now use AI assistants for everything from customer support to sales.
But here’s the catch: ease of deployment has outpaced accountability.
Many companies fail to track whether their chatbots are boosting efficiency—or silently alienating users and damaging their brand.
Let us look at the case study of DPD (World logistics giant) that screwed up its brand image with an untested chatbot.
I will be speaking at the Product Academy SUMMIT 2025 in Zürich on 26 June,2025. If you want to join the summit, this is your calling!
Offer for paid subscribers:
If you’re a paid subscriber of Productify, you are eligible to get a free guest ticket (worth $220) to accompany me into the conference above. Upgrade your subscription below and send an email on productifylabs@gmail.com with your LinkedIn ID and 2-3 lines on why you want to be part of this summit to get your free ticket. Upgrade your subscription below:
The case of delivery giant DPD’s chatbot
In 2023, global delivery giant DPD learned this lesson the hard way. Their AI chatbot, designed to handle parcel tracking, spiraled into chaos when a frustrated customer manipulated it into:
Writing a poem cursing the company’s customer service
Calling itself “a useless chatbot”
Encouraging users to avoid DPD entirely
The interaction went viral, racking up millions of views and media coverage from The Guardian to TechCrunch.
The root cause? Zero safeguards against prompt hacking and no real-time monitoring to flag toxic outputs.
DPD temporarily disabled the bot, but not before suffering reputational harm and operational disruption.
Contrast this with Photobucket, a photo-sharing platform that implemented rigorous KPI tracking with its GenAI chatbot:
Monitored response accuracy (94% of common queries resolved)
Tracked conversion rates (15% increase in premium sign-ups via bot nudges)
Audited hallucination rates monthly to minimize misinformation
By pairing real-time analytics with weekly A/B tests, Photobucket’s chatbot became a revenue driver rather than a liability. Users praised its “human-like helpfulness” in post-chat surveys.
So, how to go about tracking chatbot performance?
To not fall into the pit of untested chatbot that is not being tracked for performance over period of time before full rollout, you can essentially look at chatbot performance at three levels:
1. Technical Performance
Allows you to understand chatbot’s accuracy, speed and reliability and it allows you to finetune the chatbot’s underlying model training to become better at below metrics:
Response Accuracy: Measures correctness of answers against a ground truth. Collect data by comparing bot responses to a "golden dataset" of verified answers using tools like Promptfoo.
Hallucination Rate: Tracks the percentage of fabricated or irrelevant responses. Use LLM-as-a-judge frameworks (e.g., DeepEval) or human validation to assess this.
Response Time: Measures average latency to generate replies. Log timestamps from API calls using tools like AWS CloudWatch, aiming for a target of less than 2 seconds.
Intent Recognition Accuracy: Evaluates the ability to classify user requests correctly. Test against labeled datasets using tools like Rasa or Dialogflow.
Data collection tools include automated testing platforms like Botium, LLM evaluators, and server logs.
2. User Experience
Allows you to understand user behavior better when interacting with chatbot. This is the single most important set of metrics if you’re looking to reduce dependency on human agents.
CSAT Score: Measures user satisfaction through post-chat surveys. Collect data by embedding 1–5 star ratings or thumbs up/down buttons after interactions.
Conversation Completion Rate: Tracks the percentage of chats where users achieved their goal. Gather this data by monitoring predefined success markers (e.g., order placed, FAQ resolved) in conversation logs.
Fallback Rate: Measures the percentage of queries the bot couldn't handle. Collect by counting escalations to human agents using CRM/ticketing systems like Zendesk.
Bounce Rate: Shows the percentage of users who exit without engaging. Analyze this through session duration and inactivity via analytics tools such as Google Analytics.
Tools for data collection include post-chat surveys, session replay tools like Hotjar, and CRM systems.
3. Business Impact
Eventually deployment of chatbots has to make business sense. And if you’re in leadership position, it helps to track:
Deflection Rate: Measures the percentage of queries resolved without human help. Compare chatbot-handled vs. human-handled tickets in CRM reports to collect this data.
Cost Per Conversation: Calculates operational cost savings compared to human agents. Work with finance and ops teams to determine infrastructure costs and labor savings.
Conversion Rate: Tracks the percentage of users completing desired actions (e.g., purchases). Integrate with analytics platforms like Mixpanel to gather this information.
Compliance Rate: Measures adherence to privacy and security policies. Audit logs for data leaks and use moderation APIs like OpenAI Moderation to assess compliance.
Let me know if you have any other suggestions, or tactics you deployed at your own company to test your chatbot performance?
Leave comment below.
Do you have any stats on how successful teams have been at reducing hallucination or miscategorization of intent problems. Seems inherent to the tool.