AI Evals For PM
AI Evals For PM
Refund my order 11229090 Happy to help! Let me start the refund process
for order 11229090. Was there anything wrong
with the order?
You have the worst customer support Let me pass you to a human agent who you can
I can't believe I have to talk to an AI! talk to. What's your preferred number to call ?
Step-2 : Synthetic Test Data
✅ Synthetic data is AI-generated data that mimics real-world questions
or situations. It helps you test your AI in scenarios you haven’t seen yet or
that users haven’t asked about. This can include simulated complaints,
new topics, or rare questions.
✅ You can generate synthetic data by asking another LLM to generate a
bunch of hypothetical user requests for a customer support of an e-com
company
Step-3 : Grade Outputs
✅ To evaluate your AI, humans need to grade the generated outputs
from your Synthetic Test Data.
✅ Create a table of AI outputs vs dimensions that are important to your
product like accuracy, tone etc. You can rate it in a binary way,
Success/Failure for dimensions like accuracy and you can use a scale of 1
to 5 for dimensions like tone, empathy etc.
✅ These grades form your “source of truth,” guiding your AI toward
better performance and revealing which areas still need work.
Step-4 : Build Auto-Raters
✅ Human evaluation is not scalable. To scale, build an AI-based
autorater that uses your golden sets to grade new AI outputs.
✅ Compare its evaluations with human-graded data. Aim for high
accuracy, >95% alignment with human grades.
✅ Continue spot-checking new data, as nuance still require human
oversight.
✅ With autoraters, you’ll speed up testing and make AI improvements
faster and more reliable.
✅ AI evals aren’t a one-time thing. As your AI improves and your product
evolves, so should your evaluations. Keep refining your golden Sets,
generating fresh synthetic data, and grading outputs.
✅ This continuous loop is the key to building accurate, reliable and
trustworthy AI features that delight your users.
💬 Let’s Talk!
How are you evaluating AI in your products? What’s worked for you or
hasn’t?