0% found this document useful (0 votes)
19 views9 pages

AI Evals For PM

AI evaluations, or 'evals', are structured assessments designed to measure the performance of AI systems against product goals, focusing on accuracy, reliability, tone, and quality. They are essential for managing the unpredictability of AI outputs and involve creating golden sets, generating synthetic test data, grading outputs, and building auto-raters for scalability. Continuous refinement of these evaluations is crucial for developing accurate and trustworthy AI features.

Uploaded by

Shubham Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views9 pages

AI Evals For PM

AI evaluations, or 'evals', are structured assessments designed to measure the performance of AI systems against product goals, focusing on accuracy, reliability, tone, and quality. They are essential for managing the unpredictability of AI outputs and involve creating golden sets, generating synthetic test data, grading outputs, and building auto-raters for scalability. Continuous refinement of these evaluations is crucial for developing accurate and trustworthy AI features.

Uploaded by

Shubham Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

AI evals: The Essential

Skill for PMs


What are AI evals ?

AI evaluations, or "evals," are structured assessments to measure how


well your AI system performs against your product goals, measuring not
just accuracy, but also reliability, tone, and quality of outputs.
Why Do We Need AI evals?
✅ AI outputs are unpredictable - Hallucinations, biases, performance
drifts
✅ Traditional unit tests don’t work for GenAI’s non-deterministic
outputs
✅ AI Evals help us deliver consistent, user-friendly, and quality
experiences
✅ They’re a new QA mindset for PMs building AI-powered products
How to set-up an AI eval system?
1. Create Golden Sets — Curated datasets with known perfect outputs
2. Use AI to Generate Synthetic Test Data — Realistic scenarios to
stress-test the AI
3. Grade outputs to generate a source of Truth - Humans grade the
synthetic data across various dimensions (Scale of 1 to 5)
4. Auto-Raters — As Human grading is not scalable, we can automated
score the outputs based on our success criteria
Step-1 : Create Golden Sets
✅ Golden sets are your perfect answers. They show exactly how your AI
should respond. Include both typical user questions and tricky scenarios
that might trip up your AI.
✅ Example: A customer support bot for an e-com company
Input Golden Output

Refund my order 11229090 Happy to help! Let me start the refund process
for order 11229090. Was there anything wrong
with the order?

You have the worst customer support Let me pass you to a human agent who you can
I can't believe I have to talk to an AI! talk to. What's your preferred number to call ?
Step-2 : Synthetic Test Data
✅ Synthetic data is AI-generated data that mimics real-world questions
or situations. It helps you test your AI in scenarios you haven’t seen yet or
that users haven’t asked about. This can include simulated complaints,
new topics, or rare questions.
✅ You can generate synthetic data by asking another LLM to generate a
bunch of hypothetical user requests for a customer support of an e-com
company
Step-3 : Grade Outputs
✅ To evaluate your AI, humans need to grade the generated outputs
from your Synthetic Test Data.
✅ Create a table of AI outputs vs dimensions that are important to your
product like accuracy, tone etc. You can rate it in a binary way,
Success/Failure for dimensions like accuracy and you can use a scale of 1
to 5 for dimensions like tone, empathy etc.
✅ These grades form your “source of truth,” guiding your AI toward
better performance and revealing which areas still need work.
Step-4 : Build Auto-Raters
✅ Human evaluation is not scalable. To scale, build an AI-based
autorater that uses your golden sets to grade new AI outputs.
✅ Compare its evaluations with human-graded data. Aim for high
accuracy, >95% alignment with human grades.
✅ Continue spot-checking new data, as nuance still require human
oversight.
✅ With autoraters, you’ll speed up testing and make AI improvements
faster and more reliable.
✅ AI evals aren’t a one-time thing. As your AI improves and your product
evolves, so should your evaluations. Keep refining your golden Sets,
generating fresh synthetic data, and grading outputs.
✅ This continuous loop is the key to building accurate, reliable and
trustworthy AI features that delight your users.

💬 Let’s Talk!
How are you evaluating AI in your products? What’s worked for you or
hasn’t?

You might also like