Generative AI has radically transformed product development, enabling tools that write code, design visuals, and systems that interact with customers. But with this creative power comes the thrilling, unpredictable world of generative AI. The same input prompt can give wildly different outputs: either brilliance or nonsense.
GenAI’s unpredictability is part of what makes it so powerful. But for product managers (PMs) tasked with delivering reliable, trustworthy products and user experiences, it can feel like trying to navigate a busy city blindfolded. This challenge is clear: how do we measure quality when the system sometimes behaves differently?
Enter AI Evaluations, or “Evals,” the compass to help PMs’ find clarity and confidence in the chaos.
Evals – The PM’s North Star
In traditional software, testing is straightforward: you write a test, check if the output matches the expected result, and you’re done. But with GenAI, outputs fluctuate with context, mood, or sheer randomness. A simple pass/fail check isn’t enough. This is where traditional QA methods fall short and Evals come in. They dig deeper, assess the tone, relevance, safety alongside overall quality. That’s why leading companies like OpenAI and Anthropic have made Eval skills mandatory for PMs.
Evals help PMs see more than if the system works; they reveal how it works and how well. They help answer critical questions: Is the AI accurate? Is it helpful, safe, and consistent?
Evaluation-Driven Development
To manage the dynamic nature of AI, PMs are stepping into the role of AI strategists. They are turning to Evaluation-Driven Development, a test-and-learn approach tailored for AI systems. In this approach, PMs define clear success metrics, test against them each cycle, and tweak based on what they learn.
This scientific rigor enables PMs to spot issues early, iterate improvements faster, and deliver more value to users and stakeholders.
- Observe: Capture how the AI performs in real-world scenarios.
- Annotate: Mark what works (success cases) and what doesn’t (failure cases).
- Hypothesize: Identify possible failure points.
- Experiment: Adjust prompts, tools, or logic to improve outcomes.
- Measure: Quantify improvement measuring what changed and by how much.
- Iterate: Refine continuously.
This ongoing cycle enables PMs to shape their product matching the evolving AI’s complexity without getting lost in the noise.
Building Evals: The Four-Pillar Blueprint
Creating an efficient Eval system doesn’t have to be complicated. It just needs the right foundation. Below is a breakdown of the four pillars that make Evals work:
- Golden Examples (Goldens): Goldens are hand-picked, high-quality, ideal input-output pairs that define success. These cover typical use cases, edge scenarios, and failures. Goldens take time to get right, but they set the benchmark for success.
- Synthetic Data: Manually writing thousands of tests isn’t practical. Instead, use LLMs to automatically generate large, varied datasets, expanding on the Goldens with diverse phrasings, edge cases, and stress
scenarios. It’s fast, scalable, and cost-effective. - Human Judging: For things AI still struggles to measure, like tone, safety, or helpfulness, bring in expert human reviewers. With clear rubrics, their evaluations create a “ground-truth” dataset pinpointing strengths and gaps.
- Automated Evaluation: Train AI models (auto-raters) on Goldens and human-judged data to assess outputs at scale. Tools like Ragas can help score both generation quality and retrieval relevance.
Together, these pillars replace guesswork with insights, revealing how your AI is really performing.
Real-World Example: Bedtime Story Generator
Imagine launching an AI to generate bedtime stories for children. As a PM, you might evaluate the system on:
- Narrative coherence
- Age-appropriateness
- Creativity
- Alignment of illustrations
- Emotional engagement
Using a combination of Goldens, synthetic data, auto-raters, and human grades, you’d refine it iteratively; not by just testing, but creating a feedback loop that teaches your AI what “good” looks like and iteratively keep improving it.
Evaluation Is Product Strategy
Evals aren’t just a quality check, they’re a core part of building a great product. For PMs, they offer a competitive edge by helping to:
- Catch silent failures before users notice;
- Speed up iteration through clear, structured feedback;
- Translate fuzzy AI behavior into measurable business outcomes.
In a world where AI is evolving rapidly and unpredictably, Evals help PMs with clarity and confidence to lead.
TL;DR for PMs
Concept | Why It Matters |
AI Evals | Measure both the accuracy and quality of AI responses. |
Evaluation-Driven Development | Drive continuous product improvement through tight feedback loops. |
Goldens + Synthetic Data | Cover real and edge cases at scale, efficiently. |
Human + AI Judgment | Combine expert insight with automation for balanced, reliable reviews. |
Strategic Moat | Build trust, move faster, and deliver reliable products. |
The Reward: Trust and Edge
Shipping GenAI features without evaluations is like launching a rocket without sensors. You might land where you want, but the risk is enormous. Evals give PMs the ability to define, measure, and continually refine success.
In a space full of uncertainty, Evals are your compass.

Vivek Sunkara is a Technology Product Manager at Citi, transforming Risks & Controls data into actionable insights that drive strategic growth. A BCS Member, IEEE Senior Member, IETE Fellow, and ACM professional member, he is an ‘AI-first’ product leader focused on building products and emotionally resonant user experiences.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment