BLOG@CACM
Architecture and Hardware

The Product Manager’s Compass in the Wild West of Generative AI

AI evaluations help product managers find clarity and confidence in the chaos of GenAI outputs.

Posted
digital compass

Generative AI has radically transformed product development, enabling tools that write code, design visuals, and systems that interact with customers. But with this creative power comes the thrilling, unpredictable world of generative AI. The same input prompt can give wildly different outputs: either brilliance or nonsense.

GenAI’s unpredictability is part of what makes it so powerful. But for product managers (PMs) tasked with delivering reliable, trustworthy products and user experiences, it can feel like trying to navigate a busy city blindfolded. This challenge is clear: how do we measure quality when the system sometimes behaves differently?

Enter AI Evaluations, or “Evals,” the compass to help PMs’ find clarity and confidence in the chaos.

Evals – The PM’s North Star

In traditional software, testing is straightforward: you write a test, check if the output matches the expected result, and you’re done. But with GenAI, outputs fluctuate with context, mood, or sheer randomness. A simple pass/fail check isn’t enough. This is where traditional QA methods fall short and Evals come in. They dig deeper, assess the tone, relevance, safety alongside overall quality. That’s why leading companies like OpenAI and Anthropic have made Eval skills mandatory for PMs.

Evals help PMs see more than if the system works; they reveal how it works and how well. They help answer critical questions: Is the AI accurate? Is it helpful, safe, and consistent?

Evaluation-Driven Development

To manage the dynamic nature of AI, PMs are stepping into the role of AI strategists. They are turning to Evaluation-Driven Development, a test-and-learn approach tailored for AI systems. In this approach, PMs define clear success metrics, test against them each cycle, and tweak based on what they learn.

This scientific rigor enables PMs to spot issues early, iterate improvements faster, and deliver more value to users and stakeholders.

  • Observe: Capture how the AI performs in real-world scenarios.
  • Annotate: Mark what works (success cases) and what doesn’t (failure cases).
  • Hypothesize: Identify possible failure points.
  • Experiment: Adjust prompts, tools, or logic to improve outcomes.
  • Measure: Quantify improvement measuring what changed and by how much.
  • Iterate: Refine continuously.

This ongoing cycle enables PMs to shape their product matching the evolving AI’s complexity without getting lost in the noise.

Building Evals: The Four-Pillar Blueprint

Creating an efficient Eval system doesn’t have to be complicated. It just needs the right foundation. Below is a breakdown of the four pillars that make Evals work:

  1. Golden Examples (Goldens): Goldens are hand-picked, high-quality, ideal input-output pairs that define success. These cover typical use cases, edge scenarios, and failures. Goldens take time to get right, but they set the benchmark for success.
  2. Synthetic Data: Manually writing thousands of tests isn’t practical. Instead, use LLMs to automatically generate large, varied datasets, expanding on the Goldens with diverse phrasings, edge cases, and stress
    scenarios. It’s fast, scalable, and cost-effective.
  3. Human Judging: For things AI still struggles to measure, like tone, safety, or helpfulness, bring in expert human reviewers. With clear rubrics, their evaluations create a “ground-truth” dataset pinpointing strengths and gaps.
  4. Automated Evaluation: Train AI models (auto-raters) on Goldens and human-judged data to assess outputs at scale. Tools like Ragas can help score both generation quality and retrieval relevance.

Together, these pillars replace guesswork with insights, revealing how your AI is really performing.

Real-World Example: Bedtime Story Generator

Imagine launching an AI to generate bedtime stories for children. As a PM, you might evaluate the system on:

  • Narrative coherence
  • Age-appropriateness
  • Creativity
  • Alignment of illustrations
  • Emotional engagement

Using a combination of Goldens, synthetic data, auto-raters, and human grades, you’d refine it iteratively; not by just testing, but creating a feedback loop that teaches your AI what “good” looks like and iteratively keep improving it.

Evaluation Is Product Strategy

Evals aren’t just a quality check, they’re a core part of building a great product. For PMs, they offer a competitive edge by helping to:

  • Catch silent failures before users notice;
  • Speed up iteration through clear, structured feedback;
  • Translate fuzzy AI behavior into measurable business outcomes.

In a world where AI is evolving rapidly and unpredictably, Evals help PMs with clarity and confidence to lead.

TL;DR for PMs

Concept  Why It Matters
AI EvalsMeasure both the accuracy and quality of AI responses.
Evaluation-Driven Development Drive continuous product improvement through tight feedback loops.
Goldens + Synthetic DataCover real and edge cases at scale, efficiently.
Human + AI JudgmentCombine expert insight with automation for balanced, reliable reviews.
Strategic MoatBuild trust, move faster, and deliver reliable products.


The Reward: Trust and Edge

Shipping GenAI features without evaluations is like launching a rocket without sensors. You might land where you want, but the risk is enormous. Evals give PMs the ability to define, measure, and continually refine success.

In a space full of uncertainty, Evals are your compass.

Vivek Sunkara

Vivek Sunkara is a Technology Product Manager at Citi, transforming Risks & Controls data into actionable insights that drive strategic growth. A BCS Member, IEEE Senior Member, IETE Fellow, and ACM professional member, he is an ‘AI-first’ product leader focused on building products and emotionally resonant user experiences.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More