How to Build AI Evals in 2026 (Step-by-Step, No Hype)

Hamel Husain and Shreya Shankar are back with the definitive guide to AI evals. Step-by-step walkthrough using real production data from Nurture Boss. Error analysis, LLM judges, and the mistakes 90% of teams make. Full Writeup: https://www.news.aakashg.com/p/hamel-shreya-podcast-2 Transcript: https://www.aakashg.com/how-to-master-ai-evals-a-step-by-step-guide-with-hamel-husain-shreya-shankar/ ---- Timestamps: 0:00 - Intro 2:09 - Why Every AI Product Needs Evals 3:11 - Real Example: Nurture Boss Case Study 5:26 - Starting with Observability 11:24 - Ad Start 13:05 - Ad End: Analyzing Traces 24:55 - Error Analysis Introduction 27:00 - Axial Coding Explained 30:53 - Ad Start 32:40 - Ad End: Counting Issues 42:26 - Building Your LLM Judge 48:02 - Measuring the Judge 56:38 - PM vs AI Engineer Roles 1:01:29 - Common Mistakes to Avoid 1:06:31 - Outro ---- 🏆 Thanks to our sponsors: 1. The AI Evals Course for PMs & Engineers: You get $800 with this link: https://maven.com/parlance-labs/evals?promoCode=ag-product-growth 2. Vanta: Automate compliance, Get $1,000 with my link : https://www.vanta.com/lp/demo-1k?utm_campaign=1k_offer&utm_source=product-growth&utm_medium=podcast 3. Jira Product Discovery: Plan with purpose, ship with confidence - https://www.atlassian.com/software/jira/product-discovery 4. Land PM job: 12-week experience to master [getting a PM job](https://www.landpmjob.com/) - https://www.landpmjob.com/ 5. Pendo: the #1 Software Experience Management Platform - http://www.pendo.com/aakash ---- Key Takeaways: 1. AI evals are the #1 most important new skill for PMs in 2025 - Even Claude Code teams do evals upstream. For custom applications, systematic evaluation is non-negotiable. Dog fooding alone isn't enough at scale. 2. Error analysis is the secret weapon most teams skip - Looking at 100 traces teaches you more than any generic metric. Hamel: "If you try to use helpfulness scores, the LLM won't catch the real product issues." 3. Use observability tools but don't depend on them completely - Brain Trust, LangSmith, Arise all work. But Shreya and Hamel teach students to vibe code their own trace viewers. Sometimes CSV files are enough to start. 4. Never use agreement as your eval metric - It's a trap. A judge that always says "pass" can have 90% accuracy if failures are rare. Use TPR (true positive rate) and TNR (true negative rate) instead. 5. Open coding then axial coding reveals patterns - Write notes on 100 traces without root cause analysis. Then categorize into 5-6 actionable themes. Use LLMs to help but refine manually. 6. Product managers must do the error analysis themselves - Don't outsource to developers. Engineers lack domain context. Hamel: "It's almost a tragedy to separate the prompt from the product manager because it's English." 7. Real traces reveal what demos hide - Chat GPT said the assistant was correct but missed: wrong bathroom configuration, markdown in SMS, double-booked tours, ignored handoff requests. 8. Binary scores beat 1-5 scales for LLM judges - Easier to validate alignment. Business decisions are binary anyway. LLMs struggle with nuanced numerical scoring. 9. Code-based evals for formatting, LLM judges for subjective calls - Markdown in text messages? Write a simple assertion. Human handoff quality? Need an LLM judge with proper rubric. 10. Start with traces even before launch - Dog food your own app. Recruit friends as beta testers. Generate synthetic inputs only as last resort. Error analysis works best with real user behavior. ---- 👨‍💻 Where to find Hamel Husain: Website: https://hamel.dev Twitter/X: https://x.com/HamelHusain Course: https://evals.info 👨‍💻 Where to find Shreya Shankar: Website: https://www.shreya-shankar.com Twitter/X: https://x.com/sh_reya Course: https://evals.info 👨‍💻 Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aagupta/ Newsletter: https://www.news.aakashg.com #aievals #aipm #productmanagement ---- 🧠 About Product Growth: The world's largest podcast focused solely on product + growth, with over 200K+ listeners. 🔔 Subscribe and turn on notifications to get more videos like this.

Aakash GuptahostHamel HusainguestShreya Shankarguest

Jan 15, 20261h 7mWatch on YouTube ↗

EPISODE INFO

Released: January 15, 2026
Duration: 1h 7m
Channel: Aakash Gupta
Watch on YouTube: ▶ Open ↗

EPISODE DESCRIPTION

Hamel Husain and Shreya Shankar are back with the definitive guide to AI evals. Step-by-step walkthrough using real production data from Nurture Boss. Error analysis, LLM judges, and the mistakes 90% of teams make. Full Writeup: https://www.news.aakashg.com/p/hamel-shreya-podcast-2 Transcript: https://www.aakashg.com/how-to-master-ai-evals-a-step-by-step-guide-with-hamel-husain-shreya-shankar/ ---- Timestamps: 0:00 - Intro 2:09 - Why Every AI Product Needs Evals 3:11 - Real Example: Nurture Boss Case Study 5:26 - Starting with Observability 11:24 - Ad Start 13:05 - Ad End: Analyzing Traces 24:55 - Error Analysis Introduction 27:00 - Axial Coding Explained 30:53 - Ad Start 32:40 - Ad End: Counting Issues 42:26 - Building Your LLM Judge 48:02 - Measuring the Judge 56:38 - PM vs AI Engineer Roles 1:01:29 - Common Mistakes to Avoid 1:06:31 - Outro ---- 🏆 Thanks to our sponsors:
1. The AI Evals Course for PMs & Engineers: You get $800 with this link: https://maven.com/parlance-labs/evals?promoCode=ag-product-growth
1. Vanta: Automate compliance, Get $1,000 with my link : https://www.vanta.com/lp/demo-1k?utm_campaign=1k_offer&utm_source=product-growth&utm_medium=podcast
1. Jira Product Discovery: Plan with purpose, ship with confidence - https://www.atlassian.com/software/jira/product-discovery
1. Land PM job: 12-week experience to master [getting a PM job](https://www.landpmjob.com/) - https://www.landpmjob.com/
1. Pendo: the #1 Software Experience Management Platform - http://www.pendo.com/aakash ---- Key Takeaways:
1. AI evals are the #1 most important new skill for PMs in 2025 - Even Claude Code teams do evals upstream. For custom applications, systematic evaluation is non-negotiable. Dog fooding alone isn't enough at scale.
1. Error analysis is the secret weapon most teams skip - Looking at 100 traces teaches you more than any generic metric. Hamel: "If you try to use helpfulness scores, the LLM won't catch the real product issues."
1. Use observability tools but don't depend on them completely - Brain Trust, LangSmith, Arise all work. But Shreya and Hamel teach students to vibe code their own trace viewers. Sometimes CSV files are enough to start.
1. Never use agreement as your eval metric - It's a trap. A judge that always says "pass" can have 90% accuracy if failures are rare. Use TPR (true positive rate) and TNR (true negative rate) instead.
1. Open coding then axial coding reveals patterns - Write notes on 100 traces without root cause analysis. Then categorize into 5-6 actionable themes. Use LLMs to help but refine manually.
1. Product managers must do the error analysis themselves - Don't outsource to developers. Engineers lack domain context. Hamel: "It's almost a tragedy to separate the prompt from the product manager because it's English."
1. Real traces reveal what demos hide - Chat GPT said the assistant was correct but missed: wrong bathroom configuration, markdown in SMS, double-booked tours, ignored handoff requests.
1. Binary scores beat 1-5 scales for LLM judges - Easier to validate alignment. Business decisions are binary anyway. LLMs struggle with nuanced numerical scoring.
1. Code-based evals for formatting, LLM judges for subjective calls - Markdown in text messages? Write a simple assertion. Human handoff quality? Need an LLM judge with proper rubric.
1. Start with traces even before launch - Dog food your own app. Recruit friends as beta testers. Generate synthetic inputs only as last resort. Error analysis works best with real user behavior. ---- 👨‍💻 Where to find Hamel Husain: Website: https://hamel.dev Twitter/X: https://x.com/HamelHusain Course: https://evals.info 👨‍💻 Where to find Shreya Shankar: Website: https://www.shreya-shankar.com Twitter/X: https://x.com/sh_reya Course: https://evals.info 👨‍💻 Where to find Aakash: Twitter: https://www.x.com/aakashg0 LinkedIn: https://www.linkedin.com/in/aagupta/ Newsletter: https://www.news.aakashg.com #aievals #aipm #productmanagement ---- 🧠 About Product Growth: The world's largest podcast focused solely on product + growth, with over 200K+ listeners. 🔔 Subscribe and turn on notifications to get more videos like this.

SPEAKERS

Aakash Gupta
host
AI product creator and host of the Aakash Gupta podcast/channel, focused on practical AI engineering and product workflows.
Hamel Husain
guest
Machine learning engineer and educator known for hands-on guidance on building and validating LLM evaluation systems.
Shreya Shankar
guest
ML/AI researcher and practitioner focused on reliable evaluation, monitoring, and real-world deployment of ML/LLM systems.

EPISODE SUMMARY

In this episode of Aakash Gupta, featuring Aakash Gupta and Hamel Husain, How to Build AI Evals in 2026 (Step-by-Step, No Hype) explores step-by-step evals workflow: traces, error analysis, and LLM judges The speakers argue that most real AI products need evals, and “no evals” claims often rely on upstream testing or informal dogfooding rather than rigorous measurement.

RELATED EPISODES