Skip to content
Y CombinatorY Combinator

How Meta Prompting and Rubrics Make LLM Agents Reliable

Through rubric-based evals and explicitly layered meta prompting; Parahelp's agent prompt shows how role, task, and output-format layers drive LLM calls.

Garry TanhostJared FriedmanhostDiana HuhostHarj Taggarhost
May 30, 202531mWatch on YouTube ↗

CHAPTERS

  1. 0:00 – 0:58

    Why metaprompting feels like coding (and people management)

    Garry frames modern prompt engineering as an early-days frontier akin to coding in the 1990s, where tools are immature but rapidly evolving. He also introduces a key mental model: prompting is fundamentally about communicating clearly—similar to managing a human teammate.

    • Metaprompting is emerging as a widely used, high-leverage technique
    • Prompting today resembles early software development: powerful but under-tooled
    • Good prompts hinge on communicating decision context, not just instructions
    • LLMs can be treated like agents that need management-style guidance
  2. 0:58 – 1:48

    Inside ParaHelp: a real production prompt for AI customer support

    Jared introduces ParaHelp, an AI customer support company serving top AI products, and explains why seeing their prompt is unusually valuable. The group emphasizes that prompts for vertical agents are often considered proprietary, making this example a rare look at real-world practice.

    • ParaHelp powers customer support for companies like Perplexity and Replit
    • Production agent prompts are often ‘crown jewel’ IP and hard to obtain
    • Prompt visibility helps founders understand what ‘good’ looks like in practice
    • Vertical AI agents rely on prompts as operational control surfaces
  3. 1:48 – 4:15

    Anatomy of a strong long prompt: role, task, plan, and strict output format

    Diana breaks down what makes the ParaHelp prompt effective: it’s long, structured, and explicit about role, responsibilities, and step-by-step planning. She highlights formatting choices (Markdown headings, XML-like tags) that improve instruction-following and integration with tools.

    • Start by defining the model’s role and responsibilities clearly
    • Specify the task precisely (e.g., approve/reject tool calls)
    • Provide a step-by-step plan and ‘important things to keep in mind’
    • Constrain outputs with a strict schema for agent/tool interoperability
    • XML-like structure can improve compliance due to training priors
  4. 4:15 – 6:02

    System vs developer vs user prompts: scaling customization without ‘becoming a consultancy’

    The hosts discuss the practical challenge of supporting many customers with slightly different workflows and tone. Diana outlines an emerging architecture: keep the general ‘API of the company’ in the system prompt, place customer-specific context in developer prompts, and reserve user prompts for direct end-user interactions.

    • Customer-specific behavior often lives outside the core system prompt
    • System prompt: company-wide invariant rules and operating principles
    • Developer prompt: customer context, policies, workflows, and preferences
    • User prompt: end-user instructions when applicable
    • Prompt “forking/merging” across customers is becoming a key scaling problem
  5. 6:02 – 6:58

    Automating worked examples: pulling the best cases from customer data

    Harj points out that worked examples are crucial for quality, but manually curating them doesn’t scale. He describes a future where agents automatically extract high-signal examples from customer datasets and inject them into the right stage of the prompting pipeline.

    • Worked examples strongly improve output quality
    • Manual example curation becomes a bottleneck as customers scale
    • Opportunity: tools/agents that select examples automatically from real data
    • Pipeline design matters: where examples live affects performance and maintainability
  6. 6:58 – 7:57

    Metaprompting and prompt folding: prompts that improve themselves

    Garry describes ‘prompt folding,’ where one prompt generates improved versions of itself or specialized variants based on prior queries. The key workflow shift: instead of hand-editing, developers feed failures and additional examples back into an LLM and ask it to revise the prompt.

    • Prompt folding: dynamically generating specialized prompts from a base prompt
    • Metaprompting loop: feed failures/examples, ask the model to revise the prompt
    • Classifier-style prompts can generate task-specific instructions on demand
    • This approach reduces manual prompt rewriting and accelerates iteration
  7. 7:57 – 9:05

    When prose fails: teaching hard tasks with expert examples (LLM ‘TDD’)

    Diana explains that for complex domains (like deep code bug finding), it can be easier to provide difficult expert-labeled examples than to describe rules in text. She analogizes this to unit tests and test-driven development: examples become the spec that teaches the model what success looks like.

    • Some tasks are too complex for clean natural-language specifications
    • Expert examples can teach patterns (e.g., N+1 queries) better than prose
    • Examples help steer reasoning when exact parameters are hard to articulate
    • Analogy: ‘LLM TDD’—examples function like unit tests for behavior
  8. 9:05 – 12:09

    Escape hatches and debug channels: preventing hallucinations in production

    The group discusses a common failure mode: if forced into a fixed output format, models may fabricate answers instead of admitting uncertainty. They recommend explicit escape hatches (ask clarifying questions) and structured ‘debug info’ fields that let the model report confusion, creating a feedback-driven to-do list for developers.

    • Fixed output schemas can incentivize confident hallucinations
    • Add an explicit ‘escape hatch’ instruction: stop and ask when underspecified
    • Alternative pattern: include a ‘debug info’/complaint field in the response format
    • Production logs of debug info become a prioritized list of prompt fixes
  9. 12:09 – 12:44

    Iterating on long prompts: notes-to-edits workflows and model-assisted refactors

    As prompts become large working documents, Harj recommends keeping a running change log of issues and feeding both prompt + notes into a strong model to propose integrated edits. This supports continuous prompt refactoring rather than ad hoc patching.

    • Treat prompts as living documents that require ongoing maintenance
    • Keep a running list of observed failures and desired improvements
    • Use a strong model to propose cohesive edits rather than scattered patches
    • This workflow reduces cognitive load as prompt length and complexity grow
  10. 12:44 – 14:18

    Debugging with reasoning traces: using Gemini as a REPL for prompts

    Diana and Jared emphasize the value of reasoning/thinking traces for diagnosing prompt failures, especially when combined with long context windows. They describe using Gemini interactively like a REPL—running example-by-example, inspecting traces, and rapidly adjusting steering.

    • Reasoning traces provide critical signal for why prompts fail
    • Gemini’s long context enables iterative, example-driven prompt debugging
    • Newer API access to traces allows integration into developer workflows
    • Simple UI workflows (drag-and-drop JSON) can outperform heavy tooling for iteration
  11. 14:18 – 17:25

    Evals are the real moat: why prompts aren’t the ‘crown jewels’

    Jared argues that evaluation sets—not prompts—are the core proprietary asset, because evals explain why a prompt is written a certain way and how to improve it. Garry expands: high-quality evals require deep field immersion with real operators to capture true success criteria and edge cases.

    • Evals determine whether prompts work and guide systematic improvement
    • Prompts can be shared; evals encode the hard-won understanding of the domain
    • Great evals come from observing real workflows and incentives in the field
    • The moat is domain-specific understanding translated into measurable tests
  12. 17:25 – 23:18

    Every founder as a forward-deployed engineer (FDE): the Palantir playbook

    Garry explains how Palantir institutionalized sending engineers—not salespeople—on-site to learn workflows and ship software fast, enabling massive contracts. The hosts connect this to modern AI: founders must personally do deep customer immersion and return with working demos that reflect what they heard.

    • FDE model: engineers embed with users to understand and build rapidly
    • Palantir won big deals by showing working software quickly, not slideware
    • Founders today must be product + design + ethnography + engineering in one
    • Speed of iteration and empathy with operators beats incumbent sales processes
  13. 23:18 – 26:06

    Vertical AI agents closing huge deals with FDE + AI iteration speed

    Diana and Harj share examples of vertical agent companies winning six- and seven-figure enterprise contracts by rapidly tailoring demos and performance for each customer. They highlight that LLM-driven iteration makes it possible for small founder teams to achieve what previously required larger engineering orgs.

    • AI enables faster turnaround from customer meeting to tailored demo
    • Differentiation often comes from achieving the ‘last 5–10%’ of correctness
    • Examples: voice and customer support agents winning major enterprise deals
    • On-site tuning and tight feedback loops help maintain and expand deployments
  14. 26:06 – 30:02

    Model personalities and rubric lessons: rigidity vs flexibility across LLMs

    The panel discusses how different models behave like different ‘personalities’: some are more steerable and human-like, others require more explicit control. They illustrate this with rubric-based scoring, where one model adheres strictly to criteria while another flexibly handles exceptions—mirroring how humans use guidelines.

    • Claude described as more human/steerable; other models may need heavier steering
    • Rubrics are essential for consistent numeric scoring, but never capture all edge cases
    • Different LLMs apply the same rubric differently (rigid vs exception-aware)
    • Choose models based on the task: compliance vs judgment under ambiguity
  15. 30:02 – 31:26

    Prompting as kaizen: continuous improvement by the people doing the work

    Garry closes by unifying the themes: prompting is both engineering and communication, and the best improvements come from tight loops where builders observe failures and refine the process. He likens metaprompting to kaizen—continuous improvement driven by practitioners at the point of work.

    • Prompting is managing clarity: expectations, evaluation, and decision criteria
    • Kaizen parallels: the best process improvements come from practitioners
    • Metaprompting enables rapid, iterative refinement based on real feedback
    • The ‘brave new world’ is continuous improvement of agent behavior in production

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.