Lenny's Podcast

Edwin Chen: Why optimizing for benchmarks creates AI sloth

How Surge bootstrapped past $1B revenue with fewer than 100 people; Chen argues benchmark gaming pushes AI toward dopamine, emojis, and slop, not truth.

Lenny RachitskyhostEdwin Chenguest

Dec 7, 20251h 10mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Bootstrapped AI Data Giant Surge Reimagines Responsible Path To AGI

Founder Edwin Chen explains how Surge AI became a $1B-revenue, sub‑100‑person, fully bootstrapped company by obsessing over ultra‑high‑quality training data for frontier models like ChatGPT, Claude, and Gemini.
He argues that most labs misunderstand data quality, over‑optimize for noisy benchmarks and engagement, and risk steering AI toward dopamine and slop instead of truth and real societal progress.
Chen outlines the evolution of post‑training—from SFT and RLHF to rubrics, verifiers, and rich RL environments—and describes how taste, values, and objective functions at each lab will increasingly differentiate models.
He also makes a broader case against the default Silicon Valley VC playbook, advocating for small elite teams, deep focus, principled research, and building the one company only you are uniquely qualified to build.

IDEAS WORTH REMEMBERING

5 ideas

Extreme focus and tiny teams can massively outperform bloated organizations.

Surge surpassed $1B in revenue with under 100 people by deliberately avoiding the typical Silicon Valley game—no VC, minimal PR—and instead relying on a small, elite, deeply aligned team shipping a 10x better product.

True data quality is nuanced, subjective, and labor‑intensive to measure.

Good training data is not just ‘correct format and instructions followed’; Surge uses thousands of behavioral and performance signals to identify not just acceptable work but “best of the best” contributors for complex tasks like Nobel‑level poetry, advanced coding, and scientific reasoning.

Benchmarks and public leaderboards are distorting AI progress.

Many benchmarks are noisy or wrong, easy to game, and correlate poorly with real‑world capability; optimizing for them often rewards flashy, emoji‑laden, hallucination‑prone outputs that win votes but degrade truthfulness and reliability.

Objective functions and values will shape how different AIs behave.

Labs choose what to optimize—engagement, benchmarks, productivity, safety, taste—and those choices drive data selection, post‑training, and ultimately model personality, such as whether an assistant endlessly polishes an email or tells you, “It’s good enough, move on.”

Reinforcement learning in rich environments is the next big frontier.

Beyond SFT and RLHF, Surge is building simulated ‘worlds’ (e.g., full startup stacks, financial workflows) where models must act over long trajectories, use tools, and affect real state, receiving rewards for end‑to‑end task success rather than single‑step answers.

WORDS WORTH SAVING

5 quotes

We essentially teach AI models what's good and what's bad.

— Edwin Chen

People think you can just throw bodies at a problem and get good data. That's completely wrong.

— Edwin Chen

I'm worried that instead of building AI that will actually advance us as a species, we are optimizing for AI sloth instead.

— Edwin Chen

You are your objective function.

— Edwin Chen

I would rather be Terence Tao than Warren Buffett.

— Edwin Chen

Surge AI’s unconventional growth: bootstrapped, hyper‑lean, research‑driven path to $1B revenueDefining and operationalizing high‑quality data for LLM trainingProblems with current benchmarks, leaderboards, and engagement‑driven objectivesEvolution of post‑training: SFT, RLHF, rubrics/verifiers, and RL environmentsObjective functions, values, and model “personality” as key differentiatorsCritique of Silicon Valley’s VC/“pivot and blitzscale” culturePhilosophical lens: training AI as raising humanity’s “children”

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.