OpenAI’s head of platform engineering on the next 12-24 months of AI | Sherwin Wu

OpenAI’s head of platform engineering on the next 12-24 months of AI | Sherwin Wu

Lenny's PodcastFeb 12, 20261h 19m

Sherwin Wu (guest), Lenny Rachitsky (host)

Codex usage at OpenAI (AI-authored code, PR review)Engineers as agent managers (“sorcerers” metaphor)Stress/failure modes when agents stall; context as bottleneckScaling code review and CI via AI automationManagement changes: leverage, top performers, larger spansOne-person billion-dollar startup and second/third-order effectsEnterprise AI ROI: top-down vs bottoms-up adoption“Models eat scaffolding”: shifting tool/architecture betsBuild for where models are going (capability trajectory)API/platform stack: Responses API, Agents SDK, evals, UI kitsNear-term model direction: longer tasks + audio breakthroughsBusiness process automation outside tech as major opportunity

In this episode of Lenny's Podcast, featuring Sherwin Wu and Lenny Rachitsky, OpenAI’s head of platform engineering on the next 12-24 months of AI | Sherwin Wu explores aI agents reshape engineering, management, startups, and enterprise automation rapidly Wu reports that Codex is deeply embedded at OpenAI: ~95% of engineers use it daily and 100% of PRs are reviewed by it, with heavy users opening far more PRs.

AI agents reshape engineering, management, startups, and enterprise automation rapidly

Wu reports that Codex is deeply embedded at OpenAI: ~95% of engineers use it daily and 100% of PRs are reviewed by it, with heavy users opening far more PRs.

He argues the engineer role is shifting from writing code to “managing fleets of agents,” requiring new skills in prompting, context management, and preventing agent drift—like “sorcery” with real consequences.

For companies, he warns many AI deployments likely have negative ROI due to top-down mandates without bottoms-up champions; he recommends “tiger teams” of internal power users to drive adoption and best practices.

He predicts major near-term shifts: longer-horizon agents (multi-hour tasks) and big gains in audio/multimodal models, enabling business process automation and potentially a boom of micro-startups supporting “one-person billion-dollar” outcomes.

Key Takeaways

AI is already the default authoring and review layer at OpenAI.

Wu says nearly all code is AI-generated first, ~95% of engineers use Codex daily, and 100% of PRs are reviewed by Codex—making human effort more about steering and verification than typing.

Get the full analysis with uListen AI

High performers compound faster with AI, widening productivity gaps.

Codex-heavy engineers open ~70% more PRs, and Wu expects the gap to grow as power users learn better workflows and trust models more.

Get the full analysis with uListen AI

The new core engineering skill is agent management, not syntax.

Engineers run many parallel threads, supervise multiple agent tasks, and must prevent “Sorcerer’s Apprentice” failure modes where agents go off-rails without sufficient guidance.

Get the full analysis with uListen AI

When agents fail, missing context—not “model stupidity”—is often the root cause.

An internal experiment with a 100% Codex-written codebase shows teams must encode tribal knowledge into repos (docs, comments, structure, . ...

Get the full analysis with uListen AI

AI can shrink the pain of code review and CI if you automate the boring parts first.

Wu notes Codex review can cut review time dramatically (e. ...

Get the full analysis with uListen AI

Many enterprise AI rollouts fail because they’re mandate-driven, not adoption-driven.

Top-down “AI-first” directives without bottom-up evangelists lead to confused users and poor ROI; Wu recommends a dedicated internal tiger team to experiment, document, and teach workflows.

Get the full analysis with uListen AI

In AI product building, customer requests can lock you into dead-end scaffolding.

Because models improve quickly, customers may ask for optimizations of today’s workaround (vector stores, agent frameworks, skills files), but “the models will eat your scaffolding for breakfast,” making those bets obsolete.

Get the full analysis with uListen AI

Build products for where models will be, not where they are today.

Wu’s heuristic: target capabilities that are ~80% there; as models improve, the product “clicks” and becomes dramatically better without rebuilding from scratch.

Get the full analysis with uListen AI

A ‘one-person billion-dollar startup’ implies a broader boom in micro-SaaS and bespoke tools.

Wu argues the second/third-order effect is many small startups selling narrowly tailored software that lets ultra-lean companies outsource functions like support and ops—potentially a golden age for B2B SaaS.

Get the full analysis with uListen AI

Two near-term capability leaps will reshape products: longer-running agents and better audio.

He expects coherent multi-hour tasks to become common within 12–18 months, changing how you design feedback/guardrails, and believes audio is underrated given how much business happens via calls and speech.

Get the full analysis with uListen AI

Notable Quotes

Ninety-five percent of engineers use Codex. One hundred percent of our PRs are reviewed by Codex.

Sherwin Wu

Engineers are becoming tech leads. They're managing fleets and fleets of agents.

Sherwin Wu

It literally feels like we're wizards casting all these spells.

Sherwin Wu

This team doesn't have that escape hatch.

Sherwin Wu

The models will eat your scaffolding for breakfast.

Sherwin Wu

Questions Answered in This Episode

On the “95% use Codex” stat: what counts as “use” (drafting, refactoring, tests, review), and how do you measure it reliably?

Wu reports that Codex is deeply embedded at OpenAI: ~95% of engineers use it daily and 100% of PRs are reviewed by it, with heavy users opening far more PRs.

Get the full analysis with uListen AI

What are the most common failure patterns you see when engineers run 10–20 parallel agent threads, and what guardrails reduce “Sorcerer’s Apprentice” outcomes?

He argues the engineer role is shifting from writing code to “managing fleets of agents,” requiring new skills in prompting, context management, and preventing agent drift—like “sorcery” with real consequences.

Get the full analysis with uListen AI

From the 100% Codex-written codebase experiment: what specific repo artifacts (e.g., agents.md, patterns docs, code comments) produced the biggest improvement in agent success?

For companies, he warns many AI deployments likely have negative ROI due to top-down mandates without bottoms-up champions; he recommends “tiger teams” of internal power users to drive adoption and best practices.

Get the full analysis with uListen AI

You said context is usually the issue when an agent stalls—what’s your step-by-step debugging playbook to diagnose missing/incorrect context?

He predicts major near-term shifts: longer-horizon agents (multi-hour tasks) and big gains in audio/multimodal models, enabling business process automation and potentially a boom of micro-startups supporting “one-person billion-dollar” outcomes.

Get the full analysis with uListen AI

How do you decide which PRs are safe for “Codex-only” review versus requiring a human reviewer, and what risk signals trigger escalation?

Get the full analysis with uListen AI

Transcript Preview

Sherwin Wu

ninety-five percent of engineers use Codex. One hundred percent of our PRs are reviewed by Codex.

Lenny Rachitsky

For engineers, I don't know what job has changed more in the past couple years.

Sherwin Wu

Engineers are becoming tech leads. They're managing fleets and fleets of agents. It literally feels like we're wizards casting all these spells, and these spells are kinda like going out and doing things for you.

Lenny Rachitsky

What do you think people aren't pricing in yet?

Sherwin Wu

The second or third order effects of the one-person billion-dollar startup. To enable a one-person billion-dollar startup, there might be a hundred other small startups building bespoke software. So I think we might actually enter into a golden age of B2B SaaS.

Lenny Rachitsky

I've been hearing more and more there's this stress people feel when their agents aren't working.

Sherwin Wu

There's a team that's actually doing an experiment right now with an OpenAI, where they are maintaining a one hundred percent Codex-written code base. They run into the exact problems that you're describing, and so usually you're like, "All right, I'll roll up my sleeves and figure it out." This team doesn't have that escape hatch.

Lenny Rachitsky

You've shared that listening to customers is not always the right strategy in AI.

Sherwin Wu

The field and the models themselves are just changing so, so quickly. They tend to, like, disrupt themselves. The models will eat your scaffolding for breakfast.

Lenny Rachitsky

What's your advice to folks that are like, "Okay, I don't wanna miss the boat?"

Sherwin Wu

Make sure you're building for where the models are going and not where they are today. There's a quote from Kevin Weil, our VP of Science here, and he likes saying: "This is the worst the models will ever be."

Lenny Rachitsky

[upbeat music] Today, my guest is Sherwin Wu, head of engineering for OpenAI's API and developer platform. Considering that essentially every AI startup integrates with OpenAI's APIs, Sherwin has an incredibly unique and broad view into what is going on and where things are heading. Let's get into it after a short word from our wonderful sponsors. Today's episode is brought to you by DX, the developer intelligence platform designed by leading researchers. To thrive in the AI era, organizations need to adapt quickly, but many organization leaders struggle to answer pressing questions like: Which tools are working? How are they being used? What's actually driving value? DX provides the data and insights that leaders need to navigate this shift. With DX, companies like Dropbox, Booking.com, Adyen, and Intercom get a deep understanding of how AI is providing value to their developers and what impact AI is having on engineering productivity. To learn more, visit DX's website at getdx.com/lenny. That's getdx.com/lenny. Applications break in all kinds of ways: crashes, slowdowns, regressions, and the stuff that you only see once real users show up. Sentry catches it all. See what happened, where, and why, down to the commit that introduced the error, the developer who shipped it, and the exact line of code all in one connected view. I've definitely tried the five tabs and Slack thread approach to debugging. This is better. Sentry shows you how the request moved, what ran, what slowed down, and what users saw. Seer, Sentry's AI debugging agent, takes it from there. It uses all of that Sentry context to tell you the root cause, suggest a fix, and even opens a PR for you. It also reviews your PRs and flags any breaking changes with fixes ready to go. Try Sentry and Seer for free at sentry.io/lenny, and use code Lenny for one hundred dollars in Sentry credits. That's S-E-N-T-R-Y.io/lenny. [upbeat music] Sherwin, thank you so much for being here, and welcome to the podcast.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome