How Spotify runs agents across 20M+ lines of code, with Niklas Gustavsson

At Spotify, anyone can describe an idea and have Claude build a working prototype in their real apps in an hour or two. VP of Engineering Niklas Gustavsson walked us through it. Claude Cowork: anthropic.com/product/claude-cowork Claude Code: anthropic.com/product/claude-code

Niklas Gustavssonguest

Jun 29, 202626mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Spotify scales agentic coding and verification across massive monorepos

Spotify built agent-driven automation because its codebase grew far faster than its engineering headcount, making manual migrations and maintenance unsustainably slow.
The internal platform Honk evolved from deterministic, script-based “fleet management” to Claude Agent SDK-powered agents that can execute tasks, run CI, and safely automate PRs at scale.
Verification and test automation are positioned as the critical enabling layer for closed-loop agent development, allowing auto-merges without requiring every owning team to review each change.
Spotify reports large measurable productivity gains (e.g., higher PR frequency and a majority of PRs AI-authored) while aiming to keep quality metrics neutral through ongoing reliability investments.
Beyond engineering, Spotify is expanding agent-enabled prototyping so designers, PMs, and executives can build and share working app prototypes quickly via an internal prototype “app store.”

IDEAS WORTH REMEMBERING

5 ideas

Agents became viable once Spotify stopped treating changes as one-shot prompts.

Early attempts failed when they simply fed code to a model and requested a full transformation; success improved with decomposition, iterative workflows, and (initially) judging/verification to raise PR success rates.

Model quality reduced the need for an explicit “judge,” but not for verification.

Honk previously used a judge to boost success rates (roughly 20–30% to ~80%); as models improved, Spotify removed the judge, yet still relies heavily on CI/tests as the core correctness gate.

Test automation is the price of safely auto-merging agent-generated changes.

Spotify’s shift from team-reviewed changes to auto-merged automated PRs forced stronger component-level tests, because ownership teams are no longer guaranteed to be in the loop for every change.

Standardization improves agent performance as much as it improves human productivity.

Consistent frameworks, patterns, and tooling reduce ambiguity, letting agents “learn by example” from nearby code in large monorepos instead of encountering many divergent ways to do the same thing.

Large monorepos can be a strength for agents when retrieval and code reuse work well.

Despite concerns about indexing and size, Niklas reports Claude performs well in Spotify’s 20M+ LOC monorepo, often leveraging existing internal patterns to implement new changes more reliably.

WORDS WORTH SAVING

5 quotes

I found myself not using an ID anymore. And like the, the way that I was working had completely changed. It changed that I had not seen in the 30 years that I've been doing this type of work.

— Niklas Gustavsson

Claude works amazingly well in those repositories and, um, I think one of the things we found is how good Claude is looking at other code, uh, in the repository to get, I guess, inspiration for the problem you're trying to solve.

— Niklas Gustavsson

We make something like 4,500 production deployments every day.

— Niklas Gustavsson

We're seeing a 75%-plus improvement in PR frequency, for example, uh, that we can directly attribute to AI tooling, and I think by now 73-ish percent of PRs are directly attributed to being AI authored.

— Niklas Gustavsson

Those types of things are, were unimaginable a year ago, and now we're doing them every day.

— Niklas Gustavsson

From biology to software engineering; “AGI moments” with LLMsMonorepos vs polyrepos at Spotify (20M+ LOC backend monorepo)Fleet management and automated maintenance migrationsHonk evolution: scripts → LLM decomposition/judging → Claude agentsAgent architecture: Kubernetes, tool access, user-added tools, CI loopsVerification loops, test automation, and auto-merge operating modelROI measurement: PR frequency, AI-authored PR attribution, cost/token trackingHigh-velocity delivery (4,500 deployments/day) and reliability trade-offsOrganization-wide prototyping and internal prototype app store

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.