How Spotify runs agents across 20M+ lines of code, with Niklas Gustavsson

At Spotify, anyone can describe an idea and have Claude build a working prototype in their real apps in an hour or two. VP of Engineering Niklas Gustavsson walked us through it. Claude Cowork: anthropic.com/product/claude-cowork Claude Code: anthropic.com/product/claude-code

Niklas Gustavssonguest

Jun 29, 202626mWatch on YouTube ↗

CHAPTERS

0:10 – 0:57
From “no one will use an IDE” skepticism to a new default workflow
Niklas recalls initially doubting predictions that IDE-centric work would quickly disappear—then experiencing that shift himself within months. They frame the conversation around how dramatically day-to-day software development has changed with agents.
- •Personal disbelief about abandoning the IDE, followed by rapid adoption
- •Work style changed faster than anything in decades of engineering
- •Host relates similar internal experience with only a short head start
0:57 – 1:26
Biology to software: entering programming through early “big data” genomics
Niklas explains his formal background in molecular biology and how genome sequencing data pushed him to improve his programming skills. What began as a sabbatical pivot turned into a ~30-year software career.
- •PhD work in genomics created a need for programming
- •Early exposure to data-intensive workflows
- •A temporary switch became a long-term career transition
1:26 – 3:04
Early “AGI moments”: automating code changes before Claude, then the Opus breakthrough
They discuss Spotify’s early experiments using LLMs to automate code migrations and why it was hard at first. Niklas notes a major personal leap when models became capable enough to handle real problems without heavy prompt engineering.
- •Initial attempts at LLM-driven code changes were a struggle
- •Pre-Claude / early GPT era showed directional promise
- •Opus 4.5 shifted from “smart autocomplete” to reliable problem solver
- •Reduced need for manual IDE edits and prompt engineering
3:04 – 4:18
Niklas’s day-to-day agent setup: tmux, many sessions, and worktrees
Niklas describes a practical, terminal-centric workflow with multiple concurrent agent sessions. He runs several Claude instances in parallel across worktrees, switching between monorepos and polyrepos as needed.
- •Uses tmux with multiple tabs/panes for agents and diffs
- •Typically 5–10 active terminal tabs
- •Matrix of Claude sessions + terminals aligned to worktrees
- •Temporary sessions for smaller polyrepos when required
4:18 – 5:27
Agents in a 20M+ LOC monorepo: surprising effectiveness at scale
They discuss whether monorepos or polyrepos are a better fit for agentic development, given indexing and scale concerns. Niklas reports Claude performs exceptionally well even in Spotify’s massive backend monorepo, especially by learning patterns from nearby code.
- •Initial worry about monorepo indexing and scale
- •Backend monorepo exceeds 20 million lines of code
- •Claude excels by referencing similar code in-repo for “inspiration”
- •Monorepo consistency can amplify agent effectiveness
5:27 – 7:12
Why Spotify built “fleet management”: maintenance automation at company scale
Niklas explains the organizational pressure that drove automation: the codebase grew far faster than engineering headcount, while product ideas kept expanding. Spotify shifted from asking teams to manually execute migrations to orchestrating codebase-wide mutations across thousands of repos.
- •Codebase growth outpaced engineers (reported ~7x)
- •Maintenance work (Java upgrades, library/API migrations) slowed shipping
- •Old process: broadcast instructions; hundreds of teams repeat manual work
- •New approach: automate mutations across thousands of repos
- •Result: millions of automated PRs merged over time
7:12 – 8:43
Hitting the ceiling of deterministic refactors—and turning to LLMs
They dive into why static/AST-based transformations became unwieldy: code’s API surface and edge cases explode in complexity. This limitation motivated experimenting with early LLMs, which initially failed due to model capability and naïve one-shot prompting, then improved with decomposition and evaluation techniques.
- •Deterministic scripts struggle with real-world code variance
- •AST transformations balloon into thousands of lines to handle edge cases
- •Early LLM approach: one-shot refactor attempts didn’t work well
- •Improvements came from better models plus task decomposition
- •Early use of judges/evaluators to raise success rates
8:43 – 9:45
Honk evolves from migration bot to ubiquitous internal agent platform
Niklas describes how internal experimentation consolidated into Honk, iterated many times before formal “V2.” What began as an orchestrator for automated code changes expanded as engineers used it for broader tasks, including Slack-triggered workflows.
- •Honk originated from many internal prototypes and iterations
- •Started as code-change automation + scheduling across repos
- •Engineers expanded usage beyond migrations (e.g., Slack-triggered tasks)
- •“V2” is effectively much later iteration (V8-ish)
9:45 – 10:01
Honk architecture today: Agent SDK on Kubernetes, extensible tools, and CI verification
They outline Honk’s core components: an Agent SDK runtime in Kubernetes with tool access and a strong verification loop via CI builds. Niklas notes the platform shifted from allow-listed tools to user-authored tools, enabling broader internal integration including macOS builds for iOS work.
- •Runs the Agent SDK inside Kubernetes pods
- •Tooling model evolved from allow-listed tools to user-authored tools
- •Key capability: run verification/CI builds in the loop
- •Supports Linux and macOS CI; macOS critical for iOS pipelines
- •Can integrate deeper UI validation (e.g., simulator-based testing)
10:01 – 12:01
From “judge agents” to stronger models: raising success rates and simplifying the loop
Niklas explains Honk previously relied on a separate judge to validate outputs, dramatically boosting PR success rates early on. As models and agent robustness improved, Spotify removed the judge step, simplifying the system while maintaining high performance.
- •Early judge increased success from ~20–30% to ~80%
- •Judge was essential during early-model era
- •Model capability improvements reduced the need for a dedicated judge
- •Agent + verification loop became sufficient over time
12:01 – 14:45
Verification as a forcing function: better tests enable auto-merge and autonomy
They discuss why verification is the central bottleneck for closed-loop agentic development. Spotify’s move toward automated PRs required raising expectations around test automation, since teams would no longer manually review every change before merge.
- •Verification is critical for autonomous, multi-step agent workflows
- •Automation changed ownership expectations: teams may not see every PR
- •Improved test automation made auto-merging safe and scalable
- •Quality can stay neutral while speed improves—if infra investments are made
14:45 – 16:38
Reliability at extreme deployment velocity: 4,500 production deploys/day
Niklas connects agent-driven speedups to reliability needs at Spotify’s operational scale. With thousands of daily deployments, failures are inevitable without strong quality and reliability practices; Spotify continually invests to keep pace.
- •Spotify performs ~4,500 production deployments per day
- •More speed requires more reliability investment, not less
- •Continuous optimization to reduce idea-to-production time
- •Fast feedback loops improve product iteration and validation
16:38 – 19:20
Measuring ROI: PR attribution, cost/benefit, and linking work to user value
They cover how Spotify quantifies impact: large jumps in PR frequency and a high fraction of AI-authored PRs provide clear signals. The harder part is building attribution from PRs to work items, experiments, and ultimately user/revenue outcomes—while accounting for compute/token costs.
- •~75%+ improvement in PR frequency attributed to AI tooling
- •~73% of PRs reported as AI-authored/attributed
- •Goal: connect PRs/deployments to work items and A/B test outcomes
- •Increasing need to measure cost (tokens/time) vs productive output
- •ROI discussion shifts from obvious gains to precise accounting
19:20 – 26:10
Advice for leaders and engineers: standardize, embrace new roles, and prototype faster
Niklas advises leaders to keep investing in foundational practices—testing, verification, and standardization—because consistency makes agents more effective. He encourages engineers to focus on problem-solving outcomes, using agents to enter unfamiliar codebases faster and to unlock rapid prototyping across the whole company, including non-engineers.
- •Foundational investments (tests/verification) remain essential in the agent era
- •Standardization improves agent comprehension and consistency
- •Engineers can solve broader problems and contribute across codebases faster
- •Shift from implementation time to higher-level thinking and exploration
- •Company-wide prototyping unlock: internal app store; contributions from everyone up to execs

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

From “no one will use an IDE” skepticism to a new default workflow

Biology to software: entering programming through early “big data” genomics

Early “AGI moments”: automating code changes before Claude, then the Opus breakthrough

Niklas’s day-to-day agent setup: tmux, many sessions, and worktrees

Agents in a 20M+ LOC monorepo: surprising effectiveness at scale

Why Spotify built “fleet management”: maintenance automation at company scale

Hitting the ceiling of deterministic refactors—and turning to LLMs

Honk evolves from migration bot to ubiquitous internal agent platform

Honk architecture today: Agent SDK on Kubernetes, extensible tools, and CI verification

From “judge agents” to stronger models: raising success rates and simplifying the loop

Verification as a forcing function: better tests enable auto-merge and autonomy

Reliability at extreme deployment velocity: 4,500 production deploys/day

Measuring ROI: PR attribution, cost/benefit, and linking work to user value

Advice for leaders and engineers: standardize, embrace new roles, and prototype faster

Get more out of YouTube videos.