Coding is no longer the constraint: Scaling devex to teams and agents at Spotify

At Spotify, 96% of engineers now code with AI and PR frequency is up 60% — so the constraint has moved from writing code to orchestrating it. Niklas Gustavsson, Chief Architect & VP of Engineering, shares how Spotify built Honk, a background coding agent running on the Agent SDK, plugged it into their Fleetshift migration platform and Backstage software catalog, and learned that the same standardization that makes teams effective makes agents effective too. Walk away with Spotify's bets on developer experience for agents — and why firmer guardrails are accelerators, not constraints.

May 20, 202627mWatch on YouTube ↗

EVERY SPOKEN WORD

20 min read · 4,298 words

0:00 – 0:16
Intro
1. SPSpeaker
  [onstage rock music] [audience applauding]
0:16 – 2:18
Spotify’s AI transition: engineering scale and rapid tool adoption
1. SPSpeaker
  Hey, everyone. Yeah, so I'm Niklas. Was very surprised to see my face on screen [chuckles] earlier because I had completely forgotten that Boris was gonna mention Spotify as part of the keynote. So I'm here to give you a bit of a rundown on how we're approaching the AI transition at Spotify. So let me start with a little bit of an introduction to Spotify. Uh, anyone in here who's a Spotify user? Oh, lots of hands. Good. Um, so we're a fairly sizable engineering org at this point, close to three thousand engineers. We've spent many years trying to optimize our developer experience and how we build products. We try to make sure that it's as easy as possible to deploy and sh-ship changes to our users. One way to illustrate that is that we do around four and a half thousand deployments every day to our production environment. We run on a mix of repositories. I'll come back to this later. Some are very large monorepos. Our back end is in a forty million lines of code monorepo. Uh, and then we have lots and lots of smaller, uh, polyrepos, thousands of them. The AI transition for us has been a journey of very rapid adoption curves. We roll out tools internally all the time to make our developers more productive. But we have never seen the rate of adoption that we've seen rolling out, uh, AI coding tools. And you can see in particular how Claude Code, uh, orange in this diagram, completely exploded. It's a little bit hard to see due to the holi-holiday break, but it, it really happened around the Opus 4.5 release in November last year. And since then, growth and usage of, uh, Claude in particular, but
2:18 – 3:49
Measuring impact: AI usage rates and productivity signals (PR frequency)
1. SPSpeaker
  AI tools in general, has gone [chuckles] completely bananas. And today, more than ninety-nine percent of our engineers use AI coding tools every week. And we do a recurring engineering survey to all our engineers, and in the latest one who just came, just came in last week, ninety-four percent of our engineers reports that using AI tooling has helped them become more productive, uh, and that's with a record high self-assessed productivity. We can also look at productivity in other ways. One way is to look at PR frequency as a proxy for, for how fast and how much we're able to ship. We're seeing today an increase of seventy-six percent in PR frequency. As I was working on these slides over the last two weeks, I had to change this number because it keeps growing all the time. And by now, by far most of the PRs that we ship are authored by an AI agent together with the developer. One thing you can see in this curve, if you look, you can see it's actually hard to see here, but it, it, the, the number of PRs has been very slowly growing over a longer period of time. But you can see that jump again happening around the Opus 4.5 release. That was when this took off for, took off for us. So this of course then means
3:49 – 5:21
The hidden problem: codebase growth outpacing engineers and the maintenance tax
1. SPSpeaker
  that we also have an explosive growth in our code. Uh, luckily, that's something that we came prepared for. We've seen this for a long time also prior to AI. In fact, a few years ago, uh, we noticed that our code base, our production code base was growing seven times faster than the number of engineers. So that meant that engineers would spend more and more of their time maintaining our existing code base compared to being able to build new features and value for our users. So we realized that we needed to fix this, so we started, uh, an effort to automate as much of that maintenance as possible. A lot of that maintenance comes down to pretty dull things that we just need to do. You know, migrate from this version to this version, deprecate this API, fix this security vulnerability, those types of things. But that took a lot of time for our developers, and the way we typically did those migrations back then was to send out some migration path to hundreds of teams saying, "Hey, you need to upgrade from this Java version to this Java version, uh, for your components." The teams would go ahead and do that, and this would typ-typically take us months to complete one of those upgrades across many thousands of components. That was not fun for anyone. Uh, in, in that same engineering survey back then, migrations
5:21 – 6:22
Fleet management pre-AI: Fleetshift and automated maintenance PRs at massive scale
1. SPSpeaker
  was the top thing that users, or sorry, de- our developers were frustrated about. So we imagined, instead of doing this like component per component and fairly manually, can we imagine a way where we do this as a way to mutate our entire fleet of components? Figure out a way to do that. And we, and we built this out, built out the infrastructure for this, something we call fleet management, and the underlying system that we use is called Fleetshift. And today, up, up until today, we've now merged two and a half million of those automated maintenance PRs, work that our developers did not have to do.The vast majority of those, the green part of this graph, have been auto-merged. So there's no human in the loop. It's automation creating the, the PR to begin with, automation validates that PR is safe to merge, and then go ahead and merge it without any developer needing to care about that change. This happens every day. We ship thousands of these every
6:22 – 7:54
Why deterministic scripts broke down: complexity, corner cases, and Hyrum’s Law
1. SPSpeaker
  day. So that was all pre-AI. Uh, and one thing that we noted pretty quickly was that this works really well for simple changes. That might be changes to configuration, it might be bumping some dependency in your build file, those types of things. Works great. But once you get into a little bit more complex changes like replacing API calls, those types of things, the scripts that we used to run these shifts across our fleet became incredibly complicated. Code, as it turns out, has a very, very wide API surface. There are many, many ways to achieve the same thing if it's just calling method. And when you write that script and you run that across millions of lines of code and thousands of components, you are going to find every corner case, and you need to deal with that in your migration script. There's even a word for... a term for this. It's called Hyrum's Law, coming from an engineer at Google that discovered this many years before we then ran into it. So pretty early on as LLMs came about, we figured that, hey, instead of writing these deterministic scripts to do these code modifications, can we use an LLM for this? So very early we started iterating on trying to do this prior to Claude and, and similar tools. And
7:54 – 9:55
Introducing Honk: LLM-powered fleet code changes and verification in CI
1. SPSpeaker
  we noticed it was challenging initially. The models were just too stupid. The way we were trying to do it was just too stupid. But over time, on many iterations, we started figuring out the patterns for it, and the models got better. Out of this came a tool that we now called Honk. Uh, Boris mentioned this this morning. Has a silly name and, [chuckles] and a silly icon, but it's a very useful tool, as it turns out. And Honk is really the result of all of these iterations of us trying, trying different ways of solving this problem of like automating these still relatively simple code changes, but again, applied over many, many variants of code. It started out very differently, but today Honk is, has Claude under the hood using the Agent SDK, and it wraps up the Agent SDK ins-inside our own harness inside a Kubernetes pod, so we can schedule many of these running in our cloud environment, and we give it access to a set of trusted tools. Um, the chart here just says the verification tools, but there's actually more tools that it has available to it. And for verification, it's able to run builds of the code, it running in our CI environment. So one thing that is important to us is that we can run our builds across multiple operating systems, for example, because our clients runs on many different operating systems. So Honk has available tools that it can use to verify that its changes are correct. And again, we run many of these every day. And then we integrate this into that fleet management tooling that I mentioned before. So we use Fleetshift, our tool that I showed the graph before, to schedule and orchestrate these changes across our thousands of repositories, and Honk sits in the middle
9:55 – 10:56
Operational workflow: tracking shift status and compressing migrations from months to days
1. SPSpeaker
  doing the actual code changes. And it might look something like this. In this case, this is a fairly small migration targeting thirty-nine repositories. But for a team that owns this, they can go in and see what's the status of this particular shift today, how many PRs has been created, how many has been merged, how many failed in CI, so I need to take a look at them, those types of things. And as Boris mentioned this morning, we're seeing pretty significant, uh, time savings from this. What used to be what I described before, hundreds of teams doing migrations for their components, taking weeks and weeks or months, now can be done by a single engineer in a few days. The latest Java migration that we did, we run our backend mostly on Java on the JVM. The latest Java migration we did took three days using these tools. And we're making this now available, so we have a commercial offering
10:56 – 11:26
From internal platform to product: Fleetshift/Honk via Backstage commercial offering
1. SPSpeaker
  for other companies through our Backstage developer portal, and we're making this available as a product in that packaging. So if this is something that is relevant for your company, you can take a look there. But as it turns out, developers are very resourceful and innovative. So pretty quickly they-- folks figured out that, hey, hmm, this Honk thing that we run for all of these migrations, how about I figure out how I can call that over Slack and have it do things for me that way? So
11:26 – 11:56
Developers repurpose Honk: Slack-driven agent requests and PR generation
1. SPSpeaker
  similar to how you might invoke Claude or other tools over Slack, you can do that with Honk at Spotify as well. And it's become a very common way that people will have a Slack conversation for something, then just @mention Honk. Honk goes off and work on that and comes back with a PR. So we're seeing more and more of these patterns evolve around Honk. And in fact, yesterday we released Honk V2. The V2 versioning is a little bit off
11:56 – 14:00
Honk V2 and Chirp: orchestration, shared sessions, and ‘multiplayer’ agent collaboration
1. SPSpeaker
  because I think it's actually like the eighth version of Honk, so I don't know what we did with the versioning, but it doesn't matter too much. Uh, so yeah, so this week we have Hack Week at Spotify and we released the alpha of Honk V2.Which is a pretty significant additional features for Honk, and it really now builds towards this world where developers are using it more interactively. So we've integrated it with our agent orchestration tool that we call Chirp. This is similar to, uh, what you can do with Claude Agents or with Agent Dek or similar tools, but it's a little bit more features and it's integrated into our infrastructure. This is the way that you can run many, many agent sessions at the same time and coordinate those, those types of things. And Honk is built into that, so you can use Chirp to schedule Honk jobs, uh, for example. You can also collaborate with other developers on shared sessions. So instead of it's being you in front of your agent, you're now sharing that agent session with more people, and you can collaborate on that and give feedback and ideas and whatnot on that. So basically imagine, uh, Google Docs or something similar, but for Claude. And that then also, uh, groups up into larger efforts. So imagine you're working on a completely new feature or product, you're working with a team on that, you can have a project that you're sharing, and in that you can have many sessions with Honk that where you collaborate over, um, working towards whatever that goal is. This is also available on any device and whatnot, so users can, users can use them from wherever they are. There are lots, lots more features that we're rolling out going forward. We're very excited about Honk V2, and in particular, I'm gonna say personally, I'm very excited about these like multiplayer features of imagining how agents actually collaborates in with multiple developers and teams. All
14:00 – 16:32
Optimizing the codebase for agents: standardization as a force multiplier
1. SPSpeaker
  right. Let me switch gears a little bit. So I want to, to also talk about how we try to optimize our code base to make agents as effective as possible in our code. So we've had for many, many years, more than-- I've been at Spotify for fifteen years, and this happened prior to I arrived, so I don't actually know exactly how old it is. But we've had this belief on the fewer technologies that we use, the faster we will be able to go. And this basically comes down to a few different aspects. One, if we have a set of technologies that we know really, really well, we're really like deep experts on them, we will be able to build better things on top of those. We can also eliminate a lot of small decisions for teams. Instead of having to pick the technology for everything you're building, there's a ready set, the ready set of, um, technologies available to you that basically hopefully solves your problem. It also means that it's much easier to collaborate. If you're working with some other team on their components, and their components lu-look roughly the same as yours, it's gonna be easier for you to contribute to those. And similarly, if we need to move components around or m- developers move to a different team, things look roughly the same over there. So if you look at our-- If you look at a typical backend service at Spotify, they will all look very similar. Same technology stack, roughly the same design patterns, and so on. And we think this makes sense for agents as well. So for many years now, we've been driving towards a more and more standardized stack. Less unnecessary variance, at least. We want some level of variance. We want to be-- We want to experiment with new technologies, evaluate new things that could be good for us, but we don't want to do that, uh, willy-nilly. We want to be intentional about it. And we, we see that this leads to more effective teams at Spotify. And we believe that it also leads to more effective agents. Simply, if Claude has a lot of co- other code to look at and that code looks roughly consistent, Claude will do a better job. That's what we're seeing. And we actually have code bases that are, that are l- more fragmented, and we can actually see Claude perform
16:32 – 18:33
Backstage as the system of record for humans and agents: catalog, ownership, and actions
1. SPSpeaker
  worse in those code bases. And the starting point for this is, I mentioned Backstage before. Backstage is our developer portal. It used to be that it provided a single pane of glass for us developers. Prior to Backstage within Spotify, we had, I think, roughly like a hundred different tools that you would-- you as a developer would go to. There was one tool to check your deployments, one to look at CI, one to look at AB tests and whatnot. And it was very, very confusing. All of those tools were kind of shit as well, like they weren't particularly good. So we, we thought there was an opportunity to consolidate this and provide a better experience for our developers. And it really started with this notion of a catalog of all our software. I mentioned before that we have thousands of components in production, and Backstage came about just as a way to know who owns a b- one of those components. Let's say we have an incident and I need to be able to page someone on the owner of that team. I-- Before Backstage, I couldn't even figure out who that owner was. So it started as a way to just having a catalog for that. Over the, over the years, it's then grown into having lots and lots of tools around those components as well. So today, for as a human developer, everything I do, uh, when I need to take an action on some of our source co-- uh, some of our software components, I'm gonna do that in Backstage. And as it turns out, that's equally useful for agents. So we expose all of these as MCPs or command line tools for our agents, and Claude can go look up who's an owner for something, and it can go ping that team on Slack if it needs to ask questions about it, for example.This has turned out to be incredibly useful for us, and in particular, as we've scaled up, it allows us to keep track of everything we have going on. It is
18:33 – 21:06
Driving consistency: tech radar, ‘golden state,’ Sound Check, and automated enforcement
1. SPSpeaker
  also a way for us to drive our standardization. So I mentioned this before, we have strong recommendations for which technologies to use for a particular problem, and we describe this in a few different ways. We have a technology radar, as many companies do, that just like lists all the technologies that are available and, and what state they're in. Like this one we recommend using, this one we don't recommend using, and so on. We also have what we call golden state. So this is essentially for a particular type of component. If you're this type of backend service or you're this type of, uh, iOS view, these are the technologies and practices that we recommend that you use. And we have a way or a UI in Backstage that we call Sound Check that where you as a team can go in and self-assess this. This is an example of such a view. You can see here some component, and it has, um, a requirement to define a valid owner. That was what I was talking about before. This allows us to then make our code base much, much more consistent and has been something that we've been driving over several years. It's been, um, very, very powerful and set us up, set us up well for where we are now with AI. And we then also combine that with static analysis and linting. So these things are then implemented in our code bases as checks, so that when Claude works in our code base, it will get immediate feedback on if it's using the right set of technologies and the right set of design patterns. So if Claude comes up with something that a way to, I don't know, call gRPC in a way that, um, we know is, is not optimal for our infrastructure, Claude will get feedback from our lint system to, to correct that. And we think this is super useful both for our developers and for our agents. And we see this all the time as when I, uh, work with Claude in our code base, I will see Claude run into these lint checks all the time and correct itself. It's an awesome way to, to drive this type of standardization. All right, I'll try to sum this up. So first, hopefully this came through, but
21:06 – 23:40
What changes (and what doesn’t) with agents: verification, metrics, and human judgment
1. SPSpeaker
  the need for strong engineering practices has not gone away with agents. It remains as important as it was before. Boris mentioned verification this morning. We fully agree with that. The ability to have your code being well tested and having your agents being able to invoke those tests, either Claude running locally or Honk with the verification tools that I showed before, that is the way to make your agents be much more autonomous and come up with better solutions in your, in your code. Similarly, what I just talked about in terms of making sure that your code base is, um, consistent and, and it's well defined what developers and agents are supposed to do turns out to make agents work much, much better, at least in our case. We're also, um, very careful about trying to measure everything, measure every aspect of our developer experience. So we instrument all our infrastructure, we, uh, instrument all our PRs and so on, and we can collect that and measure how we're doing. So some of the numbers that I've been showing here today comes from that instrumentation, and we have tons and tons of metrics that we're tracking. We believe that human judgment matters just as much as it did before or even more now that we're able to move faster. We need to figure out where to apply that human judgment, though. So I mentioned the, um, increase in PR frequency. The flip side of that is that we now have seventy-six percent more PRs to review. Developers, one of our most frequent feedbacks at the moment is there's just too many freaking PR, PRs to review. So we need to figure out where we apply humans to review those PRs where it matters the most. So that won't be all PRs. We're already auto-approving some PRs that we think are safe enough to merge without human review, and then we try to focus the human review where it really matters. And I think this will be recurring. We'll figure out over time where we need the human judgment to be applied, uh, and that's gonna be both, I think, prior to invoking
23:40 – 27:36
Coding is no longer the constraint: prototyping in production code and shifting bottlenecks
1. SPSpeaker
  the agent and post invoking the agent. And lastly, as we're moving faster, um, we're seeing that coding is much less of a bottleneck now. It used to be that if you looked at our-- the way that we build our products, our, our product development life cycle, we were mostly waiting on developers building out features, implementing them. And that might have been early in the phase where we need to validate something, or it might be building that out for production. Both of those cases, that was one of the main bottlenecks that we had as a company. And that is now starting to loosen up. I wouldn't, I won't say that it's completely eliminated, but it's, it's starting to be reduced. So for example, for that early validation, Spotify is a company that hasToo many ideas, way too many ideas about what we could do to our users than we've ever been able to build, than we had a capacity to build. And having that many ideas about what we can do means that we need to validate which of those ideas make sense, and one way we can do that is to prototype. Prototype used to be a fairly expensive thing for us to do. You had to convince a bunch of developers to build something for you so you can then show that to other people. One thing that Claude and Agents allows us to do is to allow anyone to prototype in our actual production code base. So now at Spotify, you can open up Claude in our client monorepo, and through a set of skills and some infrastructure that we've built, you can prompt Claude to build out any feature that you want to try out and imagine. Claude will build that for you. You will get a, an app back that you can install and test on your device and share with other people within Spotify to actually get a sense of what it feels to use that idea you had. And this has brought prototyping for something that could take days or weeks to literally taking minutes now. So anyone, including, as it turns out, one of our CEOs, are now building these prototypes for the ideas they have. So that's for prototyping, and then the same is true for, like, building things out in production. But what we're seeing is that this is moving the constraints around. So what-- where coding used to be the bottleneck, we're now seeing more and more of that, those constraints and bottlenecks turning into other aspects of how we build products, and in particular, where we have human decision-making in the loop. So again, things like deciding what we're gonna ship to our users or which ideas we want to explore. Those things used to be-- we didn't have to make that many of those decisions because, again, we were constrained on how fast we could build things. But as that constraint lifts, we need to figure out better and more effective ways of, uh, making those decisions. And we're seeing this now, and we're trying to shift around how we, um, plan the work we do and how we decide on or how we make those decisions at the moment. It is still very much an ongoing learning at the moment and a set of experiments that we're running. But I think in six months or so, I think we'll have a very, very different way of building products compared to what it had looked like previously. That was it. If, again, if you want to try out Fleetshift and Honk, that's, uh, where you can take a look at that. And thanks for having me. [upbeat music]

Episode duration: 27:36

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode zFslvuvYifQ

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Intro

Spotify’s AI transition: engineering scale and rapid tool adoption

Measuring impact: AI usage rates and productivity signals (PR frequency)

The hidden problem: codebase growth outpacing engineers and the maintenance tax

Fleet management pre-AI: Fleetshift and automated maintenance PRs at massive scale

Why deterministic scripts broke down: complexity, corner cases, and Hyrum’s Law

Introducing Honk: LLM-powered fleet code changes and verification in CI

Operational workflow: tracking shift status and compressing migrations from months to days

From internal platform to product: Fleetshift/Honk via Backstage commercial offering

Developers repurpose Honk: Slack-driven agent requests and PR generation

Honk V2 and Chirp: orchestration, shared sessions, and ‘multiplayer’ agent collaboration

Optimizing the codebase for agents: standardization as a force multiplier

Backstage as the system of record for humans and agents: catalog, ownership, and actions

Driving consistency: tech radar, ‘golden state,’ Sound Check, and automated enforcement

What changes (and what doesn’t) with agents: verification, metrics, and human judgment

Coding is no longer the constraint: prototyping in production code and shifting bottlenecks

Get more out of YouTube videos.