How Datadog built a universal machine tool for Claude Code

At Datadog, 90% of engineers adopted AI coding tools for production work in the last four months, with Claude Code driving two-thirds of that usage. As sessions grew more ambitious, the reusable tools they produced — for verification, debugging, orchestration — sprawled into unmaintainable one-offs. Sesh Nalla, VP of Engineering, shares how Datadog built Temper: a constrained framework that produces secure, reusable tools that compound across sessions and teams instead.

May 6, 202630mWatch on YouTube ↗

EVERY SPOKEN WORD

20 min read · 4,456 words

SPSpeaker
[on-hold music] Please welcome to the stage VP of Engineering of Datadog, Sesh Nalla. [audience applauding] I know it is 4:00 PM. How's the energy going right now so far? How are you liking the conference? Great?
SPSpeaker
[cheering]
SPSpeaker
All right. Let, let's see some blood flowing. So quick show of hands if you have heard of machine tools before what's on the slide. Have you used any? I was expecting zero hands anyways, so okay. So that's, that's the talk anyways. So these are tools like jigs, fixtures, gauges, and mills that you see in manufacturing to produce precise and repeatable machine parts, the kind that, you know, you assemble them into larger machines like engines, aircrafts, nuclear reactors, lunar landing modules that we saw this morning, right? So machine tools were a breakthrough during the industrialization period that enabled the scale due to the interchangeability with standardization and precision. So that was the inspiration for my talk and what we built at Datadog, and I will share with you how all this fits into building software with Claude Code. So what you're seeing on the slide there is my view of the last 18 months. I think there is a different view of this graph at multiple sessions today. This is non-scientific, so don't please consider that my eval. It's purely personal, but it's a case of ambition with the models. For most of twenty twenty-five, the models were useful to me, but like within very narrow boundaries, like local changes, small functions, tests, glue code, throwaway prototypes for me to learn a lot, kind of learned a lot through them. And then around late twenty twenty-five, I think you all have seen this and noticed it when the slope started to change like this exponential. I started trusting Claude with larger and more ambiguous system scope work. Before I talk about machine tools, I wanna go through the lineage of some ideas that led me here. So this is twenty twenty-four. It's a write-up. We were building a distributed queuing system called Courier from scratch. All of this was before agents, all by hand. Can we beli-believe it? Like, still software was built by hand then. [chuckles] The hard part was for any distributed system that is human-built or agent-built, it's not just building the parts, but it's building the pieces and making the interaction between them observable, testable, and verifiable. So we were rigorous with formal modeling and simulation, so you see various techniques in this post. All of this is classical systems work, where you identify the parts, um, where mistakes would be expensive or hard to reverse, and you raise the rigor for those parts that you don't want them to slip to production. The next idea was around September twenty twenty-five. We called it Bits Evolve. It's a closed-loop evolutionary optimization harness inspired by AlphaEvolve from DeepMind. The idea is that you let parts of code improve themselves on a narrow controlled harness. I think we have seen some announcements today at the keynote about dreaming and, and various loops. So this was us trying in September twenty twenty-five. It's an ensemble or a council of models, big and small, and they generate variants of code, whatever you designate you want them to improve. In our case, it's heart functions, blocks of code. And then you have a cascade that you see on the right-hand side of the screen with benchmarks, tests, and production observability decide what survives. It's like natural selection. So this was the first glimpse for me that parts of software maybe could be cultivated like living organisms, like plants or like, like microbes, and growing through variation with feedback and adaptation. However, the insight was that this kind of evolution is only as good as the environment it adapts within. If you have bad benchmarks, they produce bad evolutions. Weak observability, your optimizations are shallow. So then during this period of leap or the exponential that started, we were building whole distributed systems with Claude Code. I mentioned the Opus four five inflection point where I raised my ambition. It wasn't, like, sudden. Like, I didn't wake up one day and then started trusting Claude with whole systems or even my databases. We have a lot of them. It happened progressively. Like, I tried harder tasks, then larger subsystems, failed a lot, learned a lot through a lot of experiments, and then started seeing successes around this period. So what we decided, to be more ambitious, okay, could we build something as big as Kafka? Show of hands if you have heard of Kafka. Okay, more hands now. Um, so Kafka is a streaming service. So we kind of attempted, okay, can we build this from scratch? It's like repeating the same-Playbook that we had for Courier, it's a queuing system, pretty close, right? Same rigor. So you will see the cascade on the right-hand side o- of this slide. But Claude Code doing most of the construction with one human building it. In a few days, to our disbelief, we had a full functional Kafka, Kafka-compatible system working, and we called it Helix. So the source code, methodology, and the details are all linked in this post. Feel free to check them out later. But taking Helix to production now requires building mileage, and that has been challenging for us. So the next natural move for us was, I spoke about Bits Evolve earlier, could Bits Evolve evolve parts of Helix the same way it evolved hard functions? Like, the ambition was, like, too big, like from small functions to, like, can the whole component evolve? Can it provide the mileage we needed so we could, you know, get to production? The answer was not quite. The surface area was too large. Even with the verification cascade that I showed you in the previous slide, quite rigorous, there were too many places where the human had to interpret and correct. It's too multi-turn, too interactive. So we were like, "Okay, can we look for a narrower surface? Can we dial back the ambition a little bit?" So this post was about that. So we chose our metrics aggregation server. We're Datadog, so we have lots of metrics. Could we improve the materialization logic live, not offline like what we did with Bits Evolved before? Can we optimize them per customer with a proof carrying path around the change, so a human doesn't have to review every candidate that is being, uh, generated? So if you look at the flow across these four projects, if you observe the pattern, each project exposed the bottleneck for the next. There were many talks and comments today about the bottleneck being moved. So with Courier, the bottleneck was construction, with humans building systems through careful design, modeling, and verification. It took us one year to build Courier, and then if you compare it, it took, like, three days to build Kafka in, like, what? Twelve months. So with Bits Evolved, the bottleneck moved to the feedback loop, like the model produces variation, but the harness decides what survives. And with Helix, the Kafka system that we built, the mar- the bottleneck moved again, where agents could build large parts of the system, so we have seen it. But then the, the human have to coordinate to ship the work to production through tools and mechanisms built for humans. So that's the Amdahl's law that Dario was talking about earlier today. So that's the jump we are making, if you look at this slide at the center, from mechanization to industrialization. So mechanization means agents are doing more of the work now. And industrialization, if we were to borrow the me- metaphor, and that's why I kind of introduced machine tools, means work becomes repeatable, verifiable, controllable, and scalable. So the idea of a machine tool is, I know it sounds cool, but why do we need them now for agent-built software? Aren't these for just lunar modules to land? Because of the complexity and the ambiguity growing, right? So each time we are trying to, like, increase our ambition levels, started from targeted changes on existing systems. In the last four months, about ninety percent of Datadog used AI coding tools for production code. That's roughly three thousand engineers. And Claude Code drove at least two-thirds of that. And most of the work, as I described about Helix, was still single human driving, like one engineer steering one or more agent sessions. And the work was moving across this map, more complex to generate on one axis and more ambiguous to verify on the other axis. Here are a few concrete examples. I'm not going to enumerate all of these, but feel free to take a picture if useful. And if any of these resonate with you, I'm happy to chat about more afterward or later on. But the main point I want to underline here is that these are generating personalized flows in our software development life cycle, because one human could do a lot, a lot more than what they were used to do before. So for engineers in the software delivery wor-- um, life cycle I was talking about, if I were to use the word flow, it used to mean direct relationship between intent and code. You understood the problem, you wrote the code, you tested it, you reviewed it, you shipped it, you operated it, you repeat it again and over and over again. But with agents, that abstraction level is changing rapidly. I don't know. I haven't seen code... I mean, I haven't seen code in a while. I've, I've made my peace with it because I've been a manager for a while, but many people, like, they're still trying to go through their seven stages of grief of not seeing code on a day-to-day basis, right? So you're no longer writing the code. You're shaping the work. You're deciding what the agent should see. We saw the keynotes today, like outcomes, right? What tools it should have, what success means, how failure should be detected. All of this is powerful. It's like everyone's promoted three levels up into management chain, which they weren't signed up for. Engineers, right? Because it's a huge leap, and it's also disorienting. It's like s- you're, you're pushed against gravity very fast. It can feel sickening, right? We haven't acclimated to this altitude of working, specifically engineers who love looking at code.So before this jump in model capabilities, the human team was the factory. Tools were designed around human attention, our judgment, and the operational memory of what's actually happening in production is in this human organization, our collective brains, our minds. Connecting back to Courier, like I said, only twelve months ago, this was the world we were in, and only four months ago it continued and started to change, and that's the inflection point with Claude Code and Opus four-five. We are starting to see one lead human coordinate multiple interactive sessions. I've seen, like, many screenshots of, like, parallel sessions, like cranking out stuff. It's disorienting for me to personally watch. Three, four, five agents working on different parts. I heard Jared today saying he's doing ten things at a time. Like, the stuff he was showing was only ten percent of his whatever time being spent. So these tools were still human-shaped. The agents were two orders of magnitude faster, and this tool chain isn't built for their speed. So what happened? The human became the bridge between agent execution and the human-shaped systems. And now all this operational knowledge, like you wake up at night, something broken, it's just in that person's head and probably in some markdown files that agents worked on and in between them. So that further amplified with Claude managed agents, right? So agents are doing a lot more background work, right? It's compute. And they start taking judgment-bearing roles, meaning they're not just following your instructions anymore, they're, they're making their own decisions and they're running longer, for hours, sometimes overnight, sometimes for days. I know the longest task, someone like benchmark twenty hours, twenty-eight hours. So they construct their own tools in these sessions. They write their own code. And the mismatch is still there, like each agent invents its own tools, its own glue, its own conventions, and that system becomes really hard to share and operate. You can see the blur between what the agent sessions produced as like intermediate tools and what is your product doing, like your code. You get a lot of output fast. A lot of it is useful, but some of it can look like false progress and most of the tool construction that only makes sense in that local session, and you start to see that blur. So this is where I felt we need something more structural. If agents are going to build and operate large parts of our systems, of our databases, which are mission critical, they need the equivalent of this machine tool concept that I am trying to introduce. So Temper is a machine tool. The idea is instead of agent inventing disconnected tools for every local need, it produces precise specifications of the intent and problem domain. It is a machine tool in the same sense that a jig or a CNC machine, have you seen some of those, the computer-aided hot machines where you give them specifications of what your screw threading needs to be, what this needs to be? And it's extremely repeatable. You can run them and you can build aircrafts and things like that with them, right? So in this case, the agent does not improvise the final mechanism each time. It produces a precise description and iterates with Temper or a Temper-like mechanism to make something work first and then later turn that into something repeatable, checkable and reusable, so you could actually build a software factory around your code base. So that's the concept of a dark factory. Simon Wilson of simonwilson.net, pretty amazing blog, I think he's one of the most influential AI voices right now in teaching how to work with agents and build software, has been popularizing this phrase called dark factory. I think it's a pretty good encapsulation of a software process where the agents keep working without the humans on the virtual factory floor. You can turn off the lights. So the human role now becomes like designing the factory and the constraints and the outcomes and the verification loop so this thing can run for hours and days and weeks producing what you want it to produce. So something like Temper can fill in this role of a machine tool to build such factories. Let's look at a dark factory concretely with Helix as a target. What I shared about Helix, it's a Kafka Lake streaming service. It is probably one of the five expensive services we are running at Da- at Datadog, like Kafka. So we have been shadowing our production workloads with Helix. In some cases, we actually believe it can be materially cheaper than our current production solution. Can you believe it? We took, took like a week to build it, and we started shadowing it, and we saw like two to five X opportunities that it can be cheaper. But getting from this promising state to production still takes to work on the mileage the system needs to earn that it can run and multiple people can operate it, not just the person who built it. So we created a bunch of synthetic workloads that models our production shapes, and we constructed a factory. So a software factory for Helix uses Temper in three distinct ways. First, as an agent control plane to cloud-managed agents, where the sessions, roles, work queues and operational life cycle are more precisely managed.Second, as a way for agents to build their own tools with small Temper apps, bridging the SDLC tooling like Git CI deployment. And third, as a Helix control API, the interface that... and the life cycle surface around the Helix data plan to exercise this workload. So that was a surprise to me. The surprise was it started to feel more general than agent infrastructure. A lot of software, if you squint closely, is just control logic around database, APIs around state, policies around mutation, life cycle transitions, integrations with external systems. So Temper could be this universal in a sense that it can be applied to any software that has the shape I described. Before I go deeper into how Temper works, you might be wondering at this point, why is this different from asking Claude Code to build a CRUD app like in TypeScript or Python? Claude can do that very well. We have seen like lots of PRs and lots of code going. However, in normal CRUD apps, the control logic is spread across routes, database constraints, service code, background jobs, and documentation. It may all have good tests and coverage, but the operational model, which is generally a state machine shaped, is mostly implicit in the code base. So the idea is Temper makes that state machine explicit. This isn't particularly novel. We have-- we had this with runtimes like Erlang, Erlang OTP for decades, actor runtimes, and more recently with workflow engines and durable execution runtimes like Temporal. They all have like popularized this precise runtime so you can like run invincible applications. So let's look at some Temper, Temper internals to understand. On the build path for Temper, it asks for this operational domain I'm describing about the model, what you want in the form of a blueprint. Like what are the states? Which transitions are legal? Who can request them? What effects are allowed? What invariants must hold? What happens if a tool call fails, right? So the agent will often iterate and arrive at this blueprint over like multiple turns. Think of it like its, its own plan mode to arrive at, instead of arriving at a markdown file of some description, it could arrive at this precise declarative artifact. And then you have a compiler equivalent like Temper verify it, and it can hot deploy into a runtime. There are other runtimes who does this, like Erlang Beam does that, if you have heard of it. Anybody heard of Erlang Beam? There you go. So the run path feels the same as any other cloud-shaped API. You wouldn't notice the difference. It's important to note that the agent is not generating arbitrary application code directly. It's r- we've raised the abstraction. It's generating a structured description that compiles into a runtime shape. The compilation step is outside the LLM. It's same like you write Rust code and you give it to Rust compiler, right? So Temper turns this blueprint into something called formal state transitions. It's very common in functional programming and actor runtimes if you've used them. This is the most important technical detail. Formal here doesn't mean every possible property is proved. It isn't theoretical. It means the basic shape of the application is represented as a precise transition system. And when you have that, it is a much better reasoning for both humans and agents than arbitrary code. Pardon my completely brutalist design of this slide with no syntax highlighting, but that's a agent-written spec. For a Helix rollout, that spec looks like this, states, actions, and triggers. I actually got an idea this morning when I saw the announcement on, um, Claude Managed Triggers. Maybe this could be just a pass-through. You could declare a trigger here and then have Claude run them. But the idea is you define your states, actions, and triggers, or the agent does it, Claude does it. And in this case, like the deployment description start rolling is only valid when planned or entity moves to rolling, and that triggers a side effect like going and patching a Kubernetes StatefulSet, which is not idempotent. And then a callback that comes back whether to mark the progress or fail. What Temper does-- So that artifact is declarative. So what Temper does, it takes this and generates that spec into this transition table, a concept where the critical control logic is data-like. It is not just spaghetti, um, imperatively encoded in code. It's just data-like that's interchangeable and checkable, and it is not hidden in any improvised chain of service methods. This is easier for agents to work with, and they can change this dynamically with safety. That's the promise, right? I have seen people like writing Erlang scripts and hot deploying into Beam. I think that's a pretty creative way of using runtimes that are extremely hardened over high assurance systems for agents to work with. So in this case, if a rollout needs a new state or a rollback path, the agent can make a targeted spec change, and it can hot reload it.So the iteration speed is even faster. You don't have to go through CI and then deploy and everything. So when you're leaving agents overnight, I think they can come up with some pretty good progress overnight without compromising on safety, PR reviews, and code reviews, and things like that. And we have policy gates. Because the transition table is data-like, Temper evaluates the state transitions and the policy decisions together, such as, like, who can mutate a deployment rollout? Which actions can an operator agent, like the, the team of agents that are working on the dark factory, you can say these opera- operator agents can only do so much, or these actions are for- forbidden for a builder agent. And you can ask whe- or you can specify independently as a human, can a completed rollout be rolled again? Or can a failed tool result continue on its rollout path, et cetera. And then we have the side effects and effect system, also very popular in typed runtimes like TypeScript. Effects are deliberately small typed operations here in Temper. Keeping them small prevents the state machine from becoming a backdoor for arbitrary application behavior. But if you need arbitrary application behavior, you could package them as Wasm modules. Familiar with Wasm? Anyone? Okay, a few hands. So this is where arbitrary code generated by an LLM can live. It's a very narrower place. So what that means is it makes troubleshooting easier. And the last building block for Temper is the verifier. Verifier is... Verification is basically the bottleneck right now for pretty much everything. We have seen so many announcements and discussions around it. So in this, in, in Temper, the verifier is a gate before the transition table is loaded into the runtime. So that's what allows it to say, "This is safe. You can put this and load it live." And it is multiple levels, like a Swiss cheese pattern. Not all levels need to find everything exhaustively. The agent will... Claude is generally very good at making judgment calls. Do I need to run all levels or just some levels, which seem like it uses its compiler output? Like, level one checks the algebra of the automation. Level two model checks the reachable state graph. Can any path reach a bad state? And level three runs schedules, injects faults. Do invariants survive timing failure conditions, and et cetera? And level four uses property tests, [smacks lips] which I highly encourage that we have to start learning this mechanism of randomized, uh, testing with action sequences. All of this isn't exhaustive on day one. Every discovery gap compounds the verifier. I also heard Boris mentioning the compounding effect of, like, you find tests, and then you keep block- fixing the gaps. A missing condition revealed in production or simulation can get added to the model or the test suite, and it gets compounds. So where is all, where is all this going? I don't know. I mean, I, I can't predict much about where this could go. But the idea is if, if each artifact of Temper app or a bundle is concise, very few lines fits in your head, I spend a lot of time running mission-critical infrastructure for Datadog, woken up at night, like, thousands of times past three or four years. You won't be able to, like, keep your complex mission-critical logic in your head when you want to operate it, right? For a, for a complex business domain, like banking or financial systems, you still should be able to encode your business logic this way, something a human can read. If the generated artifact is thousands of lines of tangled code, we are back to where we started, and I don't know entirely yet that an LLM can just completely review and then sign off on it. If it's just few small artifacts with explicit roles wh- when both humans and agents can modify the system without, uh, disturbing everything around it, I still feel like this is just good systems engineering anyways, and high assurance software, like aviation and financial systems, have been built this way for, for decades. However, the cost of such rigor with humans was too high for general software until now. Agents are changing their calculus. Going back to my manufacturing metaphor, the win was not that one artisan could build a brilliant machine. The win was that a machine built with machine tools made parts composable and inspectable and replaceable that we could build larger machines. And my claim is that for software, um, agent-built software, we need such kind of rigor for us to scale from here on if you were to really build databases and put them in production. I land where I started. If agents can build software autonomously inside factories with such discipline and rigor, maybe, maybe we don't need to stop at dark factories. I know there is a word dark in there. It kind of sounds sad. So the whole software built this way can feel like an organism that we can grow, cultivate, and evolve through feedback, selection, and adaptation, and it looks like, I don't know, agriculture or a directed evolution where you choose the direction in which your software needs to go. You can choose Kafka to be a queue because there are more customers using it as a FIFO queue versus, "No, we need more buffers. We don't need all these brokers." Or maybe just dreaming, right? We have dreams now in Claude. That's all I have. Thank you so much for listening. [audience applauding] [upbeat music]

Episode duration: 30:50

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode EdmuYPBt_EM

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome