ClaudeBuilding the best agentic analytics harness: Powered by Claude, built with Claude Code
EVERY SPOKEN WORD
25 min read · 4,770 words- CMChris Merrick
[on hold music] [applause] Hey, thanks everybody. Um, great to be here today. We're-- Omni's a AI analytics platform, and today I'm gonna talk a little bit about, you know, how we build with Claude and, you know, what we've built with Claude and, and how Claude powers that. Uh, so start just with how Claude has enabled us. You know, I'm the CTO. I run the engineering team at Omni. Uh, we have a team of about twenty-five engineers. Uh, this is a slide of our, uh, commits to the main branch of our repository over time. Uh, I think it kinda speaks for itself. Um, one of the things hidden in here, kind of a very small piece of that line, is my own commits. I think, uh, as CTO of a growing company with hundreds of customers, I sort of assumed that at some point I'd have to stop writing code. Uh, and I thank Claude very much for making me, uh, still able to do some software engineering from time to time. Uh, so that's been a really, you know, a, a fantastic sort of unexpected benefit of, of this rollout. And, you know, to speak a little bit about how this went, earlier in twenty-twenty-five we said to the team, "I don't know when and I don't know how, but I know our jobs are changing, so let's just start experimenting, start using these tools, figure out what works." And we did a bunch of it with some, you know, sort of fits and starts. Uh, and when Claude Code with the Opus model released, that was when some of our senior engineers said, "Wait a minute. Now this is real. This is actually helping consistently." And yeah, it's kinda been off to the races since then. Uh, and starting around, you know, I, I sort of felt [chuckles] like everybody went away for the holidays and then came back in January and had sort of skilled up and figured it out and was ready to start hitting the ground running with Claude Code. Uh, and you can see the slope of the line from there. So that velocity is a big part of our culture at Omni. We have a core value called ship it. Uh, we also have a, a core value around transparency. Uh, so you can actually see, uh, if you go on our website, omni.co, there's a top-line, uh, navigation page to our demos. Every Friday at Omni we have an all-hands meeting. It's the most important meeting of the week. We do about ten minutes of announcements, shout-outs to each other, and then we do fifty minutes or increasingly more than fifty minutes of demos. Uh, we record all this, and our CEO's favorite job is Saturday morning he wakes up, he cuts those demos, posts them on YouTube, and shares them with the world to see. So if you're curious about what we're up to, uh, you can go to omni.co. This has been a really cool way to show our customers, our prospects, the community around us sort of how we're thinking and how we're building. And that speed and velocity has been a huge benefit for us as a company. So let's talk a little bit about what we have built with Claude. Uh, so Omni's AI analytics, you know, we let you talk to your data, right? So how does that actually work? Uh, so user comes in and asks a question. In Omni, we're using Claude to actually translate that question into a semantic query. I'll explain more what that means later. Uh, we have a semantic layer, and that is, think of that as sort of a translation layer that sits on top of your data warehouse, database, or maybe multiple of those, uh, to actually, you know, provide some additional benefit about how to use the data, enforce it, sort of give it a map for how to actually translate that data or, excuse me, that query into SQL that then runs against the, the warehouse. We'll go into more of this later, but that's just a high-level view of kinda how the system works. And this is important because Claude is incredible at answering questions, but you need to tell it more about your business if you want it to answer questions about your business, right? It can tell you incredibly deep insights about how businesses work generally. But if you want it to know about how your business works, you need to tell it not only, you know, how the business works, the terminology you use, and then of course if you're asking a, a data question, you need to tell it how your data looks and how that works. And so that's where we come in. That's the problem we're solving. And it's subtly difficult, right? Like even last quarter means drastically different things at different companies. Even in our, our own company, you know, in the product and engineering organization, last quarter refers to the calendar year. In our sales team, it refers to our fiscal quarter. So all of that needs to actually get coded into the context and awareness and even the, the sort of data layer and definitions of the data so that it can be used appropriately to ultimately get you the right answer to your question. So this semantic layer, what's it doing? So it's, like I said, it's a translation layer that sits on top of the database. Uh, it's doing a few things. One, it's curating the data, right? It's very easy to sort of come up with a toy demonstration of how a, an LLM or frankly a human can get correct answers on top of, you know, ten datasets in a database. Real company data warehouses and databases are not like that. They have tens of thousands, hundreds of thousands of datasets, sometimes more. Uh, and all of them, you know, have... There, there's a hundred revenue tables, right? There's a hundred opportunity tables. It's very unclear how to actually stitch those things together and use them in the correct way. So this is one of the benefits that our semantic layer provides, is it allows you to sort of, is, def-define how to use these things together and also curate it. Say, "Hey, listen, this is the one that matters. Ignore those other ten."With that, it, it's a way to encode the context. And I think one of the things that we've learned is that context is great, but context localized to what the actual definition that the context refers to makes it all the, all the better, right? And so I think about this, we're at code with Claude. If you use Claude Code, which I assume most of you do, you have your Claude.md files, right? And sort of the more you can do to sort of localize that, that context next to the parts of the code that it applies to, the better results you're gonna get. That's what our semantic layer does as well, is it helps you provide that context next to the field definition that it applies to rather than in a separate file over somewhere else. Uh, and then finally, permissions, right? It's a permissions layer. Make sure that people see the data they're supposed to see and don't see the data they're not supposed to see. And inside of our application, this feedback loop is an important part of how this stays current and accurate because guess what? In a real organization, this stuff changes constantly. And so our application provides a feedback loop where, you know, the next question that gets asked of the data can then be fed back into the definitions, into the context for a continuous learning loop. So this is our agent. Name's Blobby. Uh, if you look at Blobby, you could probably see this is, you know, a, a mature, professional, refined data analyst. Uh, and that is what Blobby is today. But Blobby hasn't always been that refined. Uh, started building Blobby, let's say, about eighteen months ago, and we've learned a lot along the way. Uh, you know, Blobby's grown up quite a bit in the past eighteen months. So what I wanna talk about is a little bit of, you know, how we developed the different phases and also what we learned, uh, along the way that kinda helped increase the quality and capabilities of Blobby. So just to ground this conversation, I just wanna show you a really quick demo video of what Blobby does in real life. So if we can cut to the video quickly. So this is our- Right. And so what we're gonna see here is, like I said, ask questions of your data. Blobby's sitting there dutifully waiting to answer your question. Uh, and as we do this, you'll start to see some of the phases that we go through to actually break down the question and then answer it. So right, the first thing Blobby's doing is saying, "All right, you're asking about PRs." Blobby's very smart because Claude's very smart. It knows that PRs refer to GitHub pull requests. Let me go find that in your semantic model and figure out what data you're actually referring to. Uh, then it goes and looks up the values of the dataset 'cause we said we only wanted it from a particular repository, so it has to apply a filter. It needs to know what that filter value is. And guess what? When I type out questions to an LLM, I make typos all the time, so it needs to do a little bit of fuzzy matching to make sure that it's actually finding the right thing. And then, right, it goes through, it generates a query, runs that query against the data warehouse, gets the results, provides a nice visualization, and then does a nice little summary at the end to tell you what you're seeing. Cool. So just grounding you a little bit in what the actual experience is like. Uh, so what did we learn along the way? So the very first version of Blobby was basically single question, single answer. Uh, and we quickly realized that we needed to give a lot more metadata about, you know, how to use the data and how the data is typically used. So, you know, we always had these label and description fields in our, in our definitions, uh, but we needed to add some additional context. So we, we added this AI context concept, which is sort of specifically for an LLM. Hey, how should you use this? You know, when you're asked a question about this, you might wanna use this reference or this field just to kinda help the data team and the, the administrators of this to actually steer in the right direction and ensure that you get a, a quality answer. Uh, and then sample queries. Very self-explanatory. Really helpful to kinda ground it in, hey, this is a typical use case. This is what you would, uh, this is the query you would run to answer, uh, a question that like X. Uh, and then finally, values. This one was kinda subtle. What we realized is, again, back to that example of, you know, I'm asking a question about a certain repository, or in this example, you know, asking a question about, like, a region, uh, it's really helpful to give the LLM just, like, a taste of what the values of that field are. Because right, you can see, you know, region, all values are AMEA, NAM, APAC, right? So it can, it can infer the next ten values because it sees, all right, these are abbreviations of regions of the world, right? Um, but it's useful for it to know, like, okay, this is an abbreviation, so if somebody asks for United States, I can just put US in there. So this did a nice job of helping to improve the quality of the question and answers that we were getting. But at this point, Blobby was still really not an agent. So that was the next big leap here, was adding an, a, an agentic loop around this. Um, this was a, you know, a, a big engineering effort. We built our own agentic harness. Uh, and, you know, [chuckles] it included this concept of tasks, like all good agents have. Uh, included a lot of other stuff too. You know, I think one of the biggest things we learned was that, uh, the, the agentic loop was really great at recovering from errors. So, like, one of the earliest, like, massive, uh, quality increases we made was to just, A, tell Blobby how to recover from errors and give, give it some budget to go do that. Uh, and B, then go invest in providing great error messages that were descriptive about what was happening and how you might fix it. Uh, and that alone allowed the, the quality score to increase dramatically. We saw our evals, like, a lot of our more difficult evals got a lot better once we did that.At this time though, because we were sort of in this mode of question and answer, we were using the Haiku model. And the Haiku model is great, but once you get into these more elaborate agentic conversations, it's just not designed for those, right? And so, uh, we switched over to Sonnet, um, and we're showing a graph here of token consumption. Uh, and the reason is twofold. One, these are longer conversations, they're more complex, they consume more tokens. That was by design. Two, this was a really big unlock. So we all of a sudden started getting our customers saying, "Wow," like, "I just asked this question that, you know, either I never would've been able to answer myself, or even if I did, it would've taken me hours, and it just nailed it in two minutes." Uh, and so the usage of Blobby started dramatically taking off at this point. So at this point, our CEO, who is our loudest and most critical user, like all good CEOs are, was telling us, "Listen, guys, I know this thing's really good, but it screwed up this question. Go fix it." And we all said, "You know, Colin, LLMs are a little unpredictable. Like, you're just gonna have to accept that it's not always going to be perfect." He said, "Not good enough. Go fix it." Said, "Okay, fine." And where this led us was-- I, I think one of the big, uh, efforts that we undertook at this point was to say, "Okay, let's really invest in understanding the traces and being able to see the traces of these, you know, quote-unquote, bad sessions." And this led us to a series of major surgeries that we refer to as the blobotomies. Uh, and the blobotomies really were traced back to what we were seeing in these traces 'cause when you, when you look at the traces, you kinda get to see the, the inner workings of how the agent is sort of talking to itself and reacting and responding in these loops. And that really clarified, you know, why some of these, you know, seemingly kind of just, you know, bad random sessions, you could actually start to see where they're, they were rooted in real problems. And so an example of this was the original design of our agent might've been a little too clever, uh, and we had sort of an outer agent that was responsible for, you know, producing the task list. It understood where all-- like, all the data available to it, but it was not in charge of query generation. It had a sub-agent that was in charge of query generation. And felt like a reasonable design. It was also handy 'cause we could use that query generation sub-agent in a few other contexts. Uh, but what we found once we started digging into these traces was that the sub-agent, its job was to generate one query based on whatever it was asked. And the outer agent didn't know what was actually able to be answered in a single query. So it would say, "Hey, sub, sub-agent, go answer me a question about, you know, GitHub pull requests and support data and, uh, summarize these things." And then the, the sub-agent would say, "I, I can't answer that in a single query. I can only... You know, I can-- I would have to run multiple queries." And so it sort of-- the light bulb went off when we started seeing this, that like we have to be really careful about how we separate the information and the knowledge of the, you know, outer agent and inner agent. And what we ended up doing was what our engineer Joel referred to as consolidating the brain, right? You, you wanna be careful not to have a split brain between any sort of sub-agent system and outer agent system. And so we just pulled these tools up into the, the outer agent harness. Uh, and this got rid of a lot of this kind of like seemingly unpredictable, surprising behavior. So this was a really big learning, dramatically improved a, a lot of our kinda more complicated evals. So the next phase was we started saying, "Okay, that's great, but guess what? When I actually go use Claude to generate SQL, it can answer some really impressively hard questions that candidly sometimes Blobby doesn't really do very well on." So we started thinking about this, and there's a really interesting backstory here. The short version of it is when Omni was first built, we had actually built a full SQL parsing engine into it, and we ended up discarding it because it wasn't reliable enough. People would throw random SQL at it, and it just couldn't handle every possible permutation. Uh, and so that'd been sitting on the shelf for years. Uh, but we sort of got thinking. We said, "Listen, if, if Claude can generate this really powerful expressive SQL and we can parse this SQL as long as it sort of fits into a general form that we understand, maybe there's an opportunity here." And we also kinda said, "I think it's probably a safe bet to assume that the good people at Anthropic are investing heavily in making Claude really good at SQL. So that seems like a good pet-- uh, a good bet to put our chips on." So, uh, our engineer Steven sorta dusted off that old parsing code and really just fundamentally changed the interface of how we were enabling or how we were exposing sq-- or query generation to Blobby. Initially, it was this sort of JSONified form of a query that was like highly structured, and then we switched it to this sort-- this SQL parsing mode where we said, "Listen, you can now produce SQL, and we can parse through, parse through it." And we were able to sorta give it some of the guidelines that prevented us, our parser from falling over. Uh, and this enabled Blobby to now, you know, take a lot of questions that it might have taken three or four attempts to actually answer, or sometimes it would have to chain together three or four queries in sort of awkward ways, uh, and actually write it in a one-shot query. And I, I think, you know, one of the things I noticed is-It seems like Claude really likes to write SQL with CTEs, common table expressions, for any of the SQL nerds in the room. Uh, and our parser was really good at, at parsing those actually. So this turned out to be a really nice sort of marriage where we found the efficiency of the system went way up because we were, we were able to, A, we didn't have to teach Blobby about this proprietary JSON form of a query that we had invented. We were just telling it to write SQL, which it, it already knew about. Uh, and B, it was able to just produce much more efficient queries, so it didn't have to do two or three ti- uh, two or three shots on it. So this is where we are today. We have our agentic system. We have this kinda outer loop that's in charge of s- uh, checkpointing our, our executions to make sure that we can recover from any failures. And then we have this inner loop where we have a bunch of tools available. That set is growing dramatically. Uh, you know, in addition to the examples I just talked about, we have tools for generating dashboards, we have tools for generating vi- uh, visualizations, we have some validation tools. I'm gonna hopefully do a live demo at the end if the gods are with me, and we'll show some of those. Uh, and, uh, the, the surface area of those tools is constantly growing. We also have tools that enable Blobby to s- to actually do the data modeling and improve that semantic layer. Uh, we also have an eval system. We have an internal eval system. We're also building an eval system for our customers because it's really important, and one of the key benefits that we provide is predictability and quality. The CEO asks a question, needs to get the right answer, needs to get the same answer every time. Right. Evals. I actually, I love evals. I think I love evals for a different reason than most people love evals. Like I said, I- my favorite thing about the evals is just having that raw trace data. It's really like the e- the observability part of evals to me was enlightening. So it's, I, maybe this is a personality trait of like being a brute force type person, but I really like just being able to go and say like, "Okay, this was bad. Why?" And look through that data. Uh, and then obviously sort of capturing that into a judge is a nice efficiency gain as well. And this is a big one. So I talked at the beginning about how we build with Claude. I, I... When you're building an agentic system, when you're building any system, it's really important for your engineers to have an understanding of what the users of that system actually care about, right? Like, it's really hard for somebody to build a system that they couldn't themselves picture using or can't relate to the users of. Uh, so I, I actually think that, you know, beyond just the massive productivity gains we got with Claude Code, being users of Claude Code helped us understand what a good harness looks like, right? And then we can take some of those lessons and bake them into our harness, right? And so, it's like, "Hey, should we go, you know, should we go build, uh, a new way for us to go explore the, the semantic model?" Well, let's see what Claude Code does, 'cause guess what? Semantic model is not that different from a code base, right? So maybe we should sort of tap into some of the ways that the Claude Code harness does this. Uh, and I, I feel like that actually helped our engineers really relate to the problem deeply, uh, and see some really, you know, see kind of what the, the latest and greatest techniques were for, for solving it. All right. I'm gonna switch to live demo mode, and if all goes well, I'll show you just a quick, a quick glimpse at how Blobby actually works live, uh, if all goes well. Uh, so first I'm gonna create a dashboard. So, uh, create a dashboard of engineering activity in the Omni repository. So creating a dashboard creates a lot of queries. Uh, it also does a lot of thinking about sort of how to lay out the dashboard. It can sometimes take a little while. So while that's running, we'll, we'll take a look at sort of the starting point here. Um, but right, it's gonna create a bunch of, uh... It's, it's got a plan. It's gonna go through and look at the different relevant topics. An Omni topic is like a domain of data. So it's, uh, think of it as a big wide data set that combines all the other sub-data sets together. And it's found the relevant topics, and it's go- gonna go start actually building this dashboard. While it's doing that, uh, I will switch over to just like that, that demo that I showed earlier. Same exact type of query. You know, I think one of the other things that we're tapping into here is our philosophy from a product perspective is AI to build, UI to sort of validate and troubleshoot and refine. And so that's baked into this chat experience in a really deep way. So this is pretty much the same exact session. And let's say I wanna actually go sort of touch this data, understand it in a more deep way. You know, instead of having to kinda squint at a SQL block, I can actually open this in a workbook. And i- in a- in Omni, a workbook is just a way to actually generate queries and manipulate your data, right? So I can see, okay, so these are the, uh, you know, this is the, the GitHub pull request data set. It's correctly filtered for the Omni repository. You know, it's correctly looking at the main branch, um, and it's looking for all of the merged, uh, the merged pull requests. Um, and so, you know, I can go manipulate this chart if I want to. Um, and then additionally, you know, I can start actually looking at other aspects of this data. So let's say I actually just want to look at, uh, a specific user. Or you know what? We'll do it a little simpler here. I can just look at a different repository, right? And just very s- very quickly, like let's look at our docs repository instead.So just becomes a, a much easier way if you wanna then go sort of riff on what the, what Blobby has done for you, you can do that. Get back to our, our dashboard here. Uh, and so if you look... Let's see how we're doing. Oh, great. Okay, so it's built a dashboard. Uh, let's see. It's given us a little summary. Engineering activity, key metrics. Cool, okay, let's go check this out. So I'm gonna preview this. Pulls open the dashboard in a split pane. Um, and nice, it's done a very nice job here. So past three months, uh, top PR authors. Oh, man, it's gonna go to the, the heads of those engineers. Um, PR volume over time. You can see some of this trend. Actually, if we look at further back, let's look back over the past 12 months, and we should actually see some of that trend that I showed in the original... Oh yeah, there you go. Uh, in the original slide deck about or s- the original slide about, uh, our activity. Um, oh, not surprisingly, AI is a very hot topic at Omni today. Um, oh, it looks like some of the workflow data [chuckles] didn't come through quite correctly here. That happens. Um, so yeah, and then, you know, as I was saying, right, I can go through and actually just troubleshoot this live. So maybe I just wanna go understand, okay, why is this chart blank? There we go. And I'll be honest, I don't know at a glance why it's blank. [laughs] I actually don't think we have that data populating very well. Um, so anyway, we'll go back to the slides for now. Uh, thank you for indulging me in the live demo. Um, to wrap up, uh, Omni AI analytics platform powered by Claude. Uh, we have specifically designed our harness to be optimized for Claude and the Claude family of models. Have some great customers. Fortunate to be surrounded by phenomenal engineers a- and other teammates. We're based in San Francisco. Uh, if anybody wants a Blobby sticker after the talk, I'd love to chat with you and give you one.
- SPSpeaker
[clapping] [upbeat music]
Episode duration: 26:45
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode K4-flzsPraE