Skip to content
How I AIHow I AI

How Mozilla Uses Claude Mythos to find Firefox bugs before hackers do

Brian Grinstead is a distinguished engineer at Mozilla, where he’s worked on Firefox and the web platform since 2013 (he joined to help launch Firefox DevTools). Recently he and his team pointed an agentic bug-finding pipeline at Firefox—a codebase with tens of thousands of files and tens of millions of lines of code—and shipped a record month of security fixes. The viral chart everyone saw gave the credit to Anthropic’s new Mythos model. Brian’s take is that the harness and pipeline did just as much of the work, and he walks through exactly how it runs and how anyone can build a starter version. *What you’ll learn:* 1. How to build a basic bug-finding harness by running Claude Code or Codex with one prompt and the -p flag, no SDK required 2. Why pointing an agent at a whole codebase fails, and how an LLM judge can score and rank files before you spend any compute 3. How a verifier subagent kills false positives by catching the agent when it cheats 4. The goal-loop pattern: give an agent a tightly scoped problem, a clear pass/fail signal, and let it retry far past the point a human would quit 5. Why teams that already invested in fuzzing, CI, and dev tooling are so far ahead 6. How to weigh model versus harness, and why Brian splits the credit close to 50-50 7. How a non-engineer can reuse the same score, verify, and fix the loop for design quality, conversion rate, or tech debt 8. Why AI-generated patches still can’t ship on their own, and where humans stay in the loop *Brought to you by:* WorkOS—Make your app enterprise-ready today Metaview—The agentic recruiting platform for winning teams *In this episode, we cover:* (00:00) Introduction to Brian Grinstead (02:43) The viral chart: Firefox Security Bug Fixes by Month (05:32) How the custom harness works (10:22) Goal loops and guardrails (14:45) How they built it (16:55) Real bugs, including a 15-year-old one (23:00) Open-sourcing it (26:26) Why humans still review every fix (32:30) Live demo and prioritizing files (40:18) Mobilizing the team and recap (42:33) Lightning round *Tools referenced:* • Claude Code: https://claude.ai/code • Claude Agent SDK: https://code.claude.com/docs/en/agent-sdk/overview • Codex: https://openai.com/index/openai-codex/ • OpenAI Agent SDK: https://developers.openai.com/api/docs/guides/agents • VS Code: https://code.visualstudio.com/ • Docker: https://www.docker.com/ • Firefox: https://www.mozilla.org/firefox/ • Address Sanitizer: https://github.com/google/sanitizers • RLBox: https://rlbox.dev/ *Other references:* • Mozilla Bug Bounty Program: https://www.mozilla.org/security/bug-bounty/ • Mozilla GitHub: https://github.com/mozilla *Where to find Brian Grinstead:* LinkedIn: https://www.linkedin.com/in/bgrins/ GitHub: https://github.com/bgrins *Where to find Claire Vo:* ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ X: https://x.com/clairevo _Production and marketing by https://penname.co/._ _For inquiries about sponsoring the podcast, email jordan@penname.co._

Brian GrinsteadguestClaire Vohost
Jun 22, 202648mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:002:43

    Introduction to Brian Grinstead

    1. BG

      Firefox has tens of thousands of source code files and tens of millions of lines of code. It's not possible to say, "One shot, go find all the potential bugs in this project." It's way too much context for the model.

    2. CV

      I think people really underappreciate the relentless tedium that an agent will go through.

    3. BG

      Anybody who's done this kind of what I call archeology, it's really hard to do, and this is something that the coding agents are great at. I asked Claude Code, "Go figure out semantically when this bug was introduced." I was, like, watching it do Git commands I didn't even know existed.

    4. CV

      And the ability to take an agent and give it a very constrained problem and surface area and say, "Exhaust every attempt at this," is really powerful. Again, not because human intelligence couldn't identify similar issues, but actually our, like, cognitive energy [laughs] declines over time in a way that agents don't.

    5. BG

      Our goal is not to have a bunch of bugs that are hard to find. Our goal is to have zero bugs. And so I think that these tools, as us and other defenders are starting to apply them, actually get us closer to that world. [upbeat music]

    6. CV

      Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive here on a mission to help you build better with these new tools. Today, I have Brian Grinstead, distinguished engineer at Mozilla Firefox, who's gonna take us behind the scenes with their experience with Anthropic's new, but not yet fully released, model Mythos, and how they solved almost 500 security bugs by rolling their own harness, which isn't as complicated as you think. Let's get to it. This episode is brought to you by WorkOS. AI has already changed how we work. Tools are helping teams write better code, analyze customer data, and even handle support tickets automatically. But there's a catch. These tools only work well when they have deep access to company systems. Your copilot needs to see your entire codebase. Your chatbot needs to search across internal docs. And for enterprise buyers, that raises serious security concerns. That's why these apps face intense IT scrutiny from day one. To pass, they need secure authentication, access controls, audit logs, the whole suite of enterprise features. Building all that from scratch, it's a massive lift. That's where WorkOS comes in. WorkOS gives you drop-in APIs for enterprise features so your app can become enterprise-ready and scale upmarket faster. Think of it like Stripe for enterprise features. OpenAI, Perplexity, and Cursor are already using WorkOS to move faster and meet enterprise demands. Join them and hundreds of other industry leaders at WorkOS.com. Start building today.

  2. 2:435:32

    The viral chart: Firefox Security Bug Fixes by Month

    1. CV

      Brian, welcome to How I AI. I'm really excited about this episode because I think you have one of the most impactful stories in AI engineering right now. So tell me first, let's take a step back. For people, what is the scope of the product you're working on, and how challenging is what you pulled off?

    2. BG

      Yeah. Thank, thanks for having me. So, uh, I, I work on Firefox, which is a production web browser. It's very large and very complex. Um, we have to render the entire web, whether [laughs] the site was built in 2000 or just pushed up yesterday. And so there's all sorts of performance, security, and, and complexity, um, that we deal with. And so we recently, in the last few months, have been using new agentic, uh, scanning techniques to find a, a deluge of security bugs inside of the, the product, as have many teams as we've started to, uh, g-get better harnesses, models, and, and techniques for actually getting those bugs fixed.

    3. CV

      Yeah, and what, what caused me to reach out and ask you to be on the podcast was this chart. Everybody saw this maybe on X or on the timeline where the Firefox security bug fixes by month just spiked in multiples in, in April. And the headline of a lot of posts around this was Mythos, the definitely real model [laughs] that other people have access to, just not, not, not us normies, has unlocked a incredible amount of discovery and fixes on edge case or complex security bugs. But I think the story behind the story is actually a little bit different. Can you tell us about kind of how you got here, and how much of this was model, and how much of this was work you did internally?

    4. BG

      Yeah, that's right. So I think like a lot of open source projects through 2025, we had been dealing with almost, like, unwanted AI bug reports. And you know the shape. You've seen it if you see... You get a doc, and you can just tell it's from an AI. You know, it, it looks very nice and professional, but you would get halfway through, and the engineer is looking at this and saying, "That's wrong," right? And so there's a, a sort of, uh, asymmetric cost on project maintainers to receive a thing. It's cheap to just paste in some C++ code into, into a chatbot and get back something, uh, that's wrong, but there was no way to actually verify whether that was true. And that all changed, like, I would say in February of, of 2026, and I think a big part of that story is the harnesses themselves just getting better. Definitely improving and upgrading the model helps in a couple ways. Like, it, it has better hypotheses about where to start with the bug. It does a better job of making test cases. But I think the harness itself and plugging it into a pipeline of getting the bugs fixed is actually a little bit the story behind the story.

  3. 5:3210:22

    How the custom harness works

    1. CV

      Yep. So for folks that have heard this word over and over again, I know the engineers that are watching probably know what you're talking about, but just define for us or show us what you mean by harness and how you built something custom to your product, your team, that unlocked your ability to get actual throughput on security bug fixes, not just a bunch of unactionable kind of slop reports or things you couldn't validate.

    2. BG

      The harness is a way to give an LLM tools to achieve some goal. And so if you think back to chatbots before they had any custom tools or anything like this, it's almost a brain in a jar. You know, you're just chatting with it and it's chatting back. That has almost... There's no harness around that. These days, the chatbots are sort of blurring in a little bit with the coding CLIs in terms of what they can do, but this is giving access to tools like running bash scripts, opening a browser, measuring whether you were able to create a security issue, and so on. And so this is the actual, uh, mechanism that we use, uh, and have customized to find issues in Firefox. Uh, this is largely like, you could almost imagine just Claude Code is a harness, right? And so you build custom prompting and some custom orchestration around it to plug in with your particular systems. But, um, it's, it's actually a, a reasonably simple, uh, wrapper around it. You just need to give it access to the right tools for the job.

    3. CV

      So show us what, what tools you decided were necessary-

    4. BG

      Yeah

    5. CV

      ... in your har- harness, and walk us through where you felt there was real leverage in building something custom versus using off-the-shelf Claude Code or off-the-shelf Codex.

    6. BG

      So he- here's an example that's sort of a, a flowchart, a little bit of how our custom harness works. And, and I'll say up front, like, often when I see these flowcharts, you see them in academic papers too, it makes it look really complicated. There's all of these boxes and arrows and, and I think really it's, it's simpler than it looks, I would say. Firefox has tens of thousands of source code files and tens of millions of lines of code. And so it, it's not possible to say, "Oh, one shot, go find all the potential bugs in this project." It, it's way too much context for the model. And so we have to do some initial sort of scoring to indicate which files do we want to actually point this thing at. We can talk more about that later, but it, it eventually kind of comes up with some prioritized list of which files to target, or even functions in certain cases. And so that goes into this, what we... sort of a main agentic loop, but you can almost think about this as like a, a Claude Code session or a Codex session, where you have some custom prompt that says, "Here's a checkout of Firefox. Here's some tools to kind of look around the code base. Here's your target file." Within that file, we, we kind of lie and we say, "We know there's a security bug in this file. You have to go find it," basically. And it will just start working its way from the code to reason about, how do I get into this code from a webpage? So i- some evil webpage, how could it actually call this line of code? And it's interesting to watch it kind of think and move its way around, but ultimately it will come up with HTML test cases, basically. And we plug this in to our existing tooling infrastructure that we've had for, for decades to do fuzzing, for example, where you can pass a test case and get back a report as to whether there's a potential memory safety issue. That will then, um, get feedback from that tool whether it succeeded or not, and if it didn't, it goes back and starts again, and it can start many, many times and run for a very long time. Sometimes it will end up and say, "Nope, couldn't find anything." Um, other times it will say that it found something, and we ask for it to come out in a very structured format so that we can pass it on to the next phase, which is verification. And so we've already kind of verified that there is a crash because we got the n- signal out of our fuzzing build. But sometimes the agents do just wonky things. For example, it might set a pref that was only ever meant for testing and no user ever sets. Or I've even seen cases where the agent changes the code to introduce a vulnerability so that it can exploit it and achieve its goal. And so we have another agent that's kind of looking at it and saying like, "Does this look right?" Um, that usually approves it. It, sometimes it does reject it and it kind of sends it back to do more work. But by the time that this happens, we have almost no false positives on the system, which is fixing that kind of SLAP problem that we talked about, um, at the start, and it, it's very well prepared to go into the rest of our bug pipeline. A- as we had continued to work, we added a patching agent, which is meant to kind of generate a plausible fix, verify that that fix has, um, resolved the security issue, and all of that, it just gets written into a pretty simple cloud orchestration system that writes it out to a storage bucket for consuming later in the rest

  4. 10:2214:45

    Goal loops and guardrails

    1. BG

      of our pipeline.

    2. CV

      And so I just wanna take a step back 'cause I very recently did an episode on, um, slash goal, these sort of, like, goal and outcome loops that you can put, um, harnesses, and I was... I did an example using Codex, but they're available other place, you know, Ralph Loops, all this sort of thing. And I think people really underappreciate the relentless tedium that an a- an agent will go through, right? And the ability to take an agent and give it a very constrained problem and surface area and say, "Exhaust every attempt at this," is really powerful. Again, not because human intelligence couldn't identify similar issues, but actually our, like, cognitive energy [chuckles] declines over time in a way that, that agents don't, right? If you asked a human to say like, "Try 150 different things against this, um, and look at it every single time and make an evaluation," it would be very, both very time inefficient and exhausting. And so I just love these ideas of these, like, relentless loops on agents. The other thing that I wanna call out, and again, I did this, this recent episode on Goal, is putting a guardrail, whether that's a verifier subagent or, um, a constraint on these goal-style loops, is really important because of exactly what you said. I gave this example of, let's say you're trying to reduce P95 latency on a page. Well, you could remove every latency-introducing feature from that page. You could actually, like, take it away and the agent would be like, "Look, I made, I made the goal. It's, it's much faster now," but you don't have that guardrails of like nothing from a product perspective can change. You can't introduce new code. And so I was curious, as you went through this verifier subagent loop, did you then feed that back in into the prompting of the analyzer agent in to guardrail against- ... common patterns that you saw?

    3. BG

      Yeah, and 100%. Like, w- I think that you, you need to give it some grounding to say, "Don't go off the rails in this particular way," because it absolutely will. And I think we have, I would say from analyzing the logs, both sort of manually as I would, as our team would see them go by, but also using LLMs to analyze the logs, say what are some, like, common patterns and problems that we're seeing come out of this thing? Um, and then we would tweak the prompts on the analyzer agent sort of after the fact, um, to improve that. And that would come both from our own analysis of the agent trace, but also feedback from the engineering teams. And they would say, "Oh, this was a terrible bug and I didn't like this," in a really tight loop with the team that's working on the product to make sure that all the threat model and the way that we were giving this back to the team, uh, was useful.

    4. CV

      Yeah, and then I wanna call out for folks because again, while there's a bunch of boxes on the screen, it's actually not that complicated. It's a, a analyzer loop. It goes through a verifi- verifier subagent. It has access to, it looks like eight, eight, you know, probably like a dozen tools, right? Key tools that are really important. File search, how to build the package, um, bug tools. And then it goes into a very classic, like, bug fix pipeline, um, which is, you know, generate the fix and then put it through a verification pipeline and, and ship it. So this actually isn't terribly complicated, and this is probably how your human system would work in an ideal world. You've just been able to encode it in, in this harness.

    5. BG

      Yeah. And, and we're, we're kind of fortunate because we've been running, um, you know, a, a browser for a long time. We have a bug bounty system. We're used to, uh, in, uh, receiving external reports. We have an internal fuzzing team who's always finding bugs, either from manual inspection or other automated tools. And so that's another thing that has really helped, is if you have the existing pipeline in place and you're letting the agent plug in almost as if it was a person doing it, you're not inventing many things at once.

    6. CV

      Yeah.

    7. BG

      And it can focus on the one thing that you've told it to do, and to do relentlessly.

    8. CV

      Yeah. And I, um, I tell this to folks all the time, is like the revenge of the DevX team, which is teams that have already invested in developer tooling, in automations, are just so much further ahead because all those tools can be leveraged at much higher velocity by these agents. And so, um, I'm going company to company and telling people like, "Please, please, if you haven't already, now is the time to invest in developer tooling," 'cause what's good for the agents is very good

  5. 14:4516:55

    How they built it

    1. CV

      for humans as well and, and vice versa. I'm curious, is this loop... Again, like, is it model agnostic? Did you all use a specific SDK? Did you like, kind of like artisanally craft the whole thing? How did you actually build this? People are always curious.

    2. BG

      It's a very open space and there's a lot of options, so it's a good question, and I think it's also moving very quickly. So the initial version used the Claude Agent SDK, which is, um, essentially it's a wrapper of the Claude Code CLI, where it runs it in a special mode where it's streaming out JSON, but it gives you nice programmatic hooks for like a Python or TypeScript project. We have been exploring the best option for adding Codex support, and you can do that in a couple different ways. One is you could have Codex CLI, you could have the OpenAI Agent SDK, or you can move to like a third-party harness that's meant to be model agnostic. And like, like my intuition on all of this based on some initial testing is that it, the vendor-provided harnesses as the underlying, uh, infrastructure is probably the best way to go. And, and I, I... Like, they're probably doing post-training and other things using those harnesses to make their models work best in them. But you also want to make sure that you're running against a variety of, of mode- models, harness techniques, and prompting, because as defenders, you need to be sort of scanning the landscape for any one attacker who might be trying to do something weird or with a different model, and it's gonna actually just find a different bug.

    3. CV

      Yep. Yep. Great. So that's, that's helpful. So yes, again, repeating for people, this was, this was my intuition of what you were gonna say, which is the Claude Agent SDK or the OpenAI Agents SDK. Um, there are some third-party frameworks or harnesses. You can use Py is one I hear people are loving a bunch right now. Um, but I, I think your intuition matches my intuition, which is because these, um, model provider harnesses are so tuned to their particular models, you actually have to run both, especially in a security environment, because that is exactly what your attackers are gonna be running. And they do, they do have... They spike on strengths, both from a model perspective and harness perspective on different things, and will very likely identify and fix different things. So, okay, you, you've sold me.

  6. 16:5523:00

    Real bugs, including a 15-year-old one

    1. CV

      We should, we should all build our, our custom harness. You know, not just for security issues. Again, this is like a very particular use case, but there are all sorts of use cases where a custom harness would be very effective inside, um, in particular large codebases. Let's talk about how one of these actually runs.

    2. BG

      Ultimately, the, the sort of infrastructure to plug all this stuff together is very shared across many needs from like triage, uh, bug detection, bug fixing. Y- you're sort of, um, standing this thing up in its own environment. You're giving it some goal, and it needs to be some grounded goal. It is going and running, and then it's giving you some artifacts to plug in further down your pipeline, whether that's an issue tracker or a pull request and so on. And so it, it's interesting how much overlap I think there is with this and some of... I, I've seen, you know, um, projects that are designed more for bug fixing that look very similar to this. So this is your standard kind of vibe-coded dashboard here that shows basically a bunch of runs. And, and note this is, uh, mostly fabricated data. This isn't the real, you know, Firefox runs. But what we have done here is I've set sort of... We, we send them off in patches, and so set sort of sets of files that are related or we wanna do some evaluations and so on. And so what I have done here is we're actually taking, um, 10 real bugs, and these are bugs that we, we opened earlier than we normally would have, um, security bugs that had recently shipped. And we did that, and like, for exactly the thing that we're doing now, which is to sort of help, um, help make people aware of how this works, how can you apply it as defenders, and sort of, um, help understand that this is, this is real. So what we've done is we've taken those exact bugs and sort of pulled the, the actual traces for them to dig into here on the show. So a couple that I think were pretty interesting. W-we start with this legend element, so this is an HTML element that you can use for organizing forms.

    3. CV

      And I have to go back really quickly to the blog post, because this one caught my eye. Um, is this the one that was, like, 20... 15 or 20 years old?

    4. BG

      Yes.

    5. CV

      15 years old.

    6. BG

      I think that was a... Yeah, that was an XSLT bug, and we found, um-

    7. CV

      I see the number two, though

    8. BG

      A, a number-

    9. CV

      A 15-year-old bug.

    10. BG

      Yes. Exactly. So if you notice our bug IDs for these new bugs are 2,025,977, so there's many, many bugs in Bugzilla. If you find a bug, um, that is in the six digits, it's a very old bug. And so it, it was kind of funny. For th- I... For this exact XSLT one, I wanted to say, like, "When was this bug introduced?" And anybody who's done this kind of what I call archeology is, it's really hard to do, and this is something that the coding agents are great at. So I would say, "When was this bug introduced?" Well, the file got renamed three years ago-

    11. CV

      Yeah

    12. BG

      ... and so you can't just do a Git diff and then actually this blob moved to that file. It's very annoying work. And I asked Claude Code, "Go figure out, like, semantically when this bug was introduced." And it... I was, like, watching it do, um, Git co- Git commands I didn't even know existed to go and kind of taking notes as it was doing it, what it was doing. So really interesting. That's how we had gotten that 20-year-old number. Um, and so a lot of these have been around a long time. We have a bug bounty program that would pay people to find these bugs, and sort of they're, they're very hard to discover, and that's what part of what makes this so notable. And so if we look at, like, this legend one, um, the tool that uses to do browser evaluator, it tried 14 times. And so kind of logistically what this looks like is it says, "Okay, um, I'm looking at this element, but huh, web..." 'Cause Web IDL is, like, a description. "I need to go find the C++ implementation." And it just works through like you would see Claude Code or the Codex do. And it would come up with some theory. It would look at, it would look at some function and say, "Huh, I think you, you've told me that there's a bug in this." Similar to what you said, come up with 100 variations of it. "Maybe it's this problem or that problem." And it will try it, and it will keep trying it, and it tries 14 times, or 13 times, and it fails. And then finally the 14th time it hits, and it found it. And the great thing is not only does it come up with this sort of analysis, which I would love to go spend, you know, a couple hours on each of these and do a deep dive on how exactly this works and why it matters, but this is, like, the shape of the reports that I was complaining about us getting in 2025. The thing that makes this different is that we have this. And so this is, like, a really kind of complicated H-HTML page. This is what, what browsers have to deal with, people making pages like this. And they're, like, creating, uh, the element. They're setting what's called an expando property on the DOM node, which is like an attribute but not an attribute. It removes the element. It does some cycle collection, blah, blah, blah, and at the end it creates a heap use after free, which is... This is exactly the sort of shape of a bug report that we send on to our engineering team.

    13. CV

      I wanna go back to the very beginning when we were reflecting on the complexity of this product, and I was just thinking, we've decided to rewatch Silicon Valley at my house, so I'm watching these. And, uh, Gilfoyle had an HTML5 shirt on, and I was just reflecting on what you said. Like, you have to render the web, whether it was written 25 years ago or yesterday by an agent. And so the, the breadth of vulnerabilities that can be introduced in an HTML page, in JavaScript, is, it seems almost insurmountable. And so I wanna reflect for people who are maybe not watching, is you have this, this agent, this harness that not only can come up with hypotheses on what could create a vulnerability in a file or a subset of a file, and you not only get a document of, "This is where I think the vulnerability comes from," but you actually get a rep- uh, a, a test HTML file that then replicates that bug in production where you can prove it actually creates, creates the issue. Is that... Have I tied that all together correctly?

    14. BG

      That, that is exactly the thing, and that is the thing that makes this approach different

  7. 23:0026:26

    Open-sourcing it

    1. BG

      from previous attempts.

    2. CV

      Yeah. And so I just want, you know, engineering leaders, engineers out there, senior engineers out there to just think about this process, which is, you know, uh, kind of incepting your agent to believe that there is something wrong with your code, whether, whether it's a security bug or a functional bug, right? Like, I know something's broken here. I know this is suboptimal from per- from a performance perspective maybe is another example of this. Being able to run a loop of hypotheses on it and then actually create a test or re- or, um, recreation artifact is a really f- powerful loop I don't want people to miss as they zoom out into, into the agent.

    3. BG

      Yeah. I just am- I could add one thing on that. For your projects, I think that one difference is... So we, we've actually open-sourced the, the sort of tooling that we use for Firefox, um, I think just yesterday for, for some of this, so for security researchers who wanted to test it. For our case, we have to, what we call a very crystal clear w- task verification signal. And so we have this fuzzing build, um, that uses an Address Sanitizer, and it's like you win or you lose. You pass the file, and we can tell you AD. Often if you, if you have a web app, a, a distributed system of some kind, it may not be so crystal clear. And so you need to think really hard for your project about your threat model, and then how would you like to verify whether it's true or not. Like, could be, like, a test case or it could be... I, I think that's actually something to, that as you're thinking about applying this to your own project, that is a really important aspect of it.

    4. CV

      And what I think is most people just, uh, we haven't gotten in the... We haven't built the muscle memory of how to articulate success cases, how to articulate, uh, failure cases so crisply, and it's, you're really-

    5. BG

      Yeah

    6. CV

      ... benefiting from this, this previous art here, which is you've done the work up front, and so you have that already. And I, you know, just got off the call with somebody in, um, this is not engineering, in design. And I said, "This is the moment where you're actually gonna have to write down what good design is and how you might, um, quantitatively evalu- evaluate that or qualitatively evaluate it." And so I think the skill of being able to crisply articulate, test, and measure outcomes, whether they are security outcomes, quality outcomes, um, softer outcomes, is becoming a hard skill people have to develop.

    7. BG

      Yeah, and, and I think to, to points you've, you've made previously, like it's actually just great for the whole project-

    8. CV

      Mm-hmm

    9. BG

      ... to have that defined.

    10. CV

      Yep, yep. This episode is brought to you by Metaview, because who says hiring has to be fair? Every founder, hiring manager, and recruiter I speak with feels the same pressure: hire the right people as fast as possible. But recruiting is brutally time-consuming, alignment is hard, and the competition for great talent keeps getting tougher. That's why teams like Riot Games, Brex, GitLab, and Replit, plus 5,000 other organizations use Metaview, the agentic recruiting platform giving high-performance teams an unfair advantage in hiring. It works by giving you a suite of AI agents that behave like recruiting coworkers, finding candidates based on your exact criteria, taking interview notes, reviewing every inbound application, gathering insights across your hiring process, and helping you identify the best candidates in your pipeline. Don't let your competitors out-hire you. Metaview customers close roles 30% faster. Get started with Metaview today and get your first 100 candidates sourced for free at metaview.ai/howiaai.

  8. 26:2632:30

    Why humans still review every fix

    1. CV

      Okay. Love this. We'll share this MCP in the show notes. If you wanna, uh, help the project, please, please dive in. So, a- and I'm looking at this example. You know, you're seeing turns of nine turns, 10 turns, 14 turns, um, finding the result and getting that, that whole package. And then does that run through a, you know, sort of a bug fix pipeline?

    2. BG

      Right. So originally, it did not. And so actually many of these were early finds, and they don't. But one, one of these is an interesting one to look at there, which is this, um, NS_Zul content sync. So this is a pretty complex bug where it found... We have an in-process sandbag, sandbox technology called RLBox that's meant, it, it's meant to help us wrap, uh, kind of shrink wrap around third-party dependencies so that if there's a vulnerability in that code, it can't leak out to Firefox. And this was a really complicated find, and it, it has tons of artifacts, and it, it, it sort of came up with it. But the, the interesting thing is the, the fix itself is, the, the proposed fix is very simple. It just said, "Oh, you were asserting this. You should've been asserting that," um, in terms of kind of input validation. And so we did start to basically have this patching agent run on every fix, and the, the cool part is d- you're in the loop, so you can actually just apply the patch, build Firefox, and confirm that that same test doesn't crash anymore. And so that's great. But so if we go look at, at the bug, you... These are basically, this is basically a dump of that bucket that we were just looking at, all those files. And if w- we sort of receive it by the team, there's some discussion, and then we have sort of like a, "Yep, this looks like a real issue. The fix looks good, but actually we should check in a few other places." So one of the things you'll see with agents, and I, I'm sure there's harness techniques, the models will get better, is they get laser-focused on the task you've given them. And so if we go look at the actual, um, bug fix that landed here, it's pretty much what they said. We're, we're checking sort of this, you know, this, but also we're checking the same thing in like three other places. And so that's where sort of the, the expert engineers in every single subcomponent, whether that's JavaScript, media, DOM, layout, graphics. We have people who are like, uh, world-class browser engineers who are working on this stuff, and will look at the fix and say, "Oh, that looks pretty good," or, "Oh, this is like completely wrong." And we, we, um, we, of course, we use that feedback to try to improve the patching system, but I think we're pretty far off from having a kind of magic button that produces landable patches in the browser.

    3. CV

      Yeah. Fair, fair warning, because I have used the, um, Codex security product, which I actually think is, is quite good, and it does, uh, it, it helps you develop a threat model. It goes through and scan your code and comes up with issues and patches. The problem that I found was exactly this, and I do not have millions and millions of lines of code. I have hundreds of thousands of lines of code, which is it will get laser-focused on the specific patch. It'll say, "For this bug, this is the patch." But it doesn't do the next level of like go categorically find similar issues across the codebase and then come up with an architecturally clean global fix for this class of, in this instance, the security bugs. And so I have found that that's a piece missing in the existing security tooling, and it does take like an engineer that kind of knows, knows to some extent the codebase or knows the, the structure of the codebase to identify some of those. And so I do think this is the next step in some of these harnesses, which is for any one fix, taking the loop and saying, "We've identified this issue. Go in, um, in similar parts of the codebase and identify if we have this issue systematically." Then zoom back out and articulate what the fix is overall, as opposed to the point fix, and then ship that and then close all the, all the related issues is a, is a path that I've been doing manually, and now as seeing this, just thinking about how to do that more systematically.

    4. BG

      Yeah. I think that's, as you said earlier, th- these, a, a lot of these are kind of converging in terms of like the, the needs, whether it's detection- ... patching. And I, I obviously am expecting the, the harnesses and models to get better at this. I do think we're pretty far out from a web browser scale and complexity project being able to be sort of autonomously developed. And, and we, we actually have, uh, requirements for having, you know, people who write the code and review the code. But we're, you know, able to use these tools to help accelerate that quite a bit.

    5. CV

      Well, and I mean, if we just take this to the meta level and, you know, part of having an open source project like this is it is very large to maintain. It requires, you know, the community to maintain something like this, and you wouldn't expect the complexity of that to change just by nature of you introducing agents. I think in particular open source projects, um, will have to think about how we integrate agents into it, how that intersects with, um, the, the community. And I do think they are the most complex and often longest standing, you know, code bases that we have out there. And so it's interesting to hear you say, you know, "I don't think overnight we're just gonna turn this, this repo over to the agents, and we're all, we're all happy," either on the security side or on the product side, right?

    6. BG

      Yeah. And I think on the security side with, with open source supply chain is such an interesting and important topic around this too.

    7. CV

      Oh my God.

    8. BG

      I think you, you have to work with every project i- in... You know, there's a lot of important projects. Firefox depends on many, many, um, just core internet infrastructure supply chain. And every project has different needs and preferences and threat models and things that they care about, the way they want the bugs, where do they work, and there's a sort of human connection and network problem involved there, where as we f- we found many bugs in, in supply chain, and we, we have personal connections with a lot of those projects. And so you're kind of working your way in in a way that is, I think, less automatable than many people would, would hope. Um, but I think it's the reality of how this is gonna have to just get deployed across the industry.

    9. CV

      Okay. This is, this is amazing. I think you and I could talk about

  9. 32:3040:18

    Live demo and prioritizing files

    1. CV

      this all day, but let's show what this looks like at the individual engineer's, you know, desktop. How, how would you actually interface with this, um, as, as an engineer?

    2. BG

      We'll pull up VS Code here, and sort of there's a couple aspects of the, the harness here that I wanted to show off and make really concrete to, to bring home the point that it's not too complicated. So we have, um, part, part of the demo here, I have a patch applied to a local build of Firefox in a Docker container that introduces a really obvious, uh, memory safety issue if you're a, a C++ developer. And so in this patch, I wanted to show kind of a couple different approaches as sort of where we started and where we, we got to. And so inside of this, um, Docker environment, we have a, um, a simple script here that with some prompt that says, "You're looking for a, a memory safety issue. Read the file and analyze it." And so we're not giving it access to any tools. We just say, "Look at the file and find the problem." You can ea- you can run this with both Claude and Codex pretty easily. There are command line arguments like -p you've maybe seen, um, with Claude that will... It's basically designed to be run by another program, not a, um, not a human. So if we said, in this case, run with Claude, you're gonna see this kind of ugly JSON streaming out, but this is actually what's happening under the hood when you have an interactive Claude code session. It's reading a bunch of code. It's sort of... It, it, it probably pretty quickly converged on the problem in this file, and it's gonna write essentially like a, a markdown report. And so this is just like the very basic primitive building block. Like, you could build this and run this yourself, uh, i- in an hour with Claude. You can also run it with Codex. There's a Codex exec command. Um, similarly, have it output JSON. You can have another program that's consuming that. Um, this is sort of an alternative to using an agent SDK, so just another way to, to do it. And so that's, that's a very simple kind of how do you just run this thing and find a problem in a source code file? I think as we are running this on less trivial security issues, we found that we needed the actual harness that we, that I was describing earlier. And so we'll have an example running that here, where it will basically do the same loop. It's using an agent SDK, but it's gonna do... It has access to all of these tools. And so it's gonna go through and read. Uh, this'll take a while, so I'll switch over to a completed, uh, job here. But it'll basically read. It'll do some tests. It'll run this HTML page as a, as a tester, and then it'll say, "Yep, looks good." Um, this is actually the verifier subagent has now returned some structured JSON saying, "Yep, I approved it. Here's why this is a problem. Here's sort of exactly how you would, uh, exploit it. Here's the steps to reproduce, and here's the s- security impact." That i- at that point, you have the results. You have the bucket. We could go and put this in our bug tracker system and, and give it to an engineer. It does spin off a separate agent now to actually go and fix the bug. And so it will go on and, um, you know, create it, build Firefox, and verify that the crash went away.

    3. CV

      I, I think this is very straightforward. Again, I think what you're demystifying for folks is you can, you can build... I mean, V1 is literally just running Claude Code with prompt. Right?

    4. BG

      Yeah.

    5. CV

      It's not, it's not very fancy. And then, you know, V2 is running an agent SDK with a set of, like, very useful tools and a subagent that runs a verification loop at the end. You know, my question for you is with all these files, with all these lines of code, how are you prioritizing where you point the, the agent? Because as you said, you can't just go like, "Here's my code base. Find the security issues. We know they're there." How are you actually prioritizing where to look?

    6. BG

      Yeah. And that, that's one of the things we've run into with scale. Some of the sort of prepackaged skills and workflows sort of assume that you can canvas the entire repository at once and find all the issues, and you just can't. And so with Firefox, I would say. I would say with a small project, I think that's, that's plausible and probably a simple way to start. So what we are doing is a really simple sort of LLM judge here, where we sort of say, "You're a security expert. Here's the different kinds of files we're looking at, C++ files, IDL files, WebIDL files." A, a little bit of detail about each. We've sort of copied out some of the, um, the details that we have on our existing security bug classification program, and basically give me two scores. So one score is how likely do you think there's a memory safety issue, and another is how easy could you ac-access this from a webpage? Because we have a lot of code that is not running ever in the content process at all. It's, it's doing operating system integration things. And so we can just run that. Um, y- you know, it is very, very simple. You could, uh, you could come up with something, uh, yourself on your own project very easily. We just go and run that out. That's gonna generate basically like a, a scores report, and then that will... We have that write a, a markdown thing, and it'll say, "Okay," like document.cpp, "that is a, um, huge file. It is directly accessible by, uh, web content. That is, like, a very high score. You should definitely run that." We'll, we'll then, like, plug it in with different signals, like how many times have we run this file before? Has it found duplicates? Was it able to find issues? But really, really, um, simple sys- uh, heuristics, and this is an area where I'm, uh, we're actually actively working to improve this, but it's enough to get you kind of started.

    7. CV

      I just think this is so, so clever, very smart. And even for folks that don't have the codebase at the scale of Firefox or maybe don't have the same, um, threat model vulner-vulnerability surface area as your product does, I do think this idea of taking... I hear all the time, like, "How do I attack tech debt in my monorepo? How do I prioritize these things? I can't just say, like, fix, fix tech debt." Um, I think the ability to go through your codebase and prioritize areas for an agent to triage and fix, whether that is security, whether that is performance. Honestly, when I was thinking this, I was thinking for product managers and designers, you could build a very similar heuristic scoring mechanism where you say, "Go take all my components, my front-end components in my web app, and my product analytics, and give me a prioritized list of components to improve from a user experience or conversion rate perspective."

    8. BG

      Mm-hmm.

    9. CV

      And then go apply best practices on design, on conversion ra- So, like, there's just so many ways you can take this, like, LLM scoring of a prioritization of your code and then apply a very specific level of fix to it, versus saying, like, "Go all over my codebase and-"

    10. BG

      Yeah.

    11. CV

      "... make it convert better." And so I want folks to really think about how to come up with a score to prioritize things, especially if you're working with a large monorepo, because there are so many ways that this just very specific tactic is, is useful for folks.

    12. BG

      Yeah. Yeah, I, it took me... It took sort of longer than I would've liked to c- to put this in place, where it's sort of like, "Well, I think these files might be good ones to do." And then I was like, "Oh, duh, we should, like, have these things get scored." I think, um, we had, uh... W- we've also seen, like, you could imagine doing this with commit scanning, right? So if you have a newer project with not much code, instead of scanning the existing files structure, you actually wanna look at individual commits and score those commits and then run them through a pipeline. Or we, we have active work on performance as well, where of course you have a performance benchmark. It gives you a score, and you tell the agent, "Your job is to go make that number go down." And it'll go come up with all kinds of performance optimizations, and it's actually the same idea. You, it comes up with some proposals. You have a kind of verifier or judge that produces it, that gets into a pipeline that the engineering teams can look at and, and prioritize, and, um, it, it's a, it's a pattern that I'm seeing kind of repeated

  10. 40:1842:33

    Mobilizing the team and recap

    1. BG

      across many domains.

    2. CV

      Yeah. And the other thing, I think people, you know, kind of tell themselves that AI bug fixes, AI code is almost, like, limitless and free, and therefore you can, like, cover the Earth with, with AI code. But, one, budgets shall not allow. Two, there is actually a time cost to shipping, reviewing, um, verifying AI code. And so you cannot go completely prioritization free, especially when you're looking at the kinds of fixes you need to verify, and they're taking 14 loops to, to even get to a yes/no. I do think this, like, pre- pre-prioritization is very clever, um, use so you can allocate compute appropriately-

    3. BG

      Yes

    4. CV

      ... to the highest, highest impact, impact things.

    5. BG

      Yeah. It, and we have, like, um... Just to give a sense of the, you know, we showed the graph earlier. This was a sort of incident response level event within Mozilla, where we had a Slack channel with, you know, uh, almost 100 people. I think we had 100 engineers land fixes as part of this initiative. And so it would be, "Hey, we found 60 new bugs. Let's pull in, like, this team and that team and the other," and then there's, there's, um, I think a lot of work. It did require some reprioritizing. Everybody was very tired, uh, as we've sort of gone through this, but also really motivated and, and mobilized i- in particular because you were getting these very actionable reports.

    6. CV

      Amazing. Well, just to, for people that have made it this far, I just wanna repeat what we've gone through so far. You, you know, shipped almost 500 security fixes in one month. A lot of that may have been model. A bunch of that was harness. The harness is not that complicated. Anybody can replicate it. It's really a goal or, like, Ralph-style loop against a presumed problem, a verification sub-loop, a bunch of tools can be run directly by an engineer. You're reusing a lot of the SDKs provided by these model providers, and you're prioritizing the files you go after so that what you're looking at, um, is, is a high priority to fix. And then you are mobilizing a team around this new way to work, and humans are not out of the loop. Humans' lives are just a lot better. Um, even though the volume was very high, the quality was also higher, and the actionability was higher. Um, this is so generous of you all to share. I think so many engineering teams in particular are gonna get a lot out of this

  11. 42:3348:26

    Lightning round

    1. CV

      work. Before we let you go, I'm gonna do a quick lightning round question, and we'll get you back to all these AI bug reports. Um, my first question, inquiring minds have to know Is it, is it model or is it harness? Was it Mythos? Or, you know, like if you had to do a split, where do, where do you think the, uh, this huge unlock, this magic graph came from?

    2. BG

      Yeah, yeah, of course, of course it's both.

    3. CV

      Mm-hmm.

    4. BG

      I think on the split percentage is a, is a tough question. We have seen, um, examples of being able to point the harness with, with many models, even like not the latest frontier ones, and being able to find pugs. And so just that makes me think, you know, [laughs] take a, take a cheap answer and say sort of 50-50. Like I think there's so much to still innovate on the harness side and the pipeline side. Like there's this feeling of having sort of 30 ideas of every one thing you did, and then after that you had 30 more based on, on, uh, what you tried. And that is to me a signal that there's just a ton to do here still.

    5. CV

      Amazing. And then my second question, which is I feel a lot of anxiety around this, um, just given kind of our experience supply, like internet supply chain [laughs] over the last even three months. Are you... As somebody who is, um, helping steward one of the largest open source projects, are you a security doomer in the age of AI? Or how do you f- how should, how should we feel about this? Should we be scared or should we feel hopeful?

    6. BG

      Y- you know, I, I'm cautiously optimistic, which compared to a baseline is probably, um, um, much more optimistic than many people. I think that the reality is th- these are bugs that have existed for a very long time, and what was gated before was just on discovery. It's really hard to find these bugs. But our goal is not to have a bunch of bugs that are hard to find. Our goal is to have zero bugs.

    7. CV

      [laughs]

    8. BG

      And so I think that these tools, as us and other defenders are starting to apply them, actually get us closer to that world, and it is gonna be a, a bumpy road, I think, for, for some time as, um, these are getting adopted. And we, we, you know, um, different projects are gonna have different depths of bugs as well. So it definitely, uh, there, there is reason to be concerned and nervous, but I think also I would say I'm, I'm generally optimistic about how this could turn out.

    9. CV

      I, I love... You, you've given me confidence. You've made me s- well, less nervous. Also, you've given me more tools, so I feel like I am, um, empowered to go solve some of these problems myself with, with similar frameworks. The last question I have, ask everybody, um, when, when AI is not doing what you want, when it is, it is just... it's, it's not giving, um, what is your prompting technique, um, and, and how do you bend, bend AI to your will?

    10. BG

      I would say for, for like pure chatbots, I'm a very boring user and sort of I, I'm, you know, pasting docs in and saying, "Give me feedback," and then I'm manually porting that feedback back into the doc. I, I just like to have control over the process and, and use it as my own exploration and, and learning. I think there's a lot out of that from just a writing standpoint. For coding, I am like, it depends on what I'm doing, I've found, and I think this is somewhat subconscious, but like if I'm doing something creative, maybe I'm building like the dashboard I was showing, I'm much more positive and I said, you know, "Oh, this is so great," like, "Let's try three other ideas," you know? And then if I'm doing like a, a system administrative thing, you know, figure out why this VM died, and it's not doing what I want, I'm like, "Come on," like, "We can't do it." [laughs] I, I've also found like sometimes on the code, if it puts something really silly, I will just copy that block, paste it back in with the word really at the bottom, and it'll figure it out. [laughs]

    11. CV

      I have found myself lately, uh, honestly like staring in Codex and being like, "No, what you are doing is crazy."

    12. BG

      [laughs]

    13. CV

      Like, "Please, please stop. This is not good." I love this. Well, Brian, you, um, and the whole team have been so generous sharing this publicly, you know, showing us behind the scenes how you got this work done. Where can we find you and how can we be helpful?

    14. BG

      I'm not on my, I'm not on much online. I'm on, I'm on LinkedIn and, um, I, I have a, a website that I occasionally post to. I do a lot of work, uh, you know, in, in open source projects, and so I'm, I'm active on GitHub. I think people should use Firefox. I, I think a lot of people probably switched to Chrome when it came out, a- and honestly for good reason, like it was a better product at the time, and it took some time for, for us to catch up. But we're doing really great on fundamentals, you know, things that people care about, performance. We're talking about security today. Doing a lot of new feature work, and in particular for, for this audience, um, I think have sort of a, an independent, uh, mindset around AI. And so choosing whatever provider you want, you're not sort of combining your browser vendor with your, you know, with your AI model provider. You know, open source project, we have great team behind it, so I'd just say give it a try and I, I think you'll be happy with it.

    15. CV

      And, and I will hype you up, and I'll give people who are listening to this podcast a reason why, which is I guarant- guarantee you 9 out of 10 people listening on this podcast, their browser is their number one or number two memory suck on their laptop right now because it is being used by humans, it is being used by agents. It's worth doing a little competitive shopping and seeing what your experience looks like since we're all so dependent on the browser for, um, not just our, our web work, but also our AI work. Brian, this has been so great. Thank you for joining How I AI.

    16. BG

      Thanks, Claire. [upbeat music]

    17. CV

      Thanks so much for watching. If you enjoyed the show, please like and subscribe here on YouTube, or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howiaipod.com. See you next time.

Episode duration: 48:28

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode Idjt53tTv2U

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.