EVERY SPOKEN WORD
25 min read · 5,139 words- 0:00 – 1:50
Introduction
- CVClaire Vo
[upbeat music] Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive here on a mission to help you build better with these new tools. Today, I'm gonna walk through my favorite feature in my most recent favorite AI product, Goals in Codex. If you've been wondering how all these people on the timeline are getting their AI to run, quote unquote, overnight, or handle very complex long-running tasks, I'm gonna show you Goals is the answer. We're gonna walk through what it is, how I might use it, and a technical use case, along with some non-technical examples of how Goals can help you even if you're not coding. Let's get to it. This episode is brought to you by Mercury. As an AI founder, I'm constantly tracking run rate, watching revenue growth, paying vendors, and making sure I'm getting paid on time. Mercury makes all of it feel effortless. The app is genuinely beautiful. It actually looks and works like modern software, which sounds obvious, but apparently isn't when it comes to banking. What I use it for the most, bill pay for my vendors is just clean and easy, and wires and transfers, getting paid from clients, moving money, Mercury makes it so simple. Everything you need is right there. No phone calls, no hunting through menus, no wondering if something went through. I think about how much I've optimized every other tool in my stack. Mercury is the one where I don't have to think about it at all. It just works. Visit mercury.com to learn more and apply online in minutes. Mercury is a fintech company, not an FDIC-insured bank. Banking services provided through Choice Financial Group and Column NA, members FDIC.
- 1:50 – 2:45
What is /goal and when should you use it?
- CVClaire Vo
Before I go into how to use Goal, I wanna talk about what Goal is and when it's appropriate and when it's not the right tool for the job. So I'm looking at this blog post by the OpenAI developers team. It's called Using Goals in Codex, and the first thing that they have in this blog post is this awesome diagram that talks about the difference between a prompt and a goal-based loop. In a prompt, you all are used to this, it's sort of the turn-based request that we're all used to. You ask the LLM, the model, the harness, to do something, it works, it returns to you its result, and then it waits for you to prompt it again. If you're like me, the number one thing that you're saying in your coding tool is, "Okay, what's next?" And then it tells you, and you say, "Great, do it." If you find yourself in that process, using /goal in Codex might be a tool that you wanna add to your
- 2:45 – 4:06
The difference between prompts and Goal-based loops
- CVClaire Vo
toolkit. So what's the difference between this turn-based one response wait and Goal? Well, with Goal, when you give Codex a goal, it actually has something that it can work towards, and it will continue to loop to the next step and verify until it can measure that it has met that goal. And so if you look at this, the goal is the overarching kind of description of the outcome that the model wants to get towards, and that will work, it will check its work, it will decide the next step, and it will continue that three-step process until it can gather evidence that it has met the goal. Once it gathers evidence that it has met the goal, it will mark the goal as complete, and then it will tell you it is done. Now, if you've been watching kind of people online talk about how they get these long-running autonomous tasks out of Codex, Claude Code, et cetera, you're really talking about people who are using some framework of this goal. There's also a version of this called a Ralph loop that people were talking about, but functionally, the, the framework is the same. It's saying, "Keep going until X behavior or Y outcome is validated. Otherwise, I want you to re-prompt yourself and re-prompt yourself until you're
- 4:06 – 5:05
Claire’s first five-hour 45-minute autonomous coding task
- CVClaire Vo
there." And what's really fascinating about Goals, I've been using, you know, AI coding agents for many years now, and until Codex and Goal, I was not able to get these multi-hour long-running autonomous tasks. Now, I don't have the most complex coding tasks in the world. I'm not building an operating system. I'm not doing complex mathematics. And so part of that was my problems were pretty well constrained, but I did have things that I thought a long-running harness could really help me with. But until /goal was part of the Codex tool, I really just wasn't able to get my AI to self-manage enough to do that autonomously over time. But the first time I used Goal, I was actually able to get a coding task running for about five hours and 45 minutes, which is longer than I've ever had anything run before. Now, quick
- 5:05 – 6:06
How to manage a Goal lifecycle: view, pause, resume, and clear
- CVClaire Vo
introduction on how to use Goal. There are four sort of ways to manage the life cycle of Goal. The one that I use is /goal, and then I walk away. So if you write, write /goal and then prompt it with your goal, it will start working. You can use /goal to see what the current goal is again. Uh, you can pause the goal, you can resume the goal, and then you can remove the goal. So, you know, you don't have to let your AI run for six, 12, 24 hours, whatever. If it gets off the wrong track, you can absolutely manage the life cycle. But it's a really useful tool, and I love that they give the example here because this is 100% what I spend most of my time going. They say, "You really wanna use Goals when you would otherwise find yourself saying the same thing after turn," like, "Keep going," "Try the next thing," "Run it again," "Now run the test," "Continue until it's actually done." So if you're micromanaging your AI and having to tap it on the shoulder and say, "Can you pretty please go to the next step?" Goal
- 6:06 – 7:34
How to write strong goals: outcomes vs. outputs
- CVClaire Vo
is for you.Now, how do you prompt and design a real Goal? This is where product managers tune in, um, engineers that write success criteria tune in. These are where those skills on setting really measurable, well-defined Goals come into play. Because when you prompt something, you're really just saying, "Do this task," right? Like, rewrite this code, redesign this page, et cetera. When you're talking about a Goal, you want to talk about what the outcome is if that task was successful. And the technical example that they give here in this blog post is reducing P95 checkout lay- latency. So if you know that a specific page is loading kind of slow and you wanna reduce that below a threshold, and you know that can be measured because you can just load the checkout page over and over and over again, and then you create a guardrail on it, like keeping the correctness suite green, that is a really great Goal. It's measurable, it's testable, it has a guardrail on it, and there's a executable surface area that you know an LLM can be successful for. Writing Goals is its own skill set, but OpenAI has given a really great outline to what makes a strong Goal. And again, product managers, let's pay attention. If you've written an OKR, uh, developers, if you've argued that an OKR was not well-written, this is where those skills come
- 7:34 – 8:57
The six components of effective Goals
- CVClaire Vo
into play. The strongest Goals, I mean, for anything, but in particular for Codex, kind of have six things as part of it. It has an outcome, what should be true when the work is done. So once we're done, what is the outcome we're trying to deliver? Verification, how can you test it? Do you have a test suite? Do you need to pull up the browser? Is there a number that you're trying to go to or a measure? Constraints, what can't regress while Codex works. For example, on our P95 checkout latency, you could delete the page, the latency goes away, but that's not what you want. So you want constraints, you want the features to stay the same, you want particular technologies to stay the same. The boundaries, so what tools and files and things it's allowed to use in pursuit of this Goal. The iteration policy, how it should decide what to try next, kind of what would you try next. And then when it should stop and say, "Sorry, I just can't continue. I don't have a good next idea." And they give this great pattern here, which is /goal, you know, my end state verified by specific evidence. I need you to preserve these constraints. Please use these tools. Between iterations, decide the next step by doing X, Y, and Z, and if you're blocked or no valid paths remain, this is what you should do next. You should tell me, you should report, you should ask me for
- 8:57 – 9:36
Example: Reducing P95 checkout latency with /goal
- CVClaire Vo
help. And so they give an example of how to make this P95 checkout latency Goal a lot better. And it's basically by saying, "Bring it below a threshold," which was already in the original prompt, "But you're gonna verify it by the checkout benchmark. You're gonna keep the correctness suite green. You're gonna use only the checkout system. Between iterations, you're gonna tell me what changed, what the benchmark showed, and the next experiment to try. And if you can't come up with something else, stop and give me the evidence, the blocker, and what you need from me." This is a really great Goal, and this is a technical Goal, but you can also do this with non-technical projects, and I'm gonna show you a little
- 9:36 – 13:18
Demo: Using /goal to eliminate Sentry errors in ChatPRD
- CVClaire Vo
bit of how that works. So again, a Goal is a new way to prompt a LLM, in this instance Codex, to work autonomously in a loop of work, verify, check, until it hits a Goal. Goals written are a lot different than prompts. Prompts are an instruction of what to do. Goals is a description of what a good outcome is and how to get to that outcome. And then I've seen Codex be able to run these Goals for a very long time. So I'm gonna give a couple examples of how to use Goals and what I think they're most useful for, and some successes I've had with Goals. And I'm gonna kind of show you behind the scenes. I have ChatPRD, and in ChatPRD, we have a tool call in our main AI writing loop, and it edits specific parts of a PRD. And it's this diff-based editor. It's very complicated, and it looks for operation ranges inside a document, and then tries to edit those operation ranges. And we were getting tons of errors, you can see here, tons of errors on applying specific edits because it couldn't find the right operation range. I'm just gonna, again, you know, tune out if this is boring to you. But because the documents we created were complex, they had tables in them, they had bullet points in them, they had bold, they had quotes, they have images, actually precisely getting a range of nodes from the AI was really, really hard, and we were just seeing a bunch of these errors over and over again. And we would, like, find one example of why an error showed up in a very specific document, fix that, but then another one popped up. So it's like that cartoon where you, like, you plug your finger over here and another spout goes off, and it was driving us crazy. You can see here. And then you can see basically the end of April, the beginning of May, they went away. Why did they go away? Well, we used Goal to knock this out. So the Goal that I used to solve this particular problem is I gave Codex access to Sentry, I gave Codex access to these edit requests, and I said, "/goal Codex, go through every example in Sentry, every trace in Sentry of an invalid operation on the edit tool. Categorize that issue and fix it. Then replay all of the Sentry events that would've shared that same issue until you have fixed every issue and every historical example of an edit invalid operation."... is solved, and it went to town. So what it would do is it would pluck one example, it would see what the root cause was, it would implement a fix for that root cause. It would then run through all the other examples to see how many of those it burned down. It would have some remaining. It would pluck the next one. It would do the fix. It would run through all the remaining examples, burn it down, burn it, burn it, burn it down, burn it down, burn it down, and then look what we have. We have literally [laughs] zero errors left. Now, this took several hours, [swallows] and what was really nice is at the end of it, I didn't get, like, these Band-Aid fixes all over our edit code. What I got was a systematic fix that integrated every example into a more intelligent framework for how edits should be applied, and ultimately, we've had z-zero edit errors from the time that we used Goal here. And so I think this is a really great example, but
- 13:18 – 17:28
Demo: Burning down Vercel API errors
- CVClaire Vo
let's do it live, 'cause this is how I AI. I'm gonna give another example of how I might use this, again, for some of the more technical folks. So, um, these are the Vercel errors. Um, it looks scarier than it is. We d- have a lot of retries around this, but here are the errors that happen behind the scenes that we have to recover from in our main chat, [swallows] and from the last, last two weeks. And I wanna do the same thing with these errors. I wanna say, "Codex, find these errors, classify them, ship a fix, validate against the existing data until basically there are none of these errors left." So I'm gonna pull up Codex. Um, I'm gonna use GPT... This is not, like, a complicated, deep-thinking problem, so I'm gonna use GPT 5.5 Medium, and I'm gonna say, "Goal: Eliminate errors on the API chat V2 endpoint that are showing up in the Vercel logs by going through each category of error, identifying root cause, determining if this is a user-facing error. If it is, determine root cause and open a branch plus PR for fix. If it is not, reduce this error to a warning. Once all logs can be handled from the last two weeks, report to me all PRs to review and issues that could not be fixed, or what you need from me." This is terrible prompt. This is fine. This is obviously a better Goal prompt than I usually write. And say, "Success state is we have no user-facing errors and no back-end errors that should be warnings." Okay. I'm pressing Enter. Um, it's compressed my skill descriptions, but that's fine. Now, Codex has hooked up with my Vercel plugin, so it has access and can actually go access these logs. So it's making this plan, and I just wanna pause and tell you kind of how Goal works with a plan. So once it has a goal, it makes... I've seen these, like, three to five-step plans. So it's gonna inventory the current repo. It's gonna pull the last two weeks of Vercel errors and group by category. It's gonna classify them as user-facing errors, and it's gonna implement and validate fixes or downgrade warnings by category, and then it's gonna publish the PRs and report to me. Again, this is very precisely, it's measurable. It actually has a list of errors it's going to burn down. It's observable. It definitely can eliminate those errors, so it can ship a fix. It can eliminate it or it can run the same code, and it can show that the error wouldn't be hit. And then it has a success criteria and an ending state to me, which is I want a list of PRs and any blockers or things that I need to review. And so it's gonna go ahead and go through and try to find the right logs. It's gonna continue to work on this. Now, we are in a mini episode today. It's one minute into this Goal. I suspect that this is gonna take two to three hours to get through. I've run something very similar on this. It's taken about two or three hours to get through, so I will have to put in the show notes or a follow-up whether or not this was super successful, but it's just an example to you. I love this idea of just, like, Sentry zero, error zero, where you can point Goal at any kind of, like, lingering errors that have really haunted your team and developers out there. You know that these exist, and you can actually say, "Just go get rid of these." And with Goal, it really is possible, and I've seen very high quality success on using Goal
- 17:28 – 21:24
Non-technical use case: Cleaning 3,900 emails with /goal
- CVClaire Vo
to burn down errors. So that is a technical example of how to use Goal, but I wanna make this more applicable to people who aren't developers, 'cause I honestly think Goal for non-coding use cases is even more exciting. Today's episode is brought to you by Mercury, the banking solution I use for ChatPRD. I build AI tools. I talk about AI every day. So when people ask what I use to run my business, Mercury is a genuinely easy answer, because an AI founder who still deals with clunky, outdated banking is kind of a walking contradiction. Mercury is how I track run rate and revenue growth, pay my vendors through bill pay, and get paid by clients. Wires and transfers that used to feel like a whole thing, sending money, accepting payments, knowing it arrived, Mercury just makes it simple.The whole platform is clean, fast, and modern in a way that most banking honestly isn't. I've banked with them for years. It's one of those tools where I don't think about switching because it's never given me a reason to. Visit mercury.com to apply online in minutes. Mercury is a fintech company, not an FDIC-insured bank. Banking services provided through Choice Financial Group and Column NA, members FDIC. For this next example, I want to give you my favorite use case of /goal. It has blown my mind, and if you leave this episode with nothing else, I hope you go do this, which is use the goal to clean up all your unread emails. So Codex has access to my Gmail plugin. That means it has MCP access. It means it can go through and read my email. I had yesterday truly 3,900 emails, something like this. I'm gonna see if I can find the resume the save chat. So I'm gonna type in goal and see what my goal was that I did yesterday. It is the much worse written prompt, "Categorize all bulk promotion spam emails, unsubscribe from unnecessary emails, and clean up your inbox. Ask for help while needing judgment." It ran for three hours and 52 minutes, and it ha-- it used about 6 million tokens, so it was not token cheap. I'm gonna just show you what it did, which is it just read, like literally read every email, categorized them, put nice labels on them so then I could go decide, including labels like needs judgment, clicked unsubscribe links for me, gave me a list of unsubscribe links that I could use, and at the end of the day, I went from about... Let's actually ask, how many emails did I start with uncategorized, and how many are now left to filter? So it's gonna go ahead and check its own work, and you're gonna hold me accountable to show that I did not make this up, and it's gonna show how many emails I started with and how many do I have left. I'm pretty sure it was about 4,000, and I think we got down to about sub 1,000 that needed to get done. Okay, it took a little prompting to remember what it did, but again, we started about 3,900 emails. Now I'm down to 68 that I need to look at, so that's my today project. So it categorized almost 4,000 emails for me, and it put it in lovely folders. Again, it unsubscribed for me. It gave me nice categories of emails that I needed to respond to. If you've been waiting on me for a couple weeks, you now got a response. And now I have a much cleaner email that I can run over time. So again, /goal, uh, my prompt was very simple, just categorize all my emails, unsubscribe, and clean up my inbox. It ran for four hours, and now I have a much cleaner inbox to work with. Okay,
- 21:24 – 24:41
Demo: Using /goal to clean up Linear project tasks
- CVClaire Vo
I'm gonna give one other example of a non-technical use case that I think is gonna be really useful for the product managers out there, which is I have let my Linear, my task management software, go completely off the rails. This is partly an OpenClaw problem, which is I gave my agents, my OpenClaws, YOLO access to Linear, and they created a bunch of tasks, not all which that they have done. And so I wanna clean up my Linear tasks and get them to only the ones that I need to complete. And I want this in particular for our podcast Linear, because we had aspirations of all the things we would do with every episode. We usually do about 70% of those, and I just wanna clean it up. So I'm gonna say, "/goal, clean up the How I AI podcast team issues in Linear. Anything from a previously released episode that is not marked as done should be marked as will, will not do. Our goal is to have open only future tasks this week and forward for episodes not old tasks we'll never get around to." So I'm gonna let that do that. It should have access to the Linear plugin. It's gonna go through, and again, I'm telling you, this is like hundreds and hundreds and hundreds of tasks. It's gonna go through and make this judgment call of, can I close this? Can I update the data? If you wanna have better task hygiene, where you wanna make sure everything is tar- tagged correctly, assigned correctly, this is a really good use case. And so it's found the Linear team. It's gonna work at the team level. It's gonna identify stale episode tasks. It's gonna go through, clean them up. Um, the task status we want is not won't do. It's called canceled. And it's just gonna process through and go ahead and do that. So I suspect that this one will go a little bit faster, but will probably take 30 minutes to an hour to go through really high-quality judgment, and at the end of it, I'm gonna have a much cleaner Linear workspace to work with. And again, it's saying, "A clear rule is emerging. Keep current week future episode work. Cancel non-done episode release work before Monday." It's gonna scope the bulk update. It's gonna validate that the outcome I wanted, which is a clean Linear, is done, and it will complete this over time. So those are my three examples of how to use Goal. One is a technical one. Again, it's continuing to run, so it's gotten through the first two steps here. Um, the technical goal of looking at all my error logs and basically classifying them, fixing them, burning them down with the goal of having no more errors ever. There is the second very practical goal of clean up my email inbox, and so I can actually read my email. That one took about four hours, I think, useful for everyone, and I did not have to have a very good prompt there. And then my third one for project management, make sure that my projects and my tasks and issues areClean, my backlog is clean, everything is labeled the way I want, and I only have to focus on the things that matter to me. These are three ways I think you can use Goals
- 24:41 – 26:10
When not to use /goal
- CVClaire Vo
in Codex. Before we end, I wanna take a step back and talk about when you shouldn't use Goals, and then what I think is next. So, Goals are not the right tool for every job, and I'm pulling up this blog post again because I think they say it better than me. Do not use Goal for something that is a very simple one-line edit. It is just too big of a tool for the job. Your goal wouldn't be like, "Make sure this line of code is removed." You really want an outcome, not an output almost, um, for it to be a good goal. Also, don't use a goal when the finish line is vague. So you can't do... I mean, maybe you can. If you're like, /goal make my customers happy, I think that is just a very vague goal. It's very hard to measure, and there's no reliable, definitive completion condition, and so that's not very good. The other example they give is, like, refactor this code. Not a good example of when to use /goal, and in fact, I'm doing a refactor this code initiative with Codex, but I'm not using a goal. They say, and I just wanna reiterate this for you, goals are strongest when it has three properties: a durable objective, an evidence-based finish line, and a path that may require several turns of investigation. So if you have an objective that c- stays steady over time, you know you wanna hit that objective, it can be evidence-based and you can measure it, and you think getting there is gonna require a couple turns, Goals are for you. So before
- 26:10 – 30:18
Why /goal changes everything
- CVClaire Vo
we wrap, a couple thoughts on /goal and why I'm just really excited about this framework of working with AI. One, as I said at the beginning, this has been the first time that I've been able to get these autonomous long-running tasks done. And so I really can set the LLM, the AI, up with a goal, step away, and have it work over many hours on a problem that would be very annoying to babysit. So one, I think my babysitting days are largely over with AI. Not completely over. I'm still babysitting a branch right now, but largely over with AI. I think the second thing is the impact that Goal has had on quality of life things in my code that have been very hard and annoying to chase down. Yes, I probably could've gone task by task and said, "Please fix issue A, then fix issue B, then fix issue C," and I could've set different coding tools off on those problems, but this idea of just saying, like, "Error zero, go through all our error logs and fix them until they exist no more," is incredibly powerful for, in particular, quality. So for engineering teams looking to burn down tech debt, fix flaky tests, look at really annoying, like, client-side errors that are maybe annoying to reproduce, I feel like /goal is really powerful. The third thing is I think that product managers are really gonna love Goal. Again, we've had it drilled into us, outcomes, not outputs. You shouldn't be defining the work, you should be defining what success looks like. I think as more and more teams start to use /goal as part of their coding workflow, product managers are gonna have to get a lot better at prompting these AIs with good goals. And we have some of those skills already, but I think the technical level of validation that's required by /goal requires you to up-level these hard skills in re- writing what a good goal actually looks like. And then finally, I'd say with /goal and these long-running tasks, and I felt this a little bit with OpenClaw, and I just see this becoming more and more true, working with AI just continues to feel more and more like working with a colleague, a human colleague, in that you assign a human colleague a task. You don't, like, sit there over their shoulder and tap and say, "Okay, next step. Okay, next step." What you really do is you give them a goal, they go away for the time required to hit that goal, and then they come back to you with the completed task and you give feedback. And so again, it's this form factor. Even though the AI is maybe faster than a human would be on some tasks, they may be slower than humans because they have the patience to go to the edge cases of things. But either way, they're using the time necessary for the task to get it done. And it really feels like I'm much more in manager mode than builder mode, and honestly, I'm not sure that I love that. When /goal came out, I found myself kind of, like, twiddling my thumbs and looking for the job that I could do in the coding work because so much of the job had now been handled itself. So in conclusion, I really suggest you try /goal. If not in Codex, try a similar loop in whatever your favorite AI tool is. Let it run, and let it solve bigger, more complex problems for you, and come back to you when it's time to review the work. This is How I AI. I'm so excited to see what you build, and I'm gonna get back to my logs and see if we've actually eliminated all these errors. Thanks for joining. [upbeat music] Thanks so much for watching. If you enjoyed the show, please like and subscribe here on YouTube, or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howiai pod.com. See you next time.
Episode duration: 30:20
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode 2wLJl9A2CnA
