
Why I love GPT-5.5 for hard problems
Claire Vo (host)
In this episode of How I AI, featuring Claire Vo, Why I love GPT-5.5 for hard problems explores gPT-5.5 Pro excels at autonomous coding, migrations, and device hacking GPT‑5.5 and GPT‑5.5 Pro feel meaningfully more capable and token-efficient on complex work, but their pricing makes them an “intelligence tax” that needs clear ROI.
GPT-5.5 Pro excels at autonomous coding, migrations, and device hacking
GPT‑5.5 and GPT‑5.5 Pro feel meaningfully more capable and token-efficient on complex work, but their pricing makes them an “intelligence tax” that needs clear ROI.
In ChatGPT, the model can overthink relatively simple tasks (e.g., a kids’ subtraction app), highlighting a mismatch between extreme intelligence and typical consumer workflows.
In Codex, GPT‑5.5 Pro shines by autonomously executing large, multi-step engineering work like security issue remediation, technical debt cleanup, and complex data migrations.
A standout example is a long-running, near hands-off, six-hour autonomous testing and validation loop that reduced production errors dramatically and uncovered only one edge case across ~2M rows.
As a personal “high-tech eval,” GPT‑5.5 helped reverse-engineer a proprietary Bluetooth protocol to programmatically control a Divoom mini display, enabling terminal-driven notifications and custom output.
Key Takeaways
GPT‑5.5 Pro’s best ROI is ambition, not just speed.
Claire argues the model lets her attempt projects she previously avoided because they were too complex or too time-consuming to reliably decompose and execute—especially with messy edge cases.
Get the full analysis with uListen
ChatGPT may be a poor form factor for “too-smart” models without hard problems.
Her subtraction-app test took ~17 minutes of “thinking,” producing a serviceable result but raising the question of whether most users benefit from that level of reasoning and latency.
Get the full analysis with uListen
Codex + GPT‑5.5 Pro performs well on backlog-style batch work.
Uploading a CSV of security findings and asking it to cluster themes, propose fixes, and implement changes worked well after human/code review—and helped lead to a clean pen test outcome.
Get the full analysis with uListen
Autonomous, long-running loops are where the model differentiates.
A ~6-hour run built a scalable CLI-based smoke test harness across providers, requiring almost no intervention, and found only one remaining edge case after validating large production-like data.
Get the full analysis with uListen
Complex data migrations with unstructured AI-response history are now tractable.
She describes legacy response-format drift across providers and attachments/tools creating hard-to-sanitize records; GPT‑5. ...
Get the full analysis with uListen
AI coding can increase quality when used for systematic validation.
Rather than “vibe coding” lowering standards, Claire highlights error rates dropping in monitoring after the migration/testing work, suggesting strong models can improve reliability when paired with proper tests.
Get the full analysis with uListen
Use “impossible” reverse-engineering tasks as practical intelligence evals.
Her benchmark—decoding proprietary Bluetooth messaging to drive a mini display—was something GPT‑5. ...
Get the full analysis with uListen
Notable Quotes
“I’m gonna pay the intelligence tax.”
— Claire Vo
“I don't know what to do with all this intelligence if you don't have complex problems to solve.”
— Claire Vo
“This thing will think.”
— Claire Vo
“Truly, it just banged its head against the wall for six hours, and I did not have to… zero prompts, zero follow-ups, zero steering.”
— Claire Vo
“GPT 5.5 has hit my intelligence benchmark for can you hack into this Chinese digital screen with proprietary Bluetooth transport mechanisms and bitmap compression.”
— Claire Vo
Questions Answered in This Episode
For your subtraction-app experiment, what specific prompt or constraints would reduce the 17-minute “thinking” time without sacrificing correctness?
GPT‑5. ...
Get the full analysis with uListen AI
How did you structure the CSV security-issues prompt so Codex could cluster themes and implement fixes safely—did you provide coding standards, threat model assumptions, or test requirements?
In ChatGPT, the model can overthink relatively simple tasks (e. ...
Get the full analysis with uListen AI
In the 2M-row migration, what was the one edge case that slipped through, and what changed in your data model/test harness to catch it going forward?
In Codex, GPT‑5. ...
Get the full analysis with uListen AI
What guardrails did you use to trust a six-hour autonomous run (sandbox permissions, command approvals, rate limits, rollback strategy, CI gates)?
A standout example is a long-running, near hands-off, six-hour autonomous testing and validation loop that reduced production errors dramatically and uncovered only one edge case across ~2M rows.
Get the full analysis with uListen AI
Can you outline the architecture of the CLI smoke-test system (inputs, provider adapters, replay strategy, diffing/validation rules, reporting outputs)?
As a personal “high-tech eval,” GPT‑5. ...
Get the full analysis with uListen AI
Transcript Preview
Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive here on a mission to help you build better with these new tools. Today, I have a very special episode for you where I'm gonna tell you everything I think about the new GPT 5.5 model, which I've been able to test for the past couple weeks. Spoiler alert, it is a powerhouse, and I've been able to do things with this model, especially around advanced coding, that I haven't been able to do before with any other model on the market. And I'm gonna show you how it breaks my personal high-tech eval hacking into this little computer. Let's get to it. [upbeat music] So before I tell you what I built with GPT 5.5, let me tell you a little bit about the model itself. So today, OpenAI is releasing GPT 5.5 and GPT 5.5 Pro into Codex and ChatGPT, not available in the API quite yet. And this model I've been testing for the past couple weeks, and I will tell you what OpenAI is saying is true. They're saying that it has a higher capacity for complex work and is more efficient, including being more token efficient, getting that work done. And so the whole idea with this model is it's smarter and it's more efficient, so you're gonna get more done. And that has really been my experience. Now, I'm glad it's more efficient because it is expensive. GPT 5.5 is five dollars per million input tokens and thirty dollars for output tokens. And GPT 5.5 Pro, which has powered all this work that I've been doing, is thirty for a million input tokens and a hundred and eighty dollars for output tokens. So this is a pricey one, but when I reflect on what I was able to achieve with this model in early testing, I'm gonna, I'm gonna pay, I'm gonna pay the intelligence tax because I think what I was able to achieve is really important. And this is one of the things that I think about a lot when I'm testing these new models or testing these new tools. You know, everything has an ROI, and there can be an ROI in terms of speed. So can I get the things done that I wanna get done faster? And that's certainly been an accelerant from an AI tooling perspective and something we've all experienced for the past couple years. But where GPT 5.5 really helps me is ambition. It has been able to do things that literally I have not been able to do before for a couple reasons. One, just intelligence higher has solved problems that other models and other harnesses other than Codex have really had a hard time with. The second thing I've experienced is because the efficiency is higher, I'm able to do more faster without losing context of what I'm working on because it's happening really quickly, or it's being more autonomous, so I don't have to babysit as much. So again, I'm getting more done. So I do believe that what OpenAI is telling us is true, but that's coming out of my own experience spending hours and hours and hours with this model, throwing problems at it that other models have really had a hard time with, including GPT 5.5. So let's talk about what I built. And folks, for the less technical here, one of the things I'm gonna say about the model, and I tested it a little bit in ChatGPT but not a lot, is that I don't know what to do with all this intelligence if you don't have complex problems to solve. So while I've tested it in ChatGPT in my personal account, which is what I got access to, I don't have complex high intelligence problems to solve in my personal account. And so it was really hard for me to think of where I would use 5.5 or 5.5 Pro in ChatGPT simply because the problems I'm solving there aren't that hard. But I did try to solve problems there, so let's just talk about quickly how I used 5.5 in ChatGPT and what it gave me. And it will just give you an indication of what I'm gonna show you a little bit later. But again, I think what the consumer or even the everyday enterprise business user is going to struggle with using ChatGPT with this model is how many problems do you have that require super intelligence? So again, I think this is gonna be a model that developers and software engineers really love, and I'm really excited to see what OpenAI does in terms of unleashing and boxing this intelligence in use cases that then the, quote unquote, "everyday person" can use. So that's a little bit of, of my lecture on how much we have an intelligence overhang, basically. So what did I ask, uh, ChatGPT, GPT 5.5 to do in ChatGPT? Really simple thing. I'm teaching my second grader two-digit and three-digit subtraction. He's actually in first grade, but, you know, San Francisco, I'm trying to push him ahead. And so one of the ways that I've been able to teach him is build these little apps that help him understand subtraction with two digits and three digits and learn some kind of, uh, tactics to do that well. And so I asked it to build an app for me to teach my second grader more advanced subtraction concepts. I haven't been super pleased with some of the Vibe coding tools or Claude Code on this. Nothing's really, uh, built this exactly how I wanted, so I wanted to give 5.5 a shot at it. And first out the gate, it's a thinker. So you can see here it thought for seventeen minutes, twenty-seven seconds about this. You are gonna have this experience with this model. This is gonna be a theme of this mini episode. This thing will think. And it planned a app for advanced subtraction, built the code, all this kind of stuff. Now, here's my question: Do we need seventeen minutes of hyper intelligence thinking to build this app? Probably not. If I wasn't testing for the purpose of this podcast, would I have waited eighteen minutes for this app? Probably not. So again, what are we gonna do with all this intelligence? Is this the right form factor for, you know, a non-technical software engineer to access it? Not a hundred percent sure. And it built me a app here. You can see it includes mini lessons, word problems, read aloud.It's fine. It's fine. It's fine. It has different modules in it. The design leaves something to be desired, but again, I'm not really going to the GPT models for front end. I really want them to solve my hardest technical problems. And so I would just say in ChatGPT, I'm unsure yet only because I'm not sure what the average ChatGPT user is really trying to achieve and how much intelligence is required, even on the coding side. And so I just wanted to start there by saying, if you're in ChatGPT, you're using 5.5, let me know your hard intelligence problems so I can test them. I think the, like, basic vibe code me a little simple app, it's fine. It's not great. It's not any more in particular impressive than other things on the market, but it does a reasonable job, and then just the sniff of 5.5 is it's gonna think a lot, and it's gonna give you this chain of thought reasoning here to let you know how it's thinking and managing its own, own process. Okay, so I'm gonna put away ChatGPT. It's fine. Let's talk about using 5.5 Pro in Codex. And you all, I love... I love her. I do. My initial reaction when I first started testing GPT 5.5 in Codex is I am cooking. And what I mean by that is I was kicking off tons of tasks in parallel because the feedback loop for fast, the efficiency you felt right away, I was knocking off very long-standing tasks with tons of subtasks underneath them, and I'll give an example of what those are. And I was able to bite off a tech debt technical problem in the ChatPRD codebase that I have wanted to take care of for truly months. It has been plaguing me, and GPT 5.5 blasted through it. So I wanna show you a couple of those examples so you can understand what kind of tasks GPT 5.5 plus Codex is really good at and why I think its intelligence is higher, and the way it's configured to work autonomously and efficiently is really beneficial for the software engineer. So the first thing that I did, which I'm not gonna show you for what will become very obvious reasons, is we used OpenAI's Codex security product to run a threat assessment and security scan on the ChatPRD codebase. And it was pretty good. We're, we're pretty secure, but it did come up with some low priority or low severity issues that we needed to remediate. And instead of taking those one by one, what I did is I downloaded the CSV of those issues, uploaded it to Codex, and just said, "Can you please architecturally review these issues, group them if they're thematic, and then propose a change, and then make those changes?" And I will say it just did it. It did it very well. We did human review on that. We did code review on that, and we were just really happy with the quality of execution, but also the fact that I could give it a list of generally associated but not single project tasks, and it can execute on those well. And the real validation of the quality of that output came when we had, uh, very quickly after that, our annual penetration test, and our pen test came back super clean. And so I would just say if you have a list, a triage list of technical debt, if you have a triage list of security issues, even maybe front end debt, flaky tests, engineers, pay attention. You can throw that list at GPT 5.5, and it will get that list done. So that's use case one that I thought was really efficient and great. Use case two, and I'm so disappointed it cleared how hard it worked on this project. But I have, as I mentioned, this lingering tech debt in the ChatPRD codebase, which is we have millions of chats now for ChatPRD, and we were storing those chats in various legacy formats as the model providers, both OpenAI and Anthropic, have changed the shape of their model responses over time. And so TLDR for the folks that are less technical, every model in the world has changed a little bit about how they return data via API over the past three years. We have a bunch of debt and data debt around that, where we were storing legacy formats in our database. And these legacy formats, because they are AI calls, because they may or may not contain attachments, because they may or may not attach- contain tools, very hard to build a clean, cohesive backfill and sanitization of that data into our go forward data model. And I have just been slapping, like, fix after fix after fix and patch after patch after patch on this problem because every time we patch it, we find another edge case. So this is an example of a data migration problem with millions of rows, which might not sound big to many people but is pretty significant to, to us in terms of the complexity of the data inside of it with functionally unstructured, lightly structured data with tons of edge cases. And I just finally was like, you know, "GPT 5.5, take me away." Gave the model that problem, and it executed so well. It built functionally one shot a solution that covered, I'm not kidding, 98% of the edge cases that we had identified. So first of all, one shot building a complex migration by pointing things to docs and libraries, very, very good. Something that had really been hard for us to do because it was so complex and so unstructured before. The second thing, which I wanna show you on the screen now, is I needed GPT 5.5 and Codex to validate that work. And so I pulled a production-like set of examplesInto a test environment. And I asked Codex, "Look, I need you to figure out a way to programmatically test every thread that's in local" then I pulled a, a local version of this, um, production-like data. Post it to Anthropic and OpenAI and any other provider that we're, we're using. I need you to make a scalable system for our team to do this programmatically, ideally through a CLI, so that any agent can test any thread for these data issues. And then I s- I've been saying this a lot to, uh, GPT 5.5, "I trust you." This is my, my prompt to GPT 5.5, "I trust you to make a call, figure out s- how to spawn a subagent to do this, test it, and identify any issues, repair them, and get this ready for production. Thank you," because I'm very polite. This thing worked for six hours. It was actually five hours and, like, 57 minutes. Truly, it just banged its head against the wall for six hours, and I did not have to... I-- zero prompts, zero follow-ups, zero steering. I think I had to approve one, um, script call or something for it to have access to run in its sandbox, but otherwise, it just went for six hours. I have not seen... Personally, everybody says, "Oh, I'm getting my agent to run overnight." I have not seen it until GPT 5.5 in a very constrained use case. And so this thing will do long-running autonomous tasks that require sort of a loop to understand if it's doing well and moving things forward. It ran for almost six hours, and then it implemented the smoke test. It tested all the example data, and after this, we literally, after two million rows, had one edge case that was not caught. And so just, like, think about that for, for a minute. You know, we had two million rows, one edge case, where before we were hitting edge case after edge case after edge case. Six hours of GPT 5.5, and then, you know what we saw? We saw our error rate just hit the floor in our Sentry monitoring. And so people say that AI coding is going to decrease quality 'cause people are vibe coding. That is just such an 18 months or 12 months ago narrative. I think quality is going to go up. This kind of problem I've truly avoided because the intelligence was not there to do it autonomously. My ability to, and our engineering team's ability to, like, break down the problem and spend the dedicated time to hitting every edge case in our synthetic data really hard and, you know, every time you, like, plug one hole, another one pops open. And just being able to hand this to GPT 5.5 and Codex has changed my life. So again, I am scared about how much this will cost me in, you know, production when those tokens... But, like, cheaper than me, cheaper than my engineering team, and it really did run six hours. And so I'm just, like, throw this thing at your quality issues, throw this thing at your bug backlog, throw this thing at a security assessment, and close the quality gaps or performance gaps or security gaps in your app. It does really, really, really well. So that's my prime use case. If I didn't share anything else, um, this would be enough. It bit off my largest piece of tech debt in my app, basically made my errors go to zero, and did it all six hours autonomously in a self-sustaining subagent loop. I love you, GPT 5.5. But there is a real eval, and I told you this in the intro. My real eval is this thing. This is a Divoom MiniToo retro PC-style Bluetooth speaker and tiny screen. And I have been... I am not kidding. I have been hacking on this thing since January, since late January or February. I think I ordered it around Valentine's Day. And my only goal is to be able to display funny stuff on this screen. Now, it comes with an out-of-the-box iPhone app. And so I can use this proprietary iPhone app to send images to this thing, but I don't want that. I live in the terminal. I wanna be able to do this programmatically, and this is, like, proprietary code loaded on this device. I was, like, very deep in Chinese language repositories and documentation from, like, Bluetooth hardware providers. I was in deep, y'all, and I threw... First I threw Claude Code at this, and I said, "Can you figure this out?" Claude Code could not figure it out, even with Opus. I threw GPT 5.4 at it. It could not figure it out. I cannot tell you how crazy I went with this, but I'm gonna try. So this is a little device. You think you would be able to plug it in and just say, "Dear Claude Code, tell me how this device works. Make no mistakes." No, that's not how it works. It connects to your computer or to your phone via Bluetooth, so it is interacting with this app on your phone through Bluetooth. And in the app, I can, like, draw something and click Send, and it will display here. So I know that over Bluetooth I can change the display of this app. But we could not figure out how to encode that message. What did I do? Well, this is a little peek. This has nothing to do with AI. This has, has a peek to how cuckoo bananas your friend Claire is. So what I did is I spent truly hours downloading a, a Bluetooth profiling profile on my phone for developer debugging. I then hooked it up toSorry, I'm crazy. Hooked it up to a packet sniffer so that when I was using the app here on my phone and it sent an image to this computer, it would log and sniff the packets and tell me what Bluetooth was sending to this, this little guy. I threw these logs and kind of all the information that I had at 5.5, and let me show you what happened. So I'm gonna get that repo up really quickly and show you my desperate prompting. I said, "This thing is connected by Bluetooth. Take what you know, and please just do anything to figure out how to display on this. You have so much information. You should know how to do it. I believe in you." And guess what? This effing thing did it. It did it. So I'm... [laughs] My success, um, my success measure here, which is I was able to build a command line tool where I can run it in terminal, press Enter. Let's see. Did the benchmark hit? Hello. It... That's... Hello. This is months, months, months of trying to hack into this stupid thing. It was encoding and decoding bitmap files. It was crawling the web trying to find if there was some secret SDK. Codex, you did the thing. And even better than that, it is now hooked up so that any time I ask Codex to do a thing, it will alert me on this. So let's give it a little try live on the podcast, and then I will get you out of here. But I am telling you, this, hack into a proprietary device, that is my intelligence test now. All right. So let me share my screen really quickly, and let's just test if this thing works. So I have my terminal up, and I am going to go into Codex, and I'm gonna say something really simple. I'm gonna say, "What can you help me with?" Okay. And I built into my Codex config a notify hook that should do something on here when it's time to be notified. So, "What can you help me with, dear Codex?" It's gonna tell me, and let's see. It's done. Maybe I'm not paying attention to my computer. Let's see if it runs. It should make a noise. "Your move." Well, your move without the E, your mov. It made a little beepy boop. You all, this is changing [claps] my life. So again, I did three assessments of GPT 5.5. This is the one that impressed me most. I will share more about this on the blog. I might even do a little mini app on this particular workflow. I'll try to publish the code. But you all, this was my delight moment. I screamed. My children were blown away. They have seen me slave over this thing. I was sending them messages and saying, "Hey," and then, like, responding to their questions by just showing them the screen. I am obsessed. So GPT 5.5 has hit my intelligence benchmark for can you hack into this Chinese digital screen with proprietary Bluetooth transport mechanisms and bitmap compression. And guess what? 5.5 can. All right. So that is a wrap for our quick review of GPT 5.5. TLDR, I love this thing. It is super smart, it is super efficient, and it will work on its own against complex problems, basically as hard as you ask it. It has solved problems I have not been able to solve before. The only thing that I will leave you with it, it is, is that it has the, as I call it, baked potato personality that we've all come to know and love from Codex. Um, it is a dull, dull, dullard. But I learned over the testing of this, if you do /personality in Codex, you're able to change that to something a little friendlier. And while some of my fellow early testers said it had too much of a Gen Z personality, I said, "I like to stay young. Give me that Gen Z GPT 5.5. I'll take it any day over the paper bag baked potato personality that you get out of the box." Other than that, it's my favorite senior software engineer, staff software engineer. I'm gonna go blow through a bunch of technical work, and I really love this model. So I can't wait to hear what you think, and if you figure out a high intelligence test that works in ChatGPT, let me know. Otherwise, enjoy coding, and I can't wait to see what you build. Thanks, y'all. [upbeat music] Thanks so much for watching. If you enjoyed the show, please like and subscribe here on YouTube, or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howiaipod.com. See you next time. [upbeat music]
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome