EVERY SPOKEN WORD
25 min read · 4,705 words- 0:00 – 1:38
What open-weight models are and why GLM 5.2 is worth testing
- CVClaire Vo
What if I told you you could get Opus-level reasoning at a fraction of the cost? That's what we're gonna see and test today when I take a look at GLM 5.2. This is our first of many reviews of open-weight and open source models to see if we should all be paying the tax to Anthropic and OpenAI, or if we can run these models locally and get the same results. Let's dive in. This episode is brought to you by Mercury, banking redesigned from the ground up, now with Command, so you can just say what you need and the work gets done. I've always liked products that reduce the distance between knowing what you wanna do and actually doing it. That's one reason I've been a Mercury customer for years. Whether it's sending money, managing cards, or checking in on the business, Mercury has always felt fast, simple, and built for founders. What I love about Mercury Command is that it takes that idea even further. Instead of navigating menus or figuring out where a feature lives, you just say what you need. Ask a question about your finances, create an invoice, categorize transactions, or send a payment, all from a conversation. Because Command is built directly into Mercury, every answer comes from your actual account data, and every action follows the permissions and controls you've already set up. It feels like where software is headed, less navigating and more simply telling a product what you want to accomplish. Visit mercury.com to learn more and apply online in minutes. Mercury is a fintech company, not an FDIC-insured bank. Banking services provided through Choice Financial Group and Column NA,
- 1:38 – 4:02
GLM 5.2 model overview
- CVClaire Vo
members FDIC. Okay, before I get GLM 5.2 running against some benchmarks on my own computer and my own projects, I wanna talk you through what exactly this model is. So GLM stands for General Language Model, and this is a model that's put out by the Beijing-based startup Z.AI. So yes, this is a Chinese-trained model. Now, this model is open-weight. You may have heard that term before, but not exactly known what it means, and all it means is that the trained model weights are publicly available for download, which allows you to run it on your own hardware. You can fine-tune that model on your own data, and you can kind of inspect how it works. Now, the licenses for open-weight models kind of depends, so it doesn't necessarily mean you can use it, quote-unquote, for free. But why this matters for GLM 5.2 is you can self-host this. So let's say you have one of those chunky Mac Studios at home where you wanna run some models locally, you could potentially run GLM 5.2 locally. You can fine-tune it, which means you can adapt it to your purpose. And then the thing that people really love about these open-weight models is they're cheap. You can run your own inference, and so it can be much more affordable than API cost from Anthropic or OpenAI. And two, you're not locked into a vendor. So if a specific vendor changes their API terms, you can actually change what provider you use, and there will be a lot of inference providers. I'm gonna show one that you can use to run GLM 5.2. Now, why should you be paying attention to GLM 5.2? Well, this is what I'm hearing from the breathless AI bros, and I guess, again, I am a breathless AI brother, is that GLM 5.2 is getting people sort of Opus-level intelligence or GPT five-point whatever intelligence for a fraction of the cost and the ability to be self-hosted. If this is true, this is a very big deal. As we've seen, we can't always rely on the big model providers for consistent model access, and these hyper-intelligent models from the frontier labs are getting very, very expensive. And so any models where open-weights models are catching up to the intelligence of OpenAI models, Anthropic models, especially for coding use cases, which can be quite expensive, is something to pay attention to.
- 4:02 – 6:02
Capabilities and benchmark results
- CVClaire Vo
Now, let's look at the benchmarks and capabilities of the model, and then I'm gonna dive right into actually using them. So a couple of things you should know about GLM 5.2 just to pay attention to is its context window is big. It has a million token context window, so that's sufficient. But one limitation is it only takes text in and only takes text out. So you can't put an image, you can't get out images. It is a text-to-text model. That is one constraint of this model. That being said, it has all the capabilities that you should expect from a modern-day model interface. It has reasoning or thinking mode, it can stream its outputs, it can call functions, it can do context caching to make things more efficient, and it can output structured output and use MCPs. So at the end of the day, this is a very capable model with the right ergonomics that we've gotten used to. Now, what do we see from the market benchmarks on this model? So as you see here on Frontier Suite, Post-Train Bench, and SWE-Marathon, it's inching up there to Opus and right above GPT 5.5 on a lot of these benchmarks. And if you look at SWE-Bench Pro, you can see it's about on par with GPT 5.5 and almost up to Claude Opus 4.8, certainly beating Gemini 3.1 Pro. So if you look at it against these models that we've all come to know and love, i- it's, it's in their arena. It's definitely worth testing out, and the external benchmarks say this is a model with enough intelligence to attack some of our hardest coding problems. So what I'm telling you is you can get this open-weight model where you can inspect how it's actually built. You can run it locally or at least more cost efficiently on your own inference or your selected provider's inference, and it's gonna code just as well as
- 6:02 – 8:37
How to set up GLM 5.2 in Cursor
- CVClaire Vo
Opus 48. Let's do it. So how do you actually get GLM 5 in your coding stack? Let's say you're completely new to all this open-weight model stuff, and you wanna figure out how to run these in Claude Code or in Cursor. I'm gonna show you Claude Code and Cursor. It basically applies to Codex as well. I'm just gonna give you those two examples 'cause I think that's gonna cover most of your use cases. So first, you need to choose where you're gonna get your model from, and I'm still using a hosted API version of GLM 5.2. My little laptop's not gonna run this thing locally, and so I've chosen to use OpenRouter, which is a, um, unified interface for getting access to a lot of different models, both commercial and open-weight and open source. So I signed up for an OpenRouter account, and then all you need to do after you sign up for an OpenRouter account, give them your credit card, set a limit if you want to, and just set up an API key. So I set up a local dev. Actually gave me another API key. So I set up an API key, and now I have access to this model via OpenRouter. And so I'm gonna show you two ways to set this up in Cursor and then in Claude Code so you can start using GLM 5.2. Okay, setting GLM 5.2 up in Cursor is super easy, though it took me truly about 30 minutes to figure out the nuance here. No one has documented it. We'll put it in the blog post in the show notes for you. But all you have to do is go into your Cursor settings and click the Models tab, and then you need to do two things. First, you need to put your API key from OpenRouter here in the OpenAI API key field and toggle that on. And then secondarily, you need to override OpenAI base URL with this very specific URL. So it's openrouter.ai/api/v1/cursor. I could not find anything for a really long time that told me it had to be /cursor, but it is /cursor, and you need to toggle that change on. The second thing you need to do is add z-ai/glm-5.2 to your models. So you simply click View all models, you add a custom model, you add that field in, and you will be able to access this model. So if you open up your chat in Cursor, down here in the bottom, z-ai/glm-5.2 is now available as a model running through OpenRouter. Now that's Cursor, and we're gonna come back to this. On
- 8:37 – 11:04
How to set up GLM 5.2 in Claude Code
- CVClaire Vo
Claude Code, there is luckily a little bit more instruction on how to do this. And so there is this page on OpenRouter, the docs page, that shows you how to connect Claude to OpenRouter, and then I'll show you how to connect your specific model. For Claude Code, it's pretty simple as well. You need to get your OpenRouter API key and that URL, which they have here, and you need to add to your shell profile. So for people who are not super technical, who have just Claude coded their way into terminal, your shell profile is the file that manages your settings in the terminal. It's gonna, like, instantiate a bunch of environment variables. It's gonna set a bunch of settings. And so you need to edit your shell profile. It's usually zshrc or bashrc, depending on what profile you use, and you need to add these lines right here to that file. You can also... This says open it in Nano. If you're not feeling fancy, you can just find this file in your finder or in your file directory and open it in whatever code editor of choice and add these lines, which include your OpenRouter API key, the base URL, which does not contain /cursor, it's just openrouter.ai/api, and your auth token here, and then you clear the default auth token for Anthropic. The second thing you need to do is edit your claude_settings.json, which is in .claude/settings.json. You can open up again in whatever code editor you want, and change your model to the GLM 5.2 string from OpenRouter. So here I've put it in. And so with those two things, any Claude Code session that I open up will have the OpenRouter API key. It will route all requests through that OpenRouter API key, and it will set the model to GLM 5.2. There's a very similar process that you'd follow for Codex, but the TLDR of setting up your Cursor Claude Code Codex with a new open-weights model is to find a provider, switch out your API key, and route all your model calls to that new model. So now I have Cursor, and now I have Claude Code running GLM 5.2. And just to prove it to you, I'm gonna pull up Cursor right here, and you can see GLM 5.2 through the API being used right now.
- 11:04 – 12:43
Live test 1: codebase exploration and architecture audit on ChatPRD
- CVClaire Vo
Okay, so I'm gonna run through a couple use cases of GLM 5.2, and spoiler alert, I have not done any of this, so I am not sure how it's going to work, and just give you my vibe check on whether or not this is a model that I would generally use in my day-to-day flow. So the first thing I'm gonna do is just see how good it is at exploring an existing code base and telling me a little bit about it. So I'm in the ChatPRD code base, and I'm just gonna say here, "This is the ChatPRD code base. Please explore it and tell me a little bit about its architecture and the most recent things we have been shipping on this code base." So this is gonna go through my code base, and we're just gonna explore how good it is at independently auditing, reviewing, and understanding the structure of a code base from zero. This is one of the most common tasks that you would do as a software engineer is really getting oriented, and it's a good reflection of its ability to run autonomously, its ability to use its context window effectively, and its general sense as a software engineer. It was actually pretty fast, and it came back with a pretty good overview. So it is a Next app. It's got the full stack. It's got a nice architecture here of what it looks like. It's talked about different integrations we have and what we've been shipping in the last six week, which is our Chat v2 stability, absolutely, and then some billing and Lenny promo stuff that we've been working on, as well as security and dependency hygiene. So this is actually pretty correct. It was very fast and very accurate. And so just out the gate, this does not seem like a dumb model. But this
- 12:43 – 16:37
Live test 2: generating an HTML architecture and roadmap page
- CVClaire Vo
is a pretty easy task. Let's make it something that we can visualize a little bit more and see if it does a good job communicating agent to human. And we all know that this is the year of HTML, so I'm gonna say, "Turn this into an HTML page that can communicate the overall architecture of the app and give a sense of the upcoming roadmap. You can use whatever components you want to make this look good and communicate to me, the end developer, the major pieces of the architecture and product strategy. Give me a page to pull up when it's ready to review." So again, this is gonna take a little bit of the combination of the reasoning and intelligence of the model and combine it with a preview of the design sense and communication sense and see what we get out of it. Okay, it's creating this HTML page for me. It's told me to approve the HTML. I'ma pull it over 'cause it does not look bad, you guys. Um, so this is the Chat PRD architecture and roadmap review. Right out the gate, it's like slop adjacent. We have blurple on here. That's that blue purple indigo color that we love, but it's not ugly. So let's take a look at the content. Well, at the high level, it does look correct. Wow, we've merged almost 3,500 PRs. We've done a lot of PRs. It's giving me a good overview of the core pieces. Oh, this is really great. This is the anatomy of a chat turn, so how the core piece of our application actually works. Some product pillars, which are our chat, our integrations, documents and collaboration, as well as billing, which also sounds correct. And then it's given me a list of recently shipped things, and then roadmap and direction. This is actually the piece where I'm most, uh, impressed. One, this looks real cute, and it got the Chat PRD pink, which not all models get. GPT wants to give me these, like, ugly green and navy colors. Claude wants to give me Claude orange all the time. But look at this, GLM gave me Chat PRD pink. I'm very happy. And so it's given me what we're working on in flight, and then let's see what it suggests should be up next for our roadmap, integrations, partnership and enterprise motion, cost and performance, and then knowledge and retrieval. Spoiler alert, these are actually the things that we're working on, and a couple conventions about writing our code, um, which is quite nice. So I don't know, guys. This is pretty good. This is the first time I've run an open-weights model inside Claude Code, and I have to say, I am quite happy about it. But let's take it to the next step. Let's design something real. This episode is brought to you by Mercury, banking designed to work the way modern software does, now with Command, a fundamentally new way to get things done with your bank. One thing I've become increasingly allergic to as a founder is dashboards. Not because dashboards are bad, but because every workflow starts with opening another tab, finding the right report, and figuring out what to do with the information. Mercury Command takes a different approach. Instead of navigating your banking software, you simply tell Mercury what you need. Want to understand your cash position? Ask. Need to create an invoice or categorize transactions? Just ask. Because it's built directly into Mercury, every answer comes from your actual financial data, and every action respects the permissions and approval workflows you've already set up. I've always appreciated Mercury because it feels like a product company built a banking experience. Command feels like the next evolution of that idea. Visit mercury.com to learn more and apply online in minutes. Mercury is a fintech company, not an FDIC-insured bank. Banking services provided through Choice Financial Group and Column NA, members
- 16:37 – 20:57
Live test 3: redesigning the How I AI landing page in Cursor
- CVClaire Vo
FDIC. So what I have up here is the Chat PRD website, but, but specifically our How I AI landing page and blog, where we put every single episode and a summary of those episodes up on our Chat PRD blog. So this is a pretty highly trafficked part of our site, and we redesigned this a couple times using AI, but I wanna do it again. And there's a specific piece of this page I don't really love, which is this header section. And so we're gonna have GLM 5.2 just make a pass at redesigning this header section, and let's see how it does. So I'm gonna say here, "Let's redesign the header hero section of the How I AI landing page, where all the How I AI blogs are, the part that says, 'AI workflows and stories from the experts through the Cursor credit claim.' I wanna redesign it so it is higher quality design, it is a better call to action to workflows, and it helps with anything we need on SEO. Design whatever you like. Looking forward to what you make." I don't know, you guys. It's really embarrassing to prompt my own [laughs] blog, but this is one I'm gonna give it. Now, I've told it to redesign this hero section. It's gonna run it through this new model, and we're gonna see if its ability to redesign even a small section of the page will give us AI slop or if it will give us something a little higher quality. And the reason I like to test on the Chat PRD marketing site is it has an existing design system, and there are specific things that we really like to see in Chat PRD design. So this will be a good test to see whether or not- It can match to an existing design system versus generating a completely novel design like we saw in the architecture overview. So I'm gonna let that run, and we're gonna see what it looks like when I come back. Okay, it says it has a plan and executed that plan. Let's look at it. You know, I don't hate it. I don't quite love it yet, but it's not, it's not bad. What do I like about it? Well, I like the fact that the AI workflows, as requested, are a much better call to action. It also has this nice hover effect on it. I do like that it put sort of some metadata here and some value propositions on who it's for, how frequently we drop, and how many episodes we have. And I do really like this little sidebar widget, um, that makes the listen to the show, the calls to action to YouTube, Spotify, and podcasts, Apple Podcasts, look a little bit more like a player. I'm not sure what this little square in the corner is. And then I do think this copy here is... It, it might be what it was before, but looks pretty good. I would just say I don't love fully all these colors in the sidebar player, so I'm gonna give it that feedback and say, "I really like this, except for the listen to the show sidebar. YouTube, Spotify, and Apple Podcasts are very bright buttons. They're super overwhelming, and they're very wide for the text that's in it. I think this component could look a lot higher quality and a lot better for our specific design system. Can you take another pass?" Let's see what it comes up with. But I will say for the speed and for certainly the cost, this does not make me unhappy, and I think we would all question how much intelligence do we need to put towards this specific problem? And as long as the model has good taste, I don't need to be fancy and use the most expensive one. So I would say just first glance, first pass, GLM 5.2 is pretty good at design stuff, and maybe we should all be switching over to it, especially if you're anchoring in something like a design.md or other design guideline or design system where the model can really anchor on it. I do like this a little bit better. It went with a sort of black, uh, call to action, a lot more subtle and a lot smaller, but there's some misbalance between the left and right. But again, it's pretty fast. In fact, it's almost as fast as Composer is, which is a model by Cursor that I use really frequently. And so I think this is pretty good, and I would definitely put GLM 5.2 in the rotation for design
- 20:57 – 22:35
Live test 4: 45-minute autonomous task, pulling Sentry errors and Vercel logs
- CVClaire Vo
work. Let's wrap with a much more complicated use case, though, which is a long-running autonomous use case. So part of how GLM 5.2 has been advertised is that it is a very agentic model that is capable of handling very long-running autonomous tasks and solving those over time, very similar to the claims about Opus and GPT 5.5 or whatever. And so I gave it, um, a, a common task that I like to give a lot of my long-running models, which is pulling issues and error logs and then making a plan for fixing those error logs and ultimately shipping those fixes themselves. And so before I started this podcast, I started with this specific prompt, which is, "Pull the last 72 hours of Sentry errors and Vercel error logs, and build a prioritized plan of bug fixes based on observed issues." And so this has been running the entire time I've been recording this podcast, probably about 30, 45 minutes, even though you all will get a much shorter cut of this episode. And you can see here it did the thing that most models do, which is it built a to-do plan, it ran tool calls and MCP calls, it read the output. It actually asked me to auth into Vercel, so that was great. It ran several Vercel calls, and now it is putting together a plan in HTML, I believe, for us to review and decide if these are the priorities for ChatPRD. So I will let it finish writing that plan, but again, this has been running for about 30, 45 minutes, and we will see if it came up with something of high
- 22:35 – 23:49
Where it struggled
- CVClaire Vo
quality. Okay, quick intermediary peek from the reasoning minds. It is really struggling to write TypeScript, so while it can do a long-running task, it is having some TypeScript errors. So we may be sitting here for a while waiting for it to write the plan as opposed to its intelligence on getting the plan. So hold tight, we will be right back, and I will give you my opinion whether all this waiting was worth it. Oh my God, guys, it really is having trouble writing JavaScript right now. [laughs] I can't... So, so okay, we got over the hurdle of it can write HTML. It's very good at writing HTML. The HTML and CSS is well-designed and looks good. I think it can query tools and look at data very well, but I don't think it can write React, which is 98% of what I do with these models. So if this is a failure state, we're gonna have some trouble. Oh, it compiled cleanly. I spoke too soon. I just had to complain to the model gods, and we are back. It's going to clean this up and hopefully show me its plan on how to fix all the errors
- 23:49 – 25:23
My verdict on the output
- CVClaire Vo
in ChatPRD. Okay, here we are. It pulled the last 72 hours from both sources and built a prioritized plan in a canvas I can open. Here is my canvas. Again, it does not look bad. I love that it's for an engineer, so they made it in dark mode. It says we have 20 Sentry errors, five Vercel log signals, 14 planned fixes, and then Gave me two P0s, not happy about that. This is something that was not coming through on, um, the signal-to-noise ratio on some of our Sentry issues, so we will fix that, and then we need to look at this. I think this is a legacy Google Drive integration, but we will take a look at that as well. It's giving me events by volume and then, um, runtime log signals and whether or not they are high severity or low severity, and then gave me this beautiful prioritized fix plan. Y'all, I was really disappointed by its speed in writing React, but this is exactly what I need. This is super helpful. It even looks like it's supposed to be part of Cursor, which I really love, and I can go through and start to rock through these fixes. It even gave me suggested sequencing. I'm not mad. So I spoke too soon being disappointed about the performance on this long-running task. I actually think it's pretty good, and I think I'm going to ship all these fixes, and apologies to my ChatPRD users if you ran into these errors. I think a few of them are
- 25:23 – 27:12
Cost breakdown
- CVClaire Vo
new. So what's my takeaway with GLM 5.2? It's good. I would use it for front-end work. I would use it for long-running back-end tasks, and the test that we were testing is how much did it cost me? And if you pull up OpenRouter and your usage on this API key, I spent $3.36 on about 6 million tokens. Cache rate was pretty good at 72%. I spent most of those tokens on that 45-minute long-running task in Cursor to find all my issues in Sentry and Vercel and just a few in Claude Code. But if you compare this to the cost of an Opus or a GPT 5.5, this is a steal. So I think I'm gonna keep it running in Cursor for a while. Think I'm gonna keep it running in Claude Code for a while. I'm gonna see if it can handle most of my tasks, and maybe I'll have to buy some more hardware and start running this stuff locally. This has been our first open-weights model review here at How I AI. I would love to hear what you think. If you want me to try more models, if you want me to show you how to run more open-weight models, or if you just want me to explain to you what is happening in this world of frontier coding models, whether commercial or otherwise, I'm here for you. In the meantime, get back to coding and thanks for joining How I AI. [upbeat music] Thanks so much for watching. If you enjoyed the show, please like and subscribe here on YouTube, or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howiai pod.com. See you next time. [upbeat music]
Episode duration: 27:13
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode ZoBfQZ5utQk
