Skip to content
ClaudeClaude

How Metaview built self-improving prompts for application review

At Metaview, we help recruiters sift through thousands of resumes a day. Most evaluation systems set the criteria upfront and rebuild every time preferences change. We built one that learns from every decision recruiters make and evolves with them.

May 22, 202616mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. SP

    [on-hold music]

  2. SP

    Please welcome to the stage Product Engineer at Metaview, Nick Mayhew.

  3. SP

    [audience cheering] [upbeat music]

  4. SP

    Good afternoon. It's a pleasure to see you all today. Today, I'm gonna be talking about how Metaview uses self-improving prompts in application review. For those of you who don't know Metaview, we build AI native recruiting software, and application review is the process of looking through candidate CVs and cover letters to help determine who you're gonna interview. Now, before I go into the, the why and the how of why we did this, it's worth understanding the situation that we're in and where we're at nowadays. Because since twenty twenty-three and the proliferation of large language models to everyday consumers, we have seen a explosion in applications to jobs. And the main reason for this is it's lo-- AI has lowered the barrier to entry in terms of how we apply and how candidates can apply to jobs. So there are some great stats here. We have one of our clients even last week who had two thousand seven hundred and forty applications for one job in twenty-four hours. Um, that mostly happens if your jobs are remote and junior, but these volumes are incredibly high. And the one stat that's not up here that I always love to talk about is the average answer to an application question like, "Why do you wanna work at Anthropic?" Or, "Why do you want to work at Metaview?" That has increased by roughly fifty percent in the last two years, and none of us have retaken GCSEs and learnt verbose writing. What has changed is that we have the ability, and we have LLMs that help us write these answers, and that has lowered the barrier to entry and massively increased the number of applications recruiters have. So what do you do? So recruiters need help, and like any good product manager, you first go interview the stakeholders to build a system that can help you go through some of the grunt work of these applications. And when you go to a hiring manager or a founder, you ask them what they want, and they'll say, "Oh, five years backend experience," all of these classic requirements that probably everyone in this room has read a hundred times over on a job application. And so you go build out a system that can help you evaluate for these things, but you come up across a problem almost immediately. And that is your hiring manager or your founder, they look at the first set of CVs, and they go, "Actually, I want startup experience." So you go back, you rewrite the whole thing, and you restart your evaluation. And then you get to your first interview, and they go, "Mm, no, no, no, no, no. This candidate hasn't built zero to one. That's a requirement." Now, that wasn't a requirement two days ago, but now it is. And you keep rewriting your evaluation systems. And the key point I wanna take away from here is that any user-based decision and anywhere where user judgment is at the forefront, your preferences are going to evolve. And so any system you built, your prompts must evolve with them. So as we can see here, as we're talking about, like, user preferences are always going to evolve. People change their minds, and if you want to keep them at the center of the decision-making, any one of your systems have to evolve with that too. So make sure your prompts reflect the, the fact that preferences will evolve. Don't try and add this on at the end. Make this a core part of your whole system and a foundation to your system. So let's talk a little bit about how we do that at Metaview. Oh, there we go. And so when a candidate enters, we first redact their information, remove, you know, a name, email, phone numbers, personally identifiable information, so we're evaluating based on experience, skills, qualifications. And we match that up with an ideal candidate profile. Now, this is the part of the prompt that is self-learning and self-improving over time. An ideal candidate profile, for those of you who don't know, is a bit like an ideal customer profile. It's what you're looking for in a role, who you're looking to hire, and what you're looking to fill. So we match the ICP and the redacted candidate to produce an evaluation of a candidate. And this is again another very core part and a key takeaway of this process. We're working in high-risk areas here where human judgment is at the forefront and is not just human in the loop, but human in the center. And what we mean by that is that your system acts as an apprentice. It-- Your system is trying to learn and help do some of the grunt work of passing through thousands and thousands of applications, but it is not your job to make a decision. Your job is to, as the LLM-based system, to help do some of that work and spot things like, has this person worked at companies that we're looking for? Have they got roughly the right experience in the same areas? Have they used the right technologies? And so that's where your judgment comes in, and then the user's judgment comes in to actually make the decision, right? They are deciding whether you progress or reject a candidate, so they stay at the forefront. Now, how do we then learn from that? Because the user is the one making all the decisions, we can have an agent that sits on top and observes their patterns, right? You can see any progression they make, any rejection they make, and you can start picking up patterns and improving your ideal candidate profile from that. So as I say, the ideal candidate profile, that's our aspect, our prompt that is self-improving. But let's go into the weeds of the ideal candidate agent, the ICP agent, to understand exactly how this works. So there are three main parts to this. One is the, the user messages, and that is all the user decisions that they make. So every time they progress a candidate, reject a candidate, any piece of feedback they give, any time they make a manual edit to an ideal candidate profile, anything about how they want to evaluate a candidate-That gets fed in as a user message. And we built that initially, and it was kinda good, um, and it produced, you know, a proposed ideal candidate profile. It produced a decent one. But one of the things we learned very early on is that any feedback given by a user is gonna be relative to what they've just seen. And as we all know, agents need the right context, and part of the context here is the candidates' profiles, those redacted candidates' profiles. So we have a specialized tool called Query Files. Um, we usually, you know, we tried Bash and, um, all this sort of standard grep, but unstructured data can be incredibly hard to just grep for in a file system. So we have a specialized tool that can go through candidate profiles and make sense of relative feedback. So if you said, as a recruiter, "This person doesn't have enough Python experience," we can then look at the past resume, the redacted resume, to understand what that actually means. What does it mean to have too little Python? What does it mean to be too junior? And then we can build that context, and that goes into the ICP manager agent, which has one function: keep that ideal candidate profile, that prompt up to date and look what they're looking at. And this is the core of the system. Now, actually, we've had some really interesting talks over the last couple of days that say that basically this whole thing should just be one agent. So I'm gonna go back to this slide and explain a little bit why this is a workflow and an agent on top. When you're working at volume, we, as we say, process thousands of applications in a day. You can't often afford to just chuck everything at an agent. And we'd love to. Um, it's always fun just to, like, as developers, make the maximal agent as possible. But there's a business side to this where you cannot spend dollars and dollars or tens of dollars on three thousand applications for one role. Because if you're Anthropic or you're Google or you're one of these massive companies, you're receiving hundreds of thousands, if not millions of applications a year. So you need a system that's efficient in its token usage, which is why we have this workflow underneath and this agent that sits on top to evaluate the progressions and the rejections. So what does an ideal candidate profile actually look like? Um, again, this is something that we should all be aware of by now. This is Markdown documents. We don't suggest, um, and we don't use any things like weightings or any if statements or flowcharts. Um, there's been a lot of critique in the past, rightfully so, in our opinion, of keyword matching on resumes to try and understand what a good resume is. It's not the way to evaluate a person or a candidate, and that's not how we suggest these systems work. Lean into what LLMs are good at, which is natural language. Allow them to reason in prose, not in flowcharts. And so you can see here, and I'll show you an example in our demo of what a ideal candidate profile looks like. It's just a text document. It's what you would write about who you're looking for, not, "Hey, thirty percent of my weighting should be on X keywords or Y keywords or this." Allow users to write in their natural language, and you will get a system that reflects their priorities much more because they don't work like us. They don't work in weightings and flowcharts and if statements. They're just used to describing in normal language. So lean into what they're good at. And we're at an Anthropic talk, so why am I talking here? Um, we use Anthropic models a lot for this. Um, and one of the reasons we use Claude models is we have an interesting dilemma when it comes to candidate review, which is benchmarks on software evaluation is cool and all, but what we care about is can you look through a resume and understand what is real and what is not? And one of the biggest problems with evaluating CVs is that there is, let's be honest, a lot of fluff in people's CVs. There are plenty of CVs out there where people will be claiming they've done a lot more than they do. And if you have a CIF, CIF, CIF thank it model from other frontier labs, you're gonna struggle there because it will take them at their word, and they'll be like, "Oh, great, you, you know, you created a large language model by yourself in your garage." Yeah, no. And it will just say you're great. So you need a model that can reason critically, um, and that's why we use Haiku and Sonnet. Haiku for our evaluations. Again, as we're talking about, we're running thousands of these a day. In Anthropic, actually, we have special input per token limits, um, that allow us to process so many of these applications a day. And then Sonnet, because you've got an unconstrained task here of trying to find patterns where latency matters less. So we use a little bit more intelligence there to find those patterns rather than a more constrained task of ideal candidate profile, resume, what's the evaluation? So let's dive in to see what this actually looks like on a screen. Here we have in front of you is, um, product engineer backend bias. This is all dummy data. Um, this is my job within Metaview. Um, so hopefully, you know, I know what I'm looking for here. And we can see we've got some candidates up here. Um, and you can see their ICP fit. So this is are they a good fit? How much do they meet the ideal candidate profile? We've got some candidates here that are a good fit. Emily, Nina are, are okay fits. And you can see what an ideal candidate profile actually looks like. You know, it's a bit of a role summary. Again, provide that right context for the agent to understand what it's evaluating for and then must-have, nice-to-haves, and red flags. Now, the reason we phrase it like this is because, um, a lot of recruiters think like this. Um, we're just trying to reflect what users do. There's no special sauce here. Use what they use in their system. And so if I, you know, actually look at this and go, "Well, wait a second. Nina looks like a great candidate here. Um, why is she just an okay fit?" Um, let's just say-- and we think, you know, Airbnb candidate's great experience, so we're gonna progress this candidate, and we're gonna provide some feedback. And I'm copying and pasting here, but I'm saying, you know, Airbnb has great engineering culture, which will happily hire talent from, um, strong engineering companies are great fits. So, you know, these shouldn't be just an okay fit for us. So I submit this feedback, and now I switch over and hope the LLM is gonna do what I tell it to do. As we can see here, this is, um, LangChain, if any of you know the company. This is, um, the-- where we deploy our agent, um, as of right now. Um, some debates having with, uh, our Anthropic representatives on Claude managed agents, but, uh, they'll convince me eventually. Um, but here you can see that, uh, Sonnet four six has been called, and let's have a look at the input hereDo, do, do, do, do. This will take a second. Here we go. So we can see the user messages here. We've got a bit of testing that was happening earlier today, but if I scroll down to the bottom, we'll see that Nina Park was progressed and the overall feedback so far, and some further information on what we're looking for. And I hopefully-- I may have to refresh 'cause LangGraph streaming is, is not perfect. We should... Is it gonna do, do, do, do, do, do what I want? This is, uh, the scary part of any agent live demoing, I must say, is that you are sitting here hoping it, uh, is gonna respond in the time window we say. So this now is running again, and we can see, yeah, user reason. So this is where our reason was now submitted. Um, we've got some overall feedback, some other information for the agent there, and then our agent will output what tasks it wants to do and which tools it wants to call. Um, a little bit of us just waiting on Sonnet here. So what I'm gonna do is come back to that in a second and look at another one where we already have a suggestion. So here's our sales associate role. Again, all dummy data. Um, based on some feedback we've been given, uh, we've updated the ICP and we've got a new, um, ICP. So this is, uh, a changed ICP based on feedback that the user has given. And we can see the sort of green and red diffs here, and this shows us that the agent has learnt from our decisions and that we can just confirm these or edit these ourselves, but let's just preview and confirm these and reevaluate these candidates. So I'm gonna come back to this one here. As we can see, Claude has got back to us, um, and it's saying, "Oh, now I've got explicit written feedback. What should I do?" It's given its reasoning, and it's decided to call the Upsert ICP tool, and that is gonna update its ICP. So if we come back to here, we're gonna see this suggestion and some changes here. So it's made quite a big change, but one of the big ones here is it wants strong back-- product engineering backgrounds. And so you can see how now as you do this at scale and you do this quickly, it will start to learn from those patterns. Um, this is a sort of contrived example 'cause one of the interesting things when you run this evaluation at scale is I cannot be updating this ICP based on one piece of feedback. You usually start spotting patterns every hundred, two hundred, or things like that. And so that allows you to really do this at scale and really refine an ideal candidate profile so that when you are reevaluating, you get a really good sense of which candidates are the right fit and which ones aren't. So I'm just gonna kick this off, and I'm not gonna make us all sit here and watch, uh, a reevaluation of candidates by Haiku, um, however much fun it is using LLMs. So I'm gonna come back to our slides here and talk a little bit about what I think the three main takeaways from this talk should be. And the first one is any evaluation system you build in which, like, user judgment is at the center, um, you really need to understand that the user preferences are going to evolve. It's so often that we see these systems be like, "Let's just write our requirements up front. Things aren't gonna change. It's fine." You've got to understand that if you're us-- working with users and you want your users not just to be in the loop, but be at the center of the decision-making, that their preferences are gonna evolve, and building that as, like, an ad hoc thing after is not gonna work. Build it as part of the foundation of how you work. And then the second one is use pros, not rules. Again, we've seen lots of lectures here today about just leaning into markdown language, allowing the agent to do what it does best. Do that. Lean into what, um, how users write, how the LLM writes. Don't move down into flow charts and if statements, however tempting it is from the history we all were used to in pre-LLM worlds. And the first-- last one, which I think is also one of the most important, is building the guardrails into the architecture. And whether we like it or not, evaluation systems are becoming more and more part of our daily workflows. LLMs are being deployed from anywhere from code reviews to financial crime, we heard about earlier today, um, to KYC. These systems are being set up and being used, and if you try to ad hoc add your, you know, guardrails on the ot-- on top of it and at the end, it's not gonna work. So build your system from the start to be the apprentice, to learn from the user, but never overrule the user, right? Make the user the, make the user the master and make the system the apprentice. Thank you very much, um, for listening to me today. Um, I believe, yeah, that's it. Thank you very much. [audience applauding] [upbeat music]

Episode duration: 16:45

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode A3rmSUp6Dxg

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.