Agent Battle: Mine the most diamonds in 45 minutes

Head-to-head agent build-off on a shared game harness. Build an agent, submit runs, watch scores and a live game feed stream to the leaderboard on the big screen. Can you build the top-performing agent in the room?

May 23, 20268mWatch on YouTube ↗

EVERY SPOKEN WORD

5 min read · 1,321 words

0:00 – 0:20
Intro
1. SPSpeaker
  [on-hold music]
0:20 – 0:51
Agent Battle kickoff: Diamond mining with managed agents
1. SPSpeaker
  What's going on everyone? Let's get settled in. Uh, this is gonna be an agent battle, and we're gonna be tight on time, so we're gonna get right to it. My name is Ben. I'm accompanied here by Jeff. We're on the Applied AI team here at Anthropic, which means we help folks like yourselves squeeze the most juice out of the cloud ecosystem. Uh, and today we're gonna be mining for diamonds in Minecraft. So you might have played Minecraft growing up. Uh, I did in the 2000s, 2010s,
0:51 – 1:51
Three learning goals: deploy, configure, and iteratively improve agents
1. SPSpeaker
  but in the 2020s we have agents play Minecraft for us, and that's what we're gonna be doing today. So what are we actually gonna accomplish here today? There's really three things. Number one, we're gonna learn how to build and deploy a managed agent. This is something that we released a few weeks ago. Uh, it's hosted on our infrastructure, and we've set up a lot of the configuration for you, but your job is going to be to get the agent into peak diamond mining condition. Number two is we're gonna understand the impact of an agent configuration. That's the system prompts, the model string, the skills and MCPs that you're plugging into your agent to get it to elicit the behavior that you're actually looking for. And then number three, you heard Will talk about this in the last workshop session if you were here, we're gonna learn how to hill climb on evals. This is how we improve all of our agents internally. If you've deployed one into prod, you understand the process of measuring its impact, understanding its behavior, and then making changes to iteratively improve. And this is an agent battle, but there
1:51 – 3:07
Battle rules and scoring: timeboxing, submissions, and token efficiency
1. SPSpeaker
  are some rules to the battle. Uh, number one is we're gonna have a timer for about thirty-five minutes in which you're gonna be able to build and experiment with your agents. Uh, you can only submit-- Uh, well, we're only gonna accept one run per person, so you can submit as many as you'd like, but we're only taking your top run. Each run is five minutes of the agent mining for diamonds. You can kill the run at any point with just, like, a quick Control + C in the terminal. Um, we have an eval set that it takes approximately one minute if you just wanna quickly iterate on the agents. And whoever has the most diamonds at the end of thirty-five minutes wins. We're gonna have a leaderboard up here going to show you where you're at. There's also gonna be a chat where your agents can talk to one another. Uh, and if anyone is in a tie at the end of the workshop, uh, we're going to be settling that tie based on token efficiency. So this is not just Mine the most diamonds, it is get the best diamonds to tokens ratio. And that means you're gonna have to really hone your system prompts rather than just throwing in the heaviest weight model you can. Now I'm gonna hand it over to Jeff to talk through the logistics of how the harness actually works and what you guys are gonna be doing during the workshop.
3:07 – 3:37
Harness overview: Mineflayer bot + MCP tool-based control (no visuals)
1. SPSpeaker
  Hello. Uh, okay, so the harness, uh, we're actually shipping quite a bit for you to get started. Um, what you're gonna primarily be thinking about is, how can I optimize my experience in trying to mine as many diamonds as possible? We're giving you a couple different tools to accomplish this. The main structure is that you're going to be running through a Minecraft clone that connects to a Mineflayer bot. If you're not familiar with Mineflayer, uh, you're not going to be relying on visuals. There's a series of MCP tools that are shipped directly with that, such
3:37 – 4:07
Fair starting conditions: same seed and reset kit for everyone
1. SPSpeaker
  as mine block, jump, go near things, something along those lines. Don't have to think too much about this. The main levers that you're going to be focused on, uh, are about-- I think it's on the next slide, but, uh, basically along the lines of how do I optimize this run? Um, everyone's gonna start from the same seed, so there's no real optimization that needs to happen there. And every time you reset, you'll have the same start kit, the same seed to go with that. Um, okay, so the, the agent that you'll be operating out of, so there's a
4:07 – 4:38
Where to modify the agent: repo layout and key files/knobs
1. SPSpeaker
  my agent.py that's included in the repo. Uh, so if you go to /agentbattle, you'll have access to the repo, and that's where we-- you'll be running this. Uh, my agent.py is where everything should really be taking place. You can adjust the model, you can adjust the system prompt, which is currently empty, and then you have the opportunity to use a skill that's shipped directly from Anthropic, or you can replace with your own skill. Uh, and then we also are including an MCP server if you wish to adjust these things. So these are the things that-- or parameters that you have available to adjust.
4:38 – 5:08
Execution plan: iterate quickly with evals and ask for help
1. SPSpeaker
  Uh, so try things out, run evals. Um, we'll also be circulating around the room as well, so if you run into anything or wanna, uh, discuss, then feel free to do so. Um, okay, yeah, this has largely already been covered, so I wanna make sure that we have as much time as possible to get started. Um, so we can actually move over to, uh, starting the countdown. Looks like several people have already joined. I'm gonna reset the time now. And you may begin. So,
5:08 – 5:39
Countdown start: competition begins and leaderboard activity starts
1. SPSpeaker
  uh, these will all go away. Let me-- Okay, great. So it's completely refreshed. Uh, and the time has begun now, so feel free to get started. And if you have questions, we'll be around. It looks like we already have a run with ten. We have several other participants who are getting started. Um, just to give, uh, slightly more context, so within the repo there's a, a
5:39 – 6:09
Troubleshooting connectivity: CloudCode skill and network workarounds
1. SPSpeaker
  CloudCode skill that you can use to help with setup. Um, if that is, uh-- if anyone's having an issue, then that should hopefully help with the process. A, a couple people announced that they were having issues with Cloudflare. I think it may be due to the fact that there's so many people trying to connect via the, uh, the conference Wi-Fi. Um, it's a bit small, but I did put a command up here that-Found
6:09 – 7:12
Final minutes drama: ties, suspicious token counts, and new high score
1. SPSpeaker
  that it worked for at least one person. Um, so feel free to give that a try if you're running into any issues. About five minutes remaining, we have a three-way tie, and for some reason, the first person is showing as zero tokens, which is highly suspicious. [laughs] So we may have a two-way tie, [laughs] unless you can ex- Ah, okay. Well, we can, uh, we can investigate. Seems like 19 might be the upper echelon of what's possible, at least so far. Have time for about one more run. We actually have somebody who's broken 19 with only one minute and 20 seconds to go. Wow.
7:12 – 8:44
Time’s up and next steps: winners called to identify techniques
1. SPSpeaker
  You'll have to reveal your technique. 20 seconds. 10, nine, eight, seven, six, five, four, three, two, one. Time's up. Uh, so thank you everybody for participating. Uh, looks like we have a clear winner now, and would love for everybody who was able to mine 19 diamonds, uh, to come find Ben and I. Uh, because our second place has zero tokens, we don't know who actually won second and third place yet. So, uh, we'll invite everybody who is, uh, one through five to come up.
2. SPSpeaker
  What's the win?
3. SPSpeaker
  We don't know what they'll win yet. [laughs]
4. SPSpeaker
  [laughs]
5. SPSpeaker
  [laughs] [audience applauding] The other way.
6. SPSpeaker
  You guys can smile in front of, uh, in front of your winnings.
7. SPSpeaker
  Oh, gosh.
8. SPSpeaker
  Nice job, everybody. [laughs] [outro music]