Before we ship a Claude model, these teams try to break it.

They don't just test the latest Claude models, they put them through the wringer. Working at the Frontier goes inside that process: what they build, what they push back on, and how their feedback shapes what ships.

May 28, 20263mWatch on YouTube ↗

CHAPTERS

0:01 – 0:20
Early access: Customers try to break Claude before it ships
The video opens by explaining that a small set of customers get pre-release Claude models to test them aggressively. Their feedback directly influences what ultimately ships.
- •A select customer group tests models pre-launch
- •Focus is on “breaking” and stress-testing capabilities
- •Learnings from these tests shape the final release
0:20 – 0:36
The adrenaline of a new model drop: “A storm’s ahead”
Testers describe the arrival of a new model as an exciting, intense moment that demands immediate attention. The mood is urgent—everyone shifts priorities to explore what changed.
- •New model access triggers all-hands energy
- •Fast iteration and rapid exploration are expected
- •Teams drop current work to evaluate new behavior
0:36 – 0:44
Recalibrating assumptions: “The grounding has changed”
As soon as a new model arrives, teams notice that baseline behavior and reliability can shift. They must re-learn what the model does well and where it regresses.
- •Model updates can change fundamental behavior
- •Teams reassess reliability, reasoning, and constraints
- •Quick re-grounding is essential before building further
0:44 – 1:11
Building at the frontier: fun, learning, and responsibility
Participants reflect on the unique feeling of working on frontier AI: it’s exhilarating but comes with real responsibility. The goal is to keep innovating while improving security and developer usability.
- •Frontier work feels “insanely fun” and learning-focused
- •Seen as a generational opportunity in tech
- •Responsibility to push capability while staying secure
- •Empowering a broader class of builders and developers
1:11 – 1:19
First tests: automated evals as the baseline gate
The first practical step with a new model is to run automated evaluations continuously in the background. This provides fast signal on performance changes across key tasks.
- •Automated evals start immediately
- •Background testing creates a steady performance pulse
- •Evals help detect improvements and regressions early
1:19 – 1:41
Ambitious real-world target: drafting an S-1 with agentic workflows
A complex legal workflow—drafting an S-1—is presented as a “pipe dream” benchmark that is becoming more realistic. Agentic capabilities help models find information, synthesize it, and edit documents across larger sections.
- •S-1 drafting used as a high-complexity benchmark
- •Agentic models can gather needed info autonomously
- •Synthesis and editing enable larger document chunks
- •Progress suggests expanding end-to-end automation potential
1:41 – 1:54
Breakthrough feel: swapping the model makes the agent “just work”
One tester describes a dramatic reliability jump from simply replacing the underlying model. The agent shifts from intermittently stuck to consistently fast and accurate across questions.
- •Model swap alone can produce step-change improvements
- •Agent moves from inconsistent to dependable behavior
- •Speed and accuracy gains are immediately noticeable
1:54 – 2:01
Measured gains: success-rate dashboards and step-function improvements
The team quantifies progress with dashboards tracking testing-agent success rate, citing a significant increase. Metrics provide the confidence to believe the new model is meaningfully better, not just “feels” better.
- •Dashboards track agent task success rates
- •Reported improvement on the order of ~20%
- •Quantitative evals validate subjective impressions
2:01 – 2:15
Reading the tea leaves: when previously-failing evals start passing
They describe failed tasks as leading indicators of future model capability. When evals that “never worked” begin working consistently, it signals a special leap in model performance.
- •Failures highlight where next models will improve
- •New capability shows up as formerly-impossible evals passing
- •Consistency matters as much as first-time success
- •Passing hard evals suggests a standout model
2:15 – 2:35
Customer–Anthropic collaboration: high-touch iteration and trust
Working with Anthropic feels unusually collaborative, with frequent back-and-forth and engineering proximity. This relationship is framed less as a vendor purchase and more as co-building, supported by trust in quality.
- •Frequent communication with Anthropic teams
- •Feels like shared problem-solving, not transactional buying
- •Engineering collaboration accelerates iteration
- •Expectation of high quality—“not slop”
2:35 – 3:13
What the frontier feels like: dazzling, compounding, and wave-riding
The closing reflection likens frontier building to something dazzling—bright with opportunity but sometimes overwhelming. Progress compounds as tools improve products and products create feedback loops, all while bigger waves of change approach.
- •Frontier work feels “dazzling” and occasionally blinding
- •Opportunity and excitement are intense
- •Progress compounds through tool→product→customer loops
- •Metaphor of riding waves: balance now, bigger waves coming