CHAPTERS
Base44’s hypergrowth story and talk roadmap (1 → 15 → 80)
Yoav frames the session: how Base44 scaled from a solo builder to ~80 engineers while maintaining speed, using Claude Code as leverage. The talk is structured into an intro plus two growth phases: early post-acquisition scaling, then the later jump from ~40 to ~80.
What Base44 is: “vibe coding” platform built in public to profitability
Yoav explains Base44’s origin and rapid market traction. The product was built fast, shared publicly, and reached profitability within months, setting the stage for acquisition and rapid scaling.
Post-acquisition reality: preserving startup velocity while expanding team size
After acquisition by Wix (aligned user base), the core challenge became growing the team quickly without slowing down the product. Yoav outlines the operational bottlenecks that appear immediately when a single founder becomes the gatekeeper for everything.
Scaling onboarding with two simple Claude prompts (no docs to maintain)
Instead of writing and maintaining onboarding documentation, the team used Claude to generate a “real-time map” of the org and architecture. New engineers run lightweight prompts that derive current understanding directly from commits and code.
Amplifying founder-grade code review using Claude distilled guidelines
Mor (the founder) wanted strict control over backend/agent quality, but manual review doesn’t scale. The team used Claude to extract recurring review patterns from existing PR comments and turn them into continuously refreshed review instructions.
Proof of speed: WhatsApp integration delivered in days, not weeks
A concrete example shows how onboarding + PR review automation increased throughput. A new engineer delivered a complex WhatsApp integration over a weekend, with minimal review feedback, highlighting the velocity impact of the approach.
Production feedback at scale: frustration-based metric instead of heavy eval suites
With more users, the team needed a scalable way to detect agent regressions without building a large evaluation harness. They leveraged a simple behavioral signal—user frustration in chat—and used Claude to classify frustration levels during rollouts.
Transition to phase two: doubling headcount overnight (40 → ~80)
Gabriel takes over to describe the next scaling inflection: rapid hiring, internal transfers, and merging another product team. This sudden growth created new process needs around experimentation, evals, and QA that couldn’t be handled informally anymore.
Experimentation at scale: PR-time guidance on ship vs rollout vs A/B test
With many new engineers, product decision-making around experiments needed to be standardized. The team built an automated workflow that advises developers at PR time whether to ship, do gradual rollout, or run an A/B test—and what KPIs/duration to use.
Encoding product ‘taste’ from history: distilling guidelines from past experiments
Rather than create committees and endless meetings, they used their own historical decisions as training data. Claude Code, connected to PostHog and GitHub, analyzed ~100 prior experiments to draft guidelines the team could quickly refine.
Central experimentation hub by dogfooding Base44 (connect GitHub, PostHog, BigQuery)
To keep everyone aligned, they built a central dashboard inside Base44 that aggregates experiments and their business/operational impact. This made experimentation status and outcomes visible across the org and reinforced dogfooding as a core practice.
Evals v2: moving from output checks to app-correctness with a user simulator
At ~80 people, evals became worthwhile, but they still needed near-term ROI. They reframed success as whether the agent can iteratively fix an app (not perfect output on first try), then implemented CI/CD evals that build real apps and simulate user actions.
Practical eval examples: smoke tests (Hello World) and complex long-context scenarios
Gabriel shows the structure of their eval suite, starting with simple smoke tests to catch regressions early. They also run more complex scenarios, including multi-step edits on existing apps and testing their compaction mechanism under long conversations.
Scaling QA without linear tester growth: Claude Code browser skills + test setup tooling
To reduce manual QA bottlenecks, they taught Claude Code to execute repeatable browser-based test flows and to set up complex test states efficiently. They packaged common flows as reusable “skills” and created CLI tooling to manipulate state via APIs/DB for targeted edge-case testing.
Shared operating principles and the next bottleneck: post-validation in production
Gabriel closes with the themes connecting all solutions: simplicity, codifying ‘taste’ from past behavior, and dogfooding. He notes that as bottlenecks shift, the next frontier is automating post-validation—ensuring shipped changes actually move the intended metrics after release.
