At a glance
WHAT IT’S REALLY ABOUT
How Base44 scaled engineering with Claude Code and simplicity principles
- Base44 grew from a solo-founder build to a profitable product, then rapidly scaled post-acquisition while trying to preserve founder-level velocity and code quality.
- To avoid heavy process overhead early, the team used simple Claude Code prompts to generate real-time onboarding context (org/code maps) and to codify PR review “taste” from existing reviewer comments.
- Instead of building a full eval suite too early, Base44 used a production-derived “frustration” signal—LLM-classified user messages—to gate releases and compare agent versions via controlled rollouts.
- After doubling headcount to ~80, the team shifted to experimentation-at-scale by distilling A/B testing guidelines from past PostHog experiments and automating PR-level shipping recommendations and experiment setup.
- They built more formal evals and QA automation by simulating end-to-end user behavior, wrapping common flows as reusable “skills,” and integrating browser automation plus CLI-based test setup to avoid linear growth in manual QA.
IDEAS WORTH REMEMBERING
5 ideasUse real-time codebase-derived onboarding instead of brittle docs.
New engineers ran two core prompts: summarize recent commits/ownership to build an org map, and generate a Mermaid diagram of a component to understand architecture. This avoided constantly updating onboarding documents as the system evolved.
Amplify a founder’s code-review “taste” by distilling it from prior comments.
Rather than designing an elaborate review process, they collected the founder’s PR feedback, asked Claude to summarize the recurring principles, and periodically refreshed the guidance so reviews scaled beyond a single bottleneck.
Prefer a simple, high-signal production metric before investing in heavy eval infrastructure.
At ~15 engineers they skipped a full eval suite and instead classified user frustration in chats as a proxy for agent regressions. Releases were compared by rolling out to a small cohort and tracking frustration deltas across changes (prompt/model/infra).
When headcount jumps, automate experimentation decisions at the PR boundary.
With many new hires, they needed consistent guidance on when to ship, gradually roll out, or run an A/B test. Claude Code + PostHog MCP analyzed prior experiments and generated initial guidelines, then a bot-style workflow recommended KPIs and durations per PR.
Build evals around the product’s true success criteria, not just single-turn output correctness.
For an app-building agent, small failures don’t necessarily mean the experience is bad if the agent can iteratively fix issues. Their evals simulate multi-turn interactions and measure latency, turns, cost/credits, and whether the final app works.
WORDS WORTH SAVING
5 quotesAnd the key takeaway I want everyone to get come out of here, especially for those with small teams, is the fact that you need to keep everything very, very simple.
— Yoav
No, a simple prompt gives you, in real time, the entire map organization.
— Yoav
We assumed it's gonna be, gonna be a one to two weeks, uh, endeavor. And it was really, really awesome to see that-We gave that Thursday night. Sunday morning everything was ready.
— Yoav
So we figured out that our past actions, they could convey our guidelines in the best way possible.
— Gabriel
And the last thing is that the bottleneck will keep moving.
— Gabriel
High quality AI-generated summary created from speaker-labeled transcript.
