EVERY SPOKEN WORD
20 min read · 4,102 words- SPSpeaker
[on-hold music] Hello, everyone. My name is Yoav. I lead product at Base44, and going to join me on stage later on is Gabrielle, who leads our AI. And we're going to talk about how Base44 scaled from a solo founder engineer all the way up to eighty engineer, and how Claude Code helped us facilitate that growth while maintaining our velocity. We split this talk into a short intro and then two phases, going from one engineer to fifteen engineer, and then going from fifty engineers to eighty engineers. So let's talk a little bit about the first phase, which is mostly an intro to Base44 and our solo founder. So Base44 is a vibe coding platform, but this is a new term. Uh, a year ago, it was more thinking, "I wanna build a platform that will let anyone build software." Non-technical user, technical user, let's build up the speed. He started the, the platform at the end of twenty-twenty-four, and by twenty-twenty-five, he already had a working product, started building in product in, in... Sorry, building in public on LinkedIn and Twitter, gained a lot, a lot of traction, and by April twenty-twenty-five, the product was already profitable. That's the moment I joined, because money was starting flowing in, and getting a lot of traction. And because this was a profitable product, a AI-focused user base, and, uh, a crazy founder, it started getting the focus of a lot of, uh, companies and acquisition opportunities, which leaves us in the next phase, which is our post-acquisition. So Wix has very similar user base as Base, and so they saw Base44 as a big bet, and they wanted to maintain the velocity of Base44, but it's expanded dramatically. So we basically went from a two-member team into a fifteen-engineer team, and we needed to scale, and we needed to scale fast as possible. And we had four major challenges. One is onboarding doesn't scale. We can't have Mor onboard each engineer to the team. Code review doesn't scale. Mor was really, really cautious about what goes inside the backend of Base44, so he wanted to review each PR on his own. We can't have each engineer sit with our beta tester to understand whether the product is working as expected, so we need to find a way to automate that as well. And an interesting part about the fact that you have, like, very, um, immediate product market fit is there's a lot of product surface you need to cover, whether it's integration, whether it's the agentic flow, whether it's the visual editor. There's so many areas, and you need the engineer to ramp up really, really quickly. So let's jump in. How do we solve each one of the challenge? And the key takeaway I want everyone to get come out of here, especially for those with small teams, is the fact that you need to keep everything very, very simple. Okay? The meetings when we tr-try to tackle those challenges would start with, "Hey, let's build this process where we review everything and then build an onboarding doc, and we'll do, like, a nightly that, that, uh, update that." We're thinking, "Actually, no, let's keep it very simple." Every new engineer that comes into the company, we'll give him a task to basically use two prompts before he starts working on his task. One, go over all the commits and tell me what everyone is-- what everyone cares about. Um, so after we were, like, three, four engineers and people started, like, building their knowledge in each area, uh, like, the fifth and sixth engineer came, wrote this prompt, and they already get, like, this map of the organization. And you don't need to kinda, like, think about, how do I keep, like, these onboarding docs updated as new engineers come up? No, a simple prompt gives you, in real time, the entire map organization. The second thing is, before you dive into each area, is basically ask Claude, "Hey, can you give me a mermaid chart of how this component works?" And again, this works in real time because, because everything keeps evolving, you don't wanna kinda, like, try, "Hey, I need to keep this document up to date. I need to keep this document up to date." No. Claude keeps it for you. Very, very simple. One prompt gives you everything an engineer needs to know in order to start working inside of Base44. The second thing is, as I mentioned, Mor was very, very cautious about what code goes inside our agent and what code goes inside the backend of Base44. So we needed a way to amplify Mor's PR abilities. So after about one or two weeks, we already have a big pool of PR comments Mor add inside our, our repo. So again, instead of kinda, like, sitting down and thinking of brainstorming, "Okay, what's the instruction that we need?" Let's have Claude review the PRs, say what's the most important things and what's the most crucial things we need to keep in mind while engineers are writing their, uh, new code. And we put it in destruction, run it every couple of days, and have Mor re-- PR reviewer inside of Base44 without having us to build a sophisticated and complicated process. The cool thing about it is when we really started to see kinda, like, velocity picking up. Okay? So one of the, uh, PR that we kinda, like, remember and we keep referring to is we wanted to do a WhatsApp integration inside Base44, uh, to kinda, like, communicate with the agent using WhatsApp. And w-we've handed this over to a new engineer. We assumed a new engineer working on this kind of, uh, feature, it re- it requires an integration. It requires working on the agentic flow. It requires a new meta API. We assumed it's gonna be, gonna be a one to two weeks, uh, endeavor. And it was really, really awesome to see that-We gave that Thursday night. Sunday morning everything was ready. He onboarded on Thursday with, uh, using those simple prompts. He sent it over to PR. The PR model review had kind of like two, three small comments, and we were ready to move on, uh, to production. Okay. So w- we've, we managed to resolve most of the issues. Now we have the issue of how do we make sure that what goes into production, especially our agent, works really, really good for our customers? Previously bec- when we were a tiny team, we would just sit with customers and hear like how they interacted with Base44, but now we need to find a way to scale. And like almost every naive AI company out there, we will say, "Hey, let's build an eval suite. We'll make sure that everything that comes out, we'll run it through our evals, it will work perfectly, and we'll understand what's going on." And I don't know if you try to build eval, um, mechanism before, but usually 15 people s- uh, team is not ready for it. It's a much bigger endeavor. So we sat down and we said, "Okay, we already have a tremendous amount of traffic in production. How do we use that traffic in order to understand whether the model is working for our customers or not?" We have conversion rate, which is nice, but we want to understand whether the agent itself, especially for paying customer, is working as expected. So we started looking at the conversation and a very simple pattern emerged in that if you look at the conversation, when everything working's well, well, the user doesn't say anything. It just goes to the next feature, to the next feature, to the next features. But when things starts to break, that's when users get really, really loud inside the chat and say, "Hey, why is this broken? I can't believe it's not working." It's really, really easy to see and manifest the fact that things are broken. So we said, "Okay, we have a very strong single signal when things aren't working. Why don't we use that and leverage that?" And ask Claude using a simple model, using an ICU model, to classify each message on whether it's, the frustration level of the user is high or low. Once we have that, then every single version of the agent that we want to sh- that we want to release, we basically put a small percentage of the customers on that, uh, version, and we can track the, uh, the frustration level. And this works whether we're changing the infrastructure, we're changing the prompt, or we're changing the model, and we can understand whether this works as well as expected for after the change for our users. And the key takeaway again is just keeping everything super simple without building a sophisticated process around it. Uh, like we hear a lot about, like, let's build an agent for this and, and, uh, agent orchestration, but when you're a small team, you have very simple way of getting the almost the same amount of value while keeping processes really, really, really lean. But when you scale from 15 to 18, it becomes a little bit of a different challenge, and that's when Gabriel is going to walk you through. Thank you very much.
- SPSpeaker
[audience applauding] Hello, everyone. My name is Gabriel, and I lead the app builder agent for Base44. I had a lot of time to watch Yoav behind the scenes, so I get little bit nervous, so... [gulps] So Yoav just told you about the first two phases of our growth, and last couple of months we reached a new point of growth. Like, we started hiring more externally, we had more internal movers moving from Wix to Base44, and then we even merged a different product working on Vibe Coding, and in one single night we doubled our headcount from 40 people to almost 80. And that brought a new set of challenges that we had to solve. So we had many new challenges. I'd like to focus on the three most interesting ones. Like, the first one is how do we do experimentation at scale? Now, Yoav just shared how we did the, uh, the frustration metric and how we A/B tested intro- in production, but you can't expect any new hire to understand exactly which KPIs to test, how long do we wanna test things, whether you can just be brave and, and ship it, and, like, not everything needs an experiment, right? So we knew we wanted to shift left product management decisions in A/B testing. So we also, uh, needed better evals. Now, again, back to what Yoav just said, we ha- we, we were b- uh, before in a point where we knew that evals is not the best, uh, ROI for us, but now it became, uh, something we really need to focus on. And the last thing is, how do you do QA, uh, QA properly in a company that's very, uh, consumer-oriented without growing your, uh, testers in a linear way with the other headcount? Let's start with experimentations, okay? So we had... We started with a general shell of what we wanted to have. Like, we knew we wanted a process that runs when a p- pull request is ready. We knew that eventually we want like, uh, a bot commenting on GitHub saying like, for a developer, whether she could or not just ship it, if she needs an A/B test, how long should it run, which KPIs that, uh, does the experiment, uh, need to monitor? And we also wanted it to post, uh, to open the experiment on PostHog. That was like, the shell was the easy part, but we also needed the guidelines, the actual logic of how do, how do we work? Like, how do we pr- how do we operate? We never sat and, and articulated that. We didn't have a guideline committee. We just, like, had really good product sense and intuitions. So-We had one option, like get a multi-stakeholder committee and, like, enter a lot of meetings, but we really hate meetings. So we figured out that our past actions, they could convey our guidelines in the best way possible. So we thought like, "Wait, we can just take like the hundred last experiments we had on PostHog, the matching pull request, and distill our guidelines from that." So we spun Claude Code, hooked it up to the PostHog MCP. PostHog is an, uh, A/B testing experimentation, pretty great product, by the way. Uh, and, and had Claude, uh, um, suggest the first iteration of the guidelines and it wa- it did a great job. It wasn't perfect, very rough on the edges, but we had like a working document we can just iterate and a couple of hours later we had like something working. Like, uh, uh, each pull request opened has like a clear verdict whether you can just ship it, gradual roll it, uh, a gradual, uh, rollout it or do an A/B test and how long. Some features deserve seven days of, of testing for our scale, some need to have a full, a full month because you might, uh, you might affect, uh, uh, c- conversion rate and, uh, premium, uh, rates in very little, uh, percentages. And to wrap all of that up, we needed a central place that everyone could just see what's going on. So it was a great opportunity for us to dogfood our own product, Base44, connect it to BigQuery, our data warehouse, to PostHog, to GitHub, to everything and have a central place where you can-- everyone could see which experiments are running, uh, uh, how they're, how are they, um, uh, w- how they're moving the needle if something's causing more AI costs, if something's reducing like, uh, uh, rate of published apps, like all the things we cared about. And this for now kind of solved us the problem and allows us to open up a new paradigm in how we, uh, scale our experiments. Okay, so the next part is evals. Like, this could be like a easily a full one-hour talk and maybe next year depending team here will even do that. But, uh, our challenge was very short term, like we needed something to give us real value. We didn't want it to be like a three months project and we didn't-- we couldn't afford, uh, um, taking our top AI engineers. We need them to work on features and improve the product. So we asked ourselves what do we really need to be- build? Do we wanna just evaluate the output of the model or do we wanna check correctness of the apps that our users are building? And eventually we had to build a user simulator. Now for Base44, when a user, uh, types in like a request, they want like a, a s- an app and some small part of it won't work, that doesn't mean that the eval, uh, fails. That was the great epiphany moment for us. It means that our eval suite needs to pipe back the rejection and, and ask the, our, our agent to, to, to fix the, the, the, the missing parts and then we ended up looking at, uh, latency, how many turns things took, how much every, uh, uh, how much it cost to us, how many credits we took for our users and we got into like a, a, a working CI/CD pipeline where any change in our AI code spins up real, a real Base44 app, uh, uh, instance and we use Stagehand to simulate us- real user actions. Like if like there's like a automated QA engineer spun up in a small box. That's how we look at it. And this is how the internal app we built to support that looks like. Again, a great opportunity for us to dogfood our own platform. You can see here the example of like the, the most canonical eval we have is like the Hello World, uh, app. Like, uh, it, it, it doesn't mean that the app is doing like that Base44 is, is performing the way we want. It's like a smoke test. It's just making sure we didn't break anything. So the way we'll do that, we'll ask Base44 to build us a simple Hello World app, assert that the right text is visible and there's like it looks good and it's very subject- subjective but we trust, uh, AI on failing if not. Then we ask for a very small change, uh, text change and then we ask for a small feature and as you can see most of them just pass and fun fact, these eval will pass on the smallest model you can think of which is really cool. And of course we have many more complex evals. For example, we have scenarios where we start with an existing app and do many changes. We have scenarios where we get to, to check our compaction mechanism which is very complex and requires a lot of, uh, user messages. So this is kinda brought us to a new paradigm in, in, in evals. It's not perfect yet. We're constantly working on it, but it was like the right time, the right moment, uh, to, to build such a system. And the third thing I wanna share is how did we, uh, streamlined, uh, QA. So we do believe in shifting left quality of course like all of us like unit testing, end-to-end test, it's obvious everyone, uh, working at Base44 needs to have complete ownership of what they build. But most of the times you're working on really deep features that have a lot of edge cases. For example, imagine testing a feature that only affects users at a specific sub- subscription tier when they reach a specific point of their credit limits like and imagine your feature has a lot of permutations that affect that. Like it will be very tedious for everyone to test it manually and so that's a classic case where we would hand off to a QA engineer but then we'd have longer feedback loops and you have to wait for someone to be available and, and, and, and that was wasn't ideal. We knew that Claude Code need wa- uh, could operate a browser, right? Playwright, MCP, browser use like there's ton of tools out there.But it was missing critical pieces of how to do it well. For example, each time it had, it had to relearn the platform, the selectors, the flows. Um, each time, uh, it had to, uh, um, understand which events to look for in, in, in our database and in Mixpanel. So we started wrapping, uh, our, our common, uh, flows in skills. For example, we have one skill that taught, uh, Claude Code, uh, combined with the browser how to, uh, go over all the major, uh, uh, user flows that, uh, most of features will touch. And of course, for new features, like Claude Code can just understand what the feature does and how to get along. So we don't need to cover 100% in the skill. You need... You just need to maximize the 80% so you have enough context. It's like a, a, a thin trade-off between like the right abstraction level and what do you tr- just trust Claude Code. The second challenge we had is like how do you do, um, proper setup testing, like for, for your test? Let's take the, let's take the, the example from before, like when you wanna test this very specific edge case. Now, you could just click and, and do it very manually, like, like, like a QA engi- uh, a manual person, a, a human person like could just do the clicks, but that would be very, very slow, right? So what, uh, a good engineer will do, a QA engineer will do is just go to the database and override the, the, the, the setup so that they can just test that flo- that case. We needed, uh, uh, Claude to be able to do the same thing. So we created CLI tools that abstract our APIs and databases, uh, specifically for the use case of setting up tests. And we, uh, built skills that taught, uh, Claude how to use that, those properly. And eventually, we combined all of these, uh, efforts and skills into one like meta skill of how to do proper QA. And we got into a flow where a PR, a pull request opens, the agent triggers, it creates a test plan. Also, great opportunity to dogfood our own product. Sends it to an Base44 app, uh, starts testing, and reports back. And this is like how it looks like for a single test. Like you get screenshots. You can know what, what it tests, test, what it didn't test. Like f- sometimes I will get, uh, cases where like I know it, it, it, it's... I'm stretching the boundary of what it can do, and then it will just like write like, "I couldn't test that," and like surface the, the missing capabilities. But that works for 80% of the time, and it allowed us to shift left deep and edge case quality assurance and move faster. Okay. So that's all for the challenges, and I'd love to share a little bit about the, the, the, the common thread around all of, all, all of our like challenges and solutions. Like just as Yoav said before, we really value simplicity. Like we really think about like... We, we, we try to think about like bold and, and, and simple. And sometimes like we, we'll take like... We, we'll, we'll work very hard not to, to, to build complex things when they're not, when it's the not the right time. Evals is a great example. We hold it off until it was the right moment to build it, and then we went all in. The second thing is that taste is a big word, right? Like recently. Like everyone's talking about taste and like it's the last moat of us humans against machines. So I'm, I, I believe in that too, but I do think that you can encode a big chunk of your team's, company's taste, uh, by looking at your past actions. Like just... And, and that kind of pipes back to, to the, the memory talk from the last session where like you can just look at what you actually did like in the last, uh, week or so and understand what your guidelines are, like for, uh, code reviews, for, for A/B testing, you name it. The third thing is like if you're lucky enough to work on a product that can, that you yourself can use, uh, that's also like a, a huge win. Like I think the, uh, the team at Anthropic constantly speaks about how magical it is to be working on Claude Code, Cowork, and all the products suite, and how, wh- how you get the feedback and insight loop going like in a magical way. So if you can do it, like sometimes you have to stretch a little bit if you're working like on like, I don't know, on an finance app. But find ways to do it. It will be of value. And the last thing is that the bottleneck will keep moving. Like for example, for now, our current challenge is like, first of all, how to continue and scale all of the processes I've just shown, but also how do we do post-validation correctly? Like once a, a, a, a pull request reaches, uh, production, how do you make sure it's moving the right needle? For example, is a bug really reducing, uh, support tickets? You don't want a human to keep it on his head. Like, uh, is a, a feature really being used by users? Is it... Of course you want it to raise, uh, uh, business metrics, but not everything will, will show that fast. So we, we want to automate that. And that's it for today. Uh, we really appreciate you coming, and I really hope you found at least one thing you can take back to your company, organization. Thank you. [upbeat music]
Episode duration: 23:58
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode VueeyKcquoA
