Skip to content
Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence
This video isn’t embeddableWatch on YouTube →
Stanford OnlineStanford Online

Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In this CS153 “Frontier Systems” session, Anjney Midha welcomes Andreas Blattmann, co-founder of Black Forest Labs and co-creator of Stable Diffusion, for a discussion on the visual intelligence frontier and how frontier AI “factories” scale. Blattmann recounts his path from mechanical engineering to a Heidelberg PhD lab, developing latent diffusion to train image generators efficiently and enabling Stable Diffusion’s 2022 release. They contrast earlier unimodal content-creation models with today’s push toward unified multimodal systems spanning images, video, and audio, plus action prediction for computer use and robotics, emphasizing observation and interaction loops. Using Flux as a case study, they cover pre-training, mid-training, post-training, distillation for speed, customer feedback driving image editing and character consistency, and why open weights enable customization. They also discuss Self Flow for multimodal alignment, safety guardrails, EU compliance, data labeling strategies, diffusion vs autoregressive tradeoffs, and skepticism about explicit 3D representations. Guest Speaker: Andreas Blattmann is the co-founder of Black Forest Labs (BFL), the German generative AI startup behind the FLUX text-to-image foundation model, backed by Andreessen Horowitz and other major venture firms. Before founding BFL, he was a generative AI researcher at LMU Munich, NVIDIA, and Stability AI, where he made significant contributions to image and video generation. He is a co-inventor of Latent Diffusion, the generative modeling technique that produced the open-source text-to-image system Stable Diffusion (which he co-developed) and now powers cutting-edge models, including FLUX, Midjourney, and OpenAI's DALL-E 3, with applications extending into audio generation and medical imaging. His academic publications have amassed over 22,000 citations. He was named to Capital Magazin's Top 40 Under 40 in Germany in 2024. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

Anjney MidhahostAndreas Blattmannguest
May 4, 20261h 1mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:074:29

    Course framing: frontier AI flywheels and today’s “visual intelligence” factory visit

    1. AM

      Quick show of hands, how many people recognize the song that was playing? One of my favorite songs called "Bella Napoli." It has been added to the, uh, CS153 Spotify playlist. For anybody who has music requests for CS153 this quarter, also known as AI Coachella, we've got an open playlist. Please feel free to add songs there. That one was a request from me in honor of our speaker today-

    2. AB

      [chuckles]

    3. AM

      ... who I'm very lucky to call a close friend, and is the co-founder of Black Forest Labs, Andreas Blattmann. Thank you for joining us, Andy.

    4. AB

      Thanks, Ansh. Thank you, everyone. Thanks for having me.

    5. AM

      Andy is joining us from Germany in a little town called Freiburg, which I think a lot of you will be hearing about more and more as it becomes a hub, uh, for frontier research in Europe. If you remember in our first lecture, right, we talked about the anatomy of frontier AI progress. And we talked about three or four important touchpoints in this class you're gonna be hearing about over and over again. One is that there's a, a transition happening from the old systems, the old infra stack, to a new one, right? And you gotta be open to understanding what those rewrites are looking like, and, and our speakers are gonna tell you which parts of the stack they're helping to rewrite. We talked about the basic AI scaling recipe, right? We've got two sort of loops that are important to run. Once you do, you get some compute, you get some data, you build a model, and then you do inference, right? That gives you revenue to buy more compute and then context feedback. We've talked about the bottlenecks, right, on that, on getting those loops scaling, which is context, compute, capital, and culture. We talked about context and, and compute. We'll talk a little bit about all four today. And then the last was, well, for your projects, which is the, the part I'm sure many of you are anxious about, is how do you get one of those scaling flywheels going? Right? And we talked about there being sort of three steps in the journey. There's an incubation phase, where you kind of figure out which specific part of the frontier you wanna attack with a state-of-the-art system. Right? Then you land with a SOTA release, a state-of-the-art release, and then that allows you to expand to more and more capabilities on the frontier that you care about. And if you remember, we did sort of a, a field trip into one of the frontier factories, right, um, in, in our first lecture, which was Anthropic. We talked about code as one domain. And today, we have a chance to do a field trip into another frontier AI factory in Germany called Black Forest Labs. And we've got here one of the factory owners, Andy Blattmann, um, who's the co-founder of Black Forest Labs, also co-creator of Stable Diffusion. How many people here have heard of Stable Diffusion? All of you. Perfect. Great. So you've done some homework.

    6. AB

      [chuckles]

    7. AM

      And so today we're gonna talk about the frontier. You know, last, uh, on Tuesday, we talked about the, the audio and the speech frontier, right? What is a- audio intelligence like? What was it? Where is it going with Matty from Eleven Labs. And today we have Andy talking to us about the frontier of visual intelligence, which I think is one actually-- one of the most exciting frontier, if not the most critical frontier to unlock more progress in if we really want to get, um, these models to work in mission-critical contexts in the real world. And so we're gonna spend some time talking about the anatomy of visual intelligence as, as Andy sees it as one of the pioneers of the field. And then we're gonna talk, go back in time a little bit and zoom into how we bootstrapped the FLUX flywheel together a couple years ago. FLUX is the name of the flagship model family from Black Forest Labs. And then we're gonna spend some time on the fun part, which is future frontiers. Where are things right now that, that where are un- where are the unsolved problems? Where are we right now where you guys can step in and start co-creating this journey, uh, in the space. So this was the frontier factory, right? We talked about this is sort of the basic template. Again, to be clear, this is a directional heuristic. Every team is different, every research project is different. But to kind of give you a grounding sense of repeating patterns about how, um, some of the best teams are manufacturing intelligence repeatedly, remember this was the pipeline. We had, um, pre-training, mid-training, post-training with agents in the real world. There, there's a version of this that, that Andy's gonna walk us through, but before we jump into that, why don't we just spend some time on, on you, Andy. Who are you and how'd you get here?

  2. 4:297:16

    Andreas Blattmann’s path: from mechanical engineering to generative vision research

    1. AB

      Yeah, cool. Thank you, Ansh. Uh, thanks again for having me, everyone. Um, yeah, I'm Andy. Um, started looking into AI, I think in 2019. Um, I, I was actually originally studying mechanical engineering. It's classic German education, uh, I think. You go to a school and then you figure out you're kind of somewhat technical and what are you doing if you don't know exactly what, what to do. Studying mechanical engineering in Germany, right? Um, and then it, uh, yeah, through, through a couple of, um, I think coincidences, I got into computer science, into coding, into already robotics back in the days. We talk more about robotics, uh, later. Um, and applied at a PhD in Heidelberg, uh, where I met my two co-founders, Robin and Patrick. Um, and that was a really, like small lab. Everyone back in the day was doing representation learning with visual models, uh, or like for, for the visual domain and computer vision in itself back, that was 2019, was kind of a, a niche topic in this niche topic back then of AI. It was really like people saw the potential already, but, but no, no one, no one had an idea of how-

    2. AM

      Right

    3. AB

      ... how that would, uh, explode then later, right? So it was really, uh, kind of a, yeah, niche topic we worked on, but we soon had a very good intuition about like how to train models to generate pixels, mainly images back then. Um, and we're competing on a research level as a very small lab with players that were much larger than us. Uh, and finally that already back in the day-... was Google and OpenAI, their research teams, and it was not about building frontier systems. It was-

    4. AM

      Foundation models

    5. AB

      ... yeah, or, or even before that, uh, who wrote actually the nicest paper to show that something was, was happening.

    6. AM

      Right.

    7. AB

      So back in the days it, it was like-

    8. AM

      This was pre, uh-

    9. AB

      That, that was pre-Stable Diffusion. That, that-

    10. AM

      Right

    11. AB

      ... that was, that was-

    12. AM

      20-

    13. AB

      Really it was-

    14. AM

      ... 19

    15. AB

      ... for the ones who remember it, StyleGAN was kind of-

    16. AM

      StyleGAN, yep

    17. AB

      ... the images were most often generated with GANs because they had a kind of a good inductive biases for, for, for kind of this data domain. Um, and it was generating a 256 by 256 pixels image was a challenge. Like, not every algorithm could do that, and yeah, it was just a very different world. So, um, we competed with labs that were much larger than us, and we had, even back in the day, way, way, uh, less compute. So we had to come up with kind of more, um, efficient algorithms to solve that problem because images, and now speaking of videos, are so much higher dimensional than other representations, say text or something. Text is, uh, much lower dimensional.

    18. AM

      And, and to anchor folks on time, th- you were still-- This was when you were at the University of Heidelberg.

    19. AB

      Exactly.

    20. AM

      Right. Yeah.

  3. 7:168:55

    Latent diffusion: compressing pixels to win with less compute (and the road to Stable Diffusion)

    1. AB

      Exactly. Um, so, um, yeah, and then we, we spent, like, two years investigating how can we actually find representations for natural data, for images, for video, um, mainly, that are perceptually equivalent to the pixel space or to what matters to us humans in the pixel space, but much lower dimensional and much more efficient because we didn't have the computer train a kind of generative model on the pixel space. And it's also super wasteful, and that was what gave rise to a, a series of papers on latent generative modeling. So you actually train a kind of a compression model, um, similar to a learned JPEG codec, you could imagine it, to find that ex- perceptually equivalent representation to the, uh, pixel space, and you train the generative model there. And that, um, helped us saving tons of compute, training our models much more efficiently, and with orders of magnitude less compute than our competitors put out, like, better, uh, li- like, models that were on par or even better than those competitors. And that was what-- That algorithm, latent diffusion, also gave rise to Stable Diffusion then. Um, so we proposed the algorithm, saw the potential, set out to search some compute, luckily find that in the open source community, um, and trained Stable Diffusion that was then released in 2022. Um, and pretty much surprised us as well, like, with all the hype it got. And actually it was, it was fun. It was here in the Bay Area it was hyped much more than in Germany. In Germany, still today, not a lot of people know about that model, funnily.

  4. 8:5511:41

    The Stable Diffusion inflection point: when generative vision became mainstream-legible

    1. AM

      Yeah. It-- Uh, there wa- there was a moment I remember, DALL-E 2 was in preview, I think, and, and then you guys put out, uh, Stable Diffusion. And I remember on Reddit there wa- somebody had sketched out, uh, they'd taken one of their kids' like, uh, drawings. It was like a crayon drawing, and had turned it-- had run it through the image-to-image-

    2. AB

      Yeah

    3. AM

      ... transfer on SD, uh, I think it was for SD 1. Um, and it, and out, out had come this beautiful illustration. And I remember taking a screenshot of that 'cause I was just blown away, and I tweeted it. And I think it was like a Monday morning, we went into our exec meeting at Discord, and then I came out for lunch and the tweet had like 3,000 or 4,000 likes. And it was, it, like, uh, uh, for me, it was a moment where I realized that the technology of generative modeling at that point had crossed an inflection point where it suddenly became legible to people outside the machine learning community because it was so visual.

    4. AB

      Yeah.

    5. AM

      Right? I, I think it, it might be worth spending a couple mo- minutes here to just take people back in time because at, at this moment in time in the ML community, I would say there, there was a bit of a dogma that language modeling was the be-all and end-all of intelligence. You know, the general consensus at the time was that language is the interface to reasoning, to, to, for intelligence, which are the way humans reason about intelligence, the way we think is through language. And I would say that's a philosophical belief that has come and gone in its sort of the strength of its religious zeal. Um, but for those who were in the computer vision community, and I, I count myself as one of those because my last company, as we've talked about, was Ubiquity6. It was a 3D mapping and computer vision company. We were working on 3D reconstruction. It, it was clear that language is, was extraordinarily valuable, don't get me wrong, at, at, at reasoning about certain tasks and fields. But for those of us who were in the computer vision community, it, it, it felt incomplete because language is, is just one way we communicate, one way we reason about the world. For those of you who are visual thinkers or visual intelligence, who believe in multiple intelligences, right? You just learn better when you see visual representations of things. And so it was quite, um, cool to see Stable Diffusion coming out and make progress of a different kind legible to both the machine learning community as well as the, the broader developer community, the broader consumer community. And, and that's when I think we, we reached-- Uh, you know, we started working together 'cause we were trying to get some of these, um, Stable Diffusion-like capabilities onto Discord. But can you talk about it-- You, you know, you said two things that I think are quite helpful to overlay for the students here, which is the difference between natural and unnatural representations.

    6. AB

      Yeah.

    7. AM

      Could, could you speak about that for a sec?

  5. 11:4114:52

    Natural vs unnatural representations: why video/audio matter for foundational intelligence

    1. AB

      If we think about ourselves, if everyone, uh, you, you look at me currently, hopefully, uh, and, and, um, the medium through which you are ex- perceiving this is clearly video and audio, right? You hear what I'm saying, and you're seeing me, um, gesturing here or, or talking to Ansh. Um, so th- these are, these are what we, what, what I call natural representations. If you think about the source of those representations-Eventually it's the sun. Or here we have some, some lights that try to resemble what the sun, su-sun does, but it's electromagnetic waves of a source that we humans cannot control. We can shape obviously the, the-- or we, we, we can, we can control the shape of this world and we can build buildings, but this, the electromagnetic spectrum that falls onto the Earth, we cannot control. Same for sound, like natural signals like, uh, uh, hearing a river flow or something, that's just... You-- Some might, some, some might call it noise or something. That's just natural and it's there. Whereas text is inherently human-made. You see this in so many different, um, occasions. If you just measure the, the information per sign that text transports, it's so much higher than the information per sign, per pixel in an image. And why is that? Because it's human-made. It was evolutionarily very important for us to communicate, uh, efficiently. Um, and there, there's-- I think that that's also at the heart of why we need to compress images and videos before we train a generative model on them, because there's so much redundancy in, in it. In text, you don't have this redundancy because it's human-made.

    2. AM

      Right.

    3. AB

      Throughout evolution, we reduced that redundancy and, um, and made it efficient. For learning, however, it's super important, at least in how I see it and how we see it at like Black Forest Labs, to consider two things. First, if you think about yourself as, as babies, how you learn, it's first observing things, hearing and seeing, and then interacting with things in the physical world, right? This is pretty much the first, I would say, three, four, five years. I don't know when I learned reading, but it's I think w-maybe with five or something. And just the level of intelligence a three-year-old has compared to the level of intelligence a, a language model has is very different, right? And I think that's what, what-- why we care so much about natural representations like audio and video because we th-we are like absolutely convinced that this will be the fundament of all the kind of higher intelligence that these systems will eventually have. And starting from language and trying to, to stack up a b-bit of additional, uh, kind of representations on top of that is, in my, um, kind of opinion, the, the, the wrong way. You should start with from first principles, how we humans do it, and that's clearly learning on natural representations by first observing, and second, we'll talk about that later, interacting. These are just from how we think about it, the main pillars of learning and also the main pillars of what we define as visual intelligence, actually.

  6. 14:5219:16

    From unimodal content creation to unified multimodal models and physical AI

    1. AM

      So I, I, I think, um... So th-this is pretty important because two years ago, three years ago, I would say the consensus was that the way to do generative modeling was roughly this, right? Where you had this, these foundation models that were unimodal. They were text to image, text to video.

    2. AB

      Looking at Stable Diffusion as a text-

    3. AM

      Stable Diffusion was one.

    4. AB

      Text to image model.

    5. AM

      Yes.

    6. AB

      Exactly.

    7. AM

      Unimodal based on images. Yeah, but could you, could you just talk about what this, the, what the state of the art was then-

    8. AB

      Yeah

    9. AM

      ... versus now?

    10. AB

      Yeah. So yeah, Stable Diffusion, I think it's a perfect example of that. Um, it was a text to image model. You could, you could do super nice kind of artistic things that have not been possible before, but it was clearly made for content creation, right?

    11. AM

      Right.

    12. AB

      It was a, a unimodal model made for the purpose of content creation. You could, you could, yeah, make artistic style transfer. You could, you could do-- you could tr-train a LoRA and, and do maybe some, some kind of, um, character consistent marketing transformation or like, yeah, character consistency into the-- get, get character consistency into the model, then use it for marketing or something. But that's all content creation. We currently see that visual models starting to become way more than that. We don't train a single unimodal model a-anymore to just like fulfill the purpose of content creation. We're training a, a unified, a uni- a multimodal model for natural representations or natural data that then can give rise to s-so much more. It's about physical AI, it's about robotics, computer use we can do with these models. We had a couple of demos already like-- or there were recently a couple of demos, uh, that were super impressive. We can do world modeling and simulation and still content creation. Um, but combining different natural representations and only training on one is the key ingredient because it will give the model a much more natural understanding. As one example, if I, if I just see two s-rigid bodies colliding, I al-always have a sound attached to it, right? There's a correlation between that sound happening and a certain action in physical, in the physical world happening. And being able to observe this correlation for a model is super important because it will help it un- that model understand much better what's actually going on. Whereas if I only train at, at one single modality, it's much harder to, to kind of understand what's going on. Or-

    13. AM

      Right

    14. AB

      ... just interacting w-with this bottle, I think it's super hard for a model to understand what's actually going on if it's not, if it's, it doesn't hear that sound. How would that be different for that kind of transparent body compared to, to someone, um, putting their hand through water or something, which is also transparent, right?

    15. AM

      Right.

    16. AB

      So, um, these correlations between different natural data representations are super important for a model to learn kind of at a higher, um, representation of intelligence as well.

    17. AM

      Now, this, this idea, you know, for those in the machine learning community, is not new. I mean, for a while there was an i-- there was a sense that the progression of technology would be we'd have sort of state-of-the-art systems that were capable at individual modalities.

    18. AB

      Mm-hmm.

    19. AM

      And then at some point, to make them smarter, we'd have to give them the ability to reason across different domains. You know, s- uh, transfer learning, so to speak, where you can reason about-... the physics of, uh, of, of, of this bottle hitting that and, and the s-sound, the audio emerging. Um, but you can't start with everything on day one. And so could you just state-- Let, let's, let's talk about how we bootstrapped the flywheel, 'cause n-now today, fast-forward two years ago, you know, or four years after Stable Diffusion came out, you know, Flux is now, uh, used by millions of people around the world. You guys have hundreds of millions in revenue, blah, blah, blah. But for the purpose of the students, I think it's helpful to zoom back in time and say, okay, you guys had this clear thesis for eventually models will be good enough at reasoning about all kinds of-- all of visual intelligence, but you have to start somewhere.

    20. AB

      Yes.

    21. AM

      Especially when you have less resources than, than the largest companies in the world. You're a smaller team. You have less data. So can we spend a little bit of time talking about how did you concretize where to start?

    22. AB

      Yeah.

    23. AM

      And then how did you initialize that kind of momentum flywheel we talked about from at day one?

  7. 19:1621:05

    Bootstrapping the flywheel: focus, product wedge, and building Flux.1

    1. AB

      Yeah, absolutely. Yeah, I think, uh, that's one of the most important things when starting a company. Focus matters a lot-

    2. AM

      Well, or any research project, right?

    3. AB

      Yeah, yeah, yeah.

    4. AM

      At the time, actually, SD was not even a company.

    5. AB

      Yeah, yeah, yeah.

    6. AM

      It was just a project.

    7. AB

      Absolutely. Absolutely. But l- I think I wanna-- As an example, I wanna, I wanna take how we started the company, because-

    8. AM

      Sure

    9. AB

      ... there we, we had this kind of huge experience in, uh, image generation, unimodal image generation. We've done Stable Diffusion, then we've worked for Stability AI, put out a couple of, uh, more models on that domain, and we pretty much had the recipe to kind of train a frontier model for that domain.

    10. AM

      Right.

    11. AB

      So when we started the, the, the company, we clearly-- or we looked at the field and we said, "There's clearly a need for a next generation of image models, because so far the models cannot, say, produce hands that are, that are actually having five fingers," right? That, that was back in the day a thing. So we attacked that specific field and said, "Okay, we wanna be building a model for, for-- specifically for image that is just 10X better than everything else." And that's what-- That-- Then we sat down together three months. We had all the recipes. We knew what to do. We scaled it, and what came out of that was Flux.1-

    12. AM

      Right

    13. AB

      ... that initially had kind of product market fit, you could say. We, uh, even before we took our API public, we had a couple of very large customers, uh, that, that kind of helped us close the feedback loop. Now talking about the feedback loop, because obviously on-once you can build the technology, but setting that technology out to solve real-world problems will give you the very important kind of data to actually learn from, first, what is an important problem to work on, and also how to make the model better for that specific problem, right?

    14. AM

      Yeah.

    15. AB

      By that you, you have the first kind of, uh, loop closure for the flywheel.

  8. 21:0524:13

    BFL’s training pipeline in practice: pre-training → mid-training → post-training → real-world feedback

    1. AM

      I mean, l-let's, let's break that, um, that release down. Flux.1, I think this is the kind of pipeline we talked about, right? So could you just go through sort of the BFL version of this and explain what's going on at each step within the, the company of the BFL sort of pipeline?

    2. AB

      Yeah. So I mean, th-this, this is particularly for, um, for how we would define visual intelligence now, but I think I can also, for Flux.1 it was clearly we trained only on unimodal, like, like text, text and image, right?

    3. AM

      Right.

    4. AB

      Only on those representations. So the pre-training was just a large corpus of text and image. For the mid-training, we added, um, higher resolution, um, and like a couple of more capabilities into. And then we had this kind of post-training phase where we exposed the model to-- F-First we did a kind of offline post-training. Before you release an initial model, do some, uh, distillation to make the model more efficient. You, you align it with, uh, your intuition about what customers would care about.

    5. AM

      Right.

    6. AB

      And then you expose it to kind of the real world, but then you get this feedback. And for, for Flux.1, a very interesting, um, observation we made was, oh, wow, so many people are using our text-to-image models to actually train a LoRA and then do character consistency. Like they, they want, they, they wanna, they wanna have the ability to control the model with more than only text, because text is obviously nice and easy and low-key. Everyone understands it. Everyone can use it. But it's also very ambiguous if you like-- And, and again, the, the-- there's a kind of disconnect between this kind of artificial representation text and, and the natural representation image. So if I say an image of a blue bird, there are infinitely many images that, that give rise to this kind of, um, description, right? The bird could be sitting on a branch, the bird could be flying, and so on and so forth. And it's actually super hard to apply precise control to, um, image-- to, to the image you want to be generating. So th-that was, that was-- I think that's a perfect example of the benefit of the loop closure, because we-- what we learned was, okay, people want to actually do image editing.

    7. AM

      Right.

    8. AB

      So what did we do? We did a post train on-- partially based on the data we got, partially based on new stuff to create an image editing model, which was Flux.1 Context. That came out I think pretty exactly a year-- uh, b-bit, bit less than a year ago. Um, and that was the first image editing model where you could actually, in a scalable and fast way, get character consistency. So I-- now I could take a, an image of you, Ansh, and say, uh, and, and maybe of me, and combine us two sitting together not in a lecture hall, but, um, in a cafe having, having a chat. And that's, that just has massive potential for everything content creation, right? Marketing needs it to, to like get, get different product, different products into different contexts. And it just like supercharged or currently is supercharging like a lot of different applications around the creative world.

  9. 24:1331:36

    Flux.1 Kontext: solving character-consistent editing and out-iterating better-resourced labs

    1. AM

      Yeah. So th-this may not be ob-obvious, but I wanna pause here because A- because, you know, Andy did this quite naturally, but-For those of you who, who were trying out AI image models, let's say 18 months ago, how many of you tried giving it a photo of yourself and then saying, you know, "Give this person a hat," and it came out actually looking like you? Yeah, no hands are going up. One, one hand's going up. It was a pretty basic capability gap. These image models just didn't have character consistency, right? You would just give it like a photo of yourself and say, you know, "Give him a mustache" or whatever, and out would come somebody looking not like me. And, um, that, you know, that, that-- for, for many people in the space, that was just a, I actually can't... I-if I had a dollar for every time, uh, people s- would, would say, you know, "That, that's just, that problem will not get solved. Like these models are so dumb." Like, "Look, AI's so dumb, it will never..." Like, "It, it'll never be able to surpass that capability threshold." Um, I, I, I would just sit there... And the, and these are very smart people, including, by the way, some of the speakers in this class, who over the years have realized they had to update their priors about the, the speed at which you can update these capabilities. But it was co-common consensus at the time that these image models are just not gonna get that good. You know, AI's dumb. It can't reason about the, the way that humans do, humans do about cre- like, you know, faces and specific characters. And I was just, it w- I-- it was shocking to me how very smart people would very confidently proclaim in the industry that that was just not gonna get solved. Meanwhile, [chuckles] you know, here we were in Freiburg, um, looking at the data where there were people using Flux, one, which was not very good at character consistency, but that, that context feedback of seeing how the prompts people were trying to j- to use with the model and the f- the feedback of them saying, "Actually, that, that was not good. Can you please try, you know, doing this better?" The, the, the multi-step sort of reasoning chain that we were getting from seeing people out in the wild using it, and it was an open weight model, which we'll talk about in a second, which is quite unique, gave us a very clear path actually to improving the capabilities. And then actually it was w- you know, one of our team members, Dustin, uh, in, in SF, who figured out that, you know, we should just make, uh, an update to this that's, that's called Kontext with a K, 'cause that's the German way to pronounce it, um, that is specifically an editing model. And I think b-between the insight, that insight, which w- was at an off-site in Spain. Where, where were we? We were in-

    2. AB

      I think it was in, um, in Italy.

    3. AM

      In Italy. We were in Italy. You know, uh, I, I, I think Cha- DALL-E... No, Cha- GP- GPT-1 image had just come out.

    4. AB

      Yeah.

    5. AM

      We were literally all together.

    6. AB

      Yeah.

    7. AM

      And there was a sense, you know, this is an important thing as a, as a, as a new team or as a first-time, you know, researcher, it can be quite daunting when some lab that has way more resources than you launches something that's, that looks way better. And, and your first intuition is to go, "Oh my God, we're [censored] ." But you gotta remember that the mark of a good leader is to not panic, keep calm, look at the data, assess the landscape, and then come up with a plan step by step. And often you'll notice that if you're good at, at mapping the domain, you're, you wanna be an expert, and somewhere in your intuition you have a gut feeling that's telling you there's actually some unsolved problems still left. And Du- you know, Dustin did a great job at that. The team rallied. I think within 24 hours we had redone the, the staffing on the team and I think, what, 60 days later, Kontext came out?

    8. AB

      Yeah.

    9. AM

      Right? And revenue from Kontext doubled, I think, within six weeks. In fact, I think soon after is when, um, this part is public now, that Meta announced a partnership with BFL and said they were gonna be using Black Forest models. This tiny team out of Germany, I think the co- team was you, you guys were like 25 people?

    10. AB

      Yeah.

    11. AM

      To drive image editing for all 2 billion Facebook and Meta and so on users. I mean, th-thi-this is not normal, right? And, and observing... My-- I was, I was lucky enough by this time, you know, I, I'd, I had been an investor with you guys for about a year and a half, and so we would go to these off-sites together, and I, I'd had a chance to sort of see how in real time, you know, this i-i, this is n- the systems problem often is not just a technical problem, 'cause actually all the data was available to us, the context was available to us. It's the human system of, of organizing the team and the research sort of culture, right, in a way that is not, you're not panicking, but you're still assessing kind of the f- the frontier metho- methodically and being very honest with yourself about how fast capabilities are moving and where you can uniquely sort of contribute to that system is, is the key to keeping that loop going over and over and over again. And I, I think that's why, you know, BFL is where they are today. They went from zero to several hundred million revenue. It's, the company is now worth more than $3 billion. But it can be easy to forget that that wasn't always the case, and there are these moments in your journey, especially in machine learning where things change so fast, that it's very tempting to just give up, you know, and say, "You know what? This problem is solved." I, I mean, it really, it is remarkable to me how many teams in image generation just don't, no longer exist because they just gave up, and instead BFL just stayed persistent and today is one of the only leaders left, I would say an independent leader that's pushing the visual frontier. In fact, I, I hope some of the projects here push the frontier too. But, um, that's, that's I think a, a learning for me has been it's actually quite straightforward sometimes technically if you have the right leadership to keep advancing the frontier. But sort of a drift in your mission, a lack of conviction that, you know, it's worth attacking the problem you, you're committed to over and over again in the face of crazy challenges is often just the difference between, you know, success and failure. How many people know that, have seen that meme of the guy tunneling-Uh, and giving up right bef-- Yeah, you guys know that meme. We'll, we'll, we'll put it in the class reading list. I'm dating myself with-

    12. AB

      [laughs]

    13. AM

      ... with boomer memes here. But I can't tell you how many times it's felt like that at BFL, and then one release later, right, the world has changed.

    14. AB

      And that, that's actually, uh, I think also a good segue into what, what's next. Like now we're seeing this, this kind of these models being applied everywhere for content creation. But again, then you need to think more, like, like look forward and, and, and think, "Okay, what's next?" And clearly we now see this insane potential of especially these combined multimodal video/audio image models for the capabilities we just talked about, right? Physical AI, um, computer use-

    15. AM

      Right

    16. AB

      ... world modeling and simulation, and still also content creation. And you can, you can actually build a single model that is capable of all-- doing all of that together, and you will actually get compounding effects based on this example with, with, with the correlation of noise and the ph- action in the physical space. Um, you can also make models that are much smarter for generating norm- regular images or video-

    17. AM

      Right

    18. AB

      ... footage for, say again, advertising or something.

  10. 31:3643:50

    From pixels to actions: adding interaction, verification, and robotics-ready learning loops

    1. AM

      Well, c-could we, could you talk a little bit about that? Let, so let's zoom forward to, um... A-actually, could, could you talk about, you know, how do you take an image or a, a, a content creation pipeline and add to it what you just talked about, the ability to actually interact with the physical world and learn from the physical world? You know, what, what, what does action prediction mean? How is that done?

    2. AB

      Yeah.

    3. AM

      And then maybe you can talk about Self Flow a little bit since we're gonna be assigning that as reading.

    4. AB

      Yeah, yeah, yeah, absolutely. Um, so yeah, I think first you, you need to go from, we already talked about this, unimodal to multimodal.

    5. AM

      Right.

    6. AB

      Um, and then you get this kind of, yeah, this is this one, and if, if you, if you go back to the slide here, I think this, this is a good one. So there's a large pre-training on, again, natural representations. These are the representations we humans use to, to learn from in our first years of, of, of our lives. Um, and there you, you, yeah, you just combine everything together and you have these, um, this combined pre-training that gives you a very, very general model. So pre-training for us means images, video, audio combining with a architecture or with an algorithm that we've also published, um, beginning of March, Self Flow, which allows the model to actually get compounding effects by observing s-- Again, I don't make this, this, uh, example again. You, you, you saw it, uh, already a couple of times. By observing correlations that, that exist between those modalities. That gives you a very, very general representation. What we add next in mid-training is additional context. We do new tasks such as conditioning on-- I can condition a model on an input image and an audio track, and I say I, I wanna, I wanna hear, uh, Arms saying XYZ in that voice. Model does this. This is additional context, but importantly for extending the scope beyond pure, um, content creation, you also wanna condition the model on actions and you wanna have the model predict actions. And then we can arrive at models like these computer use models, for instance, that, that are conditioned on a video or an image, and they predict the next move based on keystrokes or something to achieve a certain task. Say I wanna be opening a new brow-browser tab or something, right? So this is crucial to get to expand the scope, um, of this kind of very general representation that we get from pre-training, but we actually wanna be using it for-- We wanna make use of this kind of general- generality of the representation. So we add additional context and importantly actions. Um, and what we then do, this is very important. Or yeah, may-maybe, maybe to zoom out a bit more and come back to the human learning example. Pre-training, mid-training, this is all still observation. All the algorithms that we're training like foundation models with in the early training stages currently are models observing examples. We, we are calculating a loss from that. We bre-backprop- propagate that through the network, but there's no interaction whatsoever.

    7. AM

      Right.

    8. AB

      So how do we actually get the model to interact really in the physical world? That's super important for kind of learning higher forms of intelligence as we are all, uh, convinced of. So what do we do? We use this model that can actually, given a video, predict an action to do something and hook it up in the real world on a, say, on a robot, for instance, right? And then that allows us to inter-- or like allows this model to, through a robot, interact with the physical world, create data for that again, and we can pipe that back into the model training, and that's when we close this feedback loop. Uh, so our post-training looks or means interacting with the physical world.

    9. AM

      Right. So th-this is important. If you guys remember, we talked about physical ver- uh, ver-verification, right, as a key predictor where frontier progress is gonna continue. Wherever you have context that can be, and performance that can be verified, progress can quite reliably be made there, right? So in software engineering, that's verifiable because you can write unit tests. In image generation, not very verifiable, right? Because one-- uh, beyond, beyond the basic tasks that, that, um, Andy talked about, which is accuracy, right? Six-- five fingers instead of six, c-character consistency, which is more a preference, uh-

    10. AB

      But even that example, how would you measure that at scale, um, without having a human telling you-

    11. AM

      No, exactly

    12. AB

      ... this thing had actually five, five fingers, uh, and not six?

    13. AM

      Well, I think you should talk about how, how that verification works, and then what is the-You know, in the new world where you have to verify physical tasks like robotics, what does that look like?

    14. AB

      Yeah, yeah. So I, I think in-- it, it, it's fun because if you, if you-- Oh yeah, so verification for, for, for images is, are, are super, is super tricky, especially when it comes to kind of-- or for videos when it comes to physical things. But once you hook that up in the real world, there are just certain things that go-- that, that, that you can do and certain things that you can't do because a robot arm can- cannot just, just do certain, certain joints. So it's like exposing it to the physical world naturally, um, applies the boundary conditions that we would expect. So that, that's a very important step, and by that you, you have the perfect, uh, kind of environment-

    15. AM

      Right

    16. AB

      ... to, to directly inherently model these kind of restrictions.

    17. AM

      Whereas in the case of aesthetics or visual preference, how did you guys verify that? How, how did you get a model to be better when just co- doing content creation?

    18. AB

      Yeah.

    19. AM

      Uh, what does that, that-

    20. AB

      Well, that, that, that, that involves like ma- massive, massive, massive amount of human judgment and then, uh, feedbacking that signal through the model again.

    21. AM

      Right.

    22. AB

      But that's like of- often very tedious and also often very dependent on who you're asking.

    23. AM

      Yes.

    24. AB

      Like, uh, uh, if I ask you s- You, you've looked at so many images by now, you're-- I, I would consider you an expert, um, because we always show that guy, uh, our models, but you're also enjoying it, I guess. Um-

    25. AM

      Depends on how much spit you have had that day [laughs] and my energy.

    26. AB

      I know, but, but, but like showing a- an image to someone who has no idea of con-- like of, of, of, say, image generation versus to myself, and I've looked at so many images, it gives you a very different signal. I would rate something as good or bad that looks very different from what a kind of, uh, another person, uh, would do, right? It depends on the crowd who you're asking. So it's-- You can ask people-

    27. AM

      Right

    28. AB

      ... but that is very unambi-- uh, uh, very ambiguous in a way.

    29. AM

      So th- this is a key insight, I would say, because anytime the answer to the eval question of how do you verify is it depends on the audience or it depends on the person consuming the system, it should trigger a light bulb, at least it does for me, that the value that you get from the system varies a lot by how much the model can be customized for a particular audience. And that is where open source comes in. Because the beauty of open models is if you give away the weights, and they're good general weights, right, then you can tell Meta, "Hey, you're welcome to customize the preferences of what-- of this model as you see fit for your users." And you can tell another government that has different cultural preferences and biases that wants to, you know, be able to deploy content creation for, let's say, internal teams in a completely different culture and say, "You can have the control over that last mile." And I think that's turned out to be a very critical part of the open ecosystem, where I often get asked, Anj, you know, why did BFL open their models up and just give away all this research for free? Is it just that they wanna save the world? Well, you know, part of it is cultural. As you can tell, Andy, you know, Andy was a-- came from the academic community, enjoyed and, and benefited from open publishing. But at the end of the day, you've got to turn these research products into businesses, right? And it turns out there's extraordinary value in producing state-of-the-art systems that are then open and customizable when the consumer of the system, the person benefiting from the system, has very different preferences from other people who might be consuming the system. Does that make sense? Have I lost you guys? Can I get some nodding if that's making sense? Yes. Okay. This will be a theme consistently, okay, in the class, which is anywhere you have consumers of a system or customers or the people benefiting from the system wanting more and more personalization and customization of the system for themselves, that's where open models become extraordinarily valuable. And you can actually build, it turns out, a very large business very quickly doing that. So I actually think there's a false trade-off in the space a little bit about open versus closed. These are both just techniques or tactics for how to deliver value. They, they are-- Some- sometimes they get politicized philosophically, but actually just from a very base, basic first principle of commercial perspective, you know, open makes a lot of sense in some domains where, where the aesthetics, the preferences and so on are quite-- there's a long tail of dis- the distribution is quite wide and, and, and, um, and heterogeneous versus domains where, you know, preferences are actually quite narrow. If there, if there's a pretty narrow distribution, then I think, you know, closed models and so on are quite valuable in that case. I think there's one last piece that we haven't covered, which is the state of the art today. Because as Andy said, now the state of the art, right, is about how do you get these systems to reason in a unified fashion across text, image, video, and so on in a way that's, that has cross-- sort of transfer learning across these different modalities. A very hard problem. Very, very hard problem. But as is the case with BFL consistently, the team has, you know, makes these sort of research advancements and then gives away the technology. And so th- this was actually one example, I think, what, two months ago? No, a month ago.

    30. AB

      A month ago, yeah.

  11. 43:501:01:01

    Evals, safety, openness, and the next research bets (Self Flow, distillation, 3D vs video)

    1. AB

      Yeah. So the question was, um, when, when, when we close the feedback loop, how do we ensure to, to, um, compress this? How do we make sure a- that actually, uh, personal data is respected, um, and, um, that no, no harm is, is generated, uh, based on those models? So first, um, we have a lot of content filters on, on our API, obviously, because we... Our, our belief is that these models are powerful tools for humans to, to create super, super nice and creative outputs, and also much more than only content creation, as we just saw. Um, and we don't wanna have them misused, so we, we add a lot of content filters that actually, um, make sure no harm is, is generated. Um, on the personal, uh, information, obviously being, being based in the, uh, European Union, we comply with the EU AI Act, and there's actually a, a kind of law that, that, um, we also follow that you... Based on a request, so if, if you put in a, um, a, an image of yourself on our API, and you say, "Hey, look, I, I, I don't want to-- you, you to, you to, um, to store this kind of data," we have to delete it. So we have systems in place to actually make sure this, this basically happens. So the question is like w- we had a lot of partners, um, large companies that we worked with, like xAI, Meta, um, backed by NVIDIA, and the, the question was how do we evaluate, um, with whom we work and with whom not? Um, I think maybe as a general statement, we are working on building visual intelligence infrastructure for, um, everyone, basically. So from an infrastructure perspective, you really want to make sure you put guardrails around your models, um, that people cannot misuse those. But then infrastructure is there for basically everyone, right? And, and th- that, that's, that's the, the standpoint we're also taking. We care a lot about the safety of the models. That's important. And we do everything we can to prevent misuse. But then, um, I think it's also us provide-- putting out a technology there and providing it to, to, to everyone. And the, the-- it's always hard to take a certain standpoint on like who you're working with, who you're not working with, uh, because it, you get-- it, it gets very tricky to justify in the end, um-

    2. AM

      Let me try and translate what Andi's saying.

    3. AB

      [laughs]

    4. AM

      The co- the company basically applies its guardrails to everybody. So no matter who you are and how big you are and how much money you've got, if you want us to remove our guardrails, sorry. Those guardrails apply to everybody equally, because being a standard and being infrastructure that people can rely on means you don't treat different people differently. And everyone can rely that they're not getting, you know, just because they might have more money or they might be more politically influential, whatever it might be, that they can get the same quality of service as everybody else. And so that's the position BFL's taken as an infrastructure provider, is that doesn't matter who you are. Now, sometimes you have custom needs because you have of the scale that are technical. "Hey, we need it to be deployed in this way. We need some latency requirements," that are more technical. But when it comes to guardrails, that applies to everybody. And so when some partners say, "We want you to remove those guardrails," you say, "Sorry. You can go elsewhere." And that has resulted in sometimes the company losing meaningful amounts of revenue. And that's okay, because in the long term, as we talked about in the first lecture, the way you get infrastructure to move stably is you have trusted standards and trusted institutions to enforce them. And sometimes you gotta enforce them yourselves. Would you say that's roughly correct?

    5. AB

      Thanks, sir. Yes.

    6. AM

      [laughs] We've had some, we've had some spirited debates, I would say, at the company. And, you know, we've talked about culture as a bottleneck on, on progress. You know, one of the most-- one of the secret sauces of BFL is a very united culture, where there's a lot of debate and dissent on what to do and not to do, but then when they commit, they all commit together. Uh, and wh- I mean, how many people have left the company in the entire lifetime of the company? Like two? Three?

    7. AB

      Um, one.

    8. AM

      One. They've had one person leave in the entire history of the company. Not common in the AI space, where sometimes you have like co-founders leaving six months in. I'm sure you-- This is the thing. Th- this is my one issue with the Bay Area. The culture's forgotten that sometimes to keep, to make progress on long-term ambitious goals, you gotta stay together as a unit. And, and that, that's a great question. I think it challenged, uh, you know, I think the culture at several points, and I think they turned into a sort of moat in a sense.

    9. AB

      Yeah, absolutely. You debate, then you, you, you disagree, and then you commit.

    10. AM

      You commit.

    11. AB

      Uh, and-Onwards then. [chuckles]

    12. AM

      And there'll be more, I'm sure. Uh, next question, yes.

    13. AB

      Question is: How do we deal with, um, the insane amount of data labeling that has to be done? And other than for, for text, uh, images are just like not, not, not, not that straightforward to label. Um, I think two answers. First, when we train a model, we start obviously from-- We, we just saw this kind of pre-training, mid-training, post-training, uh, stages. We start with more data and also more noisy data in pre-training, and then we like, as you progress through training, you reduce the amount of data, but you increase the quality. So for, um, in pre-training, it's enough to do automatic, uh, or automatic like labeling that you can automate and then really apply it at, at massive scales. There are systems that, that, that are available to do this, uh, also publicly some, but obviously also, uh, we have some internal, uh, stuff that I cannot t-talk too much about now. Um, but then the more we approach later stages in training, the more we also involve, um, say, human signals and stuff like that because you wanna make sure, as you say, that in the la-latest stages of training, where you actually then again align this kind of very broad and general representation your model learns with what actually matters most to everyone out there. You wanna make sure that this is actually... You have annotations that reflect exactly what you want, and that's when you-- when still the, the gold standard is involving, uh, human labeling then. Where do we see the, the, the field going in terms of, um, denoising iter-iter-- like just, just in general iterative denoising? Uh, is, is it-- Will it still be needed in the future? There are now other, um, probabilistic approaches that, such as drifting models that allow us to do maybe a single step. Um, and yeah, I'll, I'll answer that very generally. I think it's super im-interesting if you compare these kind of flow matching diffusion models, uh, with language models. Both are iterative, both are iterative models. But flow matching models or diffusion models are iterative in a dimension that is orthogonal to the data in this kind of time dimension that we-- uh, artificial time dimension that we apply that goes from pure noise to, to kind of the data you wanna be generating. Whereas language models are iterative in the direction of the data, right? You generate token by token, and that, that has very interesting implications for both the training and the, the inference, um, kind of properties these, these models have. Um, for diffusion flow matching type models, you have-- you're actually s-pretty data inefficient because every training example gives rise to infinitely many kind of potential losses because you can pick every kind of, um, point on the continuous trajectory from clean image to noise and say, "I wanna denoise from here to s-say the next step," right? And then I can do this super often. So that, that, that tells us it's super data inefficient in a way compared to language models where we can train on all tokens parallel, in parallel, or let me specify language models a bit, uh, more autoregressive models, uh, where we can train on all tokens in parallel. On the other side, we have at inference, it's like th-these two properties being switched, so, um, or the effects of these prop-two properties being switched. When you see language models, you have to generate token by token, and there are some hacks like such as speculative decoding and stuff like that, um, that maybe can, can help you. But essentially you still have to-- Like you cannot just miss data. Whereas for diffusion models or flow matching models, you can actually distill a model down, right? What we do when we do post-training, we do distillation. We've writ- We've written a bunch of papers on, uh, adversarial diffusion distillation, where you get down the kind of number of steps from flow matching models from 50, say, to 4 or 2. And then it actually doesn't make a, a, a real difference anymore if you, if you then do a drifting model and you have this then directly at, at one step maybe, or you maybe take two steps, but the pipeline is just more stable and mature when you distill a diffusion model down to two steps using adversarial diffusion distillation, right? So I think it's, it's two things of the same, uh, yeah, of the same side of the coin. But coming back to autoregressive models, that, that, that's not really the, like, like possible for like get-getting these in-insane speed ups by just distillation in, in, in the iter- or using the iterative nature of the model. That's not possible. So I think a very interesting research problem that I'm thinking often, how can we combine the data efficiency of autoregressive models with the kind of inference capabilities or inference properties that these kind of diffusion flow matching type models have? So everyone who's, who's doing-- who's li- who likes to do research, that, that's a super interesting problem to work on. Uh-

    14. AM

      Are you guys hiring?

    15. AB

      And you-- Yeah.

    16. AM

      Okay, sorry.

    17. AB

      Always. Always. And yeah. I could not, I could not spend the next half hour-

    18. AM

      I know

    19. AB

      ... talking about this with you.

    20. AM

      This, this part is a, the, you know, latent adversarial distillation is a very, um... It's a part of the, the pipeline at BFL that I would say is, is very near and dear to the, to the core of the company, not only-- for, for two reasons. One is because it actually makes these models extraordinarily efficient. And for those of you who have German friends, you know that efficiency-

    21. AB

      [chuckles]

    22. AM

      ... is top of mind, and I think that's, that's, th-that's a through line through everything, uh, BFL does. It's high quality, it's efficiency. But it also en-ended up being a key unlock for our business model. Because early on, you know, a big question was, well, we have this philosophy of we wanna be open, we wanna pr-produce open weights, but we've got to find a way to make it commercially sustainable. 'Cause there's a lot of projects that open models up, and then they, they just die, and then that's not stable infrastructure either that you can rely on. And so one of the key differences between diffusion models and autoregressive models, as Andy's talking about, is that a, you know-The, the, the model size is actually the same in a diffusion model. In the-- If you look at the f-first Flux family, we, we released-- We didn't release Flux as a single model. Uh, it was actually Flux.1 was, was, um, packaged into three different models: Flux, uh, Schnell, which is, uh, German for fast, right? Uh, Flux Dev, and then Flux Pro. And Flux Pro we put behind an API, whereas Flux Schnell was full Apache 2.0 open weights, and then Flux Dev was open weights but a commercial license where any-- you were welcome to look at the weights, use the model, but if you wanted to make revenue off of it, you had to pay. And the key distinction between these three was actually they were the same size model, unlike, for example, language models where you have Flux, uh, like if you have Claude, Haiku, Sonnet, Op-Opus, and so on, they're actually different sizes. So in autoregressive land, you distill down the model, uh, you, you train a big model, and then you distill down to smaller and smaller sizes. In Flux.1, which is a diffusion model family, it was the same size but fewer steps. So-

    23. AB

      And you, you can still do size distillation as well?

    24. AM

      You can still do size distillation, but I think it was-- they were all at this point in this pipeline, right?

    25. AB

      Yeah, yeah. For Flux.1 it was, yeah, yeah. Absolutely.

    26. AM

      And so we, we distilled it down to Schnell, which was basically a single-step model at that point. So it's-

    27. AB

      Four steps.

    28. AM

      Four steps, sorry. Four-step model, super fast, super lightweight, um, lower quality. Pro, more steps, super high quality, slower, right? 'Cause you're iterating over more diffusion steps. And so that turned out to be this very beautiful kind of packaging of the core technology in a way that was also commercially sustainable because the open source developer community was ex- was thrilled 'cause now they had this really fast model for a lot of use cases that you can run locally, and all the enterprises who didn't wanna deal with customization had a high-quality model that was behind API. And developers who wanted the mix of, of sort of a mix of both got a pretty high-quality model that was also open weights, that was fast. And, and that trade-off, you know, is, is a hard one to make if, if you don't sort of foresee the fact that you wanna close this loop that we've talked about of frontier research repeatedly. You know, I would say two years ago, the state of the art was train a model, put it out, put the weights out there, let's see. But when you start thinking long term, then you're not thinking in terms of a single model release, you're trying to think about it as a system. That capital con-- You know, you know, we've talked about all the bottlenecks, and you want one iteration to help you unlock the bottleneck for the next run and the next run. And adversarial distillation, latent adversarial distillation is w- turned out to be a, a pretty key unlock for that part of the bottleneck two years ago. Um, next question.

    29. AB

      Spatial intelligence, yes, and whether it's more 3D or, um, or like how, how I see the, the, the 3D space where some companies are working versus our kind of more video-based, um, approach, um, going forward in the future. Um, I think I'll, I'll, I'll take a, a kind of opinionated-

    30. AM

      Yeah, that's fair

Episode duration: 1:01:13

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode CBaLU0dDEY8

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.