Skip to content
Stanford OnlineStanford Online

Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In this CS153 “Frontier Systems” session, Anjney Midha welcomes Andreas Blattmann, co-founder of Black Forest Labs and co-creator of Stable Diffusion, for a discussion on the visual intelligence frontier and how frontier AI “factories” scale. Blattmann recounts his path from mechanical engineering to a Heidelberg PhD lab, developing latent diffusion to train image generators efficiently and enabling Stable Diffusion’s 2022 release. They contrast earlier unimodal content-creation models with today’s push toward unified multimodal systems spanning images, video, and audio, plus action prediction for computer use and robotics, emphasizing observation and interaction loops. Using Flux as a case study, they cover pre-training, mid-training, post-training, distillation for speed, customer feedback driving image editing and character consistency, and why open weights enable customization. They also discuss Self Flow for multimodal alignment, safety guardrails, EU compliance, data labeling strategies, diffusion vs autoregressive tradeoffs, and skepticism about explicit 3D representations. Guest Speaker: Andreas Blattmann is the co-founder of Black Forest Labs (BFL), the German generative AI startup behind the FLUX text-to-image foundation model, backed by Andreessen Horowitz and other major venture firms. Before founding BFL, he was a generative AI researcher at LMU Munich, NVIDIA, and Stability AI, where he made significant contributions to image and video generation. He is a co-inventor of Latent Diffusion, the generative modeling technique that produced the open-source text-to-image system Stable Diffusion (which he co-developed) and now powers cutting-edge models, including FLUX, Midjourney, and OpenAI's DALL-E 3, with applications extending into audio generation and medical imaging. His academic publications have amassed over 22,000 citations. He was named to Capital Magazin's Top 40 Under 40 in Germany in 2024. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

Anjney MidhahostAndreas Blattmannguest
May 4, 20261h 1mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. AM

    Quick show of hands, how many people recognize the song that was playing? One of my favorite songs called "Bella Napoli." It has been added to the, uh, CS153 Spotify playlist. For anybody who has music requests for CS153 this quarter, also known as AI Coachella, we've got an open playlist. Please feel free to add songs there. That one was a request from me in honor of our speaker today-

  2. AB

    [chuckles]

  3. AM

    ... who I'm very lucky to call a close friend, and is the co-founder of Black Forest Labs, Andreas Blattmann. Thank you for joining us, Andy.

  4. AB

    Thanks, Ansh. Thank you, everyone. Thanks for having me.

  5. AM

    Andy is joining us from Germany in a little town called Freiburg, which I think a lot of you will be hearing about more and more as it becomes a hub, uh, for frontier research in Europe. If you remember in our first lecture, right, we talked about the anatomy of frontier AI progress. And we talked about three or four important touchpoints in this class you're gonna be hearing about over and over again. One is that there's a, a transition happening from the old systems, the old infra stack, to a new one, right? And you gotta be open to understanding what those rewrites are looking like, and, and our speakers are gonna tell you which parts of the stack they're helping to rewrite. We talked about the basic AI scaling recipe, right? We've got two sort of loops that are important to run. Once you do, you get some compute, you get some data, you build a model, and then you do inference, right? That gives you revenue to buy more compute and then context feedback. We've talked about the bottlenecks, right, on that, on getting those loops scaling, which is context, compute, capital, and culture. We talked about context and, and compute. We'll talk a little bit about all four today. And then the last was, well, for your projects, which is the, the part I'm sure many of you are anxious about, is how do you get one of those scaling flywheels going? Right? And we talked about there being sort of three steps in the journey. There's an incubation phase, where you kind of figure out which specific part of the frontier you wanna attack with a state-of-the-art system. Right? Then you land with a SOTA release, a state-of-the-art release, and then that allows you to expand to more and more capabilities on the frontier that you care about. And if you remember, we did sort of a, a field trip into one of the frontier factories, right, um, in, in our first lecture, which was Anthropic. We talked about code as one domain. And today, we have a chance to do a field trip into another frontier AI factory in Germany called Black Forest Labs. And we've got here one of the factory owners, Andy Blattmann, um, who's the co-founder of Black Forest Labs, also co-creator of Stable Diffusion. How many people here have heard of Stable Diffusion? All of you. Perfect. Great. So you've done some homework.

  6. AB

    [chuckles]

  7. AM

    And so today we're gonna talk about the frontier. You know, last, uh, on Tuesday, we talked about the, the audio and the speech frontier, right? What is a- audio intelligence like? What was it? Where is it going with Matty from Eleven Labs. And today we have Andy talking to us about the frontier of visual intelligence, which I think is one actually-- one of the most exciting frontier, if not the most critical frontier to unlock more progress in if we really want to get, um, these models to work in mission-critical contexts in the real world. And so we're gonna spend some time talking about the anatomy of visual intelligence as, as Andy sees it as one of the pioneers of the field. And then we're gonna talk, go back in time a little bit and zoom into how we bootstrapped the FLUX flywheel together a couple years ago. FLUX is the name of the flagship model family from Black Forest Labs. And then we're gonna spend some time on the fun part, which is future frontiers. Where are things right now that, that where are un- where are the unsolved problems? Where are we right now where you guys can step in and start co-creating this journey, uh, in the space. So this was the frontier factory, right? We talked about this is sort of the basic template. Again, to be clear, this is a directional heuristic. Every team is different, every research project is different. But to kind of give you a grounding sense of repeating patterns about how, um, some of the best teams are manufacturing intelligence repeatedly, remember this was the pipeline. We had, um, pre-training, mid-training, post-training with agents in the real world. There, there's a version of this that, that Andy's gonna walk us through, but before we jump into that, why don't we just spend some time on, on you, Andy. Who are you and how'd you get here?

  8. AB

    Yeah, cool. Thank you, Ansh. Uh, thanks again for having me, everyone. Um, yeah, I'm Andy. Um, started looking into AI, I think in 2019. Um, I, I was actually originally studying mechanical engineering. It's classic German education, uh, I think. You go to a school and then you figure out you're kind of somewhat technical and what are you doing if you don't know exactly what, what to do. Studying mechanical engineering in Germany, right? Um, and then it, uh, yeah, through, through a couple of, um, I think coincidences, I got into computer science, into coding, into already robotics back in the days. We talk more about robotics, uh, later. Um, and applied at a PhD in Heidelberg, uh, where I met my two co-founders, Robin and Patrick. Um, and that was a really, like small lab. Everyone back in the day was doing representation learning with visual models, uh, or like for, for the visual domain and computer vision in itself back, that was 2019, was kind of a, a niche topic in this niche topic back then of AI. It was really like people saw the potential already, but, but no, no one, no one had an idea of how-

  9. AM

    Right

  10. AB

    ... how that would, uh, explode then later, right? So it was really, uh, kind of a, yeah, niche topic we worked on, but we soon had a very good intuition about like how to train models to generate pixels, mainly images back then. Um, and we're competing on a research level as a very small lab with players that were much larger than us. Uh, and finally that already back in the day-... was Google and OpenAI, their research teams, and it was not about building frontier systems. It was-

  11. AM

    Foundation models

  12. AB

    ... yeah, or, or even before that, uh, who wrote actually the nicest paper to show that something was, was happening.

  13. AM

    Right.

  14. AB

    So back in the days it, it was like-

  15. AM

    This was pre, uh-

  16. AB

    That, that was pre-Stable Diffusion. That, that-

  17. AM

    Right

  18. AB

    ... that was, that was-

  19. AM

    20-

  20. AB

    Really it was-

  21. AM

    ... 19

  22. AB

    ... for the ones who remember it, StyleGAN was kind of-

  23. AM

    StyleGAN, yep

  24. AB

    ... the images were most often generated with GANs because they had a kind of a good inductive biases for, for, for kind of this data domain. Um, and it was generating a 256 by 256 pixels image was a challenge. Like, not every algorithm could do that, and yeah, it was just a very different world. So, um, we competed with labs that were much larger than us, and we had, even back in the day, way, way, uh, less compute. So we had to come up with kind of more, um, efficient algorithms to solve that problem because images, and now speaking of videos, are so much higher dimensional than other representations, say text or something. Text is, uh, much lower dimensional.

  25. AM

    And, and to anchor folks on time, th- you were still-- This was when you were at the University of Heidelberg.

  26. AB

    Exactly.

  27. AM

    Right. Yeah.

  28. AB

    Exactly. Um, so, um, yeah, and then we, we spent, like, two years investigating how can we actually find representations for natural data, for images, for video, um, mainly, that are perceptually equivalent to the pixel space or to what matters to us humans in the pixel space, but much lower dimensional and much more efficient because we didn't have the computer train a kind of generative model on the pixel space. And it's also super wasteful, and that was what gave rise to a, a series of papers on latent generative modeling. So you actually train a kind of a compression model, um, similar to a learned JPEG codec, you could imagine it, to find that ex- perceptually equivalent representation to the, uh, pixel space, and you train the generative model there. And that, um, helped us saving tons of compute, training our models much more efficiently, and with orders of magnitude less compute than our competitors put out, like, better, uh, li- like, models that were on par or even better than those competitors. And that was what-- That algorithm, latent diffusion, also gave rise to Stable Diffusion then. Um, so we proposed the algorithm, saw the potential, set out to search some compute, luckily find that in the open source community, um, and trained Stable Diffusion that was then released in 2022. Um, and pretty much surprised us as well, like, with all the hype it got. And actually it was, it was fun. It was here in the Bay Area it was hyped much more than in Germany. In Germany, still today, not a lot of people know about that model, funnily.

  29. AM

    Yeah. It-- Uh, there wa- there was a moment I remember, DALL-E 2 was in preview, I think, and, and then you guys put out, uh, Stable Diffusion. And I remember on Reddit there wa- somebody had sketched out, uh, they'd taken one of their kids' like, uh, drawings. It was like a crayon drawing, and had turned it-- had run it through the image-to-image-

  30. AB

    Yeah

Episode duration: 1:01:13

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode CBaLU0dDEY8

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome