Stanford Online

Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In this CS153 “Frontier Systems” session, Anjney Midha welcomes Andreas Blattmann, co-founder of Black Forest Labs and co-creator of Stable Diffusion, for a discussion on the visual intelligence frontier and how frontier AI “factories” scale. Blattmann recounts his path from mechanical engineering to a Heidelberg PhD lab, developing latent diffusion to train image generators efficiently and enabling Stable Diffusion’s 2022 release. They contrast earlier unimodal content-creation models with today’s push toward unified multimodal systems spanning images, video, and audio, plus action prediction for computer use and robotics, emphasizing observation and interaction loops. Using Flux as a case study, they cover pre-training, mid-training, post-training, distillation for speed, customer feedback driving image editing and character consistency, and why open weights enable customization. They also discuss Self Flow for multimodal alignment, safety guardrails, EU compliance, data labeling strategies, diffusion vs autoregressive tradeoffs, and skepticism about explicit 3D representations. Guest Speaker: Andreas Blattmann is the co-founder of Black Forest Labs (BFL), the German generative AI startup behind the FLUX text-to-image foundation model, backed by Andreessen Horowitz and other major venture firms. Before founding BFL, he was a generative AI researcher at LMU Munich, NVIDIA, and Stability AI, where he made significant contributions to image and video generation. He is a co-inventor of Latent Diffusion, the generative modeling technique that produced the open-source text-to-image system Stable Diffusion (which he co-developed) and now powers cutting-edge models, including FLUX, Midjourney, and OpenAI's DALL-E 3, with applications extending into audio generation and medical imaging. His academic publications have amassed over 22,000 citations. He was named to Capital Magazin's Top 40 Under 40 in Germany in 2024. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

Anjney MidhahostAndreas Blattmannguest

May 3, 20261h 1mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Black Forest Labs' path from Stable Diffusion to multimodal visual intelligence

Blattmann traces his journey from a small Heidelberg lab to co-creating latent diffusion (enabling Stable Diffusion) by compressing pixel-space generation into lower-dimensional latent representations to drastically reduce compute requirements.
The talk argues that “natural representations” (video/audio) are fundamental for intelligence because they reflect how humans learn (observe first, then interact), whereas text is an efficient but human-made, high-information representation that can be an incomplete starting point.
Black Forest Labs bootstrapped its product/research flywheel by focusing first on a 10× better text-to-image model (Flux.1), using real customer usage to identify capability gaps (e.g., precise control and character consistency) and then shipping an image-editing model (Flux.1 Kontext).
The frontier direction shifts from unimodal content-creation models to unified multimodal models that learn cross-modal correlations and can be conditioned on, and predict, actions—enabling computer-use agents, robotics, world modeling/simulation, and stronger creative tools.
Key operational themes include evaluation challenges for aesthetics (human judgment and preference variance), the commercial/technical rationale for open weights and customization, and infrastructure-like safety/guardrails that apply uniformly across partners and deployments.

IDEAS WORTH REMEMBERING

5 ideas

Compute efficiency can be a decisive competitive advantage in vision.

Blattmann emphasizes that pixel-space generation is wasteful; learning a perceptually equivalent latent space lets small teams train competitive models with far less compute, which was central to latent diffusion and Stable Diffusion’s feasibility.

Start with a narrow SOTA target to ignite the flywheel, then expand.

BFL began by targeting a concrete, market-relevant improvement (a “10× better” image model) and used early customers to close the feedback loop before broadening into editing, multimodality, and action prediction.

User behavior is often the best roadmap for the next capability.

Observing widespread LoRA training and attempts at consistent characters signaled that text prompts were too ambiguous; this directly informed Flux.1 Kontext, reframing “content creation” into controllable editing workflows.

Multimodal learning is not just feature-stacking; it provides grounding via correlations.

The talk highlights that sound-image-video correlations (e.g., collisions producing characteristic audio) can help models form higher-level understanding than training separate unimodal systems.

“Interaction” is the missing ingredient beyond observation-based training.

Pretraining/mid-training are framed as passive observation; post-training that closes the loop requires acting in the world (e.g., via robots) to generate grounded data and enforce physical boundary conditions.

WORDS WORTH SAVING

5 quotes

You should start with from first principles, how we humans do it, and that's clearly learning on natural representations by first observing, and second, we'll talk about that later, interacting.

— Andreas Blattmann

Text is inherently human-made. You see this in so many different, um, occasions. If you just measure the, the information per sign that text transports, it's so much higher than the information per sign, per pixel in an image. And why is that? Because it's human-made.

— Andreas Blattmann

We don't train a single unimodal model a-anymore to just like fulfill the purpose of content creation. We're training a, a unified, a uni- a multimodal model for natural representations or natural data that then can give rise to s-so much more.

— Andreas Blattmann

So when we started the, the, the company, we clearly-- or we looked at the field and we said, "There's clearly a need for a next generation of image models, because so far the models cannot, say, produce hands that are, that are actually having five fingers," right?

— Andreas Blattmann

The co- the company basically applies its guardrails to everybody. So no matter who you are and how big you are and how much money you've got, if you want us to remove our guardrails, sorry. Those guardrails apply to everybody equally, because being a standard and being infrastructure that people can rely on means you don't treat different people differently.

— Anjney Midha

Latent diffusion and learned compression (latent generative modeling)Natural vs human-made representations (video/audio vs text)Bootstrapping a frontier-model flywheel with limited resourcesFlux.1 product line and distillation (Schnell/Dev/Pro)Character consistency and image editing (Flux.1 Kontext)Multimodal pretraining + action conditioning for physical AISelf Flow and multimodal representation alignmentVerification/evals: human preference vs physical constraintsOpen weights as customization and business strategyGuardrails, EU compliance, and infrastructure neutralityDiffusion/flow matching vs autoregressive tradeoffs3D explicit representations vs implicit video-based learning

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.