Stanford OnlineStanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence
At a glance
WHAT IT’S REALLY ABOUT
Black Forest Labs' path from Stable Diffusion to multimodal visual intelligence
- Blattmann traces his journey from a small Heidelberg lab to co-creating latent diffusion (enabling Stable Diffusion) by compressing pixel-space generation into lower-dimensional latent representations to drastically reduce compute requirements.
- The talk argues that “natural representations” (video/audio) are fundamental for intelligence because they reflect how humans learn (observe first, then interact), whereas text is an efficient but human-made, high-information representation that can be an incomplete starting point.
- Black Forest Labs bootstrapped its product/research flywheel by focusing first on a 10× better text-to-image model (Flux.1), using real customer usage to identify capability gaps (e.g., precise control and character consistency) and then shipping an image-editing model (Flux.1 Kontext).
- The frontier direction shifts from unimodal content-creation models to unified multimodal models that learn cross-modal correlations and can be conditioned on, and predict, actions—enabling computer-use agents, robotics, world modeling/simulation, and stronger creative tools.
- Key operational themes include evaluation challenges for aesthetics (human judgment and preference variance), the commercial/technical rationale for open weights and customization, and infrastructure-like safety/guardrails that apply uniformly across partners and deployments.
IDEAS WORTH REMEMBERING
5 ideasCompute efficiency can be a decisive competitive advantage in vision.
Blattmann emphasizes that pixel-space generation is wasteful; learning a perceptually equivalent latent space lets small teams train competitive models with far less compute, which was central to latent diffusion and Stable Diffusion’s feasibility.
Start with a narrow SOTA target to ignite the flywheel, then expand.
BFL began by targeting a concrete, market-relevant improvement (a “10× better” image model) and used early customers to close the feedback loop before broadening into editing, multimodality, and action prediction.
User behavior is often the best roadmap for the next capability.
Observing widespread LoRA training and attempts at consistent characters signaled that text prompts were too ambiguous; this directly informed Flux.1 Kontext, reframing “content creation” into controllable editing workflows.
Multimodal learning is not just feature-stacking; it provides grounding via correlations.
The talk highlights that sound-image-video correlations (e.g., collisions producing characteristic audio) can help models form higher-level understanding than training separate unimodal systems.
“Interaction” is the missing ingredient beyond observation-based training.
Pretraining/mid-training are framed as passive observation; post-training that closes the loop requires acting in the world (e.g., via robots) to generate grounded data and enforce physical boundary conditions.
WORDS WORTH SAVING
5 quotesYou should start with from first principles, how we humans do it, and that's clearly learning on natural representations by first observing, and second, we'll talk about that later, interacting.
— Andreas Blattmann
Text is inherently human-made. You see this in so many different, um, occasions. If you just measure the, the information per sign that text transports, it's so much higher than the information per sign, per pixel in an image. And why is that? Because it's human-made.
— Andreas Blattmann
We don't train a single unimodal model a-anymore to just like fulfill the purpose of content creation. We're training a, a unified, a uni- a multimodal model for natural representations or natural data that then can give rise to s-so much more.
— Andreas Blattmann
So when we started the, the, the company, we clearly-- or we looked at the field and we said, "There's clearly a need for a next generation of image models, because so far the models cannot, say, produce hands that are, that are actually having five fingers," right?
— Andreas Blattmann
The co- the company basically applies its guardrails to everybody. So no matter who you are and how big you are and how much money you've got, if you want us to remove our guardrails, sorry. Those guardrails apply to everybody equally, because being a standard and being infrastructure that people can rely on means you don't treat different people differently.
— Anjney Midha
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome