Skip to content
Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence
This video isn’t embeddableWatch on YouTube →
Stanford OnlineStanford Online

Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In this CS153 “Frontier Systems” session, Anjney Midha welcomes Andreas Blattmann, co-founder of Black Forest Labs and co-creator of Stable Diffusion, for a discussion on the visual intelligence frontier and how frontier AI “factories” scale. Blattmann recounts his path from mechanical engineering to a Heidelberg PhD lab, developing latent diffusion to train image generators efficiently and enabling Stable Diffusion’s 2022 release. They contrast earlier unimodal content-creation models with today’s push toward unified multimodal systems spanning images, video, and audio, plus action prediction for computer use and robotics, emphasizing observation and interaction loops. Using Flux as a case study, they cover pre-training, mid-training, post-training, distillation for speed, customer feedback driving image editing and character consistency, and why open weights enable customization. They also discuss Self Flow for multimodal alignment, safety guardrails, EU compliance, data labeling strategies, diffusion vs autoregressive tradeoffs, and skepticism about explicit 3D representations. Guest Speaker: Andreas Blattmann is the co-founder of Black Forest Labs (BFL), the German generative AI startup behind the FLUX text-to-image foundation model, backed by Andreessen Horowitz and other major venture firms. Before founding BFL, he was a generative AI researcher at LMU Munich, NVIDIA, and Stability AI, where he made significant contributions to image and video generation. He is a co-inventor of Latent Diffusion, the generative modeling technique that produced the open-source text-to-image system Stable Diffusion (which he co-developed) and now powers cutting-edge models, including FLUX, Midjourney, and OpenAI's DALL-E 3, with applications extending into audio generation and medical imaging. His academic publications have amassed over 22,000 citations. He was named to Capital Magazin's Top 40 Under 40 in Germany in 2024. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

Anjney MidhahostAndreas Blattmannguest
May 4, 20261h 1mWatch on YouTube ↗

CHAPTERS

  1. 0:07 – 4:29

    Course framing: frontier AI flywheels and today’s “visual intelligence” factory visit

    Anjney sets context for CS153’s recurring framework: rewriting the AI stack, scaling loops, and key bottlenecks (context, compute, capital, culture). He positions the session as a “field trip” into Black Forest Labs (BFL) to understand how a frontier visual-model team iterates from SOTA releases into expanding capabilities.

    • Frontier progress as repeated flywheel loops: compute → data → model → inference → revenue/feedback
    • Bottlenecks: context, compute, capital, culture
    • Incubation → SOTA release → capability expansion as a common project/company arc
    • BFL introduced as a frontier “factory” for visual intelligence
    • Goal of the talk: anatomy of visual intelligence + BFL’s evolution + future frontiers
  2. 4:29 – 7:16

    Andreas Blattmann’s path: from mechanical engineering to generative vision research

    Andreas recounts entering AI via robotics/coding, then a PhD track in Heidelberg where he met his future co-founders. He describes the 2019-era computer vision landscape as niche, dominated by GAN-based image generation and limited by compute and resolution.

    • Transition from mechanical engineering to CS/robotics and then research
    • Heidelberg lab origins and meeting co-founders Robin and Patrick
    • 2019 vision research context: niche field, GANs (e.g., StyleGAN) as dominant approach
    • Compute scarcity as a forcing function for algorithmic efficiency
    • Early focus: how to train models to generate pixels effectively
  3. 7:16 – 8:55

    Latent diffusion: compressing pixels to win with less compute (and the road to Stable Diffusion)

    Andreas explains the core insight: images/videos are high-dimensional and redundant, so training directly in pixel space is wasteful. BFL’s lineage developed latent generative modeling—learning a perceptually equivalent compressed representation—enabling orders-of-magnitude efficiency and leading to latent diffusion and Stable Diffusion.

    • Images/video are much higher dimensional than text; compute costs dominate
    • Learned compression ("learned JPEG") to create perceptually equivalent latents
    • Train generative models in latent space for efficiency and scalability
    • Latent diffusion as the algorithmic breakthrough behind Stable Diffusion
    • Open-source compute/community support as a catalyst for training and release
  4. 8:55 – 11:41

    The Stable Diffusion inflection point: when generative vision became mainstream-legible

    Anjney describes the moment Stable Diffusion made generative modeling understandable outside ML due to its visual immediacy. He contrasts prior “language-is-all” dogma with the computer-vision view that language alone is incomplete for intelligence and learning.

    • Stable Diffusion vs DALL·E-era perception shift; viral user examples (image-to-image)
    • Visual outputs make capability progress legible to broader audiences
    • Critique of “language as the be-all and end-all” framing of intelligence
    • Visual thinking and multiple intelligences as motivation for multimodal systems
    • Early industry collaboration interest (e.g., bringing capabilities to Discord)
  5. 11:41 – 14:52

    Natural vs unnatural representations: why video/audio matter for foundational intelligence

    Andreas differentiates natural signals (video/audio from physical processes) from human-invented symbolic compression (text). He argues that higher intelligence should be built from first principles of human learning: observation of natural modalities, then interaction with the world.

    • Natural representations: vision and sound rooted in physical reality
    • Text as human-made: higher information density per symbol; less redundancy
    • Motivation for compressing images/videos before generative modeling
    • Human developmental learning: observe (vision/audio) before reading/writing
    • Thesis: start with natural modalities, don’t bolt them onto language later
  6. 14:52 – 19:16

    From unimodal content creation to unified multimodal models and physical AI

    The discussion contrasts earlier unimodal text-to-image systems (optimized for content creation) with today’s push toward unified multimodal models. Andreas emphasizes cross-modal correlations (e.g., sound + collision) as a key learning signal that improves understanding and broadens capabilities beyond artistry.

    • Stable Diffusion as unimodal, content-creation-first system
    • Shift to multimodal: images + video + audio + actions
    • Cross-modal grounding: correlations improve semantic/physical understanding
    • Broader applications: robotics, computer use, world modeling/simulation
    • Multimodality can also improve traditional tasks like image/video generation quality
  7. 19:16 – 21:05

    Bootstrapping the flywheel: focus, product wedge, and building Flux.1

    Anjney pulls the conversation into “how to start” with limited resources: pick a concrete wedge, ship a SOTA model, and use real users to close the loop. Andreas explains BFL’s decision to target a next-gen image model (e.g., fixing hands/fingers) and rapidly scale known recipes into Flux.1 with early large customers.

    • Importance of focus for any research project/company
    • Choosing a clear wedge: “10× better image model” vs spreading across modalities
    • Flux.1 as a rapid execution of a known training recipe
    • Early customers before public API to accelerate feedback loops
    • Feedback loop defines what problems matter and how to improve the model
  8. 21:05 – 24:13

    BFL’s training pipeline in practice: pre-training → mid-training → post-training → real-world feedback

    Andreas walks through BFL’s version of the “frontier factory” pipeline using Flux.1. He highlights how user behavior revealed demand for control and editing, motivating a targeted post-training step that became Flux.1 Kontext.

    • Pre-training: large text+image corpus (for Flux.1)
    • Mid-training: higher resolution and capability expansion
    • Post-training: offline alignment + distillation for efficiency before release
    • Real-world exposure generates feedback on failures and unmet needs
    • Discovery: users want more control than text prompts (e.g., character consistency)
  9. 24:13 – 31:36

    Flux.1 Kontext: solving character-consistent editing and out-iterating better-resourced labs

    Anjney emphasizes how quickly capability ceilings can move with the right data, leadership, and iteration cadence. The team reorganized rapidly after competitive releases, shipped Kontext in ~60 days, and saw strong revenue and major partnerships—illustrating how culture and calm execution sustain frontier momentum.

    • Problem: text prompts are ambiguous; editing/identity consistency was missing
    • Kontext enables scalable character-consistent image editing
    • Leadership lesson: don’t panic after competitor releases—map the frontier and iterate
    • Team reallocation + fast execution cycle (days/weeks)
    • Culture as a moat: debate → disagree → commit; unusual retention and cohesion
  10. 31:36 – 43:50

    From pixels to actions: adding interaction, verification, and robotics-ready learning loops

    The conversation shifts to extending visual models into systems that predict and condition on actions, enabling computer-use and robotics. Andreas frames pre/mid training as “observation,” with post-training as “interaction” through embodied systems that generate new data and enforce physical constraints.

    • Multimodal pre-training: images + video + audio for general representations
    • Mid-training adds context: conditioning on inputs (image/audio) and on actions
    • Action prediction enables computer-use agents and robotics policies
    • Post-training closes the loop via real-world interaction (robots/environment)
    • Physical constraints provide natural boundaries and a route to more verifiable learning
  11. 43:50 – 1:01:13

    Evals, safety, openness, and the next research bets (Self Flow, distillation, 3D vs video)

    In Q&A, Andreas covers safety guardrails, EU compliance, and the difficulty of evaluating aesthetics without human judgment—supporting the case for customizable/open models. He then discusses data labeling strategy across training phases, diffusion vs autoregressive tradeoffs and distillation, and argues for implicit spatial understanding learned from video rather than explicit 3D priors.

    • Safety/guardrails: filtering, EU AI Act compliance, deletion on request
    • Partner policy: guardrails apply equally; infrastructure mindset over exceptions
    • Labeling approach: noisy/automatic at scale early; higher-quality + human signals later
    • Diffusion/flow vs autoregressive: data efficiency vs inference efficiency; distillation to few steps
    • Self Flow: multimodal alignment losses to improve semantic understanding in generative models
    • Spatial intelligence debate: implicit 3D from video/audio + interaction vs explicit 3D representations

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.