Skip to content
Stanford OnlineStanford Online

Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In this CS153 “Frontier Systems” session, Anjney Midha welcomes Andreas Blattmann, co-founder of Black Forest Labs and co-creator of Stable Diffusion, for a discussion on the visual intelligence frontier and how frontier AI “factories” scale. Blattmann recounts his path from mechanical engineering to a Heidelberg PhD lab, developing latent diffusion to train image generators efficiently and enabling Stable Diffusion’s 2022 release. They contrast earlier unimodal content-creation models with today’s push toward unified multimodal systems spanning images, video, and audio, plus action prediction for computer use and robotics, emphasizing observation and interaction loops. Using Flux as a case study, they cover pre-training, mid-training, post-training, distillation for speed, customer feedback driving image editing and character consistency, and why open weights enable customization. They also discuss Self Flow for multimodal alignment, safety guardrails, EU compliance, data labeling strategies, diffusion vs autoregressive tradeoffs, and skepticism about explicit 3D representations. Guest Speaker: Andreas Blattmann is the co-founder of Black Forest Labs (BFL), the German generative AI startup behind the FLUX text-to-image foundation model, backed by Andreessen Horowitz and other major venture firms. Before founding BFL, he was a generative AI researcher at LMU Munich, NVIDIA, and Stability AI, where he made significant contributions to image and video generation. He is a co-inventor of Latent Diffusion, the generative modeling technique that produced the open-source text-to-image system Stable Diffusion (which he co-developed) and now powers cutting-edge models, including FLUX, Midjourney, and OpenAI's DALL-E 3, with applications extending into audio generation and medical imaging. His academic publications have amassed over 22,000 citations. He was named to Capital Magazin's Top 40 Under 40 in Germany in 2024. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

Anjney MidhahostAndreas Blattmannguest
May 4, 20261h 1mWatch on YouTube ↗

CHAPTERS

  1. Course framing: frontier AI “factory” flywheel and bottlenecks

    Anjney sets the context for the talk using CS153’s repeated framework for frontier progress: scaling loops, bottlenecks, and how teams iteratively “manufacture intelligence.” He previews the session’s structure: visual intelligence basics, how Black Forest Labs (BFL) bootstrapped FLUX, and open problems ahead.

  2. Andreas Blattmann’s path: from mechanical engineering to latent diffusion and Stable Diffusion

    Andreas shares his origin story: switching from mechanical engineering into CS/robotics, then a PhD lab in Heidelberg focused on representation learning for vision. He describes competing with larger labs by developing compute-efficient methods, culminating in latent diffusion and later Stable Diffusion’s 2022 release.

  3. Why visual generation became an inflection point for mainstream adoption

    Anjney recounts the moment generative images became legible to non-ML audiences, using early Stable Diffusion examples (e.g., kids’ drawings turned into polished art). He contrasts the then-dominant “language is intelligence” dogma with the CV community’s belief that visual understanding is foundational and incomplete without natural modalities.

  4. Natural vs human-made representations: why images/video need different treatment

    Andreas explains “natural representations” (video/audio) versus “unnatural” or human-made ones (text). Because natural signals contain heavy redundancy while text is evolutionarily compressed for efficient communication, vision systems benefit from compression/latent spaces and multimodal learning grounded in how humans learn early in life.

  5. From unimodal content creation to unified multimodal “visual intelligence”

    The discussion contrasts earlier text-to-image systems (primarily for creative content) with today’s push toward unified multimodal models (image/video/audio) enabling robotics, computer use, simulation, and better content creation. Andreas highlights how correlations across modalities (e.g., collisions and sound) improve semantic understanding.

  6. Bootstrapping BFL’s flywheel: choosing a narrow wedge (10× better images)

    Andreas describes how BFL started with focus: despite a broader multimodal thesis, they picked a specific near-term wedge—building a next-generation image model that was dramatically better than existing ones. They leveraged prior experience (Stable Diffusion era) and quickly shipped FLUX.1, then used early customer feedback to drive iteration.

  7. BFL’s training pipeline in practice: pre-training → mid-training → post-training

    They map the generic “frontier factory” pipeline to BFL’s implementation for FLUX.1. Pre-training uses large-scale text+image; mid-training adds capability and resolution; post-training includes distillation for efficiency and alignment based on expected user needs—then iteration based on real-world feedback.

  8. From prompting to editing: character consistency drives FLUX.1 Kontext

    User behavior showed that text prompts were too ambiguous and that people wanted precise control, including character consistency via LoRA workflows. BFL responded by building an editing-focused model (FLUX.1 Kontext), enabling reliable identity/character preservation and scalable image editing—unlocking major product value.

  9. Operational resilience as a moat: reacting to competitors without panicking

    Anjney highlights a key organizational lesson: frontier ML progress is as much a human systems problem as a technical one. When competitors ship impressive releases, BFL focuses on calm assessment, reallocating resources quickly, and iterating—turning uncertainty into a repeatable capability upgrade process.

  10. Beyond content creation: adding actions, interaction, and real-world feedback loops

    They extend the pipeline to multimodal “natural data” and action prediction, arguing that observation-only training is insufficient for higher intelligence. By conditioning models on actions and deploying them to interactive settings (e.g., computer use, robotics), systems can generate new data and learn under physical constraints—improving verification and robustness.

  11. Verification and evaluation: aesthetics vs physical constraints

    They contrast hard-to-verify aesthetic quality (subjective, audience-dependent) with the more verifiable nature of physical interaction (robots can or cannot perform actions). Image quality improvements often require large-scale human judgment; physical deployment provides inherent constraints and clearer success/failure signals.

  12. Why open weights matter: customization, preference diversity, and sustainable business models

    Anjney explains open vs closed as a pragmatic delivery tactic, not just ideology. When preferences vary widely across cultures, products, or platforms, open weights enable downstream customization—creating strong value for large partners—while still supporting a sustainable business via APIs and licensing.

  13. Self Flow and multimodal representation alignment: moving beyond “pixel generators”

    Andreas introduces Self Flow as a method to align generative model internals with semantic representations, extending prior single-modality alignment work to multimodal settings. The goal is to avoid models that only mimic pixels and instead build representations that capture meaning across modalities—critical for unified multimodal intelligence.

  14. Q&A: safety, privacy, partners, labeling, diffusion vs autoregressive, and 3D debate

    The closing Q&A covers operational safeguards (content filters, EU compliance, deletion requests), BFL’s stance on partner guardrails, and pragmatic labeling strategies (noisy-at-scale → high-quality human signals later). Andreas also compares diffusion/flow-matching and autoregressive models (data vs inference tradeoffs, distillation) and argues for implicit spatial understanding from video/audio over explicit 3D representations, with Anjney adding nuance from 3D-mapping experience.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome