Stanford CS153 Frontier Systems | Amit Jain from Luma AI on Unified Intelligence Systems

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In week three of CS153, the instructor hosts Amit Jain from Luma to discuss “Unified Intelligence Systems” as a follow-up to a prior lecture on visual intelligence. Jain recounts his Apple work on LiDAR for projects including Titan and Vision Pro, and how early exploration of generative models and differentiable 3D led to founding Luma with an initial focus on large-scale 3D capture. Luma then shifted to generative video in 2023 to leverage the scale of internet video data, releasing the Dream Machine model in March 2024 and rapidly reaching millions of users, while building preference-based feedback loops and human annotation pipelines. Jain explains Luma’s multimodal AI factory—pretraining, post-training, deployment, and reinforcement learning—its security constraints for studio clients, and a move toward unified transformer architectures that jointly reason across text, images, video, and audio to enable end-to-end creative and professional workflows. Guest speaker: Amit Jain is the CEO and co-founder of Luma AI, a research lab developing multimodal foundation models aimed at "unified intelligence." Under his leadership, Luma has scaled from a 3D-capture pioneer into a leader in generative video, raising a $900M Series C following the success of its Dream Machine and Ray video-reasoning models. By 2026, he has steered the company into large-scale infrastructure projects including Project Halo — a 2-gigawatt AI supercluster — to build the next generation of "world models" capable of simulating physical reality. He founded Luma in 2022 from Apple, where he was a Systems and Machine Learning Engineer. At Apple, he led development of the Passthrough feature for Apple Vision Pro and was instrumental in integrating the first LiDAR sensors into the iPhone — foundational work for modern spatial computing. His background also includes physics and mathematical simulation. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

CS153 InstructorhostAmit Jainguest

May 6, 202657mWatch on YouTube ↗

EVERY SPOKEN WORD

60 min read · 12,377 words

CICS153 Instructor
Welcome, gang, to, uh, week three of CS153. We have today with us Amit Jain from Luma. Thank you for joining us, Amit. [laughs]
AJAmit Jain
Thanks for having me.
CICS153 Instructor
Okay. [audience applauds and cheers] Uh, Amit is gonna be talking to us today about unified intelligence systems. You're gonna be hearing a lot more about this. I think it's a, a very relevant follow-up to the visual intelligence systems lecture we had last week from Andy Blattman at Black Forest Labs. Um, quick recap on the class and today. We're gonna talk about Amit, and today we're gonna do a field trip into what I think is also one of the most exciting, uh, factories working on how to get work done, especially visual and creative work done in the world, called Luma. But before that, why don't we start by talking about Amit a little bit? I had the privilege to get to know Amit a few years ago when he was still an engineer at Apple. And, uh, I was at Discord at the time, and Amit-- I got an email from Amit saying, um, "Hey, I heard you have a bunch of 3D data."
AJAmit Jain
Yes.
CICS153 Instructor
"Uh, can I have it?" [laughs]
AJAmit Jain
I remember that.
CICS153 Instructor
And I said, "No, you can't."
AJAmit Jain
[laughs]
CICS153 Instructor
Uh, because Discord had acquired the data. But I started asking Amit what-- why he needed the data. If you guys remember, um, I covered this in, in our first lecture, but Ubiquity6, the company I'd started about a decade ago, was a 3D computer vision mapping company, and we had, uh, we had millions of people around the world who were capturing the world in 3D using their smartphones. And all that data, we had terabytes of data, um, uh, that were 3D representations of the world that we'd reconstructed from 2D images. And Amit said, "Well, I wanna build, um, a 3D service that, uh, that is generative. I want to allow people to create gener-- the same kinds of meshes and point clouds and 3D representations of spaces, but through generative models because that's where the world is going." Um, and I s- got interested 'cause I kinda agreed with him, and he was ahead of the curve. And so I had a chance to invest as an angel investor at the time, and then a few years later, I had a chance to partner with Amit again at a16z when I was a general partner. And had-- Thank you for letting me lead your Series-
AJAmit Jain
B
CICS153 Instructor
... B.
AJAmit Jain
Yeah.
CICS153 Instructor
Um, Amit was also one of the first customers of the a16z compute program called Oxygen, and actually helped name Oxygen as well. Um, I think the quote was, he said something like, you know, "If we don't have compute on day one-"
AJAmit Jain
Let's-
CICS153 Instructor
"... can't really read."
AJAmit Jain
Suffocate. Yeah.
CICS153 Instructor
So, um, tell us a little bit about what Luma is and how-- what were the dots that led from the insight at Apple that generative modeling was the future that led here?
AJAmit Jain
My background, b- very briefly. So at Apple, I was working on, uh, first the LiDAR systems that actually now is on our iPhones. Uh, this was called the Jasper sensor if any o- any of you are familiar. And we were trying to build, um-- We were trying to actually build like, you know, what comes after the c- after the camera. This sensor was built, now I can talk about it because, you know, the project is no more for the car, [chuckles] uh, which was called Titan. And we started to work on Vision Pro after that because, you know, the car project got, got canceled, and the Vision Pro had, had a bunch of LiDARs on it. And during that work, it started to become obvious that like, okay, you know, um, the computers of the future, uh, we still don't know what they will look like. Uh, you know, maybe they will have AI or what- whatnot. The computers of the future will need very different interfaces, will need very different kind of media, and will need very different kind of, of ways of actually capturing and creating and building those things into the system. So in 2020, uh, at Apple, we started exploring generative models. Um, and, and think about it, it's 2020, so, you know, before language model scaling was known to be working and before, um... Actually, it was before DALL-E, but NERF had already come out from Matthew Tanchik from Berkeley. So we started to explore those generative systems and, uh, that led me to thinking that, okay, if language scaling is working and here is, is a method where we-- differentiable 3D is possible, what would happen if all of these things are combined together, right? That would basically mean you have the full footprint of every observation in the universe, and you will be able to like, you know, differentiably learn about them. If you can differentiably learn about them, you can understand them, and then finally you can generate them. So that was the genesis of Luma. And at that time, because of, of the pedigree we had, 3D seemed like the most logical way of going forward because first of all, 3D tells you-- 3D has a lot more information than images do. Uh, naively, we assumed at the time 3D has a lot more information than videos do as well, and that 4D would be very easy to capture and scale. But again, I say naively because as you will learn in, in a few seconds that that was a bad assumption. But that's kind of where we started, with the idea of building what we now call a world simulator. Uh, at that time it was just like, all right, like, you know, if we can learn this and generate this, we would have something that would allow us world understanding.
CICS153 Instructor
And c- you, you talked about-- You, you said this phrase, which is important, you know, l-learn the world in a differentiable manner.
AJAmit Jain
Yeah.
CICS153 Instructor
What does that mean?
AJAmit Jain
Right. So I mean, i-if you're-- I-I'm sure you guys are all familiar with like, you know, how transformers work and how AI models work. Differentiable means you can put it in a training loop and, um, you can have a loss function that can be then iteratively optimized. So differentiable allows you to do that. If the function is non-differentiable, then like, you know, you j- really can't do gradient descent on it. And, uh, if you can't do gradient descent on it, then deep learning doesn't work. So the tools that we have for this era, for this generation, is basically compute and gradient descent. Um, and yes, transformers are things that are very, very well susceptible to gradient descent, but the actual, you know, thing underneath it is gradient descent and compute. So how can we take a lot of data, a lot of compute, and gradient descent and produce something useful out of it? Differentiability is the core characteristic of that problem, those problems, basically.
CICS153 Instructor
Yep.That's helpful. Could you just connect the dots on how, what that insight-
AJAmit Jain
Yeah
CICS153 Instructor
... led to then what-
AJAmit Jain
Right
CICS153 Instructor
... Luma's doing today?
AJAmit Jain
So we started-- When we started the company, the idea was we will, we'll, uh, you know, capture an, uh, ungodly amount of 3D data, build a flywheel that allows people to capture that and like, you know, for us to be able to use it and then like, you know, build, build both simulation systems with it. So we released an app, uh, which is called Illuma 3D Capture. It actually was very, very popular because, one, the results were really, really great. It was for the first time that NERF and Gaussian splats were productionized. And Matthew, uh, you know, he joined our team actually to really push forward the, the frontier of, of that sort of the world. But very soon we realized that it doesn't matter how many people use the app, it will never reach the scale that was necessary to learn enough about the universe.
CICS153 Instructor
Why is that?
AJAmit Jain
Because think about it, right? The number of people that are writing on the internet, that are taking photos on the internet, that are, are, are capturing videos on the internet, substantially outpaces anything one company can actually distribute. Also, there's like, you know, decades and decades and decades of that information that is already available. So it's all about data. It actually... You, you can make the case that like, you know, this particular modality of data is better for learning versus this or versus that. It really doesn't matter. That's a moot point because you're running against the physics of scale. So wherever there is scale in data, that's the only thing that's gonna work.

Episode duration: 57:41

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 6nUl_w5W9Wk

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome