Stanford OnlineStanford CS153 Frontier Systems | Amit Jain from Luma AI on Unified Intelligence Systems
EVERY SPOKEN WORD
60 min read · 12,377 words- 0:10 – 0:52
Course setup and why “unified intelligence systems” matters this week
- CICS153 Instructor
Welcome, gang, to, uh, week three of CS153. We have today with us Amit Jain from Luma. Thank you for joining us, Amit. [laughs]
- AJAmit Jain
Thanks for having me.
- CICS153 Instructor
Okay. [audience applauds and cheers] Uh, Amit is gonna be talking to us today about unified intelligence systems. You're gonna be hearing a lot more about this. I think it's a, a very relevant follow-up to the visual intelligence systems lecture we had last week from Andy Blattman at Black Forest Labs. Um, quick recap on the class and today. We're gonna talk about Amit, and today we're
- 0:52 – 2:56
How Luma’s origin story started with a blunt ask: “Can I have your 3D data?”
- CICS153 Instructor
gonna do a field trip into what I think is also one of the most exciting, uh, factories working on how to get work done, especially visual and creative work done in the world, called Luma. But before that, why don't we start by talking about Amit a little bit? I had the privilege to get to know Amit a few years ago when he was still an engineer at Apple. And, uh, I was at Discord at the time, and Amit-- I got an email from Amit saying, um, "Hey, I heard you have a bunch of 3D data."
- AJAmit Jain
Yes.
- CICS153 Instructor
"Uh, can I have it?" [laughs]
- AJAmit Jain
I remember that.
- CICS153 Instructor
And I said, "No, you can't."
- AJAmit Jain
[laughs]
- CICS153 Instructor
Uh, because Discord had acquired the data. But I started asking Amit what-- why he needed the data. If you guys remember, um, I covered this in, in our first lecture, but Ubiquity6, the company I'd started about a decade ago, was a 3D computer vision mapping company, and we had, uh, we had millions of people around the world who were capturing the world in 3D using their smartphones. And all that data, we had terabytes of data, um, uh, that were 3D representations of the world that we'd reconstructed from 2D images. And Amit said, "Well, I wanna build, um, a 3D service that, uh, that is generative. I want to allow people to create gener-- the same kinds of meshes and point clouds and 3D representations of spaces, but through generative models because that's where the world is going." Um, and I s- got interested 'cause I kinda agreed with him, and he was ahead of the curve. And so I had a chance to invest as an angel investor at the time, and then a few years later, I had a chance to partner with Amit again at a16z when I was a general partner. And had-- Thank you for letting me lead your Series-
- AJAmit Jain
B
- CICS153 Instructor
... B.
- AJAmit Jain
Yeah.
- CICS153 Instructor
Um, Amit was also one of the first customers of the a16z compute program called Oxygen, and actually helped name Oxygen as well. Um, I think the quote was, he said something like, you know, "If we don't have compute on day one-"
- AJAmit Jain
Let's-
- CICS153 Instructor
"... can't really read."
- AJAmit Jain
Suffocate. Yeah.
- CICS153 Instructor
So, um, tell us a little bit about what Luma is and how-- what were the dots that led from the insight at Apple that generative modeling was the future that led here?
- 2:56 – 5:07
From Apple LiDAR to “world simulators”: the technical and product motivation
- AJAmit Jain
My background, b- very briefly. So at Apple, I was working on, uh, first the LiDAR systems that actually now is on our iPhones. Uh, this was called the Jasper sensor if any o- any of you are familiar. And we were trying to build, um-- We were trying to actually build like, you know, what comes after the c- after the camera. This sensor was built, now I can talk about it because, you know, the project is no more for the car, [chuckles] uh, which was called Titan. And we started to work on Vision Pro after that because, you know, the car project got, got canceled, and the Vision Pro had, had a bunch of LiDARs on it. And during that work, it started to become obvious that like, okay, you know, um, the computers of the future, uh, we still don't know what they will look like. Uh, you know, maybe they will have AI or what- whatnot. The computers of the future will need very different interfaces, will need very different kind of media, and will need very different kind of, of ways of actually capturing and creating and building those things into the system. So in 2020, uh, at Apple, we started exploring generative models. Um, and, and think about it, it's 2020, so, you know, before language model scaling was known to be working and before, um... Actually, it was before DALL-E, but NERF had already come out from Matthew Tanchik from Berkeley. So we started to explore those generative systems and, uh, that led me to thinking that, okay, if language scaling is working and here is, is a method where we-- differentiable 3D is possible, what would happen if all of these things are combined together, right? That would basically mean you have the full footprint of every observation in the universe, and you will be able to like, you know, differentiably learn about them. If you can differentiably learn about them, you can understand them, and then finally you can generate them. So that was the genesis of Luma. And at that time, because of, of the pedigree we had, 3D seemed like the most logical way of going forward because first of all, 3D tells you-- 3D has a lot more information than images do. Uh, naively, we assumed at the time 3D has a lot more information than videos do as well, and that 4D would be very easy to capture and scale. But again, I say naively because as you will learn in, in a few seconds that that was a bad assumption. But that's kind of where we started, with the idea of building what we now call a world simulator. Uh, at that time it was just like, all right, like, you know, if we can learn this and generate this, we would have something that would allow us world understanding.
- 5:07 – 6:10
What “differentiably learn the world” actually means (and why it’s central)
- CICS153 Instructor
And c- you, you talked about-- You, you said this phrase, which is important, you know, l-learn the world in a differentiable manner.
- AJAmit Jain
Yeah.
- CICS153 Instructor
What does that mean?
- AJAmit Jain
Right. So I mean, i-if you're-- I-I'm sure you guys are all familiar with like, you know, how transformers work and how AI models work. Differentiable means you can put it in a training loop and, um, you can have a loss function that can be then iteratively optimized. So differentiable allows you to do that. If the function is non-differentiable, then like, you know, you j- really can't do gradient descent on it. And, uh, if you can't do gradient descent on it, then deep learning doesn't work. So the tools that we have for this era, for this generation, is basically compute and gradient descent. Um, and yes, transformers are things that are very, very well susceptible to gradient descent, but the actual, you know, thing underneath it is gradient descent and compute. So how can we take a lot of data, a lot of compute, and gradient descent and produce something useful out of it? Differentiability is the core characteristic of that problem, those problems, basically.
- CICS153 Instructor
Yep.That's helpful. Could you just connect the dots on how, what that insight-
- AJAmit Jain
Yeah
- CICS153 Instructor
... led to then what-
- AJAmit Jain
Right
- CICS153 Instructor
... Luma's doing today?
- 6:10 – 7:25
Luma’s first flywheel: productionizing NERFs/Gaussian splats—and hitting a scaling wall
- AJAmit Jain
So we started-- When we started the company, the idea was we will, we'll, uh, you know, capture an, uh, ungodly amount of 3D data, build a flywheel that allows people to capture that and like, you know, for us to be able to use it and then like, you know, build, build both simulation systems with it. So we released an app, uh, which is called Illuma 3D Capture. It actually was very, very popular because, one, the results were really, really great. It was for the first time that NERF and Gaussian splats were productionized. And Matthew, uh, you know, he joined our team actually to really push forward the, the frontier of, of that sort of the world. But very soon we realized that it doesn't matter how many people use the app, it will never reach the scale that was necessary to learn enough about the universe.
- CICS153 Instructor
Why is that?
- AJAmit Jain
Because think about it, right? The number of people that are writing on the internet, that are taking photos on the internet, that are, are, are capturing videos on the internet, substantially outpaces anything one company can actually distribute. Also, there's like, you know, decades and decades and decades of that information that is already available. So it's all about data. It actually... You, you can make the case that like, you know, this particular modality of data is better for learning versus this or versus that. It really doesn't matter. That's a moot point because you're running against the physics of scale. So wherever there is scale in data, that's the only thing that's gonna work.
- CICS153 Instructor
Hmm.
- 7:25 – 10:27
Pivot to video: Hopper compute, Dream Machine, and the next data flywheel
- AJAmit Jain
And you have to design the algorithms around where the, the data is, not the other way around, right? You come up with some pristine algorithm, but you don't have any data, then like, you know, what's the point? Robotics is coming u-up against this problem right now. We're like, "All right, we're gonna build like, you know, these action systems," but well, where is the action data? There's no internet of action data. You can have huge labs in, in China and India and in Vietnam and, and, and everywhere gathering this data, but the scale is really c- not comparable. So you have to just design the systems around data. So that's what, that's what we learned. Um, so in 2023, after that realization and after, uh, you know, NVIDIA Hopper architecture was announced, uh, we started to build the foundations of, um, you know, generative video because video is three-dimensional. It has two dimensions of space and one dimension of time. And human brain actually like, you know, learns about 3D representation through that time proxy. So when Hopper architecture came about, we started to think like, "All right, it might be possible actually to learn video and to learn the world representation through video." So in 2023, uh, Jiaming joined us. Jiaming, uh, was, uh, you know, at NVIDIA at that point. He's a Stanford grad. And a, a few other people from Stanford and Berkeley, uh, started to join the company with this idea of like, "All right, let's learn from video." And we started to build that infrastructure. And in February-- Sorry, in March 2024, we released the first video model, uh, that was called Dream Machine.
- CICS153 Instructor
Yeah.
- AJAmit Jain
And, um, you know, in, in the first three weeks, four weeks, actually, we got up to, uh, uh, s- six million users from that because people had never seen generative video. Uh, sort of was announced but never released, so people had never experienced it, so people really wanted to actually try that out. So we started with video at that point. And then we have had the similar realization again in 2025, early 2025, uh, just like annual cycles now, that just video is not enough because video is good, but it doesn't pair human logic. It doesn't pair why an event is important. What is the sequence of events, and what does that actually lead to? Just having language models in the middle, uh, that are like, you know, u- being used for embedding is not sufficient. You need unified intelligence, so that's kind of where we are now.
- CICS153 Instructor
Yep.
- AJAmit Jain
These are the dots.
- CICS153 Instructor
Well, um, yeah, so th-this is not the first time the class has heard that when you close the loop, you have to, you have to sort of evolve the, the mid-training, the post-training pipeline, the interface. And so can we spend a little bit of time-- So I, I don't think it's a surprise for people to hear that there was sort of an iterative loop every year as you got more and more data from customers.
- AJAmit Jain
Yeah.
- CICS153 Instructor
But can we talk a little bit about that first... You know, the final projects for the class this time are the one person frontier lab, where they're going to be bootstrapping their own flywheels.
- AJAmit Jain
That's very cool.
- CICS153 Instructor
Um, and the first-- You know, before-- I, I remember, you know, how nerve-wracking it was for you and for the team whe-when, when you had the realization that video was gonna be the future, but you didn't have a video model out in the world yet.
- AJAmit Jain
Yeah.
- CICS153 Instructor
And you didn't have a state-of-the-art system to start collecting that, um, that context feedback loop.
- AJAmit Jain
Yeah.
- CICS153 Instructor
So let's, let's, let's take a bit of a journey back in time, time travel to, uh, the launch of Dream Machine 1.
- AJAmit Jain
Yeah.
- CICS153 Instructor
Can you just tell folks how you went about kickstarting or bootstrapping the, the video flywheel at Luma?
- 10:27 – 13:51
Bootstrapping the video flywheel: preference signals, trainers, and product telemetry
- AJAmit Jain
So I think the core problem that you wanna think about whenever you're building these really, really large systems, they have a wild distribution, right? Like, you know, if even if you're talking about language models, well, they have all of language models, uh, language model data, and what is good, what is bad, right? So you wanna think about, okay, from this really raw distribution I get from pre-training, how do I get to a model that humans can use? And what humans find useful is a very narrow band within that distribution, and that narrow band is not like, you know, a predictable, uh, linear band. It's just like, you know, pockets of, of, of greatness that humans think are great. Some other species might find it very different, right? But we have our own aesthetics. We have our own use cases. We have our own value system. So we find those, those distributions valuable. So now the question becomes: How do you find, or how do you basically t- get that distribution out of the model? So we started to think about that problem and, and with Dream Machine, the-- because there were so many users that were using the model, the question became, "All right, can we learn something about that?" And, and preference, uh, or like, you know, preference feedback, uh, at that time, by the way, um, SFT, right, like, you know, was just started to being thought about. RLHF was a hot thing where people were thinking about like, all right, like, you know, human feedback loops. So we built a system where, um, videos that people were liking and people were downloading, we considered that to be a signal of like, all right, this is something that people prefer. Um, it was not 100% accurate because some people were downloading really bad videos as a showcase of how bad AI is at video, right? So our model also learned a lot of that. So we had to then build systems for, uh, humans to be able to, uh, go and filter out, like, uh, people we pay. So then it started to emerge what a frontier lab actually looks like. A frontier lab has these components of data, these components of compute and algorithm, but it also has huge parts of, of what we call skills and trainers and tutors and people who are doing the labeling of data and all of these systems. Um-If you don't have that, then it's actually not complete. And a part of that is also the product you built. Can the product actually give you enough information to make sure that the next model is better than the previous one? And hence the experience is better, and hence more people will use it, and hence you'll get more data from it, uh, uh, about this preference of, of human distribution, and can you make the next model actually better? So, I mean, it took us a long time to learn actually how to gather that feedback, how to... You know, and then now the system we have, uh, in, in, in the latest Luma Agent system, ungodly amount of feedback actually we get from, from, um, um, what people are doing. Every interaction that is there, we learn from, like, you know, whether they like it, dislike it, in what way they like it, what way they dislike it, whether the full chain of thought, uh, that, that the model produced and the full chain of work that the model produced is any good. Which elements of that is not good? And then that's how you actually start to get good at it. Yeah.
- CICS153 Instructor
Well, let's, um... Why, why don't we do a double-click on how that-
- AJAmit Jain
Yeah
- CICS153 Instructor
... that actually works. So to remind everybody about the field trip we're about to take, right? Um, this is the, the very basic standard AI factory we've talked about, right? Frontier AI, um, sort of pipeline. We've got pre-training, mid-training, and then we have post-training and deployment. And so today, we're gonna hear a little bit from Amit on how the, the Luma version of this works. Why don't you go ahead and just k-kind of talk us through-
- AJAmit Jain
Yeah
- CICS153 Instructor
... what's, what's actually going on under the hood at Luma.
- 13:51 – 20:45
Inside the Luma Factory: multimodal pretraining and the push to end-to-end creative work
- AJAmit Jain
Absolutely. So, um, let me talk about what is informing the design decisions for our architecture and for our models. Um, currently, we are seeing huge amount of alpha coming from, from language models being used for, for adjacent tasks like coding, for adjacent tasks like, you know, system design and, and those kind of processes. But when we start to think about tasks that require more context than what is available in text, so creative work, right? Huge amount of things, a huge amount of information that is in, in visual domain, huge amount of information in auditory domain, actually huge amount of information in the trace of how you arrived at the final output, right? That. When we think about robotics, you can definitely start to build a, a robotic system just based on text models or VLMs or VLAs that people are starting to do now. But they will not generalize, just the same way that, like, you know, a-autonomy didn't-- uh, autonomous driving didn't generalize until people started to build full end-to-end systems that w- that had language, that had video, that had, uh, like, you know, all of the control signals, all of these things in there. So that's the problem we are coming up with, that the real world is way more complicated than coding, right? I mean, coding is a really valuable task, but not everything can be done in coding, right? Like, otherwise, programmers would be the only profession that would be left. Uh, uh, and now they are also, you know, endangered species, actually. But-
- CICS153 Instructor
I'm not sure that's true, but I understand your point.
- AJAmit Jain
[laughs]
- CICS153 Instructor
Yeah.
- AJAmit Jain
As a, as a programmer, uh, it's really fun.
- CICS153 Instructor
Well, the job has evolved, for sure-
- AJAmit Jain
That's right
- CICS153 Instructor
... to become a trainer-
- AJAmit Jain
Yeah
- CICS153 Instructor
... and a tutor. Yeah.
- AJAmit Jain
I-It's a really fun, fun time to be that way. Uh, I-I started pr- coding like, you know, when I was thir-thirteen years old, in order to build s-simulation systems, in order to do like, you know... Uh, so my background is in physics. And in order to actually build like, you know, simulation systems for, for electromagnetism and those kind of things to l- see how these systems behave. That's why-- when I learned, uh, to start coding. And even at that point, it was really obvious that I cannot teach those systems from any observations.
- CICS153 Instructor
Hmm.
- AJAmit Jain
Right? We can write the code, but that is like approximations that we have in our, in our models or in our, in our equations, but we can't actually teach those models from any data. So all of this is informing how we build our systems. So even early on, we started to think about, okay, in our pre-training, how can we learn from all of video, all of images, and all of text, right? It's a really hard problem because they're really different modalities, and they're expressed very differently. If, if you think about the encoding of these modalities, text is discrete, and text performs the best when you encode, encode it in a discrete manner. At least that is the understanding today. Video is kinda somewhere in between, and audio and images are b-be-best performed in, in a continuous space. So our factory, as you call it, is built around this idea of like, how do we learn jointly from all of these systems? In 2025, these were disparate towers that we built. Language tower, image tower, video tower, audio tower, and then like, you know, we would unify them together, um, using just like, you know, some, some fusion techniques so that like, you know, they will do better. Uh, if you look at like, you know, the work from, uh, Andy's lab, right? Like, you know, uh, stable diffusion, those kind of things. That's what it does as well, where you have a tiny little language component-
- CICS153 Instructor
Right
- AJAmit Jain
... um, and, and you learn embeddings from that to be able to understand the human instructions. It was just not sufficient. So when we talk to our customers, when they try to use our system, so where, where are systems being used, right? For instance, currently, large studios. So actually, I'm very, very, very excited about, um, a new show that is coming out on Prime Video.
- CICS153 Instructor
Hmm.
- AJAmit Jain
Uh, the trailer's out. Uh, it's called, um, um, Old Stories. Uh, it's a-about Moses, right? So it has, um, Sir Ben Kingsley is the star of it.
- CICS153 Instructor
Oh, cool.
- AJAmit Jain
Uh, it's, it's a proper production. It's not an AI video. Uh, it's a $4 million, uh, sorry, $4.5 million, uh, per episode production, basically. And it's all pretty much all produced using Luma Agents. So they're using it in these, like, really high intense situations where they wanna be able to model the whole world and the physics of the world and light and, and, and, and, uh, uh, fluid and interactions and all of these kind of things.
- CICS153 Instructor
Hmm.
- AJAmit Jain
Now, when you do that, it's just not sufficient to build an image model or a video model. You need a model that understands time and causality and, and language, right? And it, it understands, like, proper instructions. "Okay, well, like, you know, uh, um, this looks good, but what if like, you know, the, the shirt sleeves had like, you know, this particular thing right here?"
- CICS153 Instructor
Hmm.
- AJAmit Jain
How do you express that instru-instruction? "Okay, in time, when this person actually walks, uh, through the door, the whole scene explodes." All right, what does walking through the door actually means? When the person walks through the door, what does this explosion of the scene mean? Give me more instructions, right?The deeper you go into these kind of problems, and this is a very, very, very big market, right? It's about 120 million creatives in the world whose-- this is their job, right? Like, you know, these are not people who paint for a hobby or all these kind of things. These are people who actually are employed in this industry. So about, uh, you know, two times, three times by estimation of coders. Um, their work every day goes into replicating the physics of the real world-
- CICS153 Instructor
Mm
- AJAmit Jain
... into computers. So we wanna build systems for them, and if you wanna do that, you wanna build what we are now calling unified models that have the same understanding and intelligence of a language model that can follow context, that can remember, and the physical understanding and the world model understanding of video models and image models. So that is what the output, that is what the things we want to produce. This was 2025. [laughs] And now in 2026 when the models got really good, what people want to do is like they wanna do the full work end to end. You know, it's like, all right, why is it only producing five-second video for me, right? Why can't it make the whole shot? Uh, if you go to like, you know, w- w- people in advertising world, why can't it make the whole campaign? If you talk to robotics companies, why can't it actually produce the whole action and then judge its own outputs, and then tell me when this is the right action and incorrect action?
- CICS153 Instructor
Mm.
- AJAmit Jain
Like, you know, why can't I get the right force in all of these kind of problems? So people want end-to-end results. So now the Luma Factory is about building systems that can do end-to-end work-
- CICS153 Instructor
Hmm
- AJAmit Jain
... in multimodal domains. So that's, that's kind of what we do. We have massive reserves of like, you know, multimodal data, uh, i-in about-- Uh, the final trainable outputs are in, in about thirty petabytes of, uh, you know, um, scale. We train them on, on, um, currently H100s and very soon GB300, uh, uh, you know, GPUs, uh, in, i-in the 0-0-010K scale, basically. So pretty much the same as, as a second-tier language model training. Like, you know-
- CICS153 Instructor
Hmm
- 20:45 – 22:31
Enterprise deployment constraints: studio secrecy, training exclusions, and learning from traces
- CICS153 Instructor
And could you talk a little bit about, you know, when you started deploying these systems in, in, in the first lecture, we talked about mission-critical context, right?
- AJAmit Jain
Mm.
- CICS153 Instructor
And one, one type of mission-critical context is a large studio-
- AJAmit Jain
Yep
- CICS153 Instructor
... for whom their data is super sensitive.
- AJAmit Jain
Yep.
- CICS153 Instructor
They don't want... You know, they're happy to have you train their data, but, uh, with their data for them.
- AJAmit Jain
Yeah.
- CICS153 Instructor
But they don't-- If, if I'm running a studio, I don't want my data being used by another studio.
- AJAmit Jain
Yeah.
- CICS153 Instructor
So how, how do you, how did you navigate the deployment sort of restrictions-
- AJAmit Jain
Yeah
- CICS153 Instructor
... of these, of these professionals?
- AJAmit Jain
So we work with two arch nemesis at the same time, Netflix and, and Amazon Prime Studio, right? Which are the two giants of streaming war at the moment. Um, so basically then, then you have to build systems that are guaranteeing that there is no way that there's any data overlap. We have internal controls and systems that like, you know, are, are some of the standard ones like SOC 2 and those, those kind of things, and then specific ones that are for AI labs on how do you not train on this, on, on this data.
- CICS153 Instructor
Hmm.
- AJAmit Jain
So for instance, uh, if you're producing the next blockbuster, you don't want the next Iron Man, for instance, right, like to show up into the training data. So we have guarantees around that, like, all right, whenever certain stuff is marked or projects are marked, they will never show up in training data. They will never show up in, in any of these loops, basically. But we still learn from like, you know, what users are doing in the product-
- CICS153 Instructor
Hmm
- AJAmit Jain
... which is different from the, the visual artifacts that they're producing, but rather the traces they're producing. We, we're still able to use them and learn from them, actually.
- CICS153 Instructor
This is the interaction data-
- AJAmit Jain
That's right
- CICS153 Instructor
... when people are working with the interface of the agents.
- AJAmit Jain
Yeah. That's right.
- CICS153 Instructor
Okay.
- AJAmit Jain
So there's some limitations on, on these kind of high, higher sensitivity projects. Yeah.
- CICS153 Instructor
Um, yeah, I think you have, you know, sort of a-
- AJAmit Jain
Yeah
- 22:31 – 25:15
Unified intelligence in action: generating polished slides from a mind map + style prompt
- CICS153 Instructor
... well, one, uh, could you talk a little about how you created these slides? 'Cause these, these-
- AJAmit Jain
Yeah
- CICS153 Instructor
... I believe were created with, with Uni1.
- AJAmit Jain
That's right.
- CICS153 Instructor
Is that right?
- AJAmit Jain
So, uh, these are, uh, you know, I, I basically gave it... Uh, actually, let me start from that first, and then I will actually, actually ta-ta-talk about unified models as well. So here, I created this, uh, like, you know, on the top what you see, I created that, um, mind map, whatever you wanna call it, in our product. And then I basically asked, if you see on the right, um, I asked it like, you know... And I also gave it Ant slide, uh, that like, you know, the one right here. Sorry, not this one, but... Okay, I don't know. The first one you saw of the factory one.
- CICS153 Instructor
The factory slide. I see.
- AJAmit Jain
That's right.
- CICS153 Instructor
Yeah.
- AJAmit Jain
I gave it that, and I asked it like, "Hey, in this style, actually produce the outputs." Now, this is actually a very, very good example of what unified intelligence that I'm gonna talk about means. People, when they think about image models, video models, or, or any models, not text, they think they are just... They produce beautiful images, right?
- CICS153 Instructor
Mm.
- AJAmit Jain
But that is a really big mental gap that the world has in this area. Just like language models produce words, right? The words can be beautiful. You can just say like, "Hey, it's a poem," and it could mean nothing, right?
- CICS153 Instructor
Hmm.
- AJAmit Jain
And, and simultaneously, you can have a mathematical proof of Euler's problem number, pick your, uh, take your pick, right? 1152. They all are words at the end of the day, but how you string them together determines the information content and determines the informa- uh, the, the intelligence of those.
- CICS153 Instructor
Hmm.
- AJAmit Jain
Just like that, how you arrange the pixels determines what they're conveying and how, how intelligent they are. So unified models that we are producing now, and I'm gonna talk about that in a second, are about how you express intelligence in whatever medium is convenient for the person that they are actually, you know, who's using it. So if a language is, uh, a language output is convenient, fantastic. If it is slides and images, fantastic. If it's a video explainer, great. But they're all basically outputs that are intelligence.
- CICS153 Instructor
Hmm.
- AJAmit Jain
So that, that's what we call unified models. So yeah, basically, uh, it was one shot. Uh, it produced those slides. It produced one that I didn't like, and I deleted it, butBefore you ask me to take a screenshot of that.
- CICS153 Instructor
Yeah.
- AJAmit Jain
But that w- that was pretty much about it. If I would have asked it to do a very detailed overview of that, then that's what it, it would have done. So end-to-end work, this is what we call end-to-end work, right? You know?
- CICS153 Instructor
So ju- just to break down what happened-
- AJAmit Jain
Yeah
- CICS153 Instructor
... you gave it, you gave it my original slide as a prompt.
- AJAmit Jain
Yeah.
- CICS153 Instructor
A screenshot of that prompt.
- AJAmit Jain
Yeah.
- CICS153 Instructor
You then gave it instructions on the right-
- AJAmit Jain
Yeah
- CICS153 Instructor
... in the chat, and then you gave it a little bit, like, guidance, is, is it? That scaffolding?
- AJAmit Jain
Yeah, just my, my, my, my thoughts up there.
- 25:15 – 29:50
Why it’s hard: bridging understanding vs generation, and what “unified” architecture changes
- AJAmit Jain
Right. So I mean, that's a good segue into unified models, basically. So, um-
- CICS153 Instructor
Okay
- AJAmit Jain
... well, LLM, first of all, doesn't generate images, right? I mean, it's a language model.
- CICS153 Instructor
Hmm.
- AJAmit Jain
You can ask an LLM to use a computer and try to generate images, but again, it really falls apart because it doesn't see anything.
- CICS153 Instructor
Hmm.
- AJAmit Jain
So when it tries to reason spatially, when it tries to produce, like, you know, any visual outputs, they're blind models. They see everything as a, a, a full sequence, right? Like, you know, even the grid nature of, of, of images and visual information is not apparent to LLMs. So when you start to do VLMss, which are vision language models, right? Like, you know, you start to teach them a little bit about image p- part of it, VLMs are still not generative. VLMs understand images, but VLMs can't generate images. So we have on this world where, like, you know, you have understanding in language and, and generation of text, and then you have, uh, models like Flux, which are good at generating images-
- CICS153 Instructor
Right
- AJAmit Jain
... right? Uh, which are great models, by the way, right? But then they don't have any of this understanding.
- CICS153 Instructor
Right.
- AJAmit Jain
Right? And I think Andy talked about that last time as well-
- CICS153 Instructor
Yes
- AJAmit Jain
... that, like, there's this big chasm in between these two things. Understanding is separate and, and, and language is separate-- oh, sorry-
- CICS153 Instructor
Generation
- AJAmit Jain
... generation is separate. But in language, that's not true. An LLM is good because it understands text and generates text all in one go.
- CICS153 Instructor
Hmm.
- AJAmit Jain
Right? There's no, there's no delta in between. There's no two models that are actually doing it. If we want to solve world understanding and, quote-unquote, "world models" that people are calling it, that's what we need to do.
- CICS153 Instructor
But, um, we've-- I mean, for at least about a year, I guess, we've had models that can generate language tokens and image tokens, right, with NanoBanana.
- AJAmit Jain
Right.
- CICS153 Instructor
But they, they were-- like, NanoBanana was still not able to generate... I, I, I remember trying to generate schematics-
- AJAmit Jain
Right
- CICS153 Instructor
... like this.
- AJAmit Jain
Uh-huh.
- CICS153 Instructor
I, I tried to generate the factory slide-
- AJAmit Jain
Yeah
- CICS153 Instructor
... with NanoBanana. I couldn't.
- AJAmit Jain
Okay.
- CICS153 Instructor
Why were the capabilities still not there with basic sort of like these jointly trained models?
- AJAmit Jain
So from what we know of Google's architecture, NanoBanana is still a fused architecture-
- CICS153 Instructor
Mm-hmm
- 29:50 – 34:52
The skills/tools/model stack: how Luma agents turn expert craft into reusable leverage
- AJAmit Jain
Uh, yeah. So actually, uh, let me talk about how we deploy these architectures, first of all. So this is what we are trying to build. If we wanted to do end-to-end work, this should be very familiar. Like, you know, if you've taken CS class, it's the REPL loop, read-eval-print loop. This is how computers work, have worked for a very, very long time. If you think about the von Neumann architecture, it is built around, like, you know, the REPL loop generally. It was not thought about at the time this way, but now we think about it this way. If you want to deploy models to not just produce, like, you know, text tokens or image tokens, but actually to do work, end-to-end work, how do you build these systems? So how do you do that REPL loop? One way is doing the left one, where, like, you know, there you have different models for each kind of things, and there's, like, two schools of thought. You produce federated models or, like, you know, you have this kind of like-... tiny models that are each doing specialized work, and then you, you make them combi- or, or you just pass outputs from each other, and you probably have a judge model on top that, like, you know, judges and orchestrates all of that work. That's approach one. And approach two is that you have these, like, you know, mega models in the middle, [chuckles] um, where they have-- where they share this, like, you know, deep connective tissue, and they can reason in one single space. And you give them, you know, inputs, and you expect outputs of them. They're iterative models, so it's not like, you know, one shot all the outputs that are gonna come out. But we are betting on this second approach.
- CICS153 Instructor
Hmm.
- AJAmit Jain
And the reason is very simple, because we think intelligence is not this pipeline architecture problem. If you think about the systems of intelligence, the systems of intelligence don't look like, you know, this kind of big database problem. The systems of intelligence look more like the human brain, where you let information itself design the architectures and circuits inside it, like what we do during training, and hopefully very soon in continual learning, these circuits will change as we, as we are actually, uh, you know, using these models. And then you sort of step away from that. [chuckles]
- CICS153 Instructor
Hmm.
- AJAmit Jain
You manage context outside. You, you know, manage memory, sometimes outside, sometimes inside, like how you do with caches in, in CPUs today. But the actual processing unit are these unified models. So that is sort of our approach of how we, how we think about building them. And how we think of improving them is a little bit like this. So if you wanna think about, like, what is the computer of the future looks like, actually, what is every agent product today, uh, it's some version of this, basically. Like, you know, this is not a big revelation. This is how things are being built. So you have, like, you know, a tool harness in the middle. Uh, I'm gonna go from the middle up. This tool harness means systems that can use Linux, systems that can use, uh, you know, call APIs, all of these kind of things. But then how does it all work? How does it actually full work gets done? So you have this, like, fat stack of skills on the top. These are domain-specific understanding, right? So you wanna teach a robot, like, you know, how to assemble something, right? That's not a normal thing, right? Like, you know, if you wanna think about, like, how is an iPhone assembled, this is a very domain-specific thing. You can give it all that information. It doesn't need to be in the model. It doesn't even need to be in the tools. You give this information as context, and you can do this across huge amount of verticals, huge amount of, uh, like, you know, uh, different task in those verticals. Then you have tool harnesses, where you give it as general ca-- ability to call tools and, and things like that. And finally, orchestrating all of that and thinking through all of that is this unified model-
- CICS153 Instructor
Hmm
- AJAmit Jain
... at the bottom. That is interpreting all of this multimodal information, generating tool calls, understanding which skills to use, and producing the outputs. So this is how we think the architecture of the future o- of computers will look like, and this is what we have built the current product basically on. This is, this is basically built on this-
- CICS153 Instructor
Right
- AJAmit Jain
... kind of architecture. Yeah.
- CICS153 Instructor
So c- actually, could you just do a one-to-one mapping? So-
- AJAmit Jain
Yeah
- CICS153 Instructor
... here, where, uh, where was the harness? Where were the skills?
- AJAmit Jain
Yeah.
- CICS153 Instructor
Where was the model?
- AJAmit Jain
So actually, when it generated these slides, right, someone on our team who's really, really good at producing greatly designed slides wrote a, I don't know about, it's a 50-page document on what it means to design good slides, right? And if you see, ac- I don't know if the prompt is there. Um, I've got a clear picture. Now kick off planning and generation. Okay, so after this, it would have, uh, said like, "Oh, let me look up the skills I have, like, access to."
- CICS153 Instructor
Ah, so that was the skill.
- AJAmit Jain
That was the skill.
- CICS153 Instructor
That's a general purpose, um-
- AJAmit Jain
Slide skill
- CICS153 Instructor
... like, best-in-class slide creation skill-
- AJAmit Jain
Correct
- CICS153 Instructor
... that was created internally by a human and then uploaded for anybody else to use-
- AJAmit Jain
Exactly
- CICS153 Instructor
... automatically.
- AJAmit Jain
Exactly. Uh, so that's the skill layer. Then, uh, the model layer is obviously the one that is generating and, and generating the tool calls and all of these kind of things. And the tool layer here, so not many tools were necessary, but I, I, I think, like, you know, your image that, that you gave, that was also passed as context.
- CICS153 Instructor
Right.
- AJAmit Jain
And we probably ran OCR on it just to, like, you know, see, like, you know, what, what kind of things are. So this was not a very tool call heavy thing. But had you asked it to make an interactive webpage-
- CICS153 Instructor
Right
- AJAmit Jain
... that, like, you know, animates all of this stuff, then we're gone and call, uh, uh-
- CICS153 Instructor
A different skill
- 34:52 – 42:37
Business and market dynamics: capital intensity, enterprise adoption, and creative productivity
- CICS153 Instructor
Okay, I'm gonna ask you one last question before we switch, which is... Okay, so it took a couple years to put the whole system together-
- AJAmit Jain
Mm-hmm
- CICS153 Instructor
... which is a fairly high-scale system. Can you talk about the business for a sec? You announced earlier this year-
- AJAmit Jain
Yeah
- CICS153 Instructor
... I think you raised about a billion dollars.
- AJAmit Jain
$1.5.
- CICS153 Instructor
$1.5 billion.
- AJAmit Jain
Yeah, total.
- CICS153 Instructor
Yeah. Over your lifetime, Luma's raised about $1.5 billion. Of that, I think a billion was raised this, this-
- AJAmit Jain
This year
- CICS153 Instructor
... these last 12 months. Um, you know, it-- why does, why is this such a capital-intensive effort if it's not as high scale as language?
- AJAmit Jain
If you really wanna do it correctly, it is larger scale than language because it is strictly a superset of, like, you know, the work that is going on in language. But currently, we don't care as much about coding, for instance, so we don't have to spend that much effort towards it. We can go towards all the areas that language models are not good at, and that means we can actually have a subscale compute infrastructure, subscale data infrastructure, things like that, so it doesn't require 100 billion yet. Uh, like, you know, we can do with one billion what, like, you know, generally takes five, 10 billion annual run rate to be able to produce. Um, but if you think about it, like, where things are going, uh, you know, in one year, two year, three years' time, we believe that these systems will far surpass language systems-
- CICS153 Instructor
Hmm
- AJAmit Jain
... just because of the access to more data. More data is better, right? Just because of their understanding of more domains. So I'll give you an example. One of our customers who's using these systems, uh, they work in energy industry. Uh, you can guess who that is. Um-
- CICS153 Instructor
Right
- AJAmit Jain
... and now suddenly, like, you know, our systems have no idea about, uh, like, you know, grid systems. Like, all right, like, you know, how the energy grid actually works and, and, and how they wanna be able to do that. So what we did isWe started to ingest their energy grid diagrams and energy grid code and all of these kind of things. And suddenly, our systems are better at producing schematics and planning than Anthropic's coding models are because they can't actually read all that information.
- CICS153 Instructor
Hmm.
- AJAmit Jain
They can't actually see like, you know, how the things are laid out, that sort of problems. It's a very small example. Um, studios have another big example where like, you know, yes, LLMs they have had forever, but a story is not just text. A story is all of the physical stuff that is happening. If it has visual understanding, it can do much better. So we believe like, you know, especially as the age of robotics comes about, you will need these systems to be general-
- CICS153 Instructor
Right
- AJAmit Jain
... and these systems to be able to do everything, including writing code, and, and that's kind of where we're gonna go. But today, this gives us a very great business where language models are not really playing. Um, currently, we are... Like, you know, when we started the company, I mean, we were very small. Today, we work with some of the largest studios in the world. Now, we work with the largest advertising agency in the world, Publicis. They're just deployment channels for us. We work with the second-largest brand in the world, Coke, who is moving three billion dollar of annual production of, of content to Luma, basically. And, um, in addition to that, like, you know, in, in, in some of the areas like how do you do work just in a company, how do you communicate information visually? The- there's starting to be like, you know, these new areas in which previously only designers and, and artists could work. Now, everyone is starting to do that work.
- CICS153 Instructor
Yeah. So th- this was... You know, you had an event earlier this year.
- AJAmit Jain
Mm-hmm.
- CICS153 Instructor
I mean, like I think it was three weeks ago-
- AJAmit Jain
Yeah
- CICS153 Instructor
... in SF, and I came by, and the thing that shocked me was that it was all artists and creatives, and y- I mean, you spoke for a little bit off the stage, but then they got you off the stage.
- AJAmit Jain
Yeah.
- CICS153 Instructor
And then a bunch of folks from Hollywood came by, a bunch of designers, and it was the first time I'd seen so many artists and creators, not, not like machine learning people-
- AJAmit Jain
Yeah
- CICS153 Instructor
... but creatives excited about using tools. W- w- why has... That, that's, and that's very new.
- AJAmit Jain
Mm-hmm.
- 42:37 – 44:59
Q&A: OpenAI Sora pause, focus as organizational physics, and what it signals for the market
- AJAmit Jain
So the question is-What is my hypothesis why Sho- Sora shut down? Whe- whether it's a business reason, it's an architecture reason. And two, what impact does it have on us in the industry, but also, like, you know, on creatives? So, I mean, I can only give you hypothesis. I don't know really what is happening inside, uh, uh, OpenAI. But, I mean, the, the, the one word here is really focus. OpenAI, at the core of it, is a large language model lab. What they do really, really well is produce models that are very good for chat particularly, right? Chat is a vertical that has, uh, about eight billion customers, right? Um, maybe not little kids, but if you... Maybe they too, right? Like, you know, because they wanna talk to a computer. So pretty much all of humanity is a good customer of chat. Executing on that is a really hard problem. Executing on anything at that scale, you need to go into the depths of hell to be like, you know, get everything working really, really well. When you do everything, that's really hard to do. I mean, Luma also had that problem actually, right? Like, you know, in early days when we, we were not really clear about how do we execute on this, so we tried a lot of parallel paths. But doesn't matter how much money you have, doesn't matter how much, how many people you have. Uh, this was also a lesson from Apple. There's, uh, way more things at Apple that they choose not to do than they choose to do, right? That is because it doesn't matter the money, doesn't matter the people, the organizational physics still come into play.
- CICS153 Instructor
Less is more.
- AJAmit Jain
Exactly. There's only so much attention you have as a company, not as a person, but as a company, that you can actually devote to making something. So OpenAI doing literally everything is not good for their business, and I think that is a realization that is setting in, and I think this will not be the last thing that they have actually canceled, right? There might be actually even more. Um, one thing I will challenge is OpenAI was not the largest player in the market. It is actually Google that is doubling down on, on video, on images, on visual generation, right? Like, you know, Gemini are... Gemini has great models that, that do pretty much all of these things actually. It doesn't indicate actually anything on the size of the market. It just indicates that they are getting their [audio cuts out] kicked because of lass of, le- lack of focus by Google, by Anthropic, and those kind of things, and they have to focus if they wanna go IPO, right? And that is the market that we are actually entering at this point. For Luma, what does this mean? Uh, I mean, this is great news. This validates like, you know, our, our, our thesis that, like, you can only do so many things at a time. And, um, this is the area that we have chosen to go in because this is a very, very big market with huge number of people that
- 44:59 – 57:36
Q&A: Copyright, model architectures shifting (GANs → diffusion → hybrid), and the remaining gap to “world models”
- AJAmit Jain
call it their profession. So it actually gives us, uh, very good footing in, in the same market. So that's what I would say. So the question is, given that anyone can make a video about anything and content about anything, what happens to copyright, right? So I think copyright and the ability to produce something are orthogonal problems, right? If you're talented enough, you can make Mickey Mouse in Photoshop, really, uh, and you can actually produce great stuff about Mickey Mouse. Like, let's say you're DreamWorks. You don't have rights to Mickey Mouse, but you have all the people who can actually produce anything related to Mickey Mouse, which you don't. Why? Because the law exists that, like, you know, prevents you from doing that. So I think none of that has changed. Has it become easier to violate other people's copyright? Yes, I think so, right? You didn't ask me, like, what the responsibility of platforms is. Again, the responsibility of the platforms is the same as it was for Photoshop, right? Like, you know, it's not Photoshop's responsibility to prevent you from producing Mickey Mouse. It's your responsibility as a law-abiding citizen, um, to not violate the law of the land.
- CICS153 Instructor
Hmm.
- AJAmit Jain
So I think it is pretty much orthogonal, basically. Like, you know, generative AI doesn't change copyright in any way, shape, or form, um, at least on the output side of it.
- CICS153 Instructor
But specifically, if there's a law that says you can't do XYZ, you'll, you will adhere to it.
- AJAmit Jain
Absolutely.
- CICS153 Instructor
Yeah.
- AJAmit Jain
If we get a DMCA notice, we'll take it down, right? Like, you know, if you're hosting it.
- CICS153 Instructor
Right.
- AJAmit Jain
Um, if that person used to create it, and we get a, a call of like, all right, like, you know, this person made something, it is not our responsibility to point law enforcement to them, right? Like, you know, because that's not the law of the land.
- CICS153 Instructor
Ah, right. That... I see.
- AJAmit Jain
So.
- CICS153 Instructor
You, you have... You, you protect the users in that case.
- AJAmit Jain
That's right.
- CICS153 Instructor
Right.
- AJAmit Jain
So the question is, um, GANs were very popular 2017, 2018, and now-
- CICS153 Instructor
Yeah
- AJAmit Jain
... you know, the world has shifted pretty much entirely towards diffusion models. What is the space of GANs in today's models? Uh, or today's architectures, basically. Um, that's a great question, actually. We still use GANs quite a lot. We use techniques from GANs quite a lot. But GANs are one of the most finicky architectures to ever work with. So GANs, if you don't know, are generative, uh, uh, like, you know, adversarial networks, and as the name says, they're adversarial networks. So, like, you know, you, you design the, the, uh, uh, objective in a very different way to diffusion models, right? Like, you know, very... You have a very predictable gradient descent. I mean, they still explode sometimes, but it's a very predictable system. GANs are still actually used quite heavily in distillation networks. Like, you know, if you wanna do distillation, like, GANs are actually pretty useful. Um, if you wanted to do a real-time system, you would still go to GANs quite a lot. But because GANs are just not very predictable, researchers don't wanna work on them. [chuckles] And that is the laws of, like, you know, physics in, in AI. What researchers wanna work on is generally what will get worked on, right? Like, so I can, I can make the case that, "Hey, Rust is more efficient." Doesn't matter. Everybody wants to code in Python, so that's what will be done. But also, GANs have not shown the kind of scaling that we are seeing with transformers, right? Uh, GANs are primarily, you know, UNet-based and, and, and convolution-based models, and they just don't really show the kind of learning that you can get from transformers. Can GANs be implemented using transformers? Yes, there are some papers about it. But at that point, you're really, really trying very hard to, like, you know, just, just do GANs, and, and that's okay. But diffusion models also now are on the way out. So, uh, I know this will be a little bit controversial if people are thinking about it, but diffusion models have physics that is not actually bearing out on scaling side of it. So Luma and, and some other companies are actually moving away to hybrid autoregressive and diffusion regimes. That's what our, our, um, unified models actually are. Because diffusion models actually have some really, really bad habits that are hard to unlearn and hard to, like, you know, get out of, of the system. So they're also on the way out, actually. So yeah.
- CICS153 Instructor
It's a very, uh-Um, if, if you realize basically when we first started teaching the class, Mike, there, there, there were debates about what the right l- programming language was for, for security, and it feels like architectures have come full circle, basically. Don't-
- AJAmit Jain
Sure.
- CICS153 Instructor
It-- Uh, m- good note for office hours this week.
- AJAmit Jain
The question was, as models get more and more powerful, what is the space of human creativity, especially in these unified models that can do pretty much all tasks, visual tasks, language tasks, all these kind of things? My stance on this has been actually very, very sterile from day one. I don't think anything the model is doing is creative or not creative. Whether it's creative or not creative is for humans to judge. That judgment alone is the act of creation, right? What you choose to do is an act of creation. What-- How I tend to spend my time is actually a very creative endeavor, right? Like, you know, it... because it will produce outputs that people will generate, consider creative or not creative. But more importantly, in a practical physical system that is an AI system, how-- where is human-- where is the role of human? It's in that fat skills area. This is why our slides look good, because someone who knows this really, really well went in and taught the model a million times over that, like, you know, this is what good looks like to humans, and this is what bad looks like to humans. This is just like programmers, right? Like, you know, before, artists never had this kind of huge leverage. They did something once, and it will run a billion times, right? Programmers have always had this leverage. You write the program once, and it will run again and again and again and again on everyone's computers, on everyone's phones many, many times over and produce value. Artists produce one thing, and then that's one thing. That's it, right? Now, other creatives in the world, not just programmers, have this leverage because of this architecture. You teach the model once, and they're able to produce like, you know, huge amount of really, really great things in different contexts. This is actually an explosion of creative potential that just never has been, right? So the skills and human creativity are much more important. Actually, it will weed out people who were mediocre, and it will, like, elevate people who are really great to even greater heights because now their work will be rerun a trillion times over. Okay, so Hollywood is default dead right now, right? And that has nothing to do with AI. Uh, that has really nothing to do with any of the technological changes recently. Hollywood's business model has been deteriorating for the past thirty years, and COVID really accelerated it, and then the writers' strike just was the nail in the coffin. Um, at this point, we are at a place where, like, you know, this production that I'm talking about is the first production in LA proper in the last five years. First production, because all production has moved out of Los Angeles. Think about it. Hollywood doesn't make movies anymore. Hollywood finances them, but doesn't make them.
- CICS153 Instructor
Where, where are they being made?
- AJAmit Jain
In Greece, in, in Canada, in, in Ireland. Wherever you get tax incentives, you're gonna make it there. So what does so-solving this problem actually mean, right? First of all, Hollywood has to stop thinking like PE. Currently, Hollywood's business model is like a private equity. Oh, Guardians of the Galaxy was a great hit. Let's make number two, number three, number four, number five, number seven, number ten, number twenty, right? Uh, uh, l- how many Avengers are there now? I don't even know, right? And how many crossovers of Avengers and Spider-Man? And I'm-- won't be surprised if, like, you know, Tintin is in there one day, right? [laughs] Uh, like some multiverse thing. Uh, as a physicist, that's a really troubling thing for me, the multiverse universe, right? Um, that's not how it works at all. Um, but it is emblematic of a PE mindset where like, all right, we created a franchise, we created an asset. How do we rent-seek that asset the most we possibly can? But, you know, audience don't think like that. More and more people want to watch great things, want to go to theaters, want to actually like, you know, watch things on their phone. That is emblematic in Netflix's growth. I mean, the, uh, on Thursday, you're gonna see like, you know, the, the, this quarter results. I'm not saying buy or sell anything. Uh, but like you've seen the growth of Netflix, and they produce eight hundred productions a year compared to like, you know, the, the five, ten, twenty that large studios are producing right now. That is a PE mindset. So those eight hundred productions don't have five hundred million dollar budgets, right? They have, uh, uh, like, you know, they're like ten million, twenty million, thirty million, fifty million dollar budgets, and this is how they're scaled. What does this allow? This allows different kinds of, and more kinds of stories to be told, and that means it appeals to a wider audience. So your platform then becomes more appealing to more people. Are they making again a Harry Potter? Yes. I'm very happy about it, by the way, right? But this is also PE mindset that like, why are we making Harry Potter when we have so many other great books that should be made again and again and again and again, right? They shou- sorry, shouldn't be made again and again. So Hollywood has been default dead for a long time. If nothing changes, this is not about AI again. If nothing changes, all those jobs are actually already gone, right? Um, the, the people in those industries know that. AI is a chance to actually change the business model once and for all. Because you can move away from these massively expensive production methods. You can move away from these huge, huge, uh, you know, time and capital sinks and go back to an era where like, you know, many, many ideas can be tried and one has a shot of building people into the, you know, into theaters. Ryan Gosling from Hail Mary, if you have not seen it, great movie, uh, makes a great point. It's not the audience's job to come to the theaters to s- you know, keep Hollywood alive. It's Hollywood's job to make great things so the audiences want to watch it. You can't blame your customer-
- CICS153 Instructor
Um, what I'm gonna do is... That was well said, and I think the PE mindset is, uh, is m- is going to come up over and over again, where capital markets, where- wherever they look for predictability-
- AJAmit Jain
Yeah
- CICS153 Instructor
...um, we tend to find a stagnation of innovation, and I think that's, that's hurting a lot of people.
- AJAmit Jain
That's a great question, and that's the one that like, you know, the whole company spends all their time thinking about, honestly. So the question is: What is the delta between where we are to getting to a place where world models, video models, whatever have you, are as generally used and useful as language models are today? Fair?
- SPSpeaker
Fair.
- AJAmit Jain
Right. So i- there's only one word, basically: intelligence. That's the deltaSo currently, image models and video models that are not unified models are really, really stupid. And, and I mean that in a non-derogatory way of that, right? When I say stupid, I mean in, in this way. Like, when you work with a person who you don't consider to be intelligent, what are the signs? They forget what you said. You have to tell them the same thing again and again. They don't actually understand what you said. They kinda are a facsimile of understanding, right? It's like you said something, but they're like, "But that's not what I said." Like, you know, "Yes, you, you kind of interpreted my words literally, but that's not what I'm saying. You have no context of what I'm saying." They are able to do small things, but like, you know, when you ask of more of them, then they can't actually do that. This is what today's video models and image models are. All of them, basically, right? Um, that's what we tried to solve with UniOne. They need to be as intelligent as language models are. They need to have multi-turn, right? So when you, when you ask it something, afterwards, it needs to be able to go back and say, "All right. This is what I generated. This is what you had asked. I, I have memory. Let me iterate on it, and let me fix it." How annoying would ChatGPT be if you only had one turn on it, right? And then you had to repeat the thing and then another turn, and repeat the thing and then another turn. That's ridiculous, right? Nobody uses that. See, that was the difference between LLMs being a research project and them becoming generally useful. That was RLHF, right, that enabled chat multi-turn. So that was number one. Number two is, is how much intelligence did they actually have? So current image models and video models are beautiful pixel generators. They have really no understanding of what the hell they're generating, the physics of it, the, the introspection on it, all of these kind of things. Unified models are designed to solve this kind of problem. So when you use them in, in things like education, for instance, right? Like, you know, these slides could, uh... Sorry, not this one, but the ones I had, could easily be used for te- by teachers, right? With probably a lot more density, and I can show you really good examples of it producing really high-density things. Videos are some of the best explainers in the world, right? Imagine a history class that is not taught as drab text, but you can actually see, and most importantly, you can do alternatives. What if the Rubicon was not crossed, right? What if Caesar was not murdered? What if, like, you know, these things didn't happen? What if Archduke Franz Ferdinand was not shot in 1914, right? Like, you know, would there still be a World War I? I posit yes, and like, you know, we can go into that. But what if you see that flow out, you had this level of like, you know, uh, temporal understanding and coherency and these kind of things. Language models are getting there. Image and video models don't have that. That's what we need to solve. So that is the distance between them being just tools that can produce almost stock footage to things that can do end-to-end work.
- CICS153 Instructor
Awesome. Thank you, Amit, for being here.
Episode duration: 57:41
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode 6nUl_w5W9Wk