Skip to content
Stanford CS153 Frontier Systems | Amit Jain from Luma AI on Unified Intelligence Systems
This video isn’t embeddableWatch on YouTube →
Stanford OnlineStanford Online

Stanford CS153 Frontier Systems | Amit Jain from Luma AI on Unified Intelligence Systems

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In week three of CS153, the instructor hosts Amit Jain from Luma to discuss “Unified Intelligence Systems” as a follow-up to a prior lecture on visual intelligence. Jain recounts his Apple work on LiDAR for projects including Titan and Vision Pro, and how early exploration of generative models and differentiable 3D led to founding Luma with an initial focus on large-scale 3D capture. Luma then shifted to generative video in 2023 to leverage the scale of internet video data, releasing the Dream Machine model in March 2024 and rapidly reaching millions of users, while building preference-based feedback loops and human annotation pipelines. Jain explains Luma’s multimodal AI factory—pretraining, post-training, deployment, and reinforcement learning—its security constraints for studio clients, and a move toward unified transformer architectures that jointly reason across text, images, video, and audio to enable end-to-end creative and professional workflows. Guest speaker: Amit Jain is the CEO and co-founder of Luma AI, a research lab developing multimodal foundation models aimed at "unified intelligence." Under his leadership, Luma has scaled from a 3D-capture pioneer into a leader in generative video, raising a $900M Series C following the success of its Dream Machine and Ray video-reasoning models. By 2026, he has steered the company into large-scale infrastructure projects including Project Halo — a 2-gigawatt AI supercluster — to build the next generation of "world models" capable of simulating physical reality. He founded Luma in 2022 from Apple, where he was a Systems and Machine Learning Engineer. At Apple, he led development of the Passthrough feature for Apple Vision Pro and was instrumental in integrating the first LiDAR sensors into the iPhone — foundational work for modern spatial computing. His background also includes physics and mathematical simulation. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

CS153 InstructorhostAmit Jainguest
May 6, 202657mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:100:52

    Course setup and why “unified intelligence systems” matters this week

    1. CI

      Welcome, gang, to, uh, week three of CS153. We have today with us Amit Jain from Luma. Thank you for joining us, Amit. [laughs]

    2. AJ

      Thanks for having me.

    3. CI

      Okay. [audience applauds and cheers] Uh, Amit is gonna be talking to us today about unified intelligence systems. You're gonna be hearing a lot more about this. I think it's a, a very relevant follow-up to the visual intelligence systems lecture we had last week from Andy Blattman at Black Forest Labs. Um, quick recap on the class and today. We're gonna talk about Amit, and today we're

  2. 0:522:56

    How Luma’s origin story started with a blunt ask: “Can I have your 3D data?”

    1. CI

      gonna do a field trip into what I think is also one of the most exciting, uh, factories working on how to get work done, especially visual and creative work done in the world, called Luma. But before that, why don't we start by talking about Amit a little bit? I had the privilege to get to know Amit a few years ago when he was still an engineer at Apple. And, uh, I was at Discord at the time, and Amit-- I got an email from Amit saying, um, "Hey, I heard you have a bunch of 3D data."

    2. AJ

      Yes.

    3. CI

      "Uh, can I have it?" [laughs]

    4. AJ

      I remember that.

    5. CI

      And I said, "No, you can't."

    6. AJ

      [laughs]

    7. CI

      Uh, because Discord had acquired the data. But I started asking Amit what-- why he needed the data. If you guys remember, um, I covered this in, in our first lecture, but Ubiquity6, the company I'd started about a decade ago, was a 3D computer vision mapping company, and we had, uh, we had millions of people around the world who were capturing the world in 3D using their smartphones. And all that data, we had terabytes of data, um, uh, that were 3D representations of the world that we'd reconstructed from 2D images. And Amit said, "Well, I wanna build, um, a 3D service that, uh, that is generative. I want to allow people to create gener-- the same kinds of meshes and point clouds and 3D representations of spaces, but through generative models because that's where the world is going." Um, and I s- got interested 'cause I kinda agreed with him, and he was ahead of the curve. And so I had a chance to invest as an angel investor at the time, and then a few years later, I had a chance to partner with Amit again at a16z when I was a general partner. And had-- Thank you for letting me lead your Series-

    8. AJ

      B

    9. CI

      ... B.

    10. AJ

      Yeah.

    11. CI

      Um, Amit was also one of the first customers of the a16z compute program called Oxygen, and actually helped name Oxygen as well. Um, I think the quote was, he said something like, you know, "If we don't have compute on day one-"

    12. AJ

      Let's-

    13. CI

      "... can't really read."

    14. AJ

      Suffocate. Yeah.

    15. CI

      So, um, tell us a little bit about what Luma is and how-- what were the dots that led from the insight at Apple that generative modeling was the future that led here?

  3. 2:565:07

    From Apple LiDAR to “world simulators”: the technical and product motivation

    1. AJ

      My background, b- very briefly. So at Apple, I was working on, uh, first the LiDAR systems that actually now is on our iPhones. Uh, this was called the Jasper sensor if any o- any of you are familiar. And we were trying to build, um-- We were trying to actually build like, you know, what comes after the c- after the camera. This sensor was built, now I can talk about it because, you know, the project is no more for the car, [chuckles] uh, which was called Titan. And we started to work on Vision Pro after that because, you know, the car project got, got canceled, and the Vision Pro had, had a bunch of LiDARs on it. And during that work, it started to become obvious that like, okay, you know, um, the computers of the future, uh, we still don't know what they will look like. Uh, you know, maybe they will have AI or what- whatnot. The computers of the future will need very different interfaces, will need very different kind of media, and will need very different kind of, of ways of actually capturing and creating and building those things into the system. So in 2020, uh, at Apple, we started exploring generative models. Um, and, and think about it, it's 2020, so, you know, before language model scaling was known to be working and before, um... Actually, it was before DALL-E, but NERF had already come out from Matthew Tanchik from Berkeley. So we started to explore those generative systems and, uh, that led me to thinking that, okay, if language scaling is working and here is, is a method where we-- differentiable 3D is possible, what would happen if all of these things are combined together, right? That would basically mean you have the full footprint of every observation in the universe, and you will be able to like, you know, differentiably learn about them. If you can differentiably learn about them, you can understand them, and then finally you can generate them. So that was the genesis of Luma. And at that time, because of, of the pedigree we had, 3D seemed like the most logical way of going forward because first of all, 3D tells you-- 3D has a lot more information than images do. Uh, naively, we assumed at the time 3D has a lot more information than videos do as well, and that 4D would be very easy to capture and scale. But again, I say naively because as you will learn in, in a few seconds that that was a bad assumption. But that's kind of where we started, with the idea of building what we now call a world simulator. Uh, at that time it was just like, all right, like, you know, if we can learn this and generate this, we would have something that would allow us world understanding.

  4. 5:076:10

    What “differentiably learn the world” actually means (and why it’s central)

    1. CI

      And c- you, you talked about-- You, you said this phrase, which is important, you know, l-learn the world in a differentiable manner.

    2. AJ

      Yeah.

    3. CI

      What does that mean?

    4. AJ

      Right. So I mean, i-if you're-- I-I'm sure you guys are all familiar with like, you know, how transformers work and how AI models work. Differentiable means you can put it in a training loop and, um, you can have a loss function that can be then iteratively optimized. So differentiable allows you to do that. If the function is non-differentiable, then like, you know, you j- really can't do gradient descent on it. And, uh, if you can't do gradient descent on it, then deep learning doesn't work. So the tools that we have for this era, for this generation, is basically compute and gradient descent. Um, and yes, transformers are things that are very, very well susceptible to gradient descent, but the actual, you know, thing underneath it is gradient descent and compute. So how can we take a lot of data, a lot of compute, and gradient descent and produce something useful out of it? Differentiability is the core characteristic of that problem, those problems, basically.

    5. CI

      Yep.That's helpful. Could you just connect the dots on how, what that insight-

    6. AJ

      Yeah

    7. CI

      ... led to then what-

    8. AJ

      Right

    9. CI

      ... Luma's doing today?

  5. 6:107:25

    Luma’s first flywheel: productionizing NERFs/Gaussian splats—and hitting a scaling wall

    1. AJ

      So we started-- When we started the company, the idea was we will, we'll, uh, you know, capture an, uh, ungodly amount of 3D data, build a flywheel that allows people to capture that and like, you know, for us to be able to use it and then like, you know, build, build both simulation systems with it. So we released an app, uh, which is called Illuma 3D Capture. It actually was very, very popular because, one, the results were really, really great. It was for the first time that NERF and Gaussian splats were productionized. And Matthew, uh, you know, he joined our team actually to really push forward the, the frontier of, of that sort of the world. But very soon we realized that it doesn't matter how many people use the app, it will never reach the scale that was necessary to learn enough about the universe.

    2. CI

      Why is that?

    3. AJ

      Because think about it, right? The number of people that are writing on the internet, that are taking photos on the internet, that are, are, are capturing videos on the internet, substantially outpaces anything one company can actually distribute. Also, there's like, you know, decades and decades and decades of that information that is already available. So it's all about data. It actually... You, you can make the case that like, you know, this particular modality of data is better for learning versus this or versus that. It really doesn't matter. That's a moot point because you're running against the physics of scale. So wherever there is scale in data, that's the only thing that's gonna work.

    4. CI

      Hmm.

  6. 7:2510:27

    Pivot to video: Hopper compute, Dream Machine, and the next data flywheel

    1. AJ

      And you have to design the algorithms around where the, the data is, not the other way around, right? You come up with some pristine algorithm, but you don't have any data, then like, you know, what's the point? Robotics is coming u-up against this problem right now. We're like, "All right, we're gonna build like, you know, these action systems," but well, where is the action data? There's no internet of action data. You can have huge labs in, in China and India and in Vietnam and, and, and everywhere gathering this data, but the scale is really c- not comparable. So you have to just design the systems around data. So that's what, that's what we learned. Um, so in 2023, after that realization and after, uh, you know, NVIDIA Hopper architecture was announced, uh, we started to build the foundations of, um, you know, generative video because video is three-dimensional. It has two dimensions of space and one dimension of time. And human brain actually like, you know, learns about 3D representation through that time proxy. So when Hopper architecture came about, we started to think like, "All right, it might be possible actually to learn video and to learn the world representation through video." So in 2023, uh, Jiaming joined us. Jiaming, uh, was, uh, you know, at NVIDIA at that point. He's a Stanford grad. And a, a few other people from Stanford and Berkeley, uh, started to join the company with this idea of like, "All right, let's learn from video." And we started to build that infrastructure. And in February-- Sorry, in March 2024, we released the first video model, uh, that was called Dream Machine.

    2. CI

      Yeah.

    3. AJ

      And, um, you know, in, in the first three weeks, four weeks, actually, we got up to, uh, uh, s- six million users from that because people had never seen generative video. Uh, sort of was announced but never released, so people had never experienced it, so people really wanted to actually try that out. So we started with video at that point. And then we have had the similar realization again in 2025, early 2025, uh, just like annual cycles now, that just video is not enough because video is good, but it doesn't pair human logic. It doesn't pair why an event is important. What is the sequence of events, and what does that actually lead to? Just having language models in the middle, uh, that are like, you know, u- being used for embedding is not sufficient. You need unified intelligence, so that's kind of where we are now.

    4. CI

      Yep.

    5. AJ

      These are the dots.

    6. CI

      Well, um, yeah, so th-this is not the first time the class has heard that when you close the loop, you have to, you have to sort of evolve the, the mid-training, the post-training pipeline, the interface. And so can we spend a little bit of time-- So I, I don't think it's a surprise for people to hear that there was sort of an iterative loop every year as you got more and more data from customers.

    7. AJ

      Yeah.

    8. CI

      But can we talk a little bit about that first... You know, the final projects for the class this time are the one person frontier lab, where they're going to be bootstrapping their own flywheels.

    9. AJ

      That's very cool.

    10. CI

      Um, and the first-- You know, before-- I, I remember, you know, how nerve-wracking it was for you and for the team whe-when, when you had the realization that video was gonna be the future, but you didn't have a video model out in the world yet.

    11. AJ

      Yeah.

    12. CI

      And you didn't have a state-of-the-art system to start collecting that, um, that context feedback loop.

    13. AJ

      Yeah.

    14. CI

      So let's, let's, let's take a bit of a journey back in time, time travel to, uh, the launch of Dream Machine 1.

    15. AJ

      Yeah.

    16. CI

      Can you just tell folks how you went about kickstarting or bootstrapping the, the video flywheel at Luma?

  7. 10:2713:51

    Bootstrapping the video flywheel: preference signals, trainers, and product telemetry

    1. AJ

      So I think the core problem that you wanna think about whenever you're building these really, really large systems, they have a wild distribution, right? Like, you know, if even if you're talking about language models, well, they have all of language models, uh, language model data, and what is good, what is bad, right? So you wanna think about, okay, from this really raw distribution I get from pre-training, how do I get to a model that humans can use? And what humans find useful is a very narrow band within that distribution, and that narrow band is not like, you know, a predictable, uh, linear band. It's just like, you know, pockets of, of, of greatness that humans think are great. Some other species might find it very different, right? But we have our own aesthetics. We have our own use cases. We have our own value system. So we find those, those distributions valuable. So now the question becomes: How do you find, or how do you basically t- get that distribution out of the model? So we started to think about that problem and, and with Dream Machine, the-- because there were so many users that were using the model, the question became, "All right, can we learn something about that?" And, and preference, uh, or like, you know, preference feedback, uh, at that time, by the way, um, SFT, right, like, you know, was just started to being thought about. RLHF was a hot thing where people were thinking about like, all right, like, you know, human feedback loops. So we built a system where, um, videos that people were liking and people were downloading, we considered that to be a signal of like, all right, this is something that people prefer. Um, it was not 100% accurate because some people were downloading really bad videos as a showcase of how bad AI is at video, right? So our model also learned a lot of that. So we had to then build systems for, uh, humans to be able to, uh, go and filter out, like, uh, people we pay. So then it started to emerge what a frontier lab actually looks like. A frontier lab has these components of data, these components of compute and algorithm, but it also has huge parts of, of what we call skills and trainers and tutors and people who are doing the labeling of data and all of these systems. Um-If you don't have that, then it's actually not complete. And a part of that is also the product you built. Can the product actually give you enough information to make sure that the next model is better than the previous one? And hence the experience is better, and hence more people will use it, and hence you'll get more data from it, uh, uh, about this preference of, of human distribution, and can you make the next model actually better? So, I mean, it took us a long time to learn actually how to gather that feedback, how to... You know, and then now the system we have, uh, in, in, in the latest Luma Agent system, ungodly amount of feedback actually we get from, from, um, um, what people are doing. Every interaction that is there, we learn from, like, you know, whether they like it, dislike it, in what way they like it, what way they dislike it, whether the full chain of thought, uh, that, that the model produced and the full chain of work that the model produced is any good. Which elements of that is not good? And then that's how you actually start to get good at it. Yeah.

    2. CI

      Well, let's, um... Why, why don't we do a double-click on how that-

    3. AJ

      Yeah

    4. CI

      ... that actually works. So to remind everybody about the field trip we're about to take, right? Um, this is the, the very basic standard AI factory we've talked about, right? Frontier AI, um, sort of pipeline. We've got pre-training, mid-training, and then we have post-training and deployment. And so today, we're gonna hear a little bit from Amit on how the, the Luma version of this works. Why don't you go ahead and just k-kind of talk us through-

    5. AJ

      Yeah

    6. CI

      ... what's, what's actually going on under the hood at Luma.

  8. 13:5120:45

    Inside the Luma Factory: multimodal pretraining and the push to end-to-end creative work

    1. AJ

      Absolutely. So, um, let me talk about what is informing the design decisions for our architecture and for our models. Um, currently, we are seeing huge amount of alpha coming from, from language models being used for, for adjacent tasks like coding, for adjacent tasks like, you know, system design and, and those kind of processes. But when we start to think about tasks that require more context than what is available in text, so creative work, right? Huge amount of things, a huge amount of information that is in, in visual domain, huge amount of information in auditory domain, actually huge amount of information in the trace of how you arrived at the final output, right? That. When we think about robotics, you can definitely start to build a, a robotic system just based on text models or VLMs or VLAs that people are starting to do now. But they will not generalize, just the same way that, like, you know, a-autonomy didn't-- uh, autonomous driving didn't generalize until people started to build full end-to-end systems that w- that had language, that had video, that had, uh, like, you know, all of the control signals, all of these things in there. So that's the problem we are coming up with, that the real world is way more complicated than coding, right? I mean, coding is a really valuable task, but not everything can be done in coding, right? Like, otherwise, programmers would be the only profession that would be left. Uh, uh, and now they are also, you know, endangered species, actually. But-

    2. CI

      I'm not sure that's true, but I understand your point.

    3. AJ

      [laughs]

    4. CI

      Yeah.

    5. AJ

      As a, as a programmer, uh, it's really fun.

    6. CI

      Well, the job has evolved, for sure-

    7. AJ

      That's right

    8. CI

      ... to become a trainer-

    9. AJ

      Yeah

    10. CI

      ... and a tutor. Yeah.

    11. AJ

      I-It's a really fun, fun time to be that way. Uh, I-I started pr- coding like, you know, when I was thir-thirteen years old, in order to build s-simulation systems, in order to do like, you know... Uh, so my background is in physics. And in order to actually build like, you know, simulation systems for, for electromagnetism and those kind of things to l- see how these systems behave. That's why-- when I learned, uh, to start coding. And even at that point, it was really obvious that I cannot teach those systems from any observations.

    12. CI

      Hmm.

    13. AJ

      Right? We can write the code, but that is like approximations that we have in our, in our models or in our, in our equations, but we can't actually teach those models from any data. So all of this is informing how we build our systems. So even early on, we started to think about, okay, in our pre-training, how can we learn from all of video, all of images, and all of text, right? It's a really hard problem because they're really different modalities, and they're expressed very differently. If, if you think about the encoding of these modalities, text is discrete, and text performs the best when you encode, encode it in a discrete manner. At least that is the understanding today. Video is kinda somewhere in between, and audio and images are b-be-best performed in, in a continuous space. So our factory, as you call it, is built around this idea of like, how do we learn jointly from all of these systems? In 2025, these were disparate towers that we built. Language tower, image tower, video tower, audio tower, and then like, you know, we would unify them together, um, using just like, you know, some, some fusion techniques so that like, you know, they will do better. Uh, if you look at like, you know, the work from, uh, Andy's lab, right? Like, you know, uh, stable diffusion, those kind of things. That's what it does as well, where you have a tiny little language component-

    14. CI

      Right

    15. AJ

      ... um, and, and you learn embeddings from that to be able to understand the human instructions. It was just not sufficient. So when we talk to our customers, when they try to use our system, so where, where are systems being used, right? For instance, currently, large studios. So actually, I'm very, very, very excited about, um, a new show that is coming out on Prime Video.

    16. CI

      Hmm.

    17. AJ

      Uh, the trailer's out. Uh, it's called, um, um, Old Stories. Uh, it's a-about Moses, right? So it has, um, Sir Ben Kingsley is the star of it.

    18. CI

      Oh, cool.

    19. AJ

      Uh, it's, it's a proper production. It's not an AI video. Uh, it's a $4 million, uh, sorry, $4.5 million, uh, per episode production, basically. And it's all pretty much all produced using Luma Agents. So they're using it in these, like, really high intense situations where they wanna be able to model the whole world and the physics of the world and light and, and, and, and, uh, uh, fluid and interactions and all of these kind of things.

    20. CI

      Hmm.

    21. AJ

      Now, when you do that, it's just not sufficient to build an image model or a video model. You need a model that understands time and causality and, and language, right? And it, it understands, like, proper instructions. "Okay, well, like, you know, uh, um, this looks good, but what if like, you know, the, the shirt sleeves had like, you know, this particular thing right here?"

    22. CI

      Hmm.

    23. AJ

      How do you express that instru-instruction? "Okay, in time, when this person actually walks, uh, through the door, the whole scene explodes." All right, what does walking through the door actually means? When the person walks through the door, what does this explosion of the scene mean? Give me more instructions, right?The deeper you go into these kind of problems, and this is a very, very, very big market, right? It's about 120 million creatives in the world whose-- this is their job, right? Like, you know, these are not people who paint for a hobby or all these kind of things. These are people who actually are employed in this industry. So about, uh, you know, two times, three times by estimation of coders. Um, their work every day goes into replicating the physics of the real world-

    24. CI

      Mm

    25. AJ

      ... into computers. So we wanna build systems for them, and if you wanna do that, you wanna build what we are now calling unified models that have the same understanding and intelligence of a language model that can follow context, that can remember, and the physical understanding and the world model understanding of video models and image models. So that is what the output, that is what the things we want to produce. This was 2025. [laughs] And now in 2026 when the models got really good, what people want to do is like they wanna do the full work end to end. You know, it's like, all right, why is it only producing five-second video for me, right? Why can't it make the whole shot? Uh, if you go to like, you know, w- w- people in advertising world, why can't it make the whole campaign? If you talk to robotics companies, why can't it actually produce the whole action and then judge its own outputs, and then tell me when this is the right action and incorrect action?

    26. CI

      Mm.

    27. AJ

      Like, you know, why can't I get the right force in all of these kind of problems? So people want end-to-end results. So now the Luma Factory is about building systems that can do end-to-end work-

    28. CI

      Hmm

    29. AJ

      ... in multimodal domains. So that's, that's kind of what we do. We have massive reserves of like, you know, multimodal data, uh, i-in about-- Uh, the final trainable outputs are in, in about thirty petabytes of, uh, you know, um, scale. We train them on, on, um, currently H100s and very soon GB300, uh, uh, you know, GPUs, uh, in, i-in the 0-0-010K scale, basically. So pretty much the same as, as a second-tier language model training. Like, you know-

    30. CI

      Hmm

  9. 20:4522:31

    Enterprise deployment constraints: studio secrecy, training exclusions, and learning from traces

    1. CI

      And could you talk a little bit about, you know, when you started deploying these systems in, in, in the first lecture, we talked about mission-critical context, right?

    2. AJ

      Mm.

    3. CI

      And one, one type of mission-critical context is a large studio-

    4. AJ

      Yep

    5. CI

      ... for whom their data is super sensitive.

    6. AJ

      Yep.

    7. CI

      They don't want... You know, they're happy to have you train their data, but, uh, with their data for them.

    8. AJ

      Yeah.

    9. CI

      But they don't-- If, if I'm running a studio, I don't want my data being used by another studio.

    10. AJ

      Yeah.

    11. CI

      So how, how do you, how did you navigate the deployment sort of restrictions-

    12. AJ

      Yeah

    13. CI

      ... of these, of these professionals?

    14. AJ

      So we work with two arch nemesis at the same time, Netflix and, and Amazon Prime Studio, right? Which are the two giants of streaming war at the moment. Um, so basically then, then you have to build systems that are guaranteeing that there is no way that there's any data overlap. We have internal controls and systems that like, you know, are, are some of the standard ones like SOC 2 and those, those kind of things, and then specific ones that are for AI labs on how do you not train on this, on, on this data.

    15. CI

      Hmm.

    16. AJ

      So for instance, uh, if you're producing the next blockbuster, you don't want the next Iron Man, for instance, right, like to show up into the training data. So we have guarantees around that, like, all right, whenever certain stuff is marked or projects are marked, they will never show up in training data. They will never show up in, in any of these loops, basically. But we still learn from like, you know, what users are doing in the product-

    17. CI

      Hmm

    18. AJ

      ... which is different from the, the visual artifacts that they're producing, but rather the traces they're producing. We, we're still able to use them and learn from them, actually.

    19. CI

      This is the interaction data-

    20. AJ

      That's right

    21. CI

      ... when people are working with the interface of the agents.

    22. AJ

      Yeah. That's right.

    23. CI

      Okay.

    24. AJ

      So there's some limitations on, on these kind of high, higher sensitivity projects. Yeah.

    25. CI

      Um, yeah, I think you have, you know, sort of a-

    26. AJ

      Yeah

  10. 22:3125:15

    Unified intelligence in action: generating polished slides from a mind map + style prompt

    1. CI

      ... well, one, uh, could you talk a little about how you created these slides? 'Cause these, these-

    2. AJ

      Yeah

    3. CI

      ... I believe were created with, with Uni1.

    4. AJ

      That's right.

    5. CI

      Is that right?

    6. AJ

      So, uh, these are, uh, you know, I, I basically gave it... Uh, actually, let me start from that first, and then I will actually, actually ta-ta-talk about unified models as well. So here, I created this, uh, like, you know, on the top what you see, I created that, um, mind map, whatever you wanna call it, in our product. And then I basically asked, if you see on the right, um, I asked it like, you know... And I also gave it Ant slide, uh, that like, you know, the one right here. Sorry, not this one, but... Okay, I don't know. The first one you saw of the factory one.

    7. CI

      The factory slide. I see.

    8. AJ

      That's right.

    9. CI

      Yeah.

    10. AJ

      I gave it that, and I asked it like, "Hey, in this style, actually produce the outputs." Now, this is actually a very, very good example of what unified intelligence that I'm gonna talk about means. People, when they think about image models, video models, or, or any models, not text, they think they are just... They produce beautiful images, right?

    11. CI

      Mm.

    12. AJ

      But that is a really big mental gap that the world has in this area. Just like language models produce words, right? The words can be beautiful. You can just say like, "Hey, it's a poem," and it could mean nothing, right?

    13. CI

      Hmm.

    14. AJ

      And, and simultaneously, you can have a mathematical proof of Euler's problem number, pick your, uh, take your pick, right? 1152. They all are words at the end of the day, but how you string them together determines the information content and determines the informa- uh, the, the intelligence of those.

    15. CI

      Hmm.

    16. AJ

      Just like that, how you arrange the pixels determines what they're conveying and how, how intelligent they are. So unified models that we are producing now, and I'm gonna talk about that in a second, are about how you express intelligence in whatever medium is convenient for the person that they are actually, you know, who's using it. So if a language is, uh, a language output is convenient, fantastic. If it is slides and images, fantastic. If it's a video explainer, great. But they're all basically outputs that are intelligence.

    17. CI

      Hmm.

    18. AJ

      So that, that's what we call unified models. So yeah, basically, uh, it was one shot. Uh, it produced those slides. It produced one that I didn't like, and I deleted it, butBefore you ask me to take a screenshot of that.

    19. CI

      Yeah.

    20. AJ

      But that w- that was pretty much about it. If I would have asked it to do a very detailed overview of that, then that's what it, it would have done. So end-to-end work, this is what we call end-to-end work, right? You know?

    21. CI

      So ju- just to break down what happened-

    22. AJ

      Yeah

    23. CI

      ... you gave it, you gave it my original slide as a prompt.

    24. AJ

      Yeah.

    25. CI

      A screenshot of that prompt.

    26. AJ

      Yeah.

    27. CI

      You then gave it instructions on the right-

    28. AJ

      Yeah

    29. CI

      ... in the chat, and then you gave it a little bit, like, guidance, is, is it? That scaffolding?

    30. AJ

      Yeah, just my, my, my, my thoughts up there.

  11. 25:1529:50

    Why it’s hard: bridging understanding vs generation, and what “unified” architecture changes

    1. AJ

      Right. So I mean, that's a good segue into unified models, basically. So, um-

    2. CI

      Okay

    3. AJ

      ... well, LLM, first of all, doesn't generate images, right? I mean, it's a language model.

    4. CI

      Hmm.

    5. AJ

      You can ask an LLM to use a computer and try to generate images, but again, it really falls apart because it doesn't see anything.

    6. CI

      Hmm.

    7. AJ

      So when it tries to reason spatially, when it tries to produce, like, you know, any visual outputs, they're blind models. They see everything as a, a, a full sequence, right? Like, you know, even the grid nature of, of, of images and visual information is not apparent to LLMs. So when you start to do VLMss, which are vision language models, right? Like, you know, you start to teach them a little bit about image p- part of it, VLMs are still not generative. VLMs understand images, but VLMs can't generate images. So we have on this world where, like, you know, you have understanding in language and, and generation of text, and then you have, uh, models like Flux, which are good at generating images-

    8. CI

      Right

    9. AJ

      ... right? Uh, which are great models, by the way, right? But then they don't have any of this understanding.

    10. CI

      Right.

    11. AJ

      Right? And I think Andy talked about that last time as well-

    12. CI

      Yes

    13. AJ

      ... that, like, there's this big chasm in between these two things. Understanding is separate and, and, and language is separate-- oh, sorry-

    14. CI

      Generation

    15. AJ

      ... generation is separate. But in language, that's not true. An LLM is good because it understands text and generates text all in one go.

    16. CI

      Hmm.

    17. AJ

      Right? There's no, there's no delta in between. There's no two models that are actually doing it. If we want to solve world understanding and, quote-unquote, "world models" that people are calling it, that's what we need to do.

    18. CI

      But, um, we've-- I mean, for at least about a year, I guess, we've had models that can generate language tokens and image tokens, right, with NanoBanana.

    19. AJ

      Right.

    20. CI

      But they, they were-- like, NanoBanana was still not able to generate... I, I, I remember trying to generate schematics-

    21. AJ

      Right

    22. CI

      ... like this.

    23. AJ

      Uh-huh.

    24. CI

      I, I tried to generate the factory slide-

    25. AJ

      Yeah

    26. CI

      ... with NanoBanana. I couldn't.

    27. AJ

      Okay.

    28. CI

      Why were the capabilities still not there with basic sort of like these jointly trained models?

    29. AJ

      So from what we know of Google's architecture, NanoBanana is still a fused architecture-

    30. CI

      Mm-hmm

  12. 29:5034:52

    The skills/tools/model stack: how Luma agents turn expert craft into reusable leverage

    1. AJ

      Uh, yeah. So actually, uh, let me talk about how we deploy these architectures, first of all. So this is what we are trying to build. If we wanted to do end-to-end work, this should be very familiar. Like, you know, if you've taken CS class, it's the REPL loop, read-eval-print loop. This is how computers work, have worked for a very, very long time. If you think about the von Neumann architecture, it is built around, like, you know, the REPL loop generally. It was not thought about at the time this way, but now we think about it this way. If you want to deploy models to not just produce, like, you know, text tokens or image tokens, but actually to do work, end-to-end work, how do you build these systems? So how do you do that REPL loop? One way is doing the left one, where, like, you know, there you have different models for each kind of things, and there's, like, two schools of thought. You produce federated models or, like, you know, you have this kind of like-... tiny models that are each doing specialized work, and then you, you make them combi- or, or you just pass outputs from each other, and you probably have a judge model on top that, like, you know, judges and orchestrates all of that work. That's approach one. And approach two is that you have these, like, you know, mega models in the middle, [chuckles] um, where they have-- where they share this, like, you know, deep connective tissue, and they can reason in one single space. And you give them, you know, inputs, and you expect outputs of them. They're iterative models, so it's not like, you know, one shot all the outputs that are gonna come out. But we are betting on this second approach.

    2. CI

      Hmm.

    3. AJ

      And the reason is very simple, because we think intelligence is not this pipeline architecture problem. If you think about the systems of intelligence, the systems of intelligence don't look like, you know, this kind of big database problem. The systems of intelligence look more like the human brain, where you let information itself design the architectures and circuits inside it, like what we do during training, and hopefully very soon in continual learning, these circuits will change as we, as we are actually, uh, you know, using these models. And then you sort of step away from that. [chuckles]

    4. CI

      Hmm.

    5. AJ

      You manage context outside. You, you know, manage memory, sometimes outside, sometimes inside, like how you do with caches in, in CPUs today. But the actual processing unit are these unified models. So that is sort of our approach of how we, how we think about building them. And how we think of improving them is a little bit like this. So if you wanna think about, like, what is the computer of the future looks like, actually, what is every agent product today, uh, it's some version of this, basically. Like, you know, this is not a big revelation. This is how things are being built. So you have, like, you know, a tool harness in the middle. Uh, I'm gonna go from the middle up. This tool harness means systems that can use Linux, systems that can use, uh, you know, call APIs, all of these kind of things. But then how does it all work? How does it actually full work gets done? So you have this, like, fat stack of skills on the top. These are domain-specific understanding, right? So you wanna teach a robot, like, you know, how to assemble something, right? That's not a normal thing, right? Like, you know, if you wanna think about, like, how is an iPhone assembled, this is a very domain-specific thing. You can give it all that information. It doesn't need to be in the model. It doesn't even need to be in the tools. You give this information as context, and you can do this across huge amount of verticals, huge amount of, uh, like, you know, uh, different task in those verticals. Then you have tool harnesses, where you give it as general ca-- ability to call tools and, and things like that. And finally, orchestrating all of that and thinking through all of that is this unified model-

    6. CI

      Hmm

    7. AJ

      ... at the bottom. That is interpreting all of this multimodal information, generating tool calls, understanding which skills to use, and producing the outputs. So this is how we think the architecture of the future o- of computers will look like, and this is what we have built the current product basically on. This is, this is basically built on this-

    8. CI

      Right

    9. AJ

      ... kind of architecture. Yeah.

    10. CI

      So c- actually, could you just do a one-to-one mapping? So-

    11. AJ

      Yeah

    12. CI

      ... here, where, uh, where was the harness? Where were the skills?

    13. AJ

      Yeah.

    14. CI

      Where was the model?

    15. AJ

      So actually, when it generated these slides, right, someone on our team who's really, really good at producing greatly designed slides wrote a, I don't know about, it's a 50-page document on what it means to design good slides, right? And if you see, ac- I don't know if the prompt is there. Um, I've got a clear picture. Now kick off planning and generation. Okay, so after this, it would have, uh, said like, "Oh, let me look up the skills I have, like, access to."

    16. CI

      Ah, so that was the skill.

    17. AJ

      That was the skill.

    18. CI

      That's a general purpose, um-

    19. AJ

      Slide skill

    20. CI

      ... like, best-in-class slide creation skill-

    21. AJ

      Correct

    22. CI

      ... that was created internally by a human and then uploaded for anybody else to use-

    23. AJ

      Exactly

    24. CI

      ... automatically.

    25. AJ

      Exactly. Uh, so that's the skill layer. Then, uh, the model layer is obviously the one that is generating and, and generating the tool calls and all of these kind of things. And the tool layer here, so not many tools were necessary, but I, I, I think, like, you know, your image that, that you gave, that was also passed as context.

    26. CI

      Right.

    27. AJ

      And we probably ran OCR on it just to, like, you know, see, like, you know, what, what kind of things are. So this was not a very tool call heavy thing. But had you asked it to make an interactive webpage-

    28. CI

      Right

    29. AJ

      ... that, like, you know, animates all of this stuff, then we're gone and call, uh, uh-

    30. CI

      A different skill

  13. 34:5242:37

    Business and market dynamics: capital intensity, enterprise adoption, and creative productivity

    1. CI

      Okay, I'm gonna ask you one last question before we switch, which is... Okay, so it took a couple years to put the whole system together-

    2. AJ

      Mm-hmm

    3. CI

      ... which is a fairly high-scale system. Can you talk about the business for a sec? You announced earlier this year-

    4. AJ

      Yeah

    5. CI

      ... I think you raised about a billion dollars.

    6. AJ

      $1.5.

    7. CI

      $1.5 billion.

    8. AJ

      Yeah, total.

    9. CI

      Yeah. Over your lifetime, Luma's raised about $1.5 billion. Of that, I think a billion was raised this, this-

    10. AJ

      This year

    11. CI

      ... these last 12 months. Um, you know, it-- why does, why is this such a capital-intensive effort if it's not as high scale as language?

    12. AJ

      If you really wanna do it correctly, it is larger scale than language because it is strictly a superset of, like, you know, the work that is going on in language. But currently, we don't care as much about coding, for instance, so we don't have to spend that much effort towards it. We can go towards all the areas that language models are not good at, and that means we can actually have a subscale compute infrastructure, subscale data infrastructure, things like that, so it doesn't require 100 billion yet. Uh, like, you know, we can do with one billion what, like, you know, generally takes five, 10 billion annual run rate to be able to produce. Um, but if you think about it, like, where things are going, uh, you know, in one year, two year, three years' time, we believe that these systems will far surpass language systems-

    13. CI

      Hmm

    14. AJ

      ... just because of the access to more data. More data is better, right? Just because of their understanding of more domains. So I'll give you an example. One of our customers who's using these systems, uh, they work in energy industry. Uh, you can guess who that is. Um-

    15. CI

      Right

    16. AJ

      ... and now suddenly, like, you know, our systems have no idea about, uh, like, you know, grid systems. Like, all right, like, you know, how the energy grid actually works and, and, and how they wanna be able to do that. So what we did isWe started to ingest their energy grid diagrams and energy grid code and all of these kind of things. And suddenly, our systems are better at producing schematics and planning than Anthropic's coding models are because they can't actually read all that information.

    17. CI

      Hmm.

    18. AJ

      They can't actually see like, you know, how the things are laid out, that sort of problems. It's a very small example. Um, studios have another big example where like, you know, yes, LLMs they have had forever, but a story is not just text. A story is all of the physical stuff that is happening. If it has visual understanding, it can do much better. So we believe like, you know, especially as the age of robotics comes about, you will need these systems to be general-

    19. CI

      Right

    20. AJ

      ... and these systems to be able to do everything, including writing code, and, and that's kind of where we're gonna go. But today, this gives us a very great business where language models are not really playing. Um, currently, we are... Like, you know, when we started the company, I mean, we were very small. Today, we work with some of the largest studios in the world. Now, we work with the largest advertising agency in the world, Publicis. They're just deployment channels for us. We work with the second-largest brand in the world, Coke, who is moving three billion dollar of annual production of, of content to Luma, basically. And, um, in addition to that, like, you know, in, in, in some of the areas like how do you do work just in a company, how do you communicate information visually? The- there's starting to be like, you know, these new areas in which previously only designers and, and artists could work. Now, everyone is starting to do that work.

    21. CI

      Yeah. So th- this was... You know, you had an event earlier this year.

    22. AJ

      Mm-hmm.

    23. CI

      I mean, like I think it was three weeks ago-

    24. AJ

      Yeah

    25. CI

      ... in SF, and I came by, and the thing that shocked me was that it was all artists and creatives, and y- I mean, you spoke for a little bit off the stage, but then they got you off the stage.

    26. AJ

      Yeah.

    27. CI

      And then a bunch of folks from Hollywood came by, a bunch of designers, and it was the first time I'd seen so many artists and creators, not, not like machine learning people-

    28. AJ

      Yeah

    29. CI

      ... but creatives excited about using tools. W- w- why has... That, that's, and that's very new.

    30. AJ

      Mm-hmm.

  14. 42:3744:59

    Q&A: OpenAI Sora pause, focus as organizational physics, and what it signals for the market

    1. AJ

      So the question is-What is my hypothesis why Sho- Sora shut down? Whe- whether it's a business reason, it's an architecture reason. And two, what impact does it have on us in the industry, but also, like, you know, on creatives? So, I mean, I can only give you hypothesis. I don't know really what is happening inside, uh, uh, OpenAI. But, I mean, the, the, the one word here is really focus. OpenAI, at the core of it, is a large language model lab. What they do really, really well is produce models that are very good for chat particularly, right? Chat is a vertical that has, uh, about eight billion customers, right? Um, maybe not little kids, but if you... Maybe they too, right? Like, you know, because they wanna talk to a computer. So pretty much all of humanity is a good customer of chat. Executing on that is a really hard problem. Executing on anything at that scale, you need to go into the depths of hell to be like, you know, get everything working really, really well. When you do everything, that's really hard to do. I mean, Luma also had that problem actually, right? Like, you know, in early days when we, we were not really clear about how do we execute on this, so we tried a lot of parallel paths. But doesn't matter how much money you have, doesn't matter how much, how many people you have. Uh, this was also a lesson from Apple. There's, uh, way more things at Apple that they choose not to do than they choose to do, right? That is because it doesn't matter the money, doesn't matter the people, the organizational physics still come into play.

    2. CI

      Less is more.

    3. AJ

      Exactly. There's only so much attention you have as a company, not as a person, but as a company, that you can actually devote to making something. So OpenAI doing literally everything is not good for their business, and I think that is a realization that is setting in, and I think this will not be the last thing that they have actually canceled, right? There might be actually even more. Um, one thing I will challenge is OpenAI was not the largest player in the market. It is actually Google that is doubling down on, on video, on images, on visual generation, right? Like, you know, Gemini are... Gemini has great models that, that do pretty much all of these things actually. It doesn't indicate actually anything on the size of the market. It just indicates that they are getting their [audio cuts out] kicked because of lass of, le- lack of focus by Google, by Anthropic, and those kind of things, and they have to focus if they wanna go IPO, right? And that is the market that we are actually entering at this point. For Luma, what does this mean? Uh, I mean, this is great news. This validates like, you know, our, our, our thesis that, like, you can only do so many things at a time. And, um, this is the area that we have chosen to go in because this is a very, very big market with huge number of people that

  15. 44:5957:36

    Q&A: Copyright, model architectures shifting (GANs → diffusion → hybrid), and the remaining gap to “world models”

    1. AJ

      call it their profession. So it actually gives us, uh, very good footing in, in the same market. So that's what I would say. So the question is, given that anyone can make a video about anything and content about anything, what happens to copyright, right? So I think copyright and the ability to produce something are orthogonal problems, right? If you're talented enough, you can make Mickey Mouse in Photoshop, really, uh, and you can actually produce great stuff about Mickey Mouse. Like, let's say you're DreamWorks. You don't have rights to Mickey Mouse, but you have all the people who can actually produce anything related to Mickey Mouse, which you don't. Why? Because the law exists that, like, you know, prevents you from doing that. So I think none of that has changed. Has it become easier to violate other people's copyright? Yes, I think so, right? You didn't ask me, like, what the responsibility of platforms is. Again, the responsibility of the platforms is the same as it was for Photoshop, right? Like, you know, it's not Photoshop's responsibility to prevent you from producing Mickey Mouse. It's your responsibility as a law-abiding citizen, um, to not violate the law of the land.

    2. CI

      Hmm.

    3. AJ

      So I think it is pretty much orthogonal, basically. Like, you know, generative AI doesn't change copyright in any way, shape, or form, um, at least on the output side of it.

    4. CI

      But specifically, if there's a law that says you can't do XYZ, you'll, you will adhere to it.

    5. AJ

      Absolutely.

    6. CI

      Yeah.

    7. AJ

      If we get a DMCA notice, we'll take it down, right? Like, you know, if you're hosting it.

    8. CI

      Right.

    9. AJ

      Um, if that person used to create it, and we get a, a call of like, all right, like, you know, this person made something, it is not our responsibility to point law enforcement to them, right? Like, you know, because that's not the law of the land.

    10. CI

      Ah, right. That... I see.

    11. AJ

      So.

    12. CI

      You, you have... You, you protect the users in that case.

    13. AJ

      That's right.

    14. CI

      Right.

    15. AJ

      So the question is, um, GANs were very popular 2017, 2018, and now-

    16. CI

      Yeah

    17. AJ

      ... you know, the world has shifted pretty much entirely towards diffusion models. What is the space of GANs in today's models? Uh, or today's architectures, basically. Um, that's a great question, actually. We still use GANs quite a lot. We use techniques from GANs quite a lot. But GANs are one of the most finicky architectures to ever work with. So GANs, if you don't know, are generative, uh, uh, like, you know, adversarial networks, and as the name says, they're adversarial networks. So, like, you know, you, you design the, the, uh, uh, objective in a very different way to diffusion models, right? Like, you know, very... You have a very predictable gradient descent. I mean, they still explode sometimes, but it's a very predictable system. GANs are still actually used quite heavily in distillation networks. Like, you know, if you wanna do distillation, like, GANs are actually pretty useful. Um, if you wanted to do a real-time system, you would still go to GANs quite a lot. But because GANs are just not very predictable, researchers don't wanna work on them. [chuckles] And that is the laws of, like, you know, physics in, in AI. What researchers wanna work on is generally what will get worked on, right? Like, so I can, I can make the case that, "Hey, Rust is more efficient." Doesn't matter. Everybody wants to code in Python, so that's what will be done. But also, GANs have not shown the kind of scaling that we are seeing with transformers, right? Uh, GANs are primarily, you know, UNet-based and, and, and convolution-based models, and they just don't really show the kind of learning that you can get from transformers. Can GANs be implemented using transformers? Yes, there are some papers about it. But at that point, you're really, really trying very hard to, like, you know, just, just do GANs, and, and that's okay. But diffusion models also now are on the way out. So, uh, I know this will be a little bit controversial if people are thinking about it, but diffusion models have physics that is not actually bearing out on scaling side of it. So Luma and, and some other companies are actually moving away to hybrid autoregressive and diffusion regimes. That's what our, our, um, unified models actually are. Because diffusion models actually have some really, really bad habits that are hard to unlearn and hard to, like, you know, get out of, of the system. So they're also on the way out, actually. So yeah.

    18. CI

      It's a very, uh-Um, if, if you realize basically when we first started teaching the class, Mike, there, there, there were debates about what the right l- programming language was for, for security, and it feels like architectures have come full circle, basically. Don't-

    19. AJ

      Sure.

    20. CI

      It-- Uh, m- good note for office hours this week.

    21. AJ

      The question was, as models get more and more powerful, what is the space of human creativity, especially in these unified models that can do pretty much all tasks, visual tasks, language tasks, all these kind of things? My stance on this has been actually very, very sterile from day one. I don't think anything the model is doing is creative or not creative. Whether it's creative or not creative is for humans to judge. That judgment alone is the act of creation, right? What you choose to do is an act of creation. What-- How I tend to spend my time is actually a very creative endeavor, right? Like, you know, it... because it will produce outputs that people will generate, consider creative or not creative. But more importantly, in a practical physical system that is an AI system, how-- where is human-- where is the role of human? It's in that fat skills area. This is why our slides look good, because someone who knows this really, really well went in and taught the model a million times over that, like, you know, this is what good looks like to humans, and this is what bad looks like to humans. This is just like programmers, right? Like, you know, before, artists never had this kind of huge leverage. They did something once, and it will run a billion times, right? Programmers have always had this leverage. You write the program once, and it will run again and again and again and again on everyone's computers, on everyone's phones many, many times over and produce value. Artists produce one thing, and then that's one thing. That's it, right? Now, other creatives in the world, not just programmers, have this leverage because of this architecture. You teach the model once, and they're able to produce like, you know, huge amount of really, really great things in different contexts. This is actually an explosion of creative potential that just never has been, right? So the skills and human creativity are much more important. Actually, it will weed out people who were mediocre, and it will, like, elevate people who are really great to even greater heights because now their work will be rerun a trillion times over. Okay, so Hollywood is default dead right now, right? And that has nothing to do with AI. Uh, that has really nothing to do with any of the technological changes recently. Hollywood's business model has been deteriorating for the past thirty years, and COVID really accelerated it, and then the writers' strike just was the nail in the coffin. Um, at this point, we are at a place where, like, you know, this production that I'm talking about is the first production in LA proper in the last five years. First production, because all production has moved out of Los Angeles. Think about it. Hollywood doesn't make movies anymore. Hollywood finances them, but doesn't make them.

    22. CI

      Where, where are they being made?

    23. AJ

      In Greece, in, in Canada, in, in Ireland. Wherever you get tax incentives, you're gonna make it there. So what does so-solving this problem actually mean, right? First of all, Hollywood has to stop thinking like PE. Currently, Hollywood's business model is like a private equity. Oh, Guardians of the Galaxy was a great hit. Let's make number two, number three, number four, number five, number seven, number ten, number twenty, right? Uh, uh, l- how many Avengers are there now? I don't even know, right? And how many crossovers of Avengers and Spider-Man? And I'm-- won't be surprised if, like, you know, Tintin is in there one day, right? [laughs] Uh, like some multiverse thing. Uh, as a physicist, that's a really troubling thing for me, the multiverse universe, right? Um, that's not how it works at all. Um, but it is emblematic of a PE mindset where like, all right, we created a franchise, we created an asset. How do we rent-seek that asset the most we possibly can? But, you know, audience don't think like that. More and more people want to watch great things, want to go to theaters, want to actually like, you know, watch things on their phone. That is emblematic in Netflix's growth. I mean, the, uh, on Thursday, you're gonna see like, you know, the, the, this quarter results. I'm not saying buy or sell anything. Uh, but like you've seen the growth of Netflix, and they produce eight hundred productions a year compared to like, you know, the, the five, ten, twenty that large studios are producing right now. That is a PE mindset. So those eight hundred productions don't have five hundred million dollar budgets, right? They have, uh, uh, like, you know, they're like ten million, twenty million, thirty million, fifty million dollar budgets, and this is how they're scaled. What does this allow? This allows different kinds of, and more kinds of stories to be told, and that means it appeals to a wider audience. So your platform then becomes more appealing to more people. Are they making again a Harry Potter? Yes. I'm very happy about it, by the way, right? But this is also PE mindset that like, why are we making Harry Potter when we have so many other great books that should be made again and again and again and again, right? They shou- sorry, shouldn't be made again and again. So Hollywood has been default dead for a long time. If nothing changes, this is not about AI again. If nothing changes, all those jobs are actually already gone, right? Um, the, the people in those industries know that. AI is a chance to actually change the business model once and for all. Because you can move away from these massively expensive production methods. You can move away from these huge, huge, uh, you know, time and capital sinks and go back to an era where like, you know, many, many ideas can be tried and one has a shot of building people into the, you know, into theaters. Ryan Gosling from Hail Mary, if you have not seen it, great movie, uh, makes a great point. It's not the audience's job to come to the theaters to s- you know, keep Hollywood alive. It's Hollywood's job to make great things so the audiences want to watch it. You can't blame your customer-

    24. CI

      Um, what I'm gonna do is... That was well said, and I think the PE mindset is, uh, is m- is going to come up over and over again, where capital markets, where- wherever they look for predictability-

    25. AJ

      Yeah

    26. CI

      ...um, we tend to find a stagnation of innovation, and I think that's, that's hurting a lot of people.

    27. AJ

      That's a great question, and that's the one that like, you know, the whole company spends all their time thinking about, honestly. So the question is: What is the delta between where we are to getting to a place where world models, video models, whatever have you, are as generally used and useful as language models are today? Fair?

    28. SP

      Fair.

    29. AJ

      Right. So i- there's only one word, basically: intelligence. That's the deltaSo currently, image models and video models that are not unified models are really, really stupid. And, and I mean that in a non-derogatory way of that, right? When I say stupid, I mean in, in this way. Like, when you work with a person who you don't consider to be intelligent, what are the signs? They forget what you said. You have to tell them the same thing again and again. They don't actually understand what you said. They kinda are a facsimile of understanding, right? It's like you said something, but they're like, "But that's not what I said." Like, you know, "Yes, you, you kind of interpreted my words literally, but that's not what I'm saying. You have no context of what I'm saying." They are able to do small things, but like, you know, when you ask of more of them, then they can't actually do that. This is what today's video models and image models are. All of them, basically, right? Um, that's what we tried to solve with UniOne. They need to be as intelligent as language models are. They need to have multi-turn, right? So when you, when you ask it something, afterwards, it needs to be able to go back and say, "All right. This is what I generated. This is what you had asked. I, I have memory. Let me iterate on it, and let me fix it." How annoying would ChatGPT be if you only had one turn on it, right? And then you had to repeat the thing and then another turn, and repeat the thing and then another turn. That's ridiculous, right? Nobody uses that. See, that was the difference between LLMs being a research project and them becoming generally useful. That was RLHF, right, that enabled chat multi-turn. So that was number one. Number two is, is how much intelligence did they actually have? So current image models and video models are beautiful pixel generators. They have really no understanding of what the hell they're generating, the physics of it, the, the introspection on it, all of these kind of things. Unified models are designed to solve this kind of problem. So when you use them in, in things like education, for instance, right? Like, you know, these slides could, uh... Sorry, not this one, but the ones I had, could easily be used for te- by teachers, right? With probably a lot more density, and I can show you really good examples of it producing really high-density things. Videos are some of the best explainers in the world, right? Imagine a history class that is not taught as drab text, but you can actually see, and most importantly, you can do alternatives. What if the Rubicon was not crossed, right? What if Caesar was not murdered? What if, like, you know, these things didn't happen? What if Archduke Franz Ferdinand was not shot in 1914, right? Like, you know, would there still be a World War I? I posit yes, and like, you know, we can go into that. But what if you see that flow out, you had this level of like, you know, uh, temporal understanding and coherency and these kind of things. Language models are getting there. Image and video models don't have that. That's what we need to solve. So that is the distance between them being just tools that can produce almost stock footage to things that can do end-to-end work.

    30. CI

      Awesome. Thank you, Amit, for being here.

Episode duration: 57:41

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 6nUl_w5W9Wk

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.