transcript.md

Why has generative AI ingested all the world’s knowledge but not been able to come up with scientific discoveries of its own, and has it finally started to understand the physical world? We'll discuss it with Meta Chief AI Scientist and Turing Award winner Yan LeCun. Welcome to Big Technology Podcast, a show for cool-headed nuanced conversations about the tech world and beyond. I'm Alex Kantrowitz, and I am thrilled to welcome Yan LeCun, the chief AI scientist, Turing Award winner, and a man known as The Godfather of AI to Big Technology Podcast.

Yan, great to see you again. Welcome to the show.

Pleasure to be here. Let's start with a question about scientific discovery and why AI has not been able to come up with it until this point. This is coming from Dwarves Patel; he asked it a couple of months ago: "Why do you make of the fact that AIs generative AI basically have the entire corpus of human knowledge memorized, and they haven't been able to make a single new connection that has led to discovery? Whereas if even a moderately intelligent person had this much stuff memorized, uh, they would notice 'oh, this thing causes this symptom, this other thing causes this symptom, there might be a medical cure here.' So shouldn't we be expecting that type of stuff from AI?"

Well, from AI yes; from large language models (LLMs), no. You know, there are several types of AI architectures right, and all of a sudden when we talk about AI, we imagine chatbots. Chatbots/LLMs are trained on an enormous amount of knowledge that is purely text, and they're trained to basically regurgitate or retrieve essentially produce answers that conform to the statistics of whatever text they've been trained on. And it's amazing what you can do with them; it's very useful—there's no question about it. We also know that they can hallucinate facts that are true, but in their purest form, they really are incapable of inventing new things.

Let me throw out this perspective that Tom Wolf from Hugging Face shared on LinkedIn over the past week—I know you were involved in a discussion about it—it's very interesting. He says, "To create an Einstein in a data center we don't just need a system that knows all the answers but rather one that can ask questions nobody else has thought or dared to ask, one that writes 'what if everything everyone is wrong about this' when all textbooks, experts, and common knowledge suggest otherwise."

Is it possible to teach LLMs to do that?

No. Not in the current form. I mean, and whatever form of AI would be able to do that will not be LLMs; they might use LLM as one component. LLMs are useful to produce text—okay, so we might in the future use AI systems and we might use them to turn abstract thoughts into language.

In the human brain, that's done by a tiny little brain area right here called the Broca area; it's about this big. That's our LLM. Okay, but we don't think in language; we think in mental representations of a situation. We have mental models of everything we think about. We can think even if we cannot speak and that takes place there—that’s like you know where real intelligence is—and that's the part that we haven’t reproduced certainly with LLMs.

So, the question is: um, are we going to have eventually AI architectures or systems that are capable—not just answering questions that already exist but solving new problems giving new solutions to problems that we specify? The answer is yes, eventually—not with current LLMs. And then the next question is: Are they going to be able to ask their own questions, like figure out what are the good questions to answer?

And the answer is eventually yes, but that's going to take a while before we get machines that are capable of this. Like you know in humans, we have all these characteristics—we have people who are who have extremely good memory; they can retrieve a lot of things. They have a lot of accumulated knowledge. We have people who are problem solvers right? You give them a problem; they'll solve it.

And I think Tom Wolf was actually talking about this kind of stuff; he said like you know, if you're good at school, you're a good problem solver—we give you a problem, you can solve it. Um and then in research, the most difficult thing is to ask the right question—the important questions. It's not just solving the problem; it's also asking the right questions, kind of framing a problem you know in the right way so that you have kind of new insight. And then after that comes "okay I need to turn this into equations or into something you know practical model." And that may be a different scale from the one that as the right questions might be a different scale also to solve equations—the people who write the equations are not necessarily the people who solve them.

Um and other people who remember that there is some textbook from 100 years ago where similar equations were solved, right? Those are three different skills. So LLMs are really good at retrieval—not good at solving new problems—get you know finding new solutions to new problems. They can retrieve existing solutions, and they're certainly not good at all at asking the right questions.

For those tuning in and learning about this for the first time: LLM is the technology behind things like the GPT model that's within baked within uh ChatGPT.

But let me ask you this, Yan—the AI field does seem to have moved from standard LLMs to LLMs that can reason and go step-by-step. I'm curious—can you program this sort of counterintuitive or heretical thinking by imbuing a reasoning model with an instruction to question its directives?

Well, so we have to figure out what reasoning really means—and there are um obviously everyone is trying to get LLMs to reason to some extent—to perhaps be able to check whether the answers they produced are correct. And the way people are approaching the problem at the moment is that they basically try to do this by modifying the current paradigm without completely changing it.

Okay, so can you know bolt a couple of WS on top of LLM so that you kind of have some primitive reasoning function? And that's essentially what a lot of reasoning systems are doing—you know when we get LLMs to kind of appear to reason is chain-of-thought right. So basically tell them to generate more tokens than they really need to in the hope that in the process of generating those tokens they're going to devote more computation to answering a question, and uh to some extent that works surprisingly but it's very limited—you don't actually get real reasoning out of this.

Reasoning at least in classical AI is in many domains involves a search through a space of potential solutions. Okay, so you have a problem to solve; you can characterize whether the problem is solved or not, so you have some way of telling whether the problem is solved. Um and then you search through a space of solutions for when that actually satisfies the constraints or you know is identified as being a solution.

Um and that's kind of the most general form of reasoning you can imagine—there is no mechanism at all in LLMs for this search mechanism. What you have is you have to bolt this on top of it right? One way to do this is you get an LM to produce lots and lots and lots of sequences of answers—right, sequences of tokens which you know represent answers—and then you have a separate system that picks which one is good.

Okay, this is a bit like writing a program by sort of randomly more or less generating instructions while maybe respecting the grammar of the language and then checking all of those programs for when that actually works. It’s not a good way—it's not a very efficient way of producing correct pieces of code; it's not a good way of reasoning either. So, um, a big issue there is that when humans or animals reason we don't do it in token space—in other words, when we reason we don’t have to you know generate a text that expresses the solution and then generate another one and then generate another one and then among others we produce pick the one that is good. We reason internally right—we have a mental model of the situation and we manipulate it in our head and we find kind of a good solution.

When we plan a sequence of actions to I don't know you know build a table or something, um, we we plan the sequence of action you know we have a mental model of that in your head. If I tell you—and this has nothing to do with language—okay, so if I tell you imagine a cube floating in front of us right now; now rotate that cube 90° along a vertical axis okay. You can imagine this thing taking place and you can readily observe that it's a cube. If I wrote it 90°, it's going to look just like the cube that I started with, okay. Um because you have this mental model of a cube—um and that reasoning is in some abstract continuous space; it's not in text—it’s not related to language or anything like that.

Um and humans do this all the time animals do this all the time—and this is what we yet cannot reproduce with machines. Yeah, it reminds me you're talking through chain-of-thought and how it doesn't produce much novel insight and uh when deep seek came out one of the big screenshots that was going around was someone asking DeepSeek for a novel insight on The Human Condition—and as you read it—it's one another one of these very like clever tricks. The AI Ps because it does seem like it’s running through all these different like very interesting observations about humans how we take our uh, our hate like our violent side and we channel it towards cooperation instead of competition and that helps us build more—and then you’re like as you read The Chain-of-Thought you’re like this is kind of just like you read Sapiens and maybe some other books and that's your chain of thought pretty much.

Yeah, I mean yeah—a lot of it is regurgitation. Now going to move a part the conversation I had later closer up which is the wall—effectively is training standard large language models coming close to hitting a wall whereas before there was somewhat predictable returns if you put a number a certain amount of data and a certain amount of compute towards training these models, you can make them predictably better. Um as we're talking it seems to me like you believe that that is eventually not going to be true.

Well it I don't know if I would call it a wall—but it's certainly diminishing return in the sense that you know—we've kind of run out of natural text data to train those LLMs where they're already trained with—you know on the order of uh, you know 10 to the 13 or 10 to the 14 tokens—that's a lot. That's a lot and that's the whole internet—that's all the publicly available internet—and then you know some company’s license uh content that is not publicly available—and then there are talks about like you know generating artificial data and then hiring thousands of people to kind of you know generate more data—other knowledge PhDs and professors.

Yeah, but in fact it could be even simpler than this because most of the systems actually don't understand basic logic for example. Right so um so to some extent you know there going to be slow progress along those lines with synthetic data with you know hiring more people to plug the holes in the in the sort of knowledge background of uh, of uh, of those systems—but but it's diminishing return—right—the costs are ballooning of generating that data—and the returns are not that great. So we need a new paradigm—okay; we need um a new kind of architecture of systems that at the you know at the core are capable of those search and uh um, searching for a good solution checking whether that solution is good planning for a sequence of actions to arrive at a particular goal—which is what you would need for an agentic system to really work.

Everybody is talking about agentic systems nobody has any idea how to build them other than uh basically regurgitating plant that have the system already been trained on okay so you know it's like it’s like everything in computer science—you you can you can engineer a solution which is limited for—for in the context of AI—uh you can make a system that is you know based on learning or retrieval with enormous amounts of data—but really the complex thing—the complex thing is how you build a system that can solve new problems without being trained to solve those problems.

We are capable of doing this; animals are capable of doing this. Facing a new situation, we can either uh—solve it zero-shot—without training ourselves to handle that situation just a first time we encounter it—or we can learn to solve it extremely quickly. So for example um you know we we can learn to drive in you know couple dozen hours of practice um and to the point that after 20-30 hours it becomes kind of a second nature where this become kind of subconscious—we don't think about it. You don’t need to think about it. You can speaking of System One, System Two—right—that's right.

So you know the discussion we had with Denny Kenman a few years ago so um, you know—the first time you drive your system too is all uh present—you have to use it to imagine all kind of catastrophic scenarios and stuff like that—right—your full attention is devoted to driving. But then after a number of hours you know you can talk to someone at the same time like you don't need to think about it; it's become sort of subconscious and more or less automatic—it's become system one—and pretty much every task that we um, you know learn—that we accomplished—the first time we have to use the full power of our minds.

And then eventually if we repeat them uh sufficiently many times they get they get kind of subconscious—I have this vivid memory of uh once being in a workshop where one of the participants was a chess grandmaster and he played a simultaneous game against like 50 of us right—you know going from from one person to another you know I got wiped out in 10 turns or something—I'm terrible at chess right—but um so he would come you know come to my table and you make your move in front of it goes like what—and then immediately plays—so I doesn't have to think about it. Um, I was not a challenging enough opponent that he had to actually call his System Two.

His system one was sufficient to beat me um and uh what that tells you is that when you become familiar with the task and you you train yourself—you know—it kind of becomes subconscious—but the but the essential ability of humans and many animals is that when you face a new situation, you can think about it figure out a sequence of actions—a course of action to accomplish a goal um—and you don't need to know much about about the situation other than your common knowledge of how the world works—basically—that's what we're missing okay with the with the AI systems.

Now, now I really have to blow up the order here because you've said some very interesting things that we have to talk about um. You talked about how basically LLMs have hit the point of diminishing returns large language models—the things that have gotten us here—and we need a new paradigm—but it also seems to me like—that new paradigm isn't here yet and I know you're working on the research for it and we're going to talk about what the next new paradigm might be.

But there's a real timeline issue, don't you think? Because I'm just thinking about the money that's been raised and put into this—yes last year 6.6 billion to open AI uh, last week or a couple weeks ago another three-and-a-half billion to anthropic after they raised four billion uh last year; Elon Musk is putting another you know another small fortune into building Grok—these are all LLM-first companies—they're not searching out the new—I mean maybe OpenAI is but that 6.6 billion that they got was because of ChatGPT.

So where's this field going to go? Because if that money is being invested into something that is at the point of diminishing returns requiring a new paradigm to progress—that sounds like a real problem.

Well, um I mean we have some ideas about what this parad—uh—the difficulty that—I mean—what we're working on is trying to make it work—and it's you know it's not simple—that may take uh—that may take years. And so the question is um is the all the capabilities we're talking about perhaps through this new paradigm—that we're thinking of—that we're working on—uh—is it going to come uh quickly enough to justify all all of this investment? And if it doesn't come quickly enough is the investment still justified?

Okay, so the first thing you can say is we are not going to get to human-level AI by just scaling up LLMs. This is just not going to happen—okay—that's your perspective; no way okay absolutely no way um and whatever you can hear from some of my uh more adventurous colleagues uh it's not going to happen within the next two years there's absolutely no way in hell to you know pardon my French um—the idea that we we're going to have a country of geniuses in the data center—that's complete BS—right, it's absolutely no way.

What we're going to have maybe is systems that are trained on sufficiently large amounts of data that any question that any reasonable person may ask will find an answer through those systems and it would feel like you have you know a PhD sitting next to you—but it’s not a PhD you have next to you—it's you know a system with gigantic memory and retrieval ability—not not a system that can invent solutions to new problems—which is really what a phg is. Okay, this is actually it's—you know connected to this post that uh Tom Wolf made that um, you know—inventing new things requires the type of skill and abilities that uh you're not going to get from LLMs.

So, this is big question which is: The investment that is being done now is not done for tomorrow; it's done for you know the next few years. And most of the investment at least for—from The Meta side is investment in infrastructure for inference okay. So let’s imagine that by the end of the year—which is really the plan at MAA—we have 1 billion users of meta AI through smart glasses, standalone app—and and whatever um you got to serve those people and that's a lot of computation.

So that's why you need you know a lot of investment in infrastructure to be able to scale this up and build it up over months or years. Um and so that you know—that's where most of the money is going—um at least on on you know on the side of companies like like MAA Microsoft—and Google—and potentially Amazon um then there is so this is just operations essentially.

Now, is there going to be the market for uh, you know 1 billion people using those things regularly even if there is no change of paradigm and the answer is probably yes. So you know, even if the revolution of new paradigm doesn't come within three years—this infrastructure is going to be used; there's very little question about that okay so it’s a good investment—and it takes so long to set up you know data centers and all that stuff—that you need to get started now and plan for progress to be continuous uh so that uh you know eventually the investment is justified—but you can't afford not to do it right because um because there would be too much of a risk to take if you have the cash.

But let's go back to what you said—the stuff today is still deeply flawed—and there have been questions about whether it’s going to be used now—meta is making this consumer bet. Right the consumers want to use the AI—that makes sense—but OpenAI has 400 million users of chat GPT—meta has three-four billion—I mean basically if you have a phone three, three something billion uh users—and then you got 600 million users of M right okay so more than chat GP—yeah but it's not used as much so the users are not as intense—but basically—the idea that—that meta can get to a billion consumer users yeah—that seems reasonable.

But the thing is—a lot of this investment has been made with the idea that this will be useful to enterprises uh—not just a consumer app—and there's a problem because like we've been talking about—but it’s not good enough yet. Um you look at DeepResearch—this is something Benedict Evans has brought up; deep research is pretty good—but it might only get you 95% of the way there and maybe 5% of it hallucinates so if you have a 100-page research report and 5% of it is wrong—and you don't know what 5% that's—that’s a problem.

And similarly in enterprises today—all every enterprise is trying to figure out how to make uh AI useful—to them generative AI useful to them—and other types of AI—but only 10% or 20% maybe of proof of Concepts make it out the door into production because there it's either too expensive or it's fallible. So if this is if we are getting to the top here uh—what do you anticipate is going to happen with everything that's—that has been pushed in the anticipation that it is going to get even better from here?

Well, so again—it’s a question timeline right when when are those systems going to become sufficiently reliable and intelligent so that the deployment is made easier. Um but but you know I mean this situation you're describing—that beyond the impressive demos actually deploying systems that are reliable is where things tend to falter—in in the use of computers and Technologies and particularly AI—this isn't new.

Um it’s basically um you know why we we had super impressive autonomous driving demos 10 years ago um—but we still don’t have level five s driving cars right. Um it's the last mile that's really difficult uh so to speak for cars you know—it’s you know the last—the last few percent of reliability—which makes a system practical um and how you integrate it with sort of existing systems—and blah blah blah—and you know how it makes users of it more efficient if you want or more reliable—or whatever.

Um that's where—that's where it’s difficult. Um and you know this is why—if we take if we go back several years and we look what happened with IDM Watson okay so Watson was going to be the thing that you know IBM was was going to push and generate tons of Revenue by having watson uh you know learn about medicine and then be deployed in every um every hospital and it was basically a complete failure—and was sold for parts right. Um and cost a lot of money to to IBM including the CEO.

And what happens is that actually deploying those systems in situations where they are reliable and and actually help people and don't like hurt—the natural conservativism of the of the labor force—um this is where things become complicated. We're seeing the same you know the process we're seeing now with the difficulty of deploying a system is not new; it's it’s happened absolutely at all times.

This is also why you know some some of your listeners perhaps are too young to remember this—but there was a big wave of interest in AI in 1980s early 1980s um around expert systems um and you know the the hottest job in the 1980s was going was going to be knowledge engineer—and your job was going to be to sit next to an expert and then you know turn the knowledge of the expert into rules and facts that would then be fed to a uh inference engine that would be able to kind of derive new facts and and answer questions and blah blah blah um big wave of interest uh The Japanese government started a big program called fifth generation computer—the hardware was going to be designed to actually take care of that and blah blah blah—you know mostly mostly a failure there was kind of a you know the wave of Interest kind of died in the mid-90s about this.

And and you know a few companies were successful but basically for a narrow set of applications for which you could actually reduce human knowledge to a bunch of rules—and for which um it was economically feasible to do so—but the the wide-ranging impact on all of society and industry was just not there. And so that's a denture of uh of AI—all the time um I mean the signals are clear—that you know still um LLMs with all the bells and whistles actually play an important role if nothing else for information retrieval.

Um you know most companies want to have some sort of internal experts that know all the internal documents so that any employee can ask any question we have one at MAA—it’s called MetaTe; it's really cool; it's very useful yeah and I'm—I'm not suggesting that AI—that modern AI is not—or modern general AI is not useful or U—I'm—I'm asking purely that there's been a lot of money that's been invested into expecting this stuff to effectively achieve God-level capabilities.

And we both are talking about how like—there’s you know potentially diminishing returns here—and then what happens if there's that timeline mismatch like you mentioned and um this is the last question I'll ask about it because—I feel like—we have so much else to cover—but I feel like timeline mismatches uh—that might be personal to you, you and I first spoke nine years ago—which is crazy now.

Nine years ago uh—and you know about how—in the early days—they had an idea for how AI should be structured—and couldn't even get a seat at the conferences um and then eventually with the right amount of—when when the right amount of compute came around those ideas started working and then the entire AI field took off based off of your idea—that you worked on with Benjo and Hinton um—but and a bunch of others.

And many others uh but for the sake of efficiency we'll say go look it up. Um but just talking about those mismatch timelines when there have been overhyped moments uh—the AI field maybe with the expert systems that you were just talking about—and they don't pan out the way people expect—The AI field goes into what's called an AI winter—well there’s a backlash yeah correct and so if we're going to if we are potentially approaching this moment of mismatch timelines do you fear that there could be another winter now given the amount of investment uh given the fact that there's going to be potentially diminishing returns with the main way of training these things.

And maybe we'll add in the fact that the market is—is the stock market looks like it’s going through a bit of a downturn right now—now—that’s a variable; probably the third most important variable of what we're talking about—but it has to factor. So I yeah—I think um—I mean there's certainly a question of timing there but I think uh if we try to dig a little bit deeper um as I said before if you think that we’re going to get to human-level AI by just training on more data and scaling up LLMs—you're making a mistake.

So if you're if you're an investor—and you invested in a company—that told you we're going to get to human level AI and PhD level by just you know training on more data and with a few tricks um—I don't know if you’re going to use your shirt—but that was probably not a good idea. Um however there are ideas about how to uh go forward and have systems that are capable of doing what every intelligent animal and human are capable of doing—and that current AI systems are not capable of doing.

And I'm talking about understanding the physical world um having persistent memory—and being able to reason and plan those are the four characteristics—that that you know need to be there um, and that require systems that you know can acquire common sense—that can learn from uh natural sensors like video as opposed to just text—just human produced data um. And that's a big challenge—I mean I've been talking about this for many years now—and uh saying this is this is where the challenge is; this is what we have to uh—to figure out.

And my group and I have or people working with me and others who have listened to me are making progress along along this line of uh systems that can be trained to understand how the world works on video for example. Systems that can use mental models of how the world—the physical world—works to plan sequences of actions to arrive at a particular goal so we we have kind of early results of this kind of system—and there are people at DeepMind working on similar things and there you know people in various universities working on it.

So um, the question is you know when is this going to go from uh interesting research papers demonstrating a new capability with a new architecture to you know architectures at scale that you know are practical for a lot of applications—and can find solutions to new problems without being trained to do it and so we have kind of early results of this kind of system. Um and there are people at DeepMind working on similar things and there you know people in various universities working on this uh.

So um the question is you know when is this going to go from uh interesting research papers demonstrating a new capability with a new architecture to you know architectures at scale that you know are practical for a lot of applications and can find solutions to new problems without being trained to do it etc. And you know it it's not going to happen within the next three years but it may happen between 3 to 5 years something like that—and that's kind of corresponds to you know the sort of ramp up that we see in uh, uh in investment now whether.

Other so so that—that's the first thing. Now, the second thing that’s important is that there's not going to be one secret magic bullet that one company or one group of people is going to invent that is going to just solve the problem um it's going to be a lot of different ideas—a lot of effort some principles around which to base this—that—that some people may may not subscribe to—and will go um in a direction that is you know well turn out to be a dead end uh.

So there's not going to be like a day before which there is no AGI and after which we have AGI—this isn't going to happen. Um it’s going to be continuous conceptual ideas—that as time goes by are going to be made bigger—and to scale—and going to work better—and it's not going to come from a single entity; it's going to come from the entire research Community across the world and the people who share their research are going to move faster than the ones that don't.

And so if you think that there is some startup somewhere with five people who has discovered the secret of AGI and you should invest five billion in them—you're making a huge mistake. You know, Yan—I always enjoy our conversations because we start to get some real answers—and I was just—and always looking back to that conversation saying okay this is what Yan says; this is what everybody else is saying.

I'm pretty sure that this is the grounding point—and that's been corrected—I know we're going to do that with this one as well. And now you've set me up for two interesting threads that we're going to pull out um as we go on with our conversation—first is the understanding of physics and the real world and the second is open source so we'll do that when we come back right after this.

And we're back here with Yan LeCun; he is the chief AI scientist at Meta—the Turing Award winner—that we’re thrilled to have on our show luckily for the third time. Um I want to talk to you about physics, Yan—because there's sort of this famous moment in Big Technology Podcast history and I say famous with our listeners—I don't know if it really extended Beyond—but you had me uh uh write to chat GPT: "If I hold a paper horizontally with both hands and let go let the go of the paper with my left hand what will happen?" And uh—I write it—and it convincingly says like—it writes though—the physics will happen—and the paper will float towards your left hand.

And I read it out loud convinced—and you're like—that thing just hallucinated—and you believed it. That is what happened so listen; it's been two years—I put the test to chat GPT today uh it says um when you let go of the paper with your left hand—gravity will cause the left side of the paper to drop while the right side still held up by your right hand remains in place this creates a pivot effect where the paper rotates around the point where your right hand is hitting it.

So now it gets it right; it learned the lesson. You know, it's quite possible that this uh um you know some someone hired by open AI to solve the problem was fed that question and sort of fed the answer—and the system was fine-tuned with the answer—I mean—you know obviously you can imagine an infinite number of such questions—and this is where you know uh—the so-called post training of LMs becomes expensive um—which is that you know how much coverage of all those style of questions do you have to do for the system to basically cover 90% or 95% or whatever percentage of all the questions that people may ask it um—but there you know—it’s there's a long tale—and there's no way you can train the system to answer all possible questions because there is an essentially infinite number of them.

And and there are way more questions the system cannot answer than questions he can it can answer. You cannot cover the set of all possible training uh you know questions in the training the training set right so because I think our conversation last time was saying you—you said that because these actions of like what's happening with the paper if you let go of it with your hand has not been covered widely in text—the model won't really know how to handle it because unless it’s been covered in text—the model won't have that understanding—won't have that inherent understanding right of the real world.

And I've kind of gone with that for a while uh then I said you know what let's let's try to generate some AI videos. And one of the interesting things that I've seen with the AI videos is there is some understanding of how the physical world works there um in a way that in our first meeting nine years ago you said um one of the hardest things to do is you ask an AI what happens if you hold a pen vertically on a table and let go uh will it fall?

And there's like an unbelievable amount of permutations uh that can occur—and it’s very, very difficult for the AI to figure that out because it just doesn't inherently understand physics—but now—you go to something like Sora uh—and you say um show me a video of a man sitting on a chair kicking his legs—and you can get that video and the person sits on the chair and they kick their legs and the legs you know don't fall out of their sockets or stuff—they bend at the joints and they don't have three legs—so wouldn’t that suggest an improvement of the capabilities here with these large large models?

No, why? Because you still have those videos produced by those uh video generation systems where you know you spill a glass of wine—and and the wine like floats in the air or like flies off—or disappears—or whatever—and um so you know of course for every specific situation—you can always collect more data for that situation—and then train your model to handle it—but that's not really understanding the underlying reality.

This is just you know compensating uh—the lack of understanding by increasingly large amounts of data. Um children understand uh, you know simple concepts like like gravity um with a surprisingly small amount of data um so in fact there is an interesting calculation you—you can do—which I’ve talked about publicly before—but um if you take LLM typical LM train on 30 trillion tokens something like that right—310 to the 13 tokens—a token is about three bytes—so that's .9 10 to the 14 tokens let’s say 10 to the 14 tokens to round this up.

Um that text would take any of us probably some on the order of 400,000 years to read right no problem at 12 hours a day okay. Now um if for old has been awake a total of 16,000 hours uh you can multiply by 3600 to give number of seconds and then you can put a number on how like how much data has got into your visual cortex through the optic nerve optic nerve each optic nerve we have two of them carries about one mega per second roughly right so it’s 2 megabytes per second uh time 3600 time 16,000 and that's just about 10 to the 14 bytes okay.

So in four years a child has seen through vision or touch for that matter as much data as the biggest LLMs—and and it tells you clearly—that we're not going to get to human-level by just training on text—it’s just not a rich enough source of information um, and by the way 16,000 hours is not that much video—it's 30 minutes of YouTube uploads okay uh we can get that pretty easily now in N months uh baby has seen you know um let's say 10 to the 13 bytes something which is not not much again.

Um and in that—in that time—baby has learned basically all of intuitive physics—that that—that—that that we know about um, you know conservation momentum gravity conservation of momentum—the fact that object don't spontaneously disappear—the fact that they still exist even if you hide them. I mean there’s all kinds of stuff you know very basic stuff—that we learn about the world in the first few months of life and this is what we need to reproduce with machine.

This type of learning of you know figuring out uh what is possible and impossible in the world what will result from an action you take um so that you can plan a sequence of actions to arrive at a particular goal—that's the idea of world model. And now connected with the question about uh video generation systems is the right way to approach this problem—to train better and better video generation systems—and my answer to this is absolutely no.

Um the the problem of understanding the world does not go through the solution to the to the generating video at the pixel level okay. I don't need to know um if I if I take this uh this glass of uh of this cup of water and I spill it—I cannot entirely predict you know the exact path of that the water will follow on the table—and what shape it's going to take—and all that stuff what noise it’s going to make.

Um but at a certain level of abstraction—I can make a prediction—that the water will spill okay, and it—it you know probably make my phone wet—and everything. So um so at a—I can't predict all the details—but I can predict at some level of abstraction—and I think that's really a critical concept—the fact that if you want a system to be able to learn to comprehend the the world and understand how the world works—it needs to be able to learn an abstract representation of the world—that allows you to make those predictions.

And um what that means is that those architectures will not be generative right—and uh—I want to get to your solution here in a moment—but I just wanted to also like what would a conversation between us be without a demo? So I want to just show you—um—I'm going to put this on the screen when we do the video but there's this is a video pretty proud of—I got this guy sitting on a chair kicking his legs out and the legs stay attached to his body—and I was like all right—this stuff is making real progress.

And then I said can I get a car going into a haystack. And so it’s two bales of hay stacks—and then a hay stack magically emerges from the hood of a car—that's stationary—and I just said to myself okay Yan—Yan wins again—it’s it's nice car though. Yeah, I mean the thing is those systems have been fine-tuned with a huge amount of data for humans because you know that’s that’s what people are asking most videos—they ask—to.

So so there is a lot of data of humans doing various things to to train those th-those uh those systems so that's why it works for humans but not for a situation—that the the people training that system had not anticipated. So you said that the model can't be generative to be able to understand the real world uh—you are working on something called VJEA right V is the video you also have JEA for images right—that is we have JEAs for all kinds of stuff text also and text so explain how that will solve the problem of being able to allow a machine to abstractly represent what is going on in the real world.

Okay, so what has made the success of uh AI—and particularly um natural language understanding—in chatbot—in the last few years—but also to some extent computer vision—is self-supervised learning. So what is self-supervised learning? It's um take an input be it an image a video a piece of text whatever uh corrupt it in some way and train a big neural net to reconstruct it basically recover the uncorrupted version of it or the undistorted version of it or a transformed version of it that would result from taking an action.

Okay, um and you know—that—that would mean um for example in the context of text—take a piece of text—remove some of the words—and then train some big neural net to predict the words that are missing. Take an image—remove some pieces of it—and then train bigal net to recover the full image. Take a video—remove a piece of it—train that to predict what’s missing.

Okay, so LLMs are a special case of this where um you you you take a a text and you train the system to just reproduce the text—and you don't need to corrupt the text text because the system is designed in such a way that to predict one particular word or token in the text it can only look at the tokens that are to the left of it okay so so in effect the system has hardwired into its architecture the fact that it cannot look at the present in the future U to predict the present—it can only look at the past.

Okay, so but basically you train that system to just reproduce its input on its output. Okay um so this kind of Architecture is called a causal architecture—and this is what an LLM is—a large language model—that's what you know all the chatbots in the world are are based on—um take a piece of text to and train the system to just reproduce a piece of text on its output—um, and to predict a particular word it can only look at the word to the to the left of it—and so now what you have is a system that given a piece of text can predict the word that that follows um—that text.

And you can take that uh that word that is predicted uh shift it into the input—and then predict the second word—shift that into the input—predict the third word—that's called auto reive prediction—it’s not a New Concept very old um so you know self-supervised learning does not train to do a particular does not uh train a system to accomplish a particular task other than capture the internal structure of the of the data it doesn't require any labeling by by human.

Okay, so apply these to images um take an image—mask a chunk of it like a bunch of patches from it if you want—and then train a big on that to reconstruct that—that what is missing and now use the internal representation of the image learned by the system um as input to a subsequent Downstream task for I don't know image recognition segmentation whatever it is.

It works to some extent—but not great um so there was a big project like this uh to do this at Fair—it’s called ma Max Auto encoder—it's a special case of doing auto—which itself is you know the sort of General framework from which I um—I derive this this idea self-supervised running. So um it doesn’t work so well um and there’s various ways to you know if you apply this to video also.

I've been working on this for almost 20 years now take a video show just the a piece of the video—and then train the system to predict what's going to happen next in the video—so same idea as for text—but just for video. And that doesn't work very well either um and the reason it doesn't work why does it work for text and not for video for example.

Um and the answer is it’s easy to predict a word that comes after a text you cannot exactly predict which word follows a particular text but you can produce something like a probability distribution over all the possible words in your dictionary all the possible tokens is only about you know 100 thousand possible tokens so you just produce a big Vector with you know 100 thousand different numbers that are positive and some to one okay um now what are you going to do to represent a probability distribution of all possible frames in a video or all possible missing parts of an image?

We don't know how to do this properly—in fact it's mathematically intractable—to represent distributions in high dimensional continuous spaces. Okay we don’t know how to do this in a kind of useful way if you want um and so—and I've tried to you know do this for video for a long time—um and so that is the reason why those idea of self-supervised learning using generative models have failed so far—and this is why you know using uh you know trying to train a video generation system as a way to understand—to get a system to understand how the world works—that’s why it can't.

Ed um, so what's the alternative? The alternative is something that is not a gener—generative architecture uh which we call JEPA. So that means joint embedding predictive architecture—and we know this works much better than attempting to reconstruct—so we we've had uh experimental uh results on learning good representations of images going back many years where instead of taking an image corrupting it and attempting to reconstruct this image.

We take the original full image and the corrupted version we run them both through neural Nets those neural Nets produce representations of those two images—the initial one and the corrupted one—and we train another neur a predictor to predict the representation of the full image from the corrupted one. Okay, and if you train a system—if you successfully train a system of this type—this this is not trained to reconstruct anything it's just trained to learn a representation so that you can make prediction within the representation layer.

And you have to make sure that the representation contains as much information as possible about the input—which is where it’s difficult actually—that's the difficult part of training those systems. So that's called a JEPA joint embedding predictive architecture um and to to train a system—if you want—to learn good representations of images—those joint embedding architectures work much better than the ones that are generative—that are trained by reconstruction.

Um and now we have a version that works for video too so we take a video we corrupt it by masking a big chunk of it we run the full video and the corrupted one through encoders that are identical—and then—and simultaneously—we train a predictor to predict the representation of the full video from the partial one and the representation that the system learns of videos when you feed it to a system that you try to tell you for example what action is taking place in the video or whether the video is possible or impossible or things like that.

It actually works quite well um, that’s cool—so it gives that abstract thinking yeah—in a way right—and and we have experimental result that shows that this joint embedding training. We have several methods for doing this uh there one that's called Dino another one that's called VCrag another one that's called Vic another one that's called IA but which is sort of distillation method um and so we you know several different ways to approach this—but one of those is going to lead to a recipe that basically gives us a general way of training those JE architectures okay.

So it’s not generative because the system is not trying to regenerate the part of the input; it's trying to generate a representation an abstract representation of the input. And what that allows it to do is to ignore all the details about that are really not predictable like you know—the pen—that you put on the table vertically—and when you let it go—you cannot predict in which direction is going to fall—but at some abstract level—you can say—that the pen is going to fall falling right without representing the direction.

Um so so that's the—that’s the idea of JEPA and and we're starting to have you know good results on sort of um having systems. So VJ system for example is train on natural lots of natural videos—and then you can show it a video that's impossible like a video where for example an object disappears or changes shape okay—you can generate this with a game engine or something—or a situation where you have a ball rolling and it rolls and it starts behind a screen and then the screen comes down and the bll is not there anymore right.

Okay um so things like this—and you measure the prediction error of the system so the system is training to predict—right—and not necessarily in time—but but like basically to predict you know—the the sort of coherence of the video—and so you you measure the prediction error as you show the the video to the system and when something impossible occurs—the prediction error goes through the roof—and so you can detect if the system has integrated some idea of what you know is possible physically or what's not possible—but just being trained with physically possible natural videos.

Um that—that’s really interesting—that’s s of the first hint. That a system is a quite suable common sense and yes um we have versions of those systems also that are soal action conditions. So basically we have things where we have a chunk of video or an image of you know the state of the world at time T—and then an action is being taken like you know a robot arm is being moved—or whatever—and then of course we can observe the the result um resulting from this action.

So now what we have when when we train a JEPA with this um—the model basically can say here is the state of the world at time T—here is an action you might take—I can predict the state of the world at time t plus one in this abstract representation space. There's this learning of of how the world works—of how the world works—and and the cool thing about this is that now you can imagine you can have the system imagine what would be the outcome of a sequence of actions.

And if you give it a goal saying like I want the world to look like this at the end can you figure out a sequence of actions to get me to that point? It can actually figure out by search for a sequence of actions that will actually produce that result—that's planning—that’s reasoning—that’s actual reasoning and actual planning okay. I have to get you out here where we are over time—but can you give me like 60 seconds your reaction uh—to deep seek—and sort of has open source overtaken the proprietary models at this point?

And we got a limit to 60 seconds otherwise—I'm going to get uh killed by your team here. So, overtaken is a is a strong word—I think uh progress is faster in the open source world—that's for sure—but of course you know uh—the pro proprietary shops are profiting from the progress of the open source world right—they get access to that information like everybody else um so what what’s clear is that there is many more interesting ideas coming out of the open source world than any single shop as big as it can be cannot come up with.

You know nobody has a monopoly and good ideas—and so the magic efficiency of the open source world is is that uh it recruits talents from all over the world—and so what we've seen with deeps—is that if you set up a small team with a relatively long leash and few constraints on coming up with just the next generation of LLMs—they can actually come up with new ideas that nobody else had come up with right—they can so reinvent a little bit what what uh you know how you do things—and then if they share that with the rest of the world—then the entire world progresses.

Okay, and so um the it’s it it clearly shows that um you know open source progresses faster um and um you know a lot more Innovation can take place in the open source World—which the provider world may have hard time catching up with uh is cheaper to run. What we see is uh for you know Partners who we we talk to um they say well our clients when they prototype something—they may use a proprietary API—but when it comes time to actually deploy the product they actually use Llama or open or other open source engines because it’s cheaper—and it’s more uh secure—you know it’s more controllable.

You can run it on premise you know there's all kinds of advantages so um we've seen also a big evolution in the the thinking of some people who are initially worried that u open source efforts uh we're going to I don't know for example you know help the Chinese or something if you have like some geopolitical reason to think it’s a bad idea um but what deeps has shown is that the Chinese don’t need us.

I mean they can come up with really good ideas right—I mean we all know that there are really really good scientists in in China and uh one thing that is not widely known is that the single most cited paper in all of science is a paper on deep learning from 10 years ago from 2015—and he came out of Beijing. Oh okay it’s uh the paper is called uh ResNet so it's a particular type of architecture of neural net where basically by default every stage in a deeping system confuses the identity function.

It just copies its input on its output and what the neural net does is compute the deviation from this identity okay. So that allows to train extremely deep neural net with you know dozens of layers perhaps 100 layers—and it was uh—the first author of that paper is gentleman called Shang-Ching Kai at the time he was working at Microsoft Research Beijing MH.

Um soon thereafter the publication of that paper he joined Fair in California so I hired him um and worked at Fair for eight years or so, and recently left and is now a professor at MIT okay. So uh there are really really good scientists everywhere around the world nobody has a monopoly on good ideas certainly Silicon Valley does not have a monopoly and good ideas um.

Or another example of that is actually the first LLM came out of Paris it came out of the Fair lives in Paris a small team of 12 people um so um you have to take advantage of the diversity of ideas backgrounds uh creative juices of the entire world if you want uh Science and Technology to progress fast—and that's enabled by by open source.

Yan, it is always great to speak with you uh appreciate this is our I think fourth or fifth time speaking again. Going Back 9 years ago—you always helped me see through all the hype and the buzz and actually figure out what’s happening—and I'm sure that's going to be the case for our listeners and viewers as well so y thank you so much for coming on hope we do it again soon.

Thank you, everybody. Thank you for watching; we'll be back on Friday to break down the week's news until then—we'll see you next time on Big Technology Podcast.

siuying/transcript.md

Key takeaways: