*Компания Meta (соцсети Facebook и Instagram) признана экстремистской организацией и запрещена в РФ
По мере того как искусственный интеллект продолжает менять мир, пересечение глубокого обучения и высокопроизводительных вычислений становится всё более важным. Эта беседа объединяет Яна ЛеКуна, пионера в области глубокого обучения и главного научного сотрудника по ИИ в компании Meta, и Билла Дэлли, ведущего архитектора компьютеров и главного научного сотрудника NVIDIA, чтобы исследовать будущее моделей ИИ, аппаратных ускорителей и эволюцию вычислительного ландшафта.
Обсуждение включает:
- Будущие прорывы в глубоком обучении и архитектурах ИИ
- Как инновации в аппаратном обеспечении способствуют эффективности и масштабируемости ИИ
- Проблемы в обучении крупномасштабных моделей и выполнении выводов ИИ в реальном времени
**Прямая ссылка на видео https://www.youtube.com/watch?v=KspcdjQJZN0
***Пересказ видео сделала нейросеть https://300.ya.ru/
Пересказ видео
Приветствие и начало беседы
- Билл Долли и Янн Лекун обсуждают искусственный интеллект.
- Янн Лекун делится своими мыслями о последних событиях в области ИИ.
Изменение интересов в ИИ
- Янн Лекун больше не интересуется магистрами.
- Он считает, что более важны вопросы понимания физического мира, памяти, рассуждений и планирования.
Модели мира и их значение
- Модели мира позволяют людям манипулировать мыслями и взаимодействовать с реальным миром.
- Системы ИИ должны иметь модели мира для работы в реальном мире.
Проблемы с токенами и видео
- Токены являются дискретными и не подходят для представления непрерывного физического мира.
- Обучение прогнозированию видео на пиксельном уровне не работает эффективно.
Предиктивная архитектура совместного внедрения (JEPA)
- JEPA изучает абстрактное представление данных для прогнозирования.
- Это позволяет системам предсказывать следующее состояние мира на основе текущего состояния и действий.
Критика существующих подходов
- Современные агентурные системы мышления генерируют случайные последовательности токенов и выбирают наилучшую.
- Этот подход неэффективен для сложных задач.
Будущее искусственного интеллекта
- Янн Лекун предпочитает термин AMI (развитый машинный интеллект) вместо AGI.
- Он считает, что системы, способные к абстрактным ментальным моделям и рассуждениям, могут быть созданы в течение трех-пяти лет.
- Однако достижение интеллекта человеческого уровня потребует больше времени и усилий.
Ограничения LLM и общий интеллект
- Расширение масштабов LLM не приведет к уровню интеллекта человека в ближайшее время.
- Общий интеллект может быть достигнут через десятилетие или около того.
Применение искусственного интеллекта
- ИИ улучшил условия жизни людей и облегчил их работу.
- Влияние ИИ на науку и медицину будет значительным.
- ИИ используется в медицинской визуализации и системах помощи при вождении.
Ограничения и проблемы внедрения
- Внедрение систем с высокой точностью и надежностью сложнее, чем ожидалось.
- Автономное вождение требует почти идеальной точности.
- В некоторых приложениях достаточно высокой, но не идеальной точности.
Роль ИИ в продуктивности и творчестве
- ИИ помогает людям быть более продуктивными и творческими.
- ИИ не заменяет людей, а дает им мощные инструменты.
- Люди будут руководителями сверхинтеллектуальных систем.
Опасности и решения
- ИИ может использоваться для создания фальшивок и ложных новостей.
- Более совершенный ИИ с здравым смыслом и способностью проверять ответы поможет решить эти проблемы.
Будущие инновации в ИИ
- Инновации могут исходить откуда угодно, умные люди есть везде.
- Хорошие идеи рождаются в результате взаимодействия и обмена идеями.
- Китай имеет много хороших ученых, например, Скай Мин Хэ, автор статьи о ResNet.
Переезд в Массачусетский технологический институт
- Во всем мире много хороших ученых
- Идеи могут приходить отовсюду, но для их воплощения нужна инфраструктура и деньги
- Открытое интеллектуальное сообщество ускоряет прогресс
Инновационные идеи и свобода
- Для инноваций нужно позволять людям работать без жесткого расписания
- Пример Lama: проект, созданный небольшой группой людей в Париже, стал успешным благодаря свободе
Открытый исходный код и его преимущества
- Lama стала инновационной благодаря открытому исходному коду
- Открытый исходный код позволяет привлекать больше умных людей и создавать разнообразные продукты
- Открытый исходный код важен для развития экосистемы стартапов
Необходимость разнообразия в ИИ
- В будущем каждое взаимодействие с цифровым миром будет опосредовано системами ИИ
- Нужны разнообразные помощники, говорящие на всех языках и понимающие разные культуры
- Открытый исходный код позволяет создавать такие помощники
Будущее базовых моделей
- Базовые модели будут иметь открытый исходный код и обучаться распределенным образом
- Проприетарные платформы скоро исчезнут
Компромисс между временем обучения и вывода
- Дженсен использовал Gentic LLM для планирования свадьбы, что показывает компромисс между обучением и выводами
- Оптимальная точка зависит от задачи и ресурсов
Новые архитектуры для абстрактного мышления
- Текущие модели рассуждают через язык, что не всегда эффективно
- Задача на ближайшие годы — разработка новых архитектур для абстрактного мышления
JEPA и её будущее
- JEPA (совместная встраиваемая прогностическая архитектура) — это модели мира, способные манипулировать абстрактными представлениями и достигать целей.
- Автор считает, что JEPA — это будущее искусственного интеллекта.
- Для запуска таких моделей требуется мощное оборудование.
Конкуренция и масштабируемость
- Конкуренция важна для развития JEPA и других мощных моделей.
- Рассуждения в абстрактном пространстве требуют значительных ресурсов.
Первая и вторая системы
- Первая система — это автоматические действия, выполняемые без размышлений.
- Вторая система — это сознательные действия, требующие планирования и использования модели мира.
- Современные системы не способны выполнять задачи с нулевым результатом без обучения.
Физический мир и язык
- Физический мир сложнее языка, который дискретен и устойчив к шуму.
- Обучение по тексту недостаточно для достижения AGI.
Нейроморфное оборудование
- Нейроморфное оборудование может дополнять графические процессоры, но не заменит их в ближайшее время.
- Аналоговые нейронные сети имеют свои преимущества, но затрудняют повторное использование оборудования.
- Мозг использует цифровые импульсы для связи между нейронами.
Технологии процессора и памяти
- Технологии процессора и памяти (PIMs) могут быть полезны для создания умных устройств с низким энергопотреблением.
- Непосредственная обработка данных на сенсоре может снизить энергопотребление.
Перспективность направления
- Биология показала, что сетчатка и нейроны эффективно обрабатывают визуальную информацию.
- Сжатие и извлечение объектов позволяют извлечь полезную информацию.
Новые технологии
- Сверхпроводимость может быть перспективной, но автор недостаточно знает об этом.
- Оптические реализации нейронных сетей не увенчались успехом в 1980-х.
- Квантовые вычисления скептически воспринимаются, их применение ограничено моделированием квантовых систем.
Обучение на основе наблюдений
- Создание ИИ, который учится как детеныш животного, требует значительных ресурсов.
- Эксперимент с MAE показал, что автоэнкодеры могут изучать изображения, но это дорого и не так эффективно, как совместные внедрения.
- Проект VJEPA работает лучше с видео, предсказывая представления и определяя физическую реалистичность.
Узкие места и рецепты
- Для успешного обучения моделей нужны хорошие рецепты и инженерные приемы.
- Идея ResNet позволила обучать глубокие нейронные сети со 100 слоями.
- В НЛП системы на основе автоэнкодеров с шумоподавлением были заменены архитектурой GPT.
Успех проекта и масштабирование
- Проект оказался успешным и хорошо масштабировался
- Необходимо разработать рецепт для масштабируемых архитектур JPAR
Будущее ИИ
- Развитие ИИ будет продолжаться и потребует вклада каждого
- Продвинутый машинный интеллект не возникнет внезапно от одной организации
- Это будет последовательный процесс, а не одно событие
Открытые исследования и доступное оборудование
- Исследования должны быть открытыми и основанными на платформах с открытым исходным кодом
- Необходимо более дешевое оборудование для широкого участия
Роль ИИ в повседневной жизни
- Будущее будет с большим разнообразием помощников с ИИ
- ИИ будет помогать в повседневной жизни через умные очки и другие устройства
- Люди станут менеджерами для ИИ, что может быть как положительным, так и отрицательным
Завершение беседы
- Благодарность за интеллектуально стимулирующую беседу
- Надежда на возможность повторить встречу в будущем
Расшифровка видео
0:00
Please welcome Bill Dolly and Yann LeCun.
0:17
Hello, everybody. We’re just going to have a little chat about AI things. Hopefully, you’ll find it interesting.
0:24
So, Yann, there’s been a lot of interesting things going on in the last year in AI. What are you doing?
0:29
What has been the most exciting development, in your opinion, over the past year? Too many to count, but I’ll tell you one thing which may surprise a few of you.
0:40
I’m not so interested in LLMs anymore. You know, they’re kind of the last thing that are in the hands of, you know, industry product people kind of, you know, improving at the margin, trying to get, you know, more data, more compute, generating synthetic data.
1:00
I think there are more interesting questions in four things.
1:06
How do you get machines to understand the physical world? And Jensen talked about this this morning in this keynote.
1:12
How do you get them to have persistent memory, which not too many people talk about?
1:17
And then the last two are, how do you get them to reason and plan? And there is some effort, of course, to get, you know, LLMs to reason.
1:25
But in my opinion, it’s a very kind of simplistic way of viewing reasoning.
1:31
I think there are probably kind of more, you know, better way of doing this.
1:36
So I’m excited about things that a lot of people in this community, in the tech community, might get excited about five years from now.
1:46
But right now it doesn’t look so exciting because it’s some obscure academic paper. But if it’s not an LLM that’s reasoning about the physical world and having persistent memory and planning, what is it?
1:57
What is the underlying model going to be? So a lot of people are working on world models, right?
2:02
So what is a world model? We all have world models in our mind.
2:08
This is what allows us to kind of, you know, manipulate thoughts, essentially.
2:14
So, you know, we have a model of the current world. You know that if I push on this bottle here from the top, it’s probably going to flip.
2:22
But if I push on it at the bottom, it’s going to slide. And, you know, if I press on it too hard, it might pop.
2:30
So we have models of the physical world that we acquire in the first few months of life. And that’s what allows us to deal with the real world.
2:38
And it’s much more difficult to deal with the real world than to deal with language. And so the type of architectures that I think we need for systems that really can deal with the real world is completely different from the ones that we deal with at the moment.
2:52
Right? LLMs predict tokens. Right, but tokens could be anything. I mean, so our autonomous vehicle model uses tokens, tokens from the sensors, and it produces tokens that drive.
3:01
And in some sense, it’s reasoning about the physical world, at least where it’s safe to drive and you won’t run into poles.
3:07
Why aren’t tokens the right way to represent the physical world? Tokens are discrete, okay? So when we talk about tokens, generally we talk about a finite set of possibilities.
3:18
In a typical LLM, the number of possible tokens is on the order of 100,000 or something like that, right?
3:24
So when you train a system to predict tokens, you can never train it to predict the exact token that’s going to follow a sequence in text, for example.
3:34
You can produce a probability distribution of all the possible tokens in your dictionary.
3:40
It’s just a long vector of 400,000 numbers between 0 and 1 that sum to 1. We know how to do this.
3:46
We don’t know how to do this with video, with natural data that is high-dimensional and continuous.
3:52
And every attempt at trying to get systems to understand the world or build mental models of the world by being trained to predict videos at the pixel level, basically has failed.
4:04
Even to train a system like a neural net of some kind to learn good representations of images, every technique that works by reconstructing an image from a corrupted or transformed version of it basically has failed.
4:18
Not completely failed, it kind of works, but it doesn’t work as well as alternative architectures that we call joint embedding, which essentially don’t attempt to reconstruct at the pixel level.
4:30
They try to learn a representation, an abstract representation of the image or the video or the natural signal that is being trained on so that you can make predictions in that abstract representation space.
4:44
The example I use very often is that if I take a video of this room and I kind of pan the camera and I stop here and I ask the system to predict what’s the continuation of that video, it’s probably going to predict there’s a room and there’s people sitting, blah, blah, blah.
5:02
There’s no way it can predict what every single one of you looks like. That’s completely unpredictable from the initial segment of the video.
5:09
And so there’s a lot of things in the world that are just not predictable and if you train a system to predict at the pixel level, it spends all of its resources trying to come up with details that it just cannot invent.
5:20
And so that’s just a complete waste of resources. And every attempt that we’ve tried, and I’ve been working on this for 20 years, of training a system using self-supervised learning by predicting video, doesn’t work.
5:32
It only works if you do it at a representation level. And what that means is that those architectures are not generative.
5:38
You’re basically saying that a transformer doesn’t have the capability of… No. …to envision transformers and they get good results.
5:45
That’s not what I’m saying, because you can use transformers for that. You can put transformers in those architectures.
5:52
It’s just that the type of architecture I’m talking about is called JEPA, Joint Embedding Predictive Architecture.
5:57
So take a chunk of video, or an image, or whatever it is, even text, run it through an encoder, you get a representation.
6:04
Then take the continuation of that text, video, or transformed version of the image, run it through an encoder as well, and now try to make a prediction in that representation space instead of making it in the input space.
6:18
You can use the same training methodology where you fill in the blanks, but you’re doing it at this latent space rather than down in the raw representation.
6:25
Exactly. And the difficulty there is that if you’re not careful, if you don’t use smart techniques to do this, the system will collapse.
6:33
Basically, it will completely ignore the input and just produce a representation that is constant.
6:39
It’s not very informative of any input. Until five, six years ago, we didn’t have any technique to prevent this from happening.
6:46
If you want to use this for an agentic system or a system that can reason and plan, then what you need is this predictor which, when it observes a piece of video, it gets some idea of the state of the world, the current state of the world.
7:00
And what it needs to do is being able to predict what is going to be the next state of the world given that I might take an action that I’m imagining taking.
7:10
So what you need is a predictor that, given the state of the world and an action you imagine, can predict the next state of the world.
7:18
And if you have such a system, then you can plan a sequence of actions to arrive at a particular outcome.
7:24
And that’s the real way that all of us do planning and reasoning. We don’t do it in token space.
7:30
So let me take a very simple example. There’s a lot of so-called agentic reasoning systems today.
7:35
way. And the way they work is that they generate lots and lots and lots of sequences of tokens using sort of different ways of generating different tokens stochastically.
7:45
And then there is a second neural net that tries to select the best sequence out of all of the ones that are generated.
7:52
It’s sort of like, you know, writing a program without knowing how to write a program.
7:58
You write kind of a random program and then you test them all and then you keep the one that actually gives you the right answer.
8:05
I mean, it’s completely hopeless. Well, there’s actually papers about super optimization that suggest doing exactly that.
8:11
Right. For short programs. For short programs you can, of course, but it goes exponentially with the length.
8:16
So it’s completely hopeless. So many people are saying that AGI, or I guess you would call it AMI, is just around the What’s your view?
8:27
When do you think it will be here? Why? What are the gaps? Yeah.
8:32
I don’t like the term AGI because people use the term to designate systems that have human level intelligence and the sad thing is that human intelligence is super specialized.
8:44
So calling this general, I think, is a misnomer. So I prefer the phrase AMI, that we pronounce AMI.
8:50
That means advanced machine intelligence. Okay. It’s just vocabulary. I think this concept that I’m describing of systems that can learn abstract mental models of the world and use them for reasoning and planning, I think we’re probably going to have a good handle on getting this to work, at least at a small scale, within three years, three to five years.
9:14
And then it’s going to be a matter of scaling them up, et cetera, until we get to human level AI.
9:21
Now here’s the thing. Historically in AI, there’s generation after generation of AI researchers who have discovered a new paradigm and have claimed that’s it, within 10 years we’re going to have, or five years or whatever, we’re going to have human level intelligence, we’re going to have machines that are smarter than humans in all domains.
9:43
And that’s been the case for 70 years. And it’s been those waves every 10 years or so.
9:49
The current wave is also wrong. So the idea that you just need to scale up LLMs or have them generate thousands of sequences of tokens and select the good ones to get to human level intelligence, you’re going to have within a few years, two years, I think, for some predictions, a country of geniuses in a data center, to quote someone who will remain nameless, I think it’s nonsense.
10:15
It’s complete nonsense. I mean, sure, there are going to be a lot of applications for which systems in the near future are going to be PhD level, if you want.
10:25
But in terms of overall intelligence, no, we’re still very far from it. When I say very far, it might happen within a decade or so.
10:34
So it’s not that far. So AI has been applied in many ways that have improved the human condition, made people’s lives easier.
10:43
What application of AI do you see as being the most compelling, the most advantageous?
10:50
So I mean, there are obvious things, of course. I mean, I think the impact of AI on science and medicine is probably going to be a lot bigger than we can currently imagine, even though it’s already pretty big.
11:05
You know, not just in terms of research for things like protein folding and drug design and things like this, understanding the mechanisms of life, but also kind of short term, right?
11:17
I mean, very often now in the U.S., you go through a medical imaging process and there is AI involved, right?
11:25
If it’s a mammogram, it’s probably pre-screened with a deep learning system to detect tumors.
11:32
If you go to an MRI machine, the time you have to spend in that MRI machine is reduced by a factor of four or something, because nowadays we can sort of recover, restore the sort of high resolution versions of MRI images with less data.
11:48
So like a lot of short term consequences. Of course, every one of our cars, and NVIDIA is one of the big suppliers of this, but most cars now come out with at least a driving assistance system or automatic emergency braking system.
12:04
They actually required equipment in Europe now for a few years. Those things reduce collisions by 40%.
12:11
I mean, they save lives. Those are like enormous applications, obviously.
12:16
And this is not generative AI, right? This is not LLMs. This is perception, essentially, and a little bit control, of course, for cars.
12:27
Now, obviously, there are a lot of applications of LLMs as they exist today, or how they will exist within a few years, in industry, in services, et cetera.
12:39
But we have to think about the limitations of this as well, that it’s more difficult than most people had thought to field and deploy systems with the level of accuracy and reliability that is expected.
12:54
That certainly has been the case for autonomous driving, right? It’s been kind of a receding horizon of when do we get level five autonomous driving.
13:05
And I think it’s going to be the same thing. It’s usually where AI fails.
13:10
It’s not like in the basic technique. It’s not in the flashy demos. It’s when you actually have to deploy it and apply it and make it reliable enough for this and integrate it with the existing system.
13:25
That’s where it becomes difficult and expensive and takes more time than expected.
13:31
Certainly, things like autonomous vehicles, where it has to be right all the time, where somebody could be injured or killed, the level of accuracy has to be almost perfect.
13:40
But there are many applications where if it just gets it right most of the time, it’s very beneficial.
13:45
Even some medical applications where a doctor is double-checking it, or certainly entertainment and things like that, education, where you just want to do more good than harm and the consequences of getting it wrong aren’t disastrous.
13:57
Absolutely. So, I mean, for most of those systems, really the most useful ones are the ones that make people more productive or more creative.
14:06
A coding assistant. Basically, it assists them, right? I mean, it’s true in medicine.
14:11
It’s true in art. It’s true in producing text. AI is not replacing people. It’s basically giving them power tools.
14:19
Well, it might at some point, but I don’t think people will go for this, right? Basically, our relationship with future AI systems, including super-intelligence, super-human systems, is that we’re going to be their boss.
14:31
We’re going to have a staff of super-intelligent virtual people kind of working for us. I don’t know about you, but I like working with people who are smarter than me.
14:40
Yeah. Me too. It’s the greatest thing in the world. Yeah. So, the flip side is, just as AI can benefit humanity in many ways, it also has a dark side where people will apply it to do things like create deepfakes and false news and it can sort of cause emotional distress if applied incorrectly.
14:59
What are your biggest concerns about the use of AI and how do we mitigate those?
15:04
Well, so one thing that certainly Meta has been very familiar with is using AI as a countermeasure against attacks, whether they are from AI or not.
15:17
So one thing that may be surprising is that despite the availability of LLMs and various deepfakes and stuff like that for a number of years now, our colleagues who are in charge of detecting and sort of taking down this kind of attack are telling us that we’re not seeing like a big increase in sort of generated content being posted on social networks, or at least not in a…
15:51
There is, of course, a lot of it, but it’s not posted in a nefarious way and usually is labeled as being synthetic.
16:01
And so we’re not seeing all the scenario, catastrophe scenarios that people were warning about three or four years ago of this is going to destroy all of the information and there’s an interesting story that I need to tell you, which is that in fall 2022, my colleagues at Meta, a small team, put together an LLM that was trained on the entire scientific literature, all the technical papers they could put their hands on.
16:39
on. It was called Galactica. And they put it up, available with a long paper that described how it was trained, open source code, and a demo system that you could just play with, right?
16:55
And this was doused with vitriol by the Twittersphere, essentially. So people saying, oh, this is horrible.
17:03
This is going to get us killed. It’s going to destroy the scientific communication system.
17:12
Now any idiot can write a scientific sounding paper on the benefits of eating crushed glass or something.
17:21
And so, I mean, there was such a, you know, a small team of five people couldn’t sleep at night, and they took down the demo.
17:31
They left the open source code and the paper. They took down the demo.
17:36
And our conclusion was, like, the world is not ready for this kind of technology, and nobody is interested.
17:45
Three weeks later, ChatGPT came out, okay? And that was, like, the second coming of the Messiah, right?
17:56
And, like, we looked at each other and said, like, what?
18:02
What just happened? Like, we just couldn’t understand the enthusiasm of the public for this, given the reaction.
18:09
You didn’t like the previous one, yeah. And so, and I think, you know, opening, I was really surprised also, actually, by the success of ChatGPT among the public.
18:20
So a lot of it is perception. And, you know, I think the discourse about…
18:26
But ChatGPT wasn’t trying to write a scholarly paper or to do the science.
18:31
It was basically something you could converse with and ask a question about anything. It was trying to be more general than that, right?
18:37
So to some extent, it was more useful to more people or, you know, more approximately useful.
18:42
So there are dangers, for sure. There is just better AI. There is, you know, unreliable systems, as I was talking about before.
18:51
The fix for this is better AI, systems that, you know, have common sense, maybe, capacity of reasoning and checking whether the answers are correct, and assessing the reliability of their own answers, which is not quite currently the case.
19:07
And, but the catastrophe scenarios, frankly, I don’t, I mean, I don’t believe in them.
19:13
Okay, that’s good. That’s what people adapt, right? I like to think that AI is mostly for good, even though there’s a little bit of bad in there.
19:20
So as somebody with homes on both sides of the Atlantic, you have a very global perspective. Where do you see, you know, future innovation in AI coming from?
19:27
So it can come from anywhere. There’s smart people anywhere. Nobody has a monopoly on good ideas.
19:33
I mean, there are people who have a huge superiority complex and think they can come up with all the good ideas without talking to anyone.
19:41
In my experience as a scientist, it’s not the case. You have to, you know, the good ideas come from interaction of a lot of people, the exchange of ideas, and, you know, in the last decade and a half or so, exchange of code as well.
19:55
And so that’s one reason why, you know, I’ve been a very strong advocate of open source AI platforms, and why Meta, in part, has adopted that philosophy as well.
20:06
We don’t have a monopoly on good ideas. As smart as we think we are, we just don’t. And, you know, the recent story about DeepSeek really shows that good ideas can come from anywhere.
20:17
Now, there is a lot of really good scientists in China. One story that a lot of people should know is that if you ask yourself, what is the paper in all of science that has gathered the largest number of citations over the last 10 years?
20:31
And that paper was published in 2015, exactly 10 years ago. And it was about a particular neural net architecture called ResNet, residual networks.
20:43
That came out of Microsoft research in Beijing by a bunch of Chinese scientists.
20:50
The lead author was Sky Ming He. After a year, he joined FAIR, Meta, in California.
20:57
Spent about eight years there. And recently… He’s at MIT now, yeah.
21:04
Moved to MIT, exactly. So, you know, that tells you there’s a lot of good scientists all over the world.
21:12
Ideas can come up from everywhere. But then to actually put those ideas in practice, you need, like, you know, a big infrastructure, a lot of computation.
21:22
You need to give a lot of money to your friends, your colleagues, right, to buy them.
21:27
But having the open intellectual community makes progress go faster, because if somebody comes up with half the good idea over here, and somebody else has the other half, and if they communicate, then it happens.
21:37
But if they’re all very insular and closed, progress just doesn’t take place. That’s right. And the other thing is, you need to… For innovative ideas to emerge, as a chief scientist at NVIDIA, you know this, you need to give people a long leash, right?
21:52
You need to let people really, sort of, innovate, and not, like, pressure them to produce something, you know, every three months or every six months.
22:02
And in fact, that’s pretty much what happened with DeepSeek. That’s what happened with Lama.
22:08
So one story that is not widely known is that there were several LLM projects at FAIR in 2022.
22:14
One that had kind of a lot of resources, support from, you know, a lot of people who needed leadership and everything.
22:22
And another one that was kind of a small pirate project by a dozen people in Paris, who basically decided to build their own LLM because they needed it for some reason.
22:34
And that became Lama, the big project you never heard of.
22:39
It was stopped. And so you can’t come up with good ideas even if you don’t have, you know, all the support.
22:48
Basically, if you are somewhat insulated from your management, and they leave you alone, you know, you can come up with better ideas than if you are supposed to innovate on the schedule.
23:02
Okay, so there’s, you know, a dozen people, they produced Lama 1.
23:07
Then, of course, a decision was made to kind of pick this as the platform as opposed to the other project.
23:15
And then a big team was built around it to produce Lama 2, which eventually was open sourced and basically caused a bit of a revolution in the landscape.
23:27
And then Lama 3. And as of yesterday, there have been over 1 billion downloads of Lama.
23:34
I find this astonishing. I mean, I assume that includes a lot of you.
23:39
But, like, who are all those people, right? I mean, you must know them because they all buy NVIDIA hardware, right, to run those things.
23:49
We thank you for selling all those GPUs. So let’s talk a little bit more about open source.
23:55
I think, you know, Lama has been really innovative in that it’s a state-of-the-art, you know, LLM that’s offered, you know, open weight at least so that people can, you know, download and run it themselves.
24:07
What are the pros and cons of that? I mean, you’re, you know, the company is obviously investing enormous amounts of money in developing the model and training it and fine-tuning it and then giving it away.
24:19
So what is good about that and what is the downside? Well, so I think there is a downside if you are a company that expects to make revenue directly from that service.
24:28
If that’s your only business, then, of course, you know, it may not be advantageous for you to kind of reveal all your secrets.
24:36
But if you are a company like Meta or to some extent Google, you know, the revenue comes from other sources.
24:44
Advertising. Advertising in the case of Meta, you know, there’s various sources for Google.
24:49
You know, perhaps there will be other sources in the future. But what matters is not, like, how much revenue can you generate in the short term?
24:59
It’s like, can you build the functionalities that are needed for the product that you want to build?
25:05
And, you know, can you kind of get the largest number of smart people in the world to contribute to it for the entire world?
25:14
Like, it doesn’t hurt Meta if some other company uses Lama for some other purpose.
25:19
Like, you know, they don’t have a social network that they can build on top of this.
25:25
So, I mean, it’s much more of a threat for Google because obviously you can you can build search engines with that.
25:32
So which is why probably they are a little less positive about this kind of approach.
25:38
But the other thing that we’ve seen the effect of PyTorch, first of all. and on the landscape, on the community.
25:45
And of Lama 2, where it basically, you know, jump-started the entire ecosystem of startups.
25:52
I mean, you know, and we see this also in sort of larger industry now, where people sometimes prototype an AI system with an API, proprietary API, and then when it comes time to deploy it, the most cost-effective way of doing it is to do it on Lama, because you can run it on-premise or some other open source.
26:12
But the biggest, you know, philosophically, I think the biggest factor, the biggest, most important reason to want to have open-source platforms is that, you know, in a short time, every single one of our interaction with the digital world will be mediated by AI systems.
26:29
I’m wearing the Ray-Ban Meta smart glasses right now. I can talk to Meta AI through it and ask it any question.
26:36
We don’t believe that people are going to want, like, a single assistant, and that those assistants are going to come up from a handful of companies on the West Coast of the U.S. or China.
26:48
We need assistants that are extremely diverse. They need to speak all the world’s languages, understand all the world’s cultures, all the value systems, all the centers of interest, you know, and they need to have, like, their biases, political opinions, blah, blah, blah.
27:05
And so we need a diversity of assistants for the same reason that we need a diverse press.
27:10
Otherwise, we’ll all have the same information from the same sources, and that’s just not good for democracy and, you know, everything else.
27:19
So we need a platform that anybody can use to build those assistants, a diverse population of assistants.
27:26
And right now, that can only be done through open-source platforms. I think it’s going to be even more important in the future because if you want, you know, foundation models to speak all the languages in the world and everything, no single entity is going to be able to do this by itself.
27:44
You know, who is going to collect all the data in all the languages in the world and just hand it over to, you know, OpenAI, Meta, or Google, or Anthropic?
27:54
Nobody. They want to keep their data. So they want to, you know, regions in the world are going to want to contribute their data to a global foundation model but not actually give out their data.
28:06
They might contribute to training a global model. I think that’s the model of the future.
28:12
Foundation models will be open-source, will be trained in a distributed fashion with various data centers around the world having access to different subsets of data and basically training kind of a consensus model, if you want.
28:26
And so that’s the way, that’s what makes open-source platforms completely inevitable and proprietary platforms, I think, are going to disappear.
28:35
Yeah, and it also makes sense both for the diversity of languages and things but also for applications.
28:42
So a given company can download WAMA and then fine-tune it on proprietary data that they wouldn’t want to upload.
28:48
Well, that’s what’s happening now. I mean, the business model of most AI startups basically is around this, right?
28:53
Specialized system for vertical applications. Yeah, yeah. So, you know, in Jensen’s keynote, he had this great thing about using a Gentic LLM to do wedding planning to decide who was going to sit around the table, and that was a great example of the tradeoff between, you know, putting effort into training and putting effort into inference.
29:14
So in one case, you can have a very powerful model that you spend an enormous number of resources on training or you can build a less powerful model but basically run it many passes so it can reason and do it.
29:27
What do you see as the tradeoffs between, you know, training time and inference or test time in building a powerful model?
29:35
Where is the optimum point? So, first of all, I think, you know, Jensen is absolutely right that you get ultimately more power in a system that can sort of, you know, reason.
29:46
I disagree with the fact that the proper way to do reasoning is the way, you know, current LLMs are augmented by reasoning ability.
29:54
You’re saying it works, but it’s not the right way. It’s not the right way. I think, you know, when we reason, when we think, we do this in some sort of abstract mental state that has nothing to do with language.
30:07
So you don’t like kicking the tokens out. You want to be reasoning in your latent space and not in token space.
30:14
It’s our latent abstract space, right? I mean, if I tell you, you know, imagine a cube floating in front of you and I rotate that cube by 90 degrees around a vertical axis, okay?
30:23
You can do this mentally. It has nothing to do with language. You know, a cat could do this.
30:28
We can’t specify the problem to a cat, obviously, through language. But, you know, cats do things that are much more complex than this when they plan, like, you know, some trajectories to jump on a piece of furniture, right?
30:40
They do things that are much more complex than that. And that is not related to language. It’s certainly not done in, you know, token space, which would be kind of actions.
30:50
It’s done in sort of abstract mental space. So that’s kind of the challenge of the next few years, which is to figure out new architectures that allow this type of thing.
30:59
That’s what I’ve been working on for the last several years. So is there a new model we should be expecting that allows us to do reasoning in this abstract space?
31:08
It’s called, we call it JEPA, or JEPA world models. And we’ve, you know, my colleagues and I have kind of put out a bunch of papers on this, kind of, you know, first steps towards this over the last few years.
31:20
So JEPA means joint embedding predictive architecture. This is those world models that learn abstract representations and are capable of sort of manipulating those representations and perhaps reason and produce sequences of actions to, you know, arrive at a particular goal.
31:35
I think that’s the future. I wrote a long paper about this that explains how this might work about three years ago.
31:42
Yeah. So, you know, to run those models, you’re going to need great hardware. And, you know, over the last decade, the capabilities of GPUs have increased by, you know, on the order of 5,000 to 10,000 times on basically both training and inference for AI models from Kepler to Blackwell.
31:59
And we’ve seen today that even more is coming. And then scale out and scale up have provided even additional capabilities.
32:07
In your opinion, what is coming down the road? What sort of things do you expect will enable us to build your JAPA model and other more powerful models?
32:16
Well, so, I mean, keep them coming. You know? Because we’re going to need all the competition that, you know, we can get our hands on.
32:25
So, I mean, this kind of reasoning in abstract space idea is going to be expensive competitionally at runtime.
32:31
And it connects with something that we’re all very familiar with, right? So psychologists talk about system one and system two.
32:39
System one is tasks that you can accomplish without really sort of thinking about them.
32:45
You become used to them, and you can accomplish them without thinking too much about them.
32:50
So if you are an experienced driver, you can drive even without driving assistance.
32:56
You can drive without thinking about it much. You know, you can talk to someone at the same time.
33:02
You can, you know, et cetera. But if you are a… If you drive for the first time, for the first few hours, you are… You’re dangerous.
33:10
You have to refocus on what you’re doing, right? And you’re planning all kinds of catastrophe scenarios and stuff like that.
33:18
Imagine all kind of things. So that’s system two. You’re recruiting your entire prefrontal cortex, your world model, your internal world model, to figure out, you know, what’s going to happen and then plan actions so that good things happen.
33:33
Whereas when you’re familiar with this, you can just use system one and sort of do this automatically.
33:39
So this idea that you start by, you know, using your world model, and you’re able to accomplish a task, even a task that you’ve never encountered before, zero shot, right?
33:50
You don’t have to be trained to solve that task. You can just accomplish that task without learning anything, just on the basis of your understanding of the world and your planning abilities.
34:02
That’s what’s missing in current systems. But if you accomplish that task multiple times, then eventually it gets compiled into what’s called a policy, right?
34:12
So a sort of reactive system that allows you to just accomplish that task without planning.
34:17
So the first thing, this reasoning, is system two, the sort of automatic, subconscious, reactive policy.
34:24
That’s system one. NNMs can do system one and are trained to inching their way towards system two.
34:30
But ultimately, I think we need a different architecture for system two.
34:36
Okay, and do you think it will be your… Was it Jampa? I think it’s not going to be a generative architecture.
34:41
If you want a system to understand the physical world, the physical world is much, much more difficult.
34:47
to understand that language. We think of language as kind of the epitome of human capability, you know, intellectual capabilities.
34:56
But in fact, language is simple because it’s discrete. And it’s discrete because it’s a communication mechanism and it needs to be discrete.
35:05
Otherwise, it wouldn’t be noise resistant. You wouldn’t be able to understand what I’m saying right now.
35:11
And so, it’s simple for that reason. But the real world is just much more complicated.
35:17
Like, okay, here is something that some of you may have heard me say in the past.
35:22
Current LLMs are trained typically with something like on the order of 30 trillion tokens, right?
35:29
Tokens typically is about three bytes. So, that’s .9, 10 to the 13 bytes. Let’s say 10 to the 14 bytes.
35:36
That would take any of us over 400,000 years to read through that because that’s kind of the totality of all the text available on the internet, right?
35:45
Now, psychologists tell us that a four-year-old has been awake a total of 16,000 hours.
35:51
And we have about two megabytes going to our visual cortex through our optic nerve every second, two megabytes per second, roughly.
36:00
Multiply this by 16,000 hours times 3,600. It’s about 10 to the 14 bytes. In four years, through vision, you see as much data as text that would take you 400,000 years to read.
36:11
I mean, that tells you we’re never going to get to AGI, whatever you mean by this, by just training from text.
36:19
It’s just not happening. Yeah. So, going back to hardware, there’s been a lot of progress on spiking systems and people who advocate this look at analogies to how biological systems work, suggest that neuromorphic hardware has a role.
36:34
Is there any place where you see neuromorphic hardware either complementing or replacing GPUs in doing AI?
36:43
Not anytime soon. You’ll give me the 20 bucks afterwards. What’s that?
36:48
Okay, I have to tell you, sorry about this. So, when I started at Bell Labs in 1988, the group I was in was actually focused on analog hardware for neural nets.
37:04
And they, you know, built a bunch of generations of completely analog neural nets and then mixed analog digital and then completely digital towards the mid-90s.
37:14
And that’s when people kind of lost interest in neural nets, so then there was no point anymore.
37:21
The problem with exotic underlying principles like this is that the current digital CMOS is in such a deep local minimum that it’s going to take a while before, you know, kind of alternative technologies and enormous amounts of investment before alternative technologies can catch up.
37:39
And it’s not even clear that at a principal level there is any advantage to it.
37:45
So, things like, you know, analog or spiking neurons or, you know, spiking neural nets, there might be some sort of intrinsic advantage except that basically they make hardware reuse very difficult, right?
37:58
I mean, every piece of hardware that we use at the moment is too big and too fast in a sense.
38:04
So, you have to essentially reuse the same piece of hardware, you know, multiplex the same piece of hardware to compute multiple parts of your neural net.
38:15
If you use analog hardware, basically you can’t use multiplexing, so you have to have one physical neuron per neuron in your virtual neural net.
38:24
And that means now that you can’t fit a decent size neural net on a single chip, you have to do multi-chip.
38:31
It’s going to be incredibly fast once you’re able to do this, but it’s not going to be efficient because you need to do cross-chip communication and, you know, memory becomes complicated and in the end you need to actually communicate digitally because that’s the only way to do it efficiently for, you know, noise resistance.
38:53
In fact, the brain, here is an interesting piece of information, most brains or most animals, in the brains of most animals, the neurons communicate through spikes, okay, and spikes are binary signals.
39:06
So, it is digital, it’s not analog. The computation at level of neuron may be analog, but the communication between neurons is actually digital.
39:16
Except for tiny animals, so if you take C. elegans, right, the one millimeter long worm, it’s got 302 neurons, they don’t spike, they don’t need to spike because they don’t need to communicate far away.
39:29
So, they can use analog communication at that scale. So, you know, that tells you that even if we want to use exotic technology like analog computation, we’re going to have to use digital communication somehow, if nothing else, for memory.
39:45
So, it’s not clear. I mean, you’ve gone through this calculation, I know, multiple times, I probably know much less about that than you do, but I don’t see it happening anytime soon.
39:57
There might be some corners of like edge computation. So, if you want like a super cheap microcontroller that is going to run, you know, a perception system for your vacuum cleaner or your lawnmower, then maybe analog computation makes sense.
40:13
If you can fit the whole thing in a single chip and you can use maybe, I don’t know, phase change memory or something like this to store the weights, maybe.
40:24
I know some people are like seriously kind of building those things. That gets on to what people call PIMs, or processor and memory technologies.
40:31
Right. With analog and digital. Do you see a role for them? Oh, yeah. I mean, absolutely. So, I mean, some of my colleagues are actually very interested in this because they want to build like successors to those, you know, smart glasses, and what you want is some, you know, technology basically taking place all the time.
40:46
Right now, it’s not possible because of power consumption. Just, you know, a sensor, like an image sensor, you can’t leave it on all the time in a pair of glasses like this.
40:56
You run the battery in minutes. And so, the one potential solution to this is to actually have processing on the sensor directly, so you don’t have to shuffle the data out of the chip, which is really what costs energy, right?
41:09
Shuffling data is what costs energy. It’s not the competition itself. And so, there’s, you know, quite a bit of work on this, but we’re not there yet.
41:17
So, you see that as a promising direction? I see this as a promising direction. In fact, biology has figured this out, right?
41:24
Our retina has on the order of 60 million photo sensors, and in front of our retina, we have four layers of neurons, transparent neurons, that process the signal to squeeze it down to one million optical nerve fibers to go to our visual context.
41:38
So, there is compression, feature extraction, you know, all kinds of stuff to really sort of get, you know, most of the useful information out of a visual system.
41:47
Yeah. So, what about other emerging technologies? Do you see, you know, quantum or superconducting logic or anything else on the horizon that’s going to give us a great step forward in AI processing capability?
41:59
Superconductivity, perhaps. I don’t know enough about this to really tell. Optical has been very disappointing.
42:06
I think there’s been generation. I mean, I remember being totally amazed by talks about optical implementations of neural nets back in the 1980s, and they never panned out.
42:14
I mean, technology is evolving, obviously, so maybe things may change. I think a lot of the cost there is like analog.
42:22
You lose it in the conversion to interface with digital systems. And then for quantum, I’m extremely skeptical of quantum computing.
42:31
I mean, I think the only medium-term application of quantum computing that I see is for simulating quantum systems.
42:38
Like, you know, if you want to do like quantum chemistry or something, maybe.
42:43
Like, for anything else, like, I’m extremely skeptical. Okay. So, you’ve talked about building AI that can learn from observation like a baby animal.
42:53
What kind of demands do you see that putting on the hardware, and how do you think we need to grow the hardware to enable that?
43:04
How much can you give us? Well, it’s a question of how much you’re willing to buy. The more you buy, the more you save.
43:11
And the more you make, as we heard today. Right. Exactly. Yeah. No, it’s not going to be cheap.
43:16
It’s not going to be cheap because, I mean, video, like, okay, let me tell you an experiment that some of my colleagues did until about a year ago.
43:27
So there was a technique for self-supervised learning to learn image representations using reconstruction.
43:34
The stuff I told you doesn’t work. Okay. It’s a project that was called MAE, mass autoencoder.
43:41
So it’s basically an autoencoder, a denoising autoencoder, very much like.
43:46
like, you know, what’s usually an LP, right? So you take an image, you corrupt it by removing some pieces of it, a big chunk of it actually, and you train some gigantic neural net to reconstruct the full image at a pixel level essentially or at the token level.
44:03
And then you use the internal representation as input to a downstream task that you train supervised, object recognition or whatever.
44:13
It works okay. You have to boil a small pond to cool down those liquid-cooled GPU clusters to be able to do this.
44:20
It doesn’t work nearly as well as those joint embedding architectures. You may have heard of Deno, Deno V2, iJEPA, et cetera.
44:29
So those are joint embedding architectures, and they tend to work better and actually be cheaper to train.
44:36
So in joint embedding, you basically have the two latent spaces for the two input classes.
44:42
That’s right. Rather than converting everything into one kind of token. Well, instead of having an image and then a corrupted or transformed version of it, and then reconstructing the full image from the corrupted or transformed version, you take the full image and the corrupted transformed version, you run them both through encoders, and then you try to… You try to link those embeddings.
45:02
Train the representation of the full image from the representation of the partially visible one, the corrupted one.
45:09
Okay? So the joint embedding predictive architecture. That works better and is cheaper.
45:15
Okay. Now the MAE team said, okay, this seems to work okay for images. Let’s try to do this for video.
45:21
So now you have to tokenize a video, basically turn a video into 16 by 16 patches, and that’s a lot of patches for even a short video.
45:30
And then train some gigantic neural net to reconstruct the patches that are missing in a video, maybe predict a future video.
45:38
That required boiling a small lake, not a small pond. And it was basically a failure.
45:43
That project was stopped. Okay. So the alternative that we have now is a project called VJEPA, and we are getting close to version two, where basically it’s one of those joint embedding predictive architectures.
45:56
So it does prediction on video, but at the representation level, and it seems to work really well.
46:02
We have an example of this. The first version of this is trained on very short videos, just 16 frames, and it’s trained to basically predict the representation of a full video from a version of a partially masked one.
46:16
And that system apparently is able to tell you whether a particular video is physically possible or not, at least in restricted cases.
46:25
And it gives you a binary output, this is feasible, this is not? Well, no, it’s simpler than this.
46:33
You measure the prediction error that the system produces. So you take a sliding window of those 16 frames on a video, and you look at, you know, can you predict like the next few frames, and you measure the prediction error.
46:47
And when something really strange happens in the video, like an object disappears or changes shape or, you know, something like that, or spontaneously appears or doesn’t appear in physics.
46:56
So it’s learning what is physically realistic just by observing videos. Yeah. These are, you know, I mean, you train it on natural videos, and then you test it on synthetic video where something really weird happens.
47:07
Right. So if you trained it on videos where really weird things happen, that would become normal and it wouldn’t detect those as being odd.
47:14
So you don’t do that. No, I mean, it corresponds a bit to, like, if you, you know, baby humans take a while to learn intuitive physics, like the fact that, you know, an object that is not supported falls.
47:24
So basically the effect of gravity. Babies learn this around the age of nine months.
47:30
So if you show a five or six-month baby a scenario where an object appears to float in the air, they’re not surprised.
47:38
But by nine months or ten months, they look at it, you know, with huge eyes, and you can actually measure that, like psychologists have ways of measuring the attention.
47:50
And what that means is that the internal world model, mental model of the world of the infant is being violated.
47:58
The baby is seeing something that she doesn’t think is possible. It’s not, it doesn’t match expectations.
48:06
And so she has to look at it to correct her internal world model to say, like, you know, maybe I should learn about this.
48:14
Yeah. So you’ve talked about, you know, reasoning and planning in this, you know, joint embedding space.
48:22
What do we need to get there? What are the bottlenecks both on the model side and on the hardware side?
48:27
A lot of it is just making it work. So we need a good recipe. You know, like, before people came up with a good recipe to train, let’s say, even simple convolutional nets.
48:39
Okay? So, you know, back in the, you know, until the late 2000s, Geoff Hinton was just, you know, telling everyone, you know, it’s very difficult to train deep network with backprop.
48:50
You know, Jan can do it with ConvNet, but he’s the only one in the world who can do it, which, of course, was true at the time, but was not really true.
48:58
Okay? It turns out it’s not that difficult, but there’s a lot of tricks that you have to figure out, like, you know, engineering tricks or intuitive tricks or, you know, which non-linearity you use, this idea of ResNet, right?
49:11
Paper cited 250,000 times in the last 10 years, the most cited paper in all of science. It’s a very simple idea, right?
49:18
You just have a connection that skip every layer so that, by default, a layer in a deep neural net is basically computes the identity function, and what the neural net is doing is a deviation from that.
49:29
Very simple idea, but that allowed to… Keeps you from losing your gradient going backwards, yeah.
49:35
That’s right. And it allowed to train neural nets with 100 layers or something like that, and now we… Because before that, people had all these tricks where they’d pull out intermediate things and have loss functions on those to avoid, because you couldn’t backprop all the way through.
49:51
That’s right. You know, a layer would die, and your network would be dead, essentially, so you would have to restart training.
49:58
So, you know, people gave up pretty quickly because they just didn’t have all the tricks. And so, before people came up with a good recipe with all those visual connections, you know, Adam optimizers and normalization and stuff like that, and by the way, we just had a paper showing that normalization in Transformers, and things like that.
50:15
Before you had this complete recipe and all the tricks, it was really… Nothing worked.
50:21
And, you know, same with NLP, with Natural Language Processing systems, right? There was those systems in the mid-2010s based on basically denoising autoencoders like BERT type system where you take a piece of text, you corrupt it, train a big neural net to recover the words that are missing, and eventually that was wiped out by the GPT style architecture where you just train on the entire system.
50:45
You basically train it as an autoencoder, but you don’t need to corrupt the input because the architecture is causal.
50:52
It’s a recipe, right? Turned out to be incredibly successful, to scale really well. So we have to come up with a good recipe for those JPAR architectures that will scale to the same extent.
51:03
That’s what’s missing. Well, we have a flashing red light ahead of us, so are there any final thoughts you’d like to leave the audience with before we adjourn?
51:13
Yeah. I mean, I want to reinforce the point I was making earlier. I think the progress of AI is going to take, and the progress towards, you know, human level AI or advanced machine intelligence or AGI, whatever you want to call it, is going to require contributions from everyone.
51:29
It’s not going to come up, it’s not going to come from like a single entity somewhere that does R&D in secret.
51:36
That’s just not happening. It’s not going to be an event. It’s going to be a lot of kind of successive progress along the way.
51:44
And humanity is not going to get killed within an hour of this happening because it’s not going to be an event, okay?
51:52
And because it’s going to require contributions from basically everywhere around the world, it’s going to have to be open research and based on open source platforms if they require a lot of training.
52:05
And we’re going to need like cheaper hardware. You’re going to need to lower your prices.
52:11
I’m sorry. You have to take that up with Jensen. And we’ll have a future with high diversity of AI assistants that are going to help us in our daily lives, be with us at all times through maybe our smart glasses or other smart devices.
52:26
And we’re going to be their boss. They’re going to be working for us.
52:36
It’s going to be like all of us are going to be managers, okay? That’s a terrible future. Well, on that note, I think I’d like to thank you for just a really intellectually stimulating conversation and hope we get a chance to do this again.
52:49
All right. Yeah. Thanks.

