0% found this document useful (0 votes)
84 views38 pages

Do Large Language Models Need Sensory Grounding For Meaning and Understanding?

The document discusses the importance of sensory grounding for meaning and understanding in humans and whether large language models also need it.

Uploaded by

Zakhar Kogan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views38 pages

Do Large Language Models Need Sensory Grounding For Meaning and Understanding?

The document discusses the importance of sensory grounding for meaning and understanding in humans and whether large language models also need it.

Uploaded by

Zakhar Kogan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Philosophy of Deep Learning, NYU

Do large language models


need sensory grounding
for meaning and
understanding?

Spoiler: YES!
Yann LeCun
Courant Institute & Center for Data Science, NYU
Meta – Fundamental AI Research
NYU
2023-03-24
Generated with Make-A-Scene
Y. LeCun

Machine Learning sucks! (compared to humans and animals)

Supervised learning (SL) requires large numbers of labeled samples.


Reinforcement learning (RL) requires insane amounts of trials.
Self-Supervised Learning (SSL) requires large numbers of unlabeled
samples.
Most current ML-based AI systems:
make stupid mistakes, do not reason nor plan
Animals and humans:
Can learn new tasks very quickly.
Understand how the world works
Can reason and plan
Humans and animals have common sense
current machines, not so much (it’s very superficial).
Self-Supervised Learning
has
taken over the world

For understanding & generation


of images, audio, text...

Generated with Make-A-Scene


Y. LeCun

Self-Supervised Learning = Learning to Fill in the Blanks


Reconstruct the input or Predict missing parts of the input.
time or space →
Y. LeCun

Self-Supervised Learning = Learning to Fill in the Blanks


Reconstruct the input or Predict missing parts of the input.
time or space →
Y. LeCun

SSL via Denoising Auto-Encoder / Masked Auto-Encoder


BERT [Devlin 2018]
RoBERTa [Ott 2019] ỹ C(y,ỹ)
….
Decoder

Learned
representation

Encoder

ŷ corruption y

This is a [...] of text extracted This is a piece of text extracted


[...] a large set of [...] articles from a large set of news articles
Y. LeCun

Auto-Regressive Generative Models

Outputs one “token” after another


Tokens may represent words, image patches, speech segments...

Stochastic
Encoder
Predictor

x[t-3] x[t-2] x[t-1] x[t] x[t+1]

Prompt Predicted token

Stochastic
Encoder
Predictor

x[t-2] x[t-1] x[t] x[t+1] x[t+2]

Context
Y. LeCun

Auto-Regressive Large Language Models (AR-LLMs)


Outputs one text token after another
Tokens may represent words or subwords
Encoder/predictor is a transformer architecture
With billions of parameters: typically from 1B to 500B
Training data: 1 to 2 trillion tokens
LLMs for dialog/text generation:
BlenderBot, Galactica, LLaMA (FAIR), Alpaca (Stanford), LaMDA/Bard
(Google), Chinchilla (DeepMind), ChatGPT (OpenAI), GPT-4 ??…
Performance is amazing … but … they make stupid mistakes
Factual errors, logical errors, inconsistency, limited reasoning, toxicity...
LLMs have no knowledge of the underlying reality
They have no common sense & they can’t plan their answer
Y. LeCun

Unpopular Opinion about AR-LLMs

Auto-Regressive LLMs are doomed.


They cannot be made factual, non-toxic, etc.
They are not controllable Tree of “correct”
answers Tree of all possible
token sequences
Probability e that any produced token takes
us outside of the set of correct answers
Probability that answer of length n is
correct:
n
P(correct) = (1-e)

This diverges exponentially.


It’s not fixable.
Y. LeCun

Auto-Regressive Generative Models Suck!

AR-LLMs
Have a constant number of computational steps between input and
output. Weak representational power.
Do not really reason. Do not really plan

Humans and many animals


Understand how the world works.
Can predict the consequences of their actions.
Can perform chains of reasoning with an unlimited number of steps.
Can plan complex tasks by decomposing it into sequences of subtasks
Y. LeCun
pointing
How could machines learn like animals and humans?
Social
helping vs false perceptual
Communication hindering beliefs

face tracking
How can babies
Actions rational, goal-
directed actions learn how the
Perception

biological
motion world works?
gravity, inertia
Physics stability,
support
conservation of
momentum
How can
teenagers learn
Object permanence shape
constancy to drive with
Objects solidity, rigidity 20h of practice?
[Emmanuel natural kind categories Age (months)
Dupoux] 0
Production

1 2 3 4 5 6 7 8 9 10 11 12 13 14
proto-imitation
crawling walking
emotional contagion
Y. LeCun

Three challenges for AI & Machine Learning


1. Learning representations and predictive models of the world
Supervised and reinforcement learning require too many samples/trials
Self-supervised learning / learning dependencies / to fill in the blanks
learning to represent the world in a non task-specific way
Learning predictive models for planning and control
2. Learning to reason, like Daniel Kahneman’s “System 2”
Beyond feed-forward, System 1 subconscious computation.
Making reasoning compatible with learning.
Reasoning and planning as energy minimization.
3. Learning to plan complex action sequences
Learning hierarchical representations of action plans
A Cognitive Architecture
capable of
reasoning & planning
Position paper:
“A path towards autonomous machine intelligence”
https://openreview.net/forum?id=BZ5a1r-kVsf

Longer talk: search “LeCun Berkeley” on YouTube


Y. LeCun

Modular Architecture for Autonomous AI


Configurator
Configures other modules for task configurator
Perception Short-term
memory
Estimates state of the world World Model
World Model
Predicts future world states Perception
Actor Critic
Cost Intrinsic Cost
Compute “discomfort” cost

Actor
Find optimal action sequences action
Short-Term Memory
Stores state-cost episodes percept
Y. LeCun

Mode-1 Perception-Action Cycle

Perception module s[0]=Enc(x)


Extract representation of the world
Policy module A(s[0])
C(s[0]) C(s[1])
Computes an action reactively
Cost module C(s[0]) s[0] s[1]
Computes cost of state Pred(s,a)

Optionally: action A(s)


World Model Pred(s,a) a[0] Actor
Predicts future state
Stores states and costs in short-term
memory
Y. LeCun

Mode-2 Perception-Planning-Action Cycle


Akin to classical Model-Predictive Control (MPC)
Actor proposes an ation sequence
World Model predicts outcome
Actor optimizes action sequence to minimize cost
e.g. using gradient descent, dynamic programming, MC tree search…
Actor sends first action(s) to effectors
C(s[0]) C(s[t]) C(s[t+1]) C(s[T-1]) C(s[T])

Pred(s,a) Pred(s,a) Pred(s,a) Pred(s,a)


s[0] s[t] s[t+1] s[T-1]

action
a[0] a[t] a[t+1] a[T-1]
Actor

[Henaff et al ICLR 19],[Hafner et al. ICML 19],[Chaplot et al. ICML 21],[Escontrela CoRL 22],...
Y. LeCun

Cost Module

Intrinsic Cost (IC)


Immutable cost modules.
Hard-wired drives. Intrinsic Cost (IC) Trainable Cost / Critic (TC)
IC1(s) IC2(s) ... ICk(s) TC1(s) TC2(s) ... TCl(s)

Trainable Cost (TC)


Trainable
Predicts future values of IC s
Equivalent to a critic in RL
Implements subgoals
Configurable
All are differentiable
Building & Training
the World Model
Energy-Based Models
Joint-Embedding Architecture
Y. LeCun

How do we represent uncertainty in the predictions?


[Mathieu,
The world is only partially Couprie,
predictable LeCun
How can a predictive model ICLR 2016]
represent multiple
predictions?
Probabilistic models are
intractable in continuous
domains.
Generative Models must
predict every detail of the
world
My solution: Joint-
Embedding Predictive
Architecture [Henaff, Canziani, LeCun ICLR 2019]
Y. LeCun

Architecture for the world model: JEPA


JEPA: Joint Embedding
Predictive Architecture.
x: observed past and present
y: future
a: action
z: latent variable (unknown)
D( ): prediction cost
C( ): surrogate cost
JEPA predicts a representation
of the future Sy from a
representation of the past and
present Sx
Y. LeCun

Architectures: Generative vs Joint Embedding


Generative: predicts y (with all the details, including irrelevant ones)
Joint Embedding: predicts an abstract representation of y

a) Generative Architecture b) Joint Embedding Architecture


Examples: VAE, MAE...
Y. LeCun

Energy-Based Models: Implicit function


The only way to formalize & understand all model types
Gives low energy to compatible pairs of x and y
Gives higher energy to incompatible pairs
Energy
Landscape
F(x,y) y

x y

time or space →

x
Y. LeCun

EBM Training: two categories of methods

Contrastive methods y
Push down on energy of Contrastive
samples

training samples Low energy


region Contrastive
y
Method
Pull up on energy of
suitably-generated
contrastive samples x

Scales very badly with y


dimension x

Regularized Methods Training


samples Regularized
Regularizer minimizes the Method

volume of space that can


x
take low energy
Y. LeCun

Recommendations:
Abandon generative models
in favor joint-embedding architectures
Abandon Auto-Regressive generation
Abandon probabilistic model
in favor of energy-based models
Abandon contrastive methods
in favor of regularized methods
Abandon Reinforcement Learning
In favor of model-predictive control
Use RL only when planning doesn’t yield the
predicted outcome, to adjust the world model or
the critic.
Y. LeCun

Training a JEPA non contrastively

Four terms in the cost


Maximize information
content in Minimize
representation of x Prediction
Error
Maximize information Maximize Maximize
content in Information Information
Content
representation of y Content

Minimize Prediction Minimize


error Information
Content
Minimize information
content of latent
variable z
Y. LeCun

VICReg: Variance, Invariance, Covariance Regularization

Variance:
Maintains variance of
components of
representations
Covariance:
Decorrelates
components of
covariance matrix of
representations
Invariance:
Minimizes prediction
error.
Barlow Twins [Zbontar et al. ArXiv:2103.03230], VICReg [Bardes, Ponce, LeCun arXiv:2105.04906, ICLR 2022],
VICRegL [Bardes et al. NeurIPS 2022], MCR2 [Yu et al. NeurIPS 2020][Ma, Tsao, Shum, 2022]
Y. LeCun

VICRegL: local matching latent variable for segmentation

Latent variable optimization:


Finds a pairing between local feature vectors of the two images
[Bardes, Ponce, LeCun NeurIPS 2022, arXiv:2210.01571]
Y. LeCun

MC-JEPA: Motion & Content JEPA

Simultaneous SSL for


Image recognition
Motion estimation
Trained on
ImageNet 1k
Various video datasets
Uses VCReg to prevent
collapse
ConvNext-T backbone
Y. LeCun

MC-JEPA: Motion & Content JEPA

Motion estimation architecture uses a top-down hierarchical


predictor that “warp” feature maps.
Y. LeCun

MC-JEPA: Optical Flow Estimation Results


Y. LeCun

Image-JEPA: uses masking, transformer, EMA weights


“SSL from images with a JEPA”
M. Assran et al arxiv:2301.08243
Jointly embeds a context and a
number of neighboring patches.
Uses predictors
Uses only masking
Y. LeCun

Hierarchical Prediction at Multiple Time-Scales & Abstraction Levels

Low-level
representations
can only predict in
the short term.
Too much details JEPA-2
Prediction is hard
Higher-level
representations
can predict in the
longer term.
Less details.
Prediction is easier JEPA-1
Y. LeCun

Hierarchical Planning with Uncertainty

Predictors use latent variables sampled from regularizers. C(s2[4])

Enc2(s[0]) Pred2(s,a,z) Pred2(s,a,z)


s2[0] s2[2] s2[4]

R2 z2[2] R2 z2[4]

C(s[2]) C(s[4])

Enc1(x) Pred1(s,a,z) Pred1(s,a,z) Pred1(s,a,z) Pred1(s,a,z)


s[0] s[1] s[2] s[3] s[4]
R1 z1[0] R1 z1[0] R1 z1[0] R1 z1[0]

action
a[0] a[1] a[2] a[3]
Actor
Y. LeCun
C(s2)
Hierarchical Planning with Uncertainty
Enc2(s[0]) Pred2
s2initial s2
Hierarchical world model R2 z2
Hierarchical planning a2
An action at level k specifies an
objective for level k-1 C(s1,a2)

Prediction in higher levels are Enc1(x) Pred1


more abstract and longer-range. s1initial s1
R1 z1

This type of planning/reasoning a1

by minimizing a cost w.r.t “action”


C(s0,a1)
variables is what’s missing from
current architectures Enc0(x) Pred0
s0 initial
Including AR-LLMs, multimodal R0 z0
s0
systems, learning robots,...
a0
Y. LeCun

Steps towards Autonomous AI Systems


Self-Supervised Learning
To learn representations of the world
To learn predictive models of the world
Handling uncertainty in predictions
Joint-embedding predictive architectures
Energy-Based Model framework
Learning world models from observation
Like animals and human babies?
Reasoning and planning
That is compatible with gradient-based learning
No symbols, no logic → vectors & continuous functions
Y. LeCun

Positions / Conjectures
Prediction is the essence of intelligence
Learning predictive models of the world is the basis of common sense
Almost everything is learned through self-supervised learning
Low-level features, space, objects, physics, abstract representations…
Almost nothing is learned through reinforcement, supervision or imitation
Reasoning == simulation/prediction + optimization of objectives
Computationally more powerful than auto-regressive generation.
H-JEPA with non-contrastive training is the thing
Probabilistic generative models and contrastive methods are doomed.
Intrinsic cost & architecture drive behavior & determine what is learned
Emotions are necessary for autonomous intelligence
Anticipation of outcomes by the critic or world model+intrinsic cost.
Y. LeCun

Challenges for AI Research

Finding a general recipe for training Hierarchical Joint Embedding


Architectures-based World Models from video, image, audio, text…

Designing surrogate costs to drive the H-JEPA to learn relevant


representations (prediction is just one of them)
Integrating an H-JEPA into an agent capable of planning/reasoning
Devising inference procedures for hierarchical planning in the
presence of uncertainty (gradient-based methods, beam search,
MCTS,….)
Minimizing the use of RL to situations where the model or the critic
are inaccurate and lead to unforeseen outcomes.
Scaling
Thank you!

You might also like