Do Large Language Models Need Sensory Grounding For Meaning and Understanding?
Do Large Language Models Need Sensory Grounding For Meaning and Understanding?
Spoiler: YES!
Yann LeCun
Courant Institute & Center for Data Science, NYU
Meta – Fundamental AI Research
NYU
2023-03-24
Generated with Make-A-Scene
Y. LeCun
Learned
representation
Encoder
ŷ corruption y
Stochastic
Encoder
Predictor
Stochastic
Encoder
Predictor
Context
Y. LeCun
AR-LLMs
Have a constant number of computational steps between input and
output. Weak representational power.
Do not really reason. Do not really plan
face tracking
How can babies
Actions rational, goal-
directed actions learn how the
Perception
biological
motion world works?
gravity, inertia
Physics stability,
support
conservation of
momentum
How can
teenagers learn
Object permanence shape
constancy to drive with
Objects solidity, rigidity 20h of practice?
[Emmanuel natural kind categories Age (months)
Dupoux] 0
Production
1 2 3 4 5 6 7 8 9 10 11 12 13 14
proto-imitation
crawling walking
emotional contagion
Y. LeCun
Actor
Find optimal action sequences action
Short-Term Memory
Stores state-cost episodes percept
Y. LeCun
action
a[0] a[t] a[t+1] a[T-1]
Actor
[Henaff et al ICLR 19],[Hafner et al. ICML 19],[Chaplot et al. ICML 21],[Escontrela CoRL 22],...
Y. LeCun
Cost Module
x y
time or space →
x
Y. LeCun
Contrastive methods y
Push down on energy of Contrastive
samples
Recommendations:
Abandon generative models
in favor joint-embedding architectures
Abandon Auto-Regressive generation
Abandon probabilistic model
in favor of energy-based models
Abandon contrastive methods
in favor of regularized methods
Abandon Reinforcement Learning
In favor of model-predictive control
Use RL only when planning doesn’t yield the
predicted outcome, to adjust the world model or
the critic.
Y. LeCun
Variance:
Maintains variance of
components of
representations
Covariance:
Decorrelates
components of
covariance matrix of
representations
Invariance:
Minimizes prediction
error.
Barlow Twins [Zbontar et al. ArXiv:2103.03230], VICReg [Bardes, Ponce, LeCun arXiv:2105.04906, ICLR 2022],
VICRegL [Bardes et al. NeurIPS 2022], MCR2 [Yu et al. NeurIPS 2020][Ma, Tsao, Shum, 2022]
Y. LeCun
Low-level
representations
can only predict in
the short term.
Too much details JEPA-2
Prediction is hard
Higher-level
representations
can predict in the
longer term.
Less details.
Prediction is easier JEPA-1
Y. LeCun
R2 z2[2] R2 z2[4]
C(s[2]) C(s[4])
action
a[0] a[1] a[2] a[3]
Actor
Y. LeCun
C(s2)
Hierarchical Planning with Uncertainty
Enc2(s[0]) Pred2
s2initial s2
Hierarchical world model R2 z2
Hierarchical planning a2
An action at level k specifies an
objective for level k-1 C(s1,a2)
Positions / Conjectures
Prediction is the essence of intelligence
Learning predictive models of the world is the basis of common sense
Almost everything is learned through self-supervised learning
Low-level features, space, objects, physics, abstract representations…
Almost nothing is learned through reinforcement, supervision or imitation
Reasoning == simulation/prediction + optimization of objectives
Computationally more powerful than auto-regressive generation.
H-JEPA with non-contrastive training is the thing
Probabilistic generative models and contrastive methods are doomed.
Intrinsic cost & architecture drive behavior & determine what is learned
Emotions are necessary for autonomous intelligence
Anticipation of outcomes by the critic or world model+intrinsic cost.
Y. LeCun