Parsimony and Self-Consistency-with-Translation
Parsimony and Self-Consistency-with-Translation
AI in the past 10 years: learning models of the world for perception and prediction.
Perceive
Learn
Predict
A More Magical Decade: the True Origin of Intelligence
Intelligence真正神奇的⼗年:智能研究的真正起源
• 1943, Artificial Neural Networks, Warren McCulloch and Walter Pitts
(1943年,人工神经网络,沃伦·麦卡洛克和沃尔特·皮茨)
• 1944, Game Theory, John von Neumann
(1944年,博弈论,约翰·冯·诺依曼)
• 1940’s, Turing Machine, Alan Turing etc.
(1940年代,图灵机,艾伦·图灵等。)
• 1948, Information Theory, Claude Shannon
(1948年,信息论,克劳德·香农)
• 1948, Feedback Control & Cybernetics, Nobert Wiener
(1948年,反馈控制与控制论,诺伯特·维纳)
Learn Learn
Perceive
Communicate
Predict
The Magical Decade Since 2012
A Special Case: Fitting Class Labels via a Deep Network
First Phase: Training Perception/Discriminative Models
2 3 R
2 3
k
2 3
RD Mj 1 0 0
M1 f (x, ✓)
607 6 17 607
M x y 6 7
6 .. 7
6 7
6 .. 7 · · ·
6 7
6 .. 7
M2 4.5 4.5 4.5
0 0 1
apple bird cat
System:
• # parameters: 175B (?)
• Pre-training data: very high quality (45T -> 800G in GPT-3)
• # GPUs: about 10,000 A-100 (cost about 250 million USDs)
• Inference cost: a few cents per chat
Generating Natural Language: ChatGPT
Method: 模型优化对齐(人的偏好)
• Pretrain a large language model
• GPT-3 –[Instruction data]->
• GPT-3.5 –[Rank data] ->
• Finetune a reward model [RL] ->
• ChatGPT
System:
• # parameters: 175B (?)
• Pre-training data: very high quality (45T -> 800G in GPT-3)
• # GPUs: about 10,000 A-100 (cost about 250 million USDs)
• Inference cost: a few cents per chat
ChatGPT Is An Engineering Marvel
• Big Model is Necessary but Not Sufficient
• Big High-quality Data is Necessary
• Relatively Easy to Train Model for a
Specific Domain but Lack Generality as ChatGPT
Predict/Generation
Objective of Learning from High Dimensional Data
为从外界感知的数据寻找一种紧凑和结构化的表示。
Low-dim Autoencodingf (x,for High-Dim
M
Data M2
简约:要从⾼维数据中学到什么?
1
low-dim itssubmanifolds:
Goal: a linearmaximally
discriminative X
classes on low-dim submanifolds ⇢M[
representation
uncorrelated subspaces
k learnM
, we j a map in
j=1 Cosine
{S }. Right:
j
f (x,a) such
j similarity
(LDR) high-dim
that z = f (x
Z =between
[z1 ,learned , ) are on2
space
m ] ⇢byR
. . . , zfeatures
i i R Dof:
adunion
our method
for the CIFAR10 training dataset. Each class has 5,000
D samples and their features span a subspace of over 10
(d ⌧ D) for the data X = [x 1
dimensions (see Figure 3(c)).
, . . . , x m ] ⇢ R such that:
the component f distributions
(x,✓) Dj are M(orjcan be made). One popular
g(z,⌘)
Sj assumption is that the
working
⇡ XRstructures.
d
⇢ RD of eachRclass
D
X distribution !Z ⇢ Rd low-dimensional
has relatively M! 1 X̂intrinsic 2 RD .9 Hence we may (2)assume
the distribution Dj of each class has x a support on a low-dimensional submanifold, say Mj with
A mixture of dimension dj ⌧ D, and the
i
Mdistribution DM of x is supported on the mixtureAof
mixture of
those submanifolds,
2
low-dim submanifolds M = [kj=1 Mj , in the high-dimensional ambient space RD , as illustrated
f (x, ✓) incoherent
in Figure subspaces
1 left.
zi
With the manifold assumption in mind, we want to learn a mapping z = f (x, ✓) that S maps
2 5 /each of
Ma (EECS Department, UC Berkeley) Parsimony & Self-Consistency
the submanifolds Mj ⇢ RD to a linear subspace Sj ⇢ Rd (see S
August 5th, 2022, IAS, HKUST 55
1 Figure 1 middle). To do so, we
require our learned representation to have the following properties:
Figure 1: Left
1. Between-Class and Middle: Features
Discriminative: The distribution D of
of samples high-dim
from data
different
D
x R is supported
classes/clusters should on a
beits classes
highly on low-dim
uncorrelated andsubmanifolds Mj , we
belong to different learn a map flinear
low-dimensional such that zi = f (xi , )
(x, )subspaces.
Goal: a linear discriminative
maximally
2. Within-Class uncorrelated representation
Compressible:subspaces
Features {S }. Right:
of jsamples
(LDR)
Cosine
from
Z
similarity
the same
=between
[z1should
class/cluster
,learned
. . . be, zfeature
m]
for the CIFAR10 training dataset. Each classtohas 5,000
D samples and their features span a sub
(d ⌧ D) for relatively
the data X
correlated in= [x
a sense
dimensions (see Figure 3(c)).1 , .
that . .
they, x
belong
m ] ⇢ R such
a low-dimensional that:
linear subspace.
3. Maximally Diverse Representation: Dimension (or variance) of features for each class/cluster
should be as large as possible as long as they stay uncorrelated from the other classes.
the component distributions Dj are (or can be made). One popular working assum
D f (x,✓) d g(z,⌘) D9
X ⇢ R
Notice that,distribution
although theof !
each class
intrinsic Z ⇢ R
has relatively
structures !
low-dimensional
of each class/cluster X̂ ⇡ X 2 R
intrinsic structures.
may be low-dimensional, . Hence
they are
by no meansthesimply
distribution Dj of
linear in their each representation
original class has a support onsubspaces
x. Here the a low-dimensional submanifold
{Sj } can be viewed
◆ Parsimony: Information Gain and Coding
ctness
ZZ > for Mixed Representations
(3)the goodness of a learned representation?
. Gain: How to measure
Information
2
Volume: # spheres/beans; Information: # bits to specify an instance
measures
The wholeinformation gain for any mixture of subspaces/Gaussians.
is to be maximally
greater than the sum of the parts! 2
The optimal
Ma (EECS, UCB &representation
IDS, HKU) maximizes the coding rate February
Closed-Loop Transcription reduction16, 2023 (MCR
7 / 59):
简约:深度神经⽹络是⾃然界对优化算法的实现
ReduNet: A White-box Deep Network via Rate Reduction
A white-box, forward-constructed, multi-channel (convolution) deep neural
network from maximizing the rate reduction via projected gradient flow:
@ R(Z, ⇧, ✏)
Z`+1 / Z` + ⌘ · s.t. Z` ⇢ Sd 1
. (5)
@Z Z`
@R(Z) . ⇥ ⇤
= ↵(I +↵Z` Z`⇤ ) 1 Z` ⇤
= E` Z` ⇡ ↵ Z` ↵Z` (Z` Z` ) . (6)
@Z Z` | {z } | {z }
auto-regression residual self-attention head
Transform
X⇢ M[ ,k
Goal:low-dim submanifolds: j=1 M
we learn a mapin
f (x,a )high-dim
a linearits classes on low-dim submanifolds
discriminative
maximally representation
uncorrelated subspaces {S }. Right:
j similarity
(LDR)
Cosine
j
j
Z= . . . ,spa
such that z = f (x
[z1 ,learned
between
,
zfeatm i i
distributions Dj areM
m
D samples and their features span a
Sj assu
D f (x,✓)
R D d(orj can be made). One popular
g(z,⌘) R d workingD9
⇢ R of each class
X distribution !Z ⇢ R low-dimensional
has relatively M1! X̂ S ⇡ X
intrinsic
j 2 R
structures.. Hen
RD
the distribution Dj ofMeach
j
class has a supportR d
on a low-dimensional submanif
Comparison with conventional practice dimension dj ⌧ D, and the
M = [kj=1 M j , in
x
the
M 1 xi
Mdistribution DM
i high-dimensional ambient
of x is supported on the mixture of th
2
space RD , as illustrated in Figu
M M2 f (x, ✓) zi
of NNs (since McCulloch-Pitts’1943). Ma
With the manifold assumption in mind, f we
(x, ✓)want
the submanifolds Mj ⇢ RD to a linear subspace Sj ⇢ RdS(see
(EECS Department, UC Berkeley) Parsimony & Self-Consistency
S
to learn
z a mapping z = f (x, ✓)
i August
1 5th, 2022, IAS, H
Figure
S2 1 midd
require our learned representation to have the following1 properties:
Figure 1: LeftDiscriminative:
1. Between-Class and Middle: The distribution
Features D offrom
of samples high-dim dataclass
different x
Figure 1: Left and Middle:
Conventional DNNs Goal: itsaclasses
linear
maximally
beitshighly
classes
on maximally
low-dim
2. uncorrelated
ReduNets
The distribution
on low-dim
uncorrelated
discriminative
submanifolds
and
uncorrelated
Within-Classsubspaces
Compressible:
belongDtoofdifferent
submanifolds
representation
j , we learn{S
Msubspaces
Right:
{Sj }.dataset.
high-dim
Mj , we
a jmap
of }.
FeaturesCosine
data
learn
Right:
samples
RD is
x a2map
low-dimensional
(LDR)
such that
f (x, )Cosine
supporte
linear) su
f (x,
ziZ = f=
the similarity
frombetween
similarity same classb
learned
(
for the CIFAR10 training Each class has 5,000 samples and
(d ⌧ D)
for thefor the
CIFAR10 data
relatively
training
dimensions
X in=
correlated
dataset.
(see Figure
[x
a sense
Each 1 ,that
class
3(c)).
. .they
has .,x
5,000 m ] to
belong
samples⇢aand Rtheirsuch
D
features th
low-dimensional lin
spa
One low-dim nonlinear submanifold: Nonlinear PCA A Special Case: Fitting Clas
f (x,✓) g(z,⌘)
X ⇢ MD ! Z ⇢ Sd ! X̂ ⇢ MD . (15)
RD Mj
M1
Solve the following optimization problem: M M2
x
⼀个⼤问题:我们是否需要在数据 𝒙 空间中进⾏测量?
Self-Consistency: Closed-Loop Feedback and Self-Critiquing Game
Closed-loop error
Error in data
In feature space
distributions
Yes! Measure di↵erence in Xj and X̂j through their features Zj and Ẑj :
f (x,✓) g(z,⌘) f (x,✓)
Xj ! Zj ! X̂j ! Ẑj , j = 1, . . . , k. (27)
with “their distance” measured by the rate reduction:
. 1
R Zj , Ẑj = R Zj [ Ẑj R Zj ) + R Ẑj ) , j = 1, . . . , k. (28)
2
Self-Consistency: Closed-Loop Self-Critiquing Game
Objective of Learning from High-Dimensional Data
⾃洽:闭环⾃主博弈纠错
Fitting Class Labels via a Deep Network
RD Mj 2 3 2
M1 f (x, ✓) is 1
compressive 6 07 6
M x y 6.7
6
7
6
6
M2 4 .. 5 4
0
Structured
Representations
有结构的表示
No neural collapse,
No mode collapse!
无神经塌陷,
没有模式崩溃!
Experiments: Incremental
Incremental Learning and Continuous
via Closed-Loop Learning
Transcription
Incremental Learning of Structured Memory: one class at a time.9
No catastrophic
forgetting!
没有记忆丧失
9
Incremental Learning of Structured Memory via Closed-Loop Transcription, S. Tong
and Yi Ma et. al., ICLR 2023. (arXiv:2202.05411)
Ma (EECS, UCB & IDS, HKU) Closed-Loop Transcription February 16, 2023 39 / 59
Experiments: Incremental and Continuous Learning
Empirical Verification
Empirical Verification
而是复习使记忆增强
Incremental Learning via Closed-Loop Transcription Incremental Learning via Closed-Loop Transcription
Incremental Learning of Structured Memory: one class at a time.10 Memory consolidation via review (ICLR 2023)
10
Incremental Learning of Structured Memory via Closed-Loop Transcription, Ma
S. Tong
(EECS, UCB & IDS, HKU) Closed-Loop Transcription February 16, 2023 41 / 59
and Yi Ma et. al., ICLR 2023. (arXiv:2202.05411)
Ma (EECS, UCB & IDS, HKU) Closed-Loop Transcription February 16, 2023 40 / 59
Experiments:
Unsupervised Unsupervised or Self-Supervised
Learning via Closed-Loop TranscriptionLearning
Unsupervised Learning of Structured Memory: one sample at a time11
No catastrophic
forgetting!
没有记忆丧失
11
Unsupervised Learning of Structured Representations via Closed-Loop Transcription,
S. Tong, Yann LeCun, and Yi Ma, arXiv:2210.16782, 2022.
Experiments: Unsupervised or Self-Supervised Learning
Structured
representations
Structured
System Memory
demonstrated in Nature
the same characteristics of (visual) memory in nature:
• Sparse coding in visual cortex (Olshausen, Nature 1996) 12 .
b.
c.
d.
0
10
2.5−5.0
−1
10
spatial frequency (cycles/window)
−2
10
P(ai)
−3
10
1.2−2.5
−4
10
−5
10
−6
0−1.2 10
−4 −3 −2 −1 0 1 2 3 4
y ai
x
Connections to Neuroscience?与神经科学的联系?
• Parsimony: what’s in neuroscience to verify this principle?
sensor/encoder 𝑓
external multi-modal high-dim 𝑥 𝑧!"#$%&' = 𝑓(𝑥) internal compact and structured
sensory data with low-dim structures 𝑓(𝑥, 𝜃) representations for parsimony
𝑥, = 𝑔(𝑧) 𝑧̂ = 𝑓(𝑥)
,
visual self-consistency sparse codes
via closed-loop 脑回路?
tactile 𝑧 ≈ 𝑧̂ = 𝑓(𝑔(𝑧)) subspaces
𝑧!"#
auditory 𝑔(𝑧, 𝜂) grid/place cells
Learn
Perceive
Predict
Challenging Mathematical Problems?
具有挑战性的数学问题?
Extensions
Open Directions: and Generalizations?
Extensions and Connections
延伸和扩展
• how to scale up to hundreds and thousands of classes?
(variational forms for rate reduction...)
A unified purpose: maximize “information gain” with every unit, at every stage!
World Model: From Memory to Language?
世界模型:从个体记忆到语⾔?
A unified purpose: maximize “information gain” at both individual and community level!
智能的统一的目的:在个人和群体层面上实现 "信息增益 "最大化!
Learn
Perceive
Communicate
Predict
• ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction,
Ryan Chan, Yaodong Yu, Yi Ma et. al., arXiv:2104.10446, JMLR, 2022.
Open the black box and close the loop for intelligence.
A compressive closed-loop transcription
is a universal learning machine for
a compact and structured model of real-world data.