0% found this document useful (0 votes)
13 views46 pages

Parsimony and Self-Consistency-with-Translation

The document discusses the evolution of artificial intelligence (AI) and neural networks, highlighting significant developments since 2012, including advancements in deep learning and generative models. It emphasizes the importance of parsimony and self-consistency in learning from high-dimensional data, aiming to create compact and structured representations. The document also touches on the future of AI, suggesting a shift from artificial to autonomous intelligence and the need for a universal computational mechanism for intelligence.

Uploaded by

Rong Shen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views46 pages

Parsimony and Self-Consistency-with-Translation

The document discusses the evolution of artificial intelligence (AI) and neural networks, highlighting significant developments since 2012, including advancements in deep learning and generative models. It emphasizes the importance of parsimony and self-consistency in learning from high-dimensional data, aiming to create compact and structured representations. The document also touches on the future of AI, suggesting a shift from artificial to autonomous intelligence and the need for a universal computational mechanism for intelligence.

Uploaded by

Rong Shen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

On the Principles of Parsimony and Self-Consistency:

From Artificial Intelligence to Autonomous Intelligence


简约与自洽:从人工智能到自主智能
Prof. Yi Ma 马毅教授
IDS, University of Hong Kong ⾹港⼤学同⼼基⾦数据科学研究院院长
EECS, UC Berkeley 加州⼤学伯克利分校电⼦⼯程与计算机系教授

“Everything should be made as simple as possible, but not any simpler.”


「⼀切都应该尽可能简单,但不能更简单。」
-- Albert Einstein
80 Years of Evolution of (Artificial) Neural Networks
人工神经⽹络近 80 年的演变

Slide courtesy of Prof. Rene Vidal


A Magical Decade for AI Since 2012: Revolution of
Deep Network/Learning
2012年以来⼈⼯智能的神奇⼗年:深度⽹络/学习的⾰命
The “Origin” of Artificial Intelligence (AI)?
⼈⼯智能 (AI)的“起源”
The Dartmouth Project in 1956?
Organized by John McCarthy, Marvin Minsky, Nathaniel Rochester,
Claude Shannon with attendees such as Herbert Simon and Allen Newell etc.

That “AI” program mainly focuses on higher-level intelligent functions:


symbolic, causal, logical deduction, inference or problem solving.

AI in the past 10 years: learning models of the world for perception and prediction.
Perceive

Learn

Predict
A More Magical Decade: the True Origin of Intelligence
Intelligence真正神奇的⼗年:智能研究的真正起源
• 1943, Artificial Neural Networks, Warren McCulloch and Walter Pitts
(1943年,人工神经网络,沃伦·麦卡洛克和沃尔特·皮茨)
• 1944, Game Theory, John von Neumann
(1944年,博弈论,约翰·冯·诺依曼)
• 1940’s, Turing Machine, Alan Turing etc.
(1940年代,图灵机,艾伦·图灵等。)
• 1948, Information Theory, Claude Shannon
(1948年,信息论,克劳德·香农)
• 1948, Feedback Control & Cybernetics, Nobert Wiener
(1948年,反馈控制与控制论,诺伯特·维纳)
Learn Learn
Perceive

Communicate
Predict
The Magical Decade Since 2012
A Special Case: Fitting Class Labels via a Deep Network
First Phase: Training Perception/Discriminative Models

2 3 R
2 3
k
2 3
RD Mj 1 0 0
M1 f (x, ✓)
607 6 17 607
M x y 6 7
6 .. 7
6 7
6 .. 7 · · ·
6 7
6 .. 7
M2 4.5 4.5 4.5
0 0 1
apple bird cat

igure: Black Box DNN Perception and Recognition(感知和识别):


for Classification: y is the class label of x represented
• Speech
s a “one-hot” vector in Rk . ToRecognition(语音识别)
learn a nonlinear mapping f (·, ✓) : x 7! y, say
• Object
modeled by a deep network, Recognition(物体识别)
using cross-entropy (CE) loss.
• Image Segmentation and Parsing(图像分割与剖析)
• ... … m
X
. 1
First Phase: Training Perception/Discriminative Models
第⼀个阶段 : 训练感知/判别模型
Speech, recognition, object detection, segmentation, and parsing…
语音、识别、对象检测、分割和剖析...
istency: How
The to Learn
Magical Correctly
Decade Since 2012 & Autonom
Second Phase: Training Generative/Predictive Models

Generation and Prediction(生成与预测):


• Generative Adversarial Network(对抗生成网络)
scribe the data X ⇢ [ k
M onto an LDR Z ⇢
• Diffusion and Denoise Process(扩散与去噪过程)
k
[j=1
j=1 j
• Autoregression and Autoencoding(自回归与自编码)
• ……
Mj ) = Sj with S i ? Sj and g(f (Mj )) = Mj .
Generating Natural Images: Midjournel or Stability.ai
The Diffusion and Denoising Method:
Based on Laplace Method or Langevin Dynamics
(over 250 years or 100 years ago, respectively)
基于拉普拉斯⽅法或朗之万动⼒学
(分别是250多年前或100多年前)

Trained with at least Billions of images paired


with texts, on Thousands of GPUs.
在数千个 GPU 上训练⾄少数⼗亿配对的图像⽂本。
Generating Natural Language: ChatGPT
Method:
预训练大模型(背诵课本)
• Pretrain a large language model
• GPT-3 –[Instruction data]->
• GPT-3.5 –[Rank data] ->
• Finetune a reward model [RL] ->
• ChatGPT

System:
• # parameters: 175B (?)
• Pre-training data: very high quality (45T -> 800G in GPT-3)
• # GPUs: about 10,000 A-100 (cost about 250 million USDs)
• Inference cost: a few cents per chat
Generating Natural Language: ChatGPT
Method: 模型优化对齐(人的偏好)
• Pretrain a large language model
• GPT-3 –[Instruction data]->
• GPT-3.5 –[Rank data] ->
• Finetune a reward model [RL] ->
• ChatGPT

System:
• # parameters: 175B (?)
• Pre-training data: very high quality (45T -> 800G in GPT-3)
• # GPUs: about 10,000 A-100 (cost about 250 million USDs)
• Inference cost: a few cents per chat
ChatGPT Is An Engineering Marvel
• Big Model is Necessary but Not Sufficient
• Big High-quality Data is Necessary
• Relatively Easy to Train Model for a
Specific Domain but Lack Generality as ChatGPT

• Situation in China: Many (claim) ran models of


size in the order of 1 Trillions (万亿) parameters,
and yet to succeed with 100 Billions (千亿).
Too many PRs, no demo/prototype.
许多(声称)运行模型大小在万亿参数的数量级,
但尚未有千亿数量级成功。炒作太多,没有公开原型。

Like fine-tuning ChatGPT: you get what you reward!


A Chinese Counterpart: 对话写作猫 (秘塔科技)
第一个最近公开的自主开发的轻量级中文原型 By a former student: Kerui Min
• Model Size only 10 Billion
• Learn from 100 Million
Chinese Articles
• Good Generative Quality
• Limited Memorization of Facts
• Due to Limited GPU Resources
(Dozens of A-100 GPUs v.s.
10,000 A-100 for GPT 3.5…)
You May Try Yourself(可以自己试用)
第一个最近公开的自主开发的轻量级中文原型
• 邀请码:meta
• xiezuocat.com/chat
What are Missing? Or What are the Next?
• Theory for Deep Networks: from Black Boxes to White Boxes?
深度⽹络的理论 :从黑盒到⽩盒?
• Function of Learned Models: from Open-Loop to Closed-Loop?
学习模型的功能 :从开环到闭环?
• Science of Intelligence: from Artificial to Autonomous/Natural?
智能的科学 :从⼈⼯到⾃主/⾃然?
Towards a universal computational mechanism for intelligence at all scales?
迈向适合所有规模智能系统的通用计算机制?
Learn
Perceive/Recognition

Predict/Generation
Objective of Learning from High Dimensional Data

为从外界感知的数据寻找一种紧凑和结构化的表示。
Low-dim Autoencodingf (x,for High-Dim
M
Data M2

Parsimony: What to Learn from High-Dimensional


✓)
S
z
S Data? i
2

简约:要从⾼维数据中学到什么?
1

Assumption: the data X = [x1 , . . . , xm ] ⇢ RD lie on one or multip


Figure 1: Left and Middle: The distribution D of high-dim data x R is supported on a manifold M and
D

low-dim itssubmanifolds:
Goal: a linearmaximally
discriminative X
classes on low-dim submanifolds ⇢M[
representation
uncorrelated subspaces
k learnM
, we j a map in
j=1 Cosine
{S }. Right:
j
f (x,a) such
j similarity
(LDR) high-dim
that z = f (x
Z =between
[z1 ,learned , ) are on2
space
m ] ⇢byR
. . . , zfeatures
i i R Dof:
adunion
our method
for the CIFAR10 training dataset. Each class has 5,000
D samples and their features span a subspace of over 10
(d ⌧ D) for the data X = [x 1
dimensions (see Figure 3(c)).
, . . . , x m ] ⇢ R such that:
the component f distributions
(x,✓) Dj are M(orjcan be made). One popular
g(z,⌘)
Sj assumption is that the
working
⇡ XRstructures.
d
⇢ RD of eachRclass
D
X distribution !Z ⇢ Rd low-dimensional
has relatively M! 1 X̂intrinsic 2 RD .9 Hence we may (2)assume
the distribution Dj of each class has x a support on a low-dimensional submanifold, say Mj with
A mixture of dimension dj ⌧ D, and the
i
Mdistribution DM of x is supported on the mixtureAof
mixture of
those submanifolds,
2
low-dim submanifolds M = [kj=1 Mj , in the high-dimensional ambient space RD , as illustrated
f (x, ✓) incoherent
in Figure subspaces
1 left.
zi
With the manifold assumption in mind, we want to learn a mapping z = f (x, ✓) that S maps
2 5 /each of
Ma (EECS Department, UC Berkeley) Parsimony & Self-Consistency
the submanifolds Mj ⇢ RD to a linear subspace Sj ⇢ Rd (see S
August 5th, 2022, IAS, HKUST 55
1 Figure 1 middle). To do so, we
require our learned representation to have the following properties:
Figure 1: Left
1. Between-Class and Middle: Features
Discriminative: The distribution D of
of samples high-dim
from data
different
D
x R is supported
classes/clusters should on a
beits classes
highly on low-dim
uncorrelated andsubmanifolds Mj , we
belong to different learn a map flinear
low-dimensional such that zi = f (xi , )
(x, )subspaces.
Goal: a linear discriminative
maximally
2. Within-Class uncorrelated representation
Compressible:subspaces
Features {S }. Right:
of jsamples
(LDR)
Cosine
from
Z
similarity
the same
=between
[z1should
class/cluster
,learned
. . . be, zfeature
m]
for the CIFAR10 training dataset. Each classtohas 5,000
D samples and their features span a sub
(d ⌧ D) for relatively
the data X
correlated in= [x
a sense
dimensions (see Figure 3(c)).1 , .
that . .
they, x
belong
m ] ⇢ R such
a low-dimensional that:
linear subspace.
3. Maximally Diverse Representation: Dimension (or variance) of features for each class/cluster
should be as large as possible as long as they stay uncorrelated from the other classes.
the component distributions Dj are (or can be made). One popular working assum
D f (x,✓) d g(z,⌘) D9
X ⇢ R
Notice that,distribution
although theof !
each class
intrinsic Z ⇢ R
has relatively
structures !
low-dimensional
of each class/cluster X̂ ⇡ X 2 R
intrinsic structures.
may be low-dimensional, . Hence
they are
by no meansthesimply
distribution Dj of
linear in their each representation
original class has a support onsubspaces
x. Here the a low-dimensional submanifold
{Sj } can be viewed
◆ Parsimony: Information Gain and Coding
ctness
ZZ > for Mixed Representations
(3)the goodness of a learned representation?
. Gain: How to measure
Information
2
Volume: # spheres/beans; Information: # bits to specify an instance

i-class data may be


e subsets:
𝑧?
𝑧?
Z2 [ · · · [ Zk .

e average coding rate is:


“The whole is more than the sum of its parts.”
November 4, 2021 「整体⽐部份的和多。」
22 / 71
k
X ✓ ◆-- Aristotle, 320BC
Representation X k
1 d > tr(⇧j ) d >
R Z, ⇧, ✏) = log det I + ZZ log det I + Z⇧ Z
Parsimony:
Parsimony:Compact
MeasuredCoding
by Rateand Structured Representation
j
2 m✏2 2m tr(⇧j )✏2
| {z } j=1
Reduction/Information
R(Z)
| Gain {z
Rc (Z|⇧,✏)
}

Di↵erence in rate distortion


measures information between
gain for the whole
any mixture and the parts:
of subspaces/Gaussians.
✓ for Compact Structured
Parsimony: Information Gain ◆ X k ✓ ◆
1 dRepresentation tr(⇧ ) d 2
TheR optimal representation maximizes
Z, ⇧, ✏) = log det I + ZZ > the coding
j rateIreduction
log det + (MCR
Z⇧ j Z > ):
2 m✏2 2m tr(⇧j )✏2
| {z } j=1
|c {z d 1 }
Parsimony:
max R Z(✓),Measured by
⇧, ✏ = R(Z(✓))Rate
R (Z(✓) s.t. Z ⇢ S . Gain
Reduction/Information
| ⇧, ✏),R (Z|⇧,✏)
R(Z)
(4) c

measures information gain for any mixture of subspaces/Gaussians.
Di↵erence
The whole in ratebedistortion
is to maximallybetween the whole and the parts:
2
The optimal representation
✓ maximizes
greater than the sum of the parts! ◆ X the
k coding rate✓reduction (MCR ):◆
1 d > tr(⇧j ) d >
R Z, ⇧, ✏) =
log det I + ZZ log det I + Z⇧ j Z
2 m✏2 c 2m tr(⇧ j )✏ 2
d 1
max R Z(✓),
| ⇧, ✏ = R(Z(✓))
整体要最大限度地大于部分的总和! {z }R j=1
|(Z(✓) | ⇧, ✏), s.t.
{z Z ⇢ S . (4)}
✓ R(Z)
Rc (Z|⇧,✏)

measures
The wholeinformation gain for any mixture of subspaces/Gaussians.
is to be maximally
greater than the sum of the parts! 2
The optimal
Ma (EECS, UCB &representation
IDS, HKU) maximizes the coding rate February
Closed-Loop Transcription reduction16, 2023 (MCR
7 / 59):

max R Z(✓), ⇧, ✏ = R(Z(✓)) Rc (Z(✓) | ⇧, ✏), s.t. Z ⇢ Sd 1


. (4)

Parsimony: Information Gain for Compact Structured
Parsimony: Deep Neural Networks are Nature’s Optimization
Representation

简约:深度神经⽹络是⾃然界对优化算法的实现
ReduNet: A White-box Deep Network via Rate Reduction
A white-box, forward-constructed, multi-channel (convolution) deep neural
network from maximizing the rate reduction via projected gradient flow:
@ R(Z, ⇧, ✏)
Z`+1 / Z` + ⌘ · s.t. Z` ⇢ Sd 1
. (5)
@Z Z`
@R(Z) . ⇥ ⇤
= ↵(I +↵Z` Z`⇤ ) 1 Z` ⇤
= E` Z` ⇡ ↵ Z` ↵Z` (Z` Z` ) . (6)
@Z Z` | {z } | {z }
auto-regression residual self-attention head

Transform

ReduNet: A Whitebox Deep Network from Rate Reduction (JMLR, 2022):


Representation
Assumption: the data X = [x1 , . . . , xm ] ⇢ RD lie on
Parsimony: White-box Objective, Architecture & Representation
Figure 1: Left and Middle: The distribution D of high-dim data x R is supported o D

X⇢ M[ ,k
Goal:low-dim submanifolds: j=1 M
we learn a mapin
f (x,a )high-dim
a linearits classes on low-dim submanifolds
discriminative
maximally representation
uncorrelated subspaces {S }. Right:
j similarity
(LDR)
Cosine
j
j
Z= . . . ,spa
such that z = f (x
[z1 ,learned
between
,
zfeatm i i

White-box Objectives, Architectures,


简约:⽩盒⽬标、架构和表⽰ and Representations for the CIFAR10 training dataset. Each class has 5,000
(d ⌧ D) for the data X = [x , . . . , x ] ⇢ R such that:
the component
1
dimensions (see Figure 3(c)).

distributions Dj areM
m
D samples and their features span a

Sj assu
D f (x,✓)
R D d(orj can be made). One popular
g(z,⌘) R d workingD9
⇢ R of each class
X distribution !Z ⇢ R low-dimensional
has relatively M1! X̂ S ⇡ X
intrinsic
j 2 R
structures.. Hen
RD
the distribution Dj ofMeach
j
class has a supportR d
on a low-dimensional submanif
Comparison with conventional practice dimension dj ⌧ D, and the
M = [kj=1 M j , in
x
the
M 1 xi
Mdistribution DM
i high-dimensional ambient
of x is supported on the mixture of th
2
space RD , as illustrated in Figu
M M2 f (x, ✓) zi
of NNs (since McCulloch-Pitts’1943). Ma
With the manifold assumption in mind, f we
(x, ✓)want
the submanifolds Mj ⇢ RD to a linear subspace Sj ⇢ RdS(see
(EECS Department, UC Berkeley) Parsimony & Self-Consistency
S
to learn
z a mapping z = f (x, ✓)
i August
1 5th, 2022, IAS, H
Figure
S2 1 midd
require our learned representation to have the following1 properties:
Figure 1: LeftDiscriminative:
1. Between-Class and Middle: The distribution
Features D offrom
of samples high-dim dataclass
different x
Figure 1: Left and Middle:
Conventional DNNs Goal: itsaclasses
linear
maximally
beitshighly
classes
on maximally
low-dim
2. uncorrelated
ReduNets
The distribution
on low-dim
uncorrelated
discriminative
submanifolds
and
uncorrelated
Within-Classsubspaces
Compressible:
belongDtoofdifferent
submanifolds
representation
j , we learn{S
Msubspaces
Right:
{Sj }.dataset.
high-dim
Mj , we
a jmap
of }.
FeaturesCosine
data
learn
Right:
samples
RD is
x a2map
low-dimensional
(LDR)
such that
f (x, )Cosine
supporte
linear) su
f (x,
ziZ = f=
the similarity
frombetween
similarity same classb
learned
(
for the CIFAR10 training Each class has 5,000 samples and
(d ⌧ D)
for thefor the
CIFAR10 data
relatively
training
dimensions
X in=
correlated
dataset.
(see Figure
[x
a sense
Each 1 ,that
class
3(c)).
. .they
has .,x
5,000 m ] to
belong
samples⇢aand Rtheirsuch
D
features th
low-dimensional lin
spa

Objectives input/output fitting information gain


3. (see
dimensions Maximally
Figure Diverse
should be
3(c)). Representation: Dimension (or variance) of features fo
as large as
the component possible as long
distributions
D f (x,✓)
the component distributions
as they
D are
Dj are (or can be j(or stay
can uncorrelated
made).
be made). One
d One g(z,⌘) from po
the
popular working a
NoticeX ⇢
that, R
distribution
although theof each
intrinsic !
class Z
has
structures⇢ R
relatively
of each !
low-dimensional
distribution of each class has relatively low-dimensional intrinsicmay
class/cluster X̂ ⇡9 X
intrins
be low-dim
structures. H
Deep architectures trial & error iterative optimization
dimension
the
thebydistribution distribution
no means simply
Dj oflinear
ddimension
as nonlinear generalized
D
eachinclass of
their
j has
D, and
j principal
dthe
and(such
each
original class
a support
components
⌧distribution
has a
representationsupport on
on a low-dimensional
theDdistribution D of xonis
for x [VMS16].
of x is supported
a low-dim
x. Here the subspaces
subma
supported
Furthermore,
the
{
mixtureasoo
fo
j ⌧ D,
or classification tasks as object recognition), we consider two samples
Mj ,= in [
k
the Mj , in the high-dimensional
high-dimensional ambient ambient space R inD
, as
= [kj=1
Mdiffer by M orspace RD , as illustrated
T = { }. HF
Layer operators empirical projected gradient
certain j=1
class of domain deformations augmentations
interested in low-dimensional structures that are invariant to such deformations,10 w
With the manifold
With the manifold
have sophisticated assumption inassumption
mind, we want
geometric and topological
in mind, we awant
to learn to learn
mapping z =afma(x
Ma (EECS Department, UC Berkeley) D structures
Parsimony [WDCB05] and can be
& Self-Consistency ddi
Aug
the submanifolds
the submanifolds M ⇢ R toM
principled manner even
D
a linear to a linear
⇢ R subspace
j with CNNsj[CW16, CGW19]. There jS ⇢subspace
R (see S
d
are previous ⇢ R1 m(
Figure
j attempts

Shift invariance CNNs+augmentation invariant


require our learned ReduNets
requirerepresentation
our learned representation to have properties:
to have the following the following proper
subspace structures on features learned by a deep network for supervised [LQMS1
learning [JZL+ 17, ZJH+ 18, PFX+ 17, ZHF18, ZJH+ 19, ZLY+ 19, LQMS18].
1. Between-Class
1. Between-Class Discriminative:Discriminative: Features
Features of samples of different
from samples cf
expressive property of subspaces exploited by [JZL+ 17] does not enforce all the
and belong to1
Initializations random/pre-design forward unrolled
be highly uncorrelated
be highly uncorrelated
Compressible:
2. Within-Class
different
and
Features of samples
Compressible:
low-dimensional
listed above; [LQMS18] uses a nuclear norm based geometric loss to enforce orth
belong linea
to different low-di
classes, but does not promote diversity in the learned representations, as we will
2. Within-Class Features from the same fro
of samples cl
right illustrates a representation learned by our method on the CIFAR10 dataset. M
relatively correlated in a sense that they belong to a low-dimensional
relatively correlated in a sense that they belong to a lo
found in the experimental Section 3.
Training/fine-tuning back prop forward/back
3. Maximally3.Diverse
should be as
Maximally
large as
Diverse
possible
prop
Representation:
as
Dimension (or Dimension
Representation:
long as they stay
variance) of(or
uncorrelated
feature
vari
from
2 Technical Approach and Method
should be as large as possible as long as they stay unc
Interpretability black box 2.1 white
Notice that, although box
the intrinsic structures of each class/cluster may be low-
Measure of Compactness for a Representation
Notice that, although the intrinsic structures of each class/clust
by no means simply linear in their original representation x. Here the subspace
by no means simply linear in their original representation x. He
as nonlinear generalized principal components for x [VMS16]. Furthermore
Although the above properties are all highly desirable for the latent representation
Representations hidden/latent or incoherent
9
as nonlinear
classification
or
tasks (such
classification subspaces
generalized
as
tasks
principal components
object recognition),
(such as object
we consider
differ by certain class of domain deformations or augmentations we
recognition),
fortwo
x [VMS16
means easy to obtain: Are these properties compatible so that we can expect to asamples
T =consid
{ }.
interested differ by certain class of that domain deformations or augmentat
There are many reasons why this assumption is plausible: 1. high dimensional data are
data that in low-dimensional
belong to the same classstructures are
should be similar invariant
and correlatedto
tosuch deformations,
each other; 3. typicall
interested in low-dimensional structures that are invariant to suc
White-box Convolution Sparse Coding Networks
Parsimony: White-Box Sparse-Coding Convolution Network
Revisiting Sparse Convolutional Model for Visual Recognition, NeurIPS 2022.

Claim: performance of such sparse dictionary (SD) learning network is as


good as ResNet on all datasets, but more stable to noise!
Self-Consistency: HowtotoLearn
Self-Consistency: How Learn Correctly
Correctly & Internally?
& Autonomously?
⾃洽:如何⾃主地学习到正确的模型?

Goal: Transcribe the data X ⇢ [kj=1 Mj onto an LDR Z ⇢ [kj=1 Sj :

f (Mj ) = Sj with S i ? Sj and g(f (Mj )) = Mj . (7)


| {z } | {z } | {z }
linear discriminative auto-embedding

Autoencoding of multiple low-dim nonlinear submanifolds:


f (x,✓) g(z,⌘)
X⇢ [kj=1 Mj ! [kj=1 Zj ⇢ Sj ! X̂ ⇢ [kj=1 Mj . (8)
Self-Consistency: How
Principal Component to Learn
Analysis Correctly & Internally?
(Auto-Encoding)
One low-dim linear subspace: principal component analysis (PCA)
D VT d V
X⇢S !Z⇢S ! X̂ ⇢ S D . (13)

Solve the following optimization problem:


min kX X̂k22 s.t. X̂ = V V T X, V 2 O(D,Objective (14)
d). of Learning from High-Dimensional Da
V

One low-dim nonlinear submanifold: Nonlinear PCA A Special Case: Fitting Clas
f (x,✓) g(z,⌘)
X ⇢ MD ! Z ⇢ Sd ! X̂ ⇢ MD . (15)
RD Mj
M1
Solve the following optimization problem: M M2
x

min kX X̂k22 s.t. X̂ = g(f (X, ⌘), ✓). (16)


✓,⌘ | {z }
d(X,X̂)2
Figure: Black Box DNN for Classific
What is the right distance d(X, X̂), say forasimages?
a “one-hot” vector in Rk . To learn
Self-Consistency: How to Learn Correctly & Internally?

⼀个⼤问题:我们是否需要在数据 𝒙 空间中进⾏测量?
Self-Consistency: Closed-Loop Feedback and Self-Critiquing Game

Self-Consistency: Closed-Loop Error Feedback


Self-Consistency: ⾃洽:闭环反馈纠错
Closed-Loop Error Feedback
Is it possible to measure everything only in the feature z space?
f (x,✓) g(z,⌘) f (x,✓)
X !Z ! X̂ ! Ẑ. (26)

Closed-loop error
Error in data
In feature space
distributions

Yes! Measure di↵erence in Xj and X̂j through their features Zj and Ẑj :
f (x,✓) g(z,⌘) f (x,✓)
Xj ! Zj ! X̂j ! Ẑj , j = 1, . . . , k. (27)
with “their distance” measured by the rate reduction:
. 1
R Zj , Ẑj = R Zj [ Ẑj R Zj ) + R Ẑj ) , j = 1, . . . , k. (28)
2
Self-Consistency: Closed-Loop Self-Critiquing Game
Objective of Learning from High-Dimensional Data

⾃洽:闭环⾃主博弈纠错
Fitting Class Labels via a Deep Network

RD Mj 2 3 2
M1 f (x, ✓) is 1
compressive 6 07 6
M x y 6.7
6
7
6
6
M2 4 .. 5 4
0

Figure: Black Box DNN for Classification: y is the class la


as a “one-hot” vector in Rk . To learn a nonlinear mapping f
modeled by a deep network, using cross-entropy (CE) loss.
m
. 1 X
min CE(✓, x, y) = E[hy, log[f (x, ✓)]i] ⇡ hyi , l
✓2⇥ m i=1
Prevalence of neural collapse during the terminal phase of deep l
Papyan, Han, and Donoho, 2020.

Ma (EECS Department, UC Berkeley) Parsimony & Self-Consistency Augus


Self-Consistency: Closed-Loop Feedback and
Self-Consistency: Closed-Loop Feedback and Game
Game
⾃洽:闭环反馈与博弈纠错
f is both an encoder and sensor; and g is both a decoder and controller.
They form a closed-loop system for feedback and game:

A closed-loop notion of “self-consistency” between Z and Ẑ is achieved


by a self-critiquing game between the sensor f and the generator g:
k
X
.
D(X, X̂) = max min R f (Xj , ✓), f (g(f (Xj , ✓), ⌘), ✓) . (29)
✓ ⌘
j=1
| {z } | {z }
Zj (✓) Ẑj (✓,⌘)
Self-Consistency: Closed-Loop Feedback and Game

Self-Consistency: Closed-Loop Control and Pursuit-Evasion Game

CTRL: Dual Roles of the Encoder and Decoder


f is both an encoder and sensor; and g is both a decoder and control
They form a closed-loop feedback control system:
Self-Consistency: Closed-Loop Feedback and Game
Experiments: Supervised Learning on Real-World Datasets

Structured
Representations
有结构的表示

No neural collapse,
No mode collapse!
无神经塌陷,
没有模式崩溃!
Experiments: Incremental
Incremental Learning and Continuous
via Closed-Loop Learning
Transcription
Incremental Learning of Structured Memory: one class at a time.9

No catastrophic
forgetting!
没有记忆丧失

max min R(Z) + R(Ẑ) + R(Znew , Ẑnew )


✓ ⌘

subject to R(Zold , Ẑold ) = 0. (32)

9
Incremental Learning of Structured Memory via Closed-Loop Transcription, S. Tong
and Yi Ma et. al., ICLR 2023. (arXiv:2202.05411)
Ma (EECS, UCB & IDS, HKU) Closed-Loop Transcription February 16, 2023 39 / 59
Experiments: Incremental and Continuous Learning
Empirical Verification
Empirical Verification

而是复习使记忆增强
Incremental Learning via Closed-Loop Transcription Incremental Learning via Closed-Loop Transcription
Incremental Learning of Structured Memory: one class at a time.10 Memory consolidation via review (ICLR 2023)

Method MNIST CIFAR10


INFORS (Sun et. al., ICLR 2022) 0.814 0.526
CLS-ER (Arani et. al. ICLR 2022) 0.895 0.662
i-LDR(ours) 0.990 0.723
Table: Comparison with latest SOTA on MNIST and CIFAR-10.

iCaRL-S EEIL-S DGMw EEC EECS i-LDR


0.290 0.118 0.178 0.352 0.309 0.523
Table: Comparison on ImageNet-50. The results of other methods are as reported
in the EEC paper. (a) x̂old before review (b) x̂old after review
No catastrophic forgetting! Figure: Visualization of replayed images x̂old of class 1-‘airplane’ in CIFAR10.

10
Incremental Learning of Structured Memory via Closed-Loop Transcription, Ma
S. Tong
(EECS, UCB & IDS, HKU) Closed-Loop Transcription February 16, 2023 41 / 59
and Yi Ma et. al., ICLR 2023. (arXiv:2202.05411)
Ma (EECS, UCB & IDS, HKU) Closed-Loop Transcription February 16, 2023 40 / 59
Experiments:
Unsupervised Unsupervised or Self-Supervised
Learning via Closed-Loop TranscriptionLearning
Unsupervised Learning of Structured Memory: one sample at a time11

No catastrophic
forgetting!
没有记忆丧失

max min R(Z) + R(Z, Ẑ) (33)


✓ ⌘
X X
i i
subject to R(z , ẑ ) = 0, and R(z i , zai ) = 0.
i2N i2N

11
Unsupervised Learning of Structured Representations via Closed-Loop Transcription,
S. Tong, Yann LeCun, and Yi Ma, arXiv:2210.16782, 2022.
Experiments: Unsupervised or Self-Supervised Learning

Structured
representations

From 1000 epochs


towards
one epoch!
Connections to Neuroscience? 与神经科学的联系? Empirical Verification

Structured
System Memory
demonstrated in Nature
the same characteristics of (visual) memory in nature:
• Sparse coding in visual cortex (Olshausen, Nature 1996) 12 .

• Subspace embedding (Tsao, Cell 2017, Nature 2020).13


• Predictive coding in visual cortex (Rao, Nature Neuroscience 1999).
sparse coding in visual cortex
a.

b.

c.
d.
0
10

2.5−5.0
−1
10
spatial frequency (cycles/window)

−2
10
P(ai)

−3
10
1.2−2.5

−4
10

−5
10

−6
0−1.2 10
−4 −3 −2 −1 0 1 2 3 4

y ai
x
Connections to Neuroscience?与神经科学的联系?
• Parsimony: what’s in neuroscience to verify this principle?

• Self-consistency: what’s in neuroscience to verify this principle?

• Forward optimization versus backward propagation

• Open loop versus closed loop

• Self-critiquing or self-interrogating mechanisms

• From signals, to structures, to symbols, to semantics?


A Universal Learning Engine: Compressive Closed-loop Transcription
通⽤学习引擎:压缩闭环转录

sensor/encoder 𝑓
external multi-modal high-dim 𝑥 𝑧!"#$%&' = 𝑓(𝑥) internal compact and structured
sensory data with low-dim structures 𝑓(𝑥, 𝜃) representations for parsimony
𝑥, = 𝑔(𝑧) 𝑧̂ = 𝑓(𝑥)
,
visual self-consistency sparse codes
via closed-loop 脑回路?
tactile 𝑧 ≈ 𝑧̂ = 𝑓(𝑔(𝑧)) subspaces
𝑧!"#
auditory 𝑔(𝑧, 𝜂) grid/place cells

olfactory controller/decoder 𝑔 sparse graphs


... Mult
i -mo primate brain cort
ex …
dal s al
enso hin
ry cort tor
ex En
AConclusions:
Universal Learning Engine: Compressive
Compressive Closed-LoopClosed-loop Transcription
Transcription
通⽤学习引擎:压缩闭环转录
External Internal
world memory

• a universal learning engine: transform sensed data of external world


to a compact and structured (LDR or sparse) internal representation.
• parsimony: optimization of the information gain (rate reduction) via
a sensor and a generator.
• self-consistency: a self-critiquing game between the sensor and
generator through a closed-loop feedback system.
• a white-box system: learning objectives, network architectures &
operators, and learned representations.
A More Magical Decade: the True Origin of Intelligence
Intelligence真正神奇的⼗年:智能的真正起源
• 1943, Artificial Neural Networks, Warren McCulloch and Walter Pitts

• 1944, Game Theory, John von Neumann

• 1940’s, Turing Machine, Alan Turing etc.

• 1948, Information Theory, Claude Shannon

• 1948, Feedback Control & Cybernetics, Nobert Wiener

Learn
Perceive

Predict
Challenging Mathematical Problems?
具有挑战性的数学问题?
Extensions
Open Directions: and Generalizations?
Extensions and Connections
延伸和扩展
• how to scale up to hundreds and thousands of classes?
(variational forms for rate reduction...)

• whitebox architectures for closed-loop transcription (ReduNet like)?

• learning dictionaries to detect and align objects or patterns in 1D


sequence, 2D image, or 3D space (transformers...)?

• computational mechanisms for memory forming (in Nature)?


(recognition and generation integrated...)

• closed-loop transcription for other types of low-dim structures?


(dynamical, causal, logical, symbolical, graphical, genetic...)

The principles of parsimony and self-consistency shall always rule!


World Model: An Integrated System of Transcriptions?
世界模型:⼤规模信息集成转录系统?
Closed-loop Transcription

Neural networks are nature’s Closed-loop transcriptors


optimization algorithms that are basic learning units for
maximize information gain. autonomous self-consistency.
(one iteration per layer) (error feedback & self-critique)
∆I (or ∆R)

Robustly and efficiently


learn compact structured
representations of the world.
(parallel & distributed)

A unified purpose: maximize “information gain” with every unit, at every stage!
World Model: From Memory to Language?
世界模型:从个体记忆到语⾔?

A unified purpose: maximize “information gain” at both individual and community level!
智能的统一的目的:在个人和群体层面上实现 "信息增益 "最大化!

Learn
Perceive

Communicate
Predict

Visual, Tactile, Auditory… Vocabulary, Languages…

Memory of an Individual and Knowledge Shared by a Community


References: Close the Loop between the Origin and the Current
1948/1961 2022
Sparse coding
Low-dimensional
Error correction
Optimization
Compression
Deep networks

• On the Principles of Parsimony and Self-Consistency for the Emergence of Intelligence,


Yi Ma, Doris Tsao and Heung-Yeung Shum, arXiv:2207.04630, FITEE, 2022.

• CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction,


Xili Dai, Shengbang Tong, Yi Ma et. al., arXiv:2111.06636, Entropy, March 2022.

• ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction,
Ryan Chan, Yaodong Yu, Yi Ma et. al., arXiv:2104.10446, JMLR, 2022.
Open the black box and close the loop for intelligence.
A compressive closed-loop transcription
is a universal learning machine for
a compact and structured model of real-world data.

Thank you! Thanks!


Questions, please?

“What I cannot create, I do not understand.”


“Learners need endless feedback more than they need endless
teaching.”
「我自己无法创造出来的,我无法理解。」
– Grant Wiggins
-- Richard Feynman’s last words

You might also like