0% found this document useful (0 votes)
37 views

Zero Initialization Initializi

ZerO Initialization is a new method for initializing neural networks using only zeros and ones. It uses identity and Hadamard transforms to deterministically initialize weights without random values. Theoretical analysis shows ZerO avoids a "training degeneracy" issue with identity initialization in networks with varying layer dimensions. Empirically, ZerO achieves state-of-the-art performance on ImageNet, enables training very deep networks without batch normalization, and exhibits low-rank and sparse solutions. ZerO also improves training reproducibility as it does not rely on random numbers.

Uploaded by

Srikanth Shenoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Zero Initialization Initializi

ZerO Initialization is a new method for initializing neural networks using only zeros and ones. It uses identity and Hadamard transforms to deterministically initialize weights without random values. Theoretical analysis shows ZerO avoids a "training degeneracy" issue with identity initialization in networks with varying layer dimensions. Empirically, ZerO achieves state-of-the-art performance on ImageNet, enables training very deep networks without batch normalization, and exhibits low-rank and sparse solutions. ZerO also improves training reproducibility as it does not rely on random numbers.

Uploaded by

Srikanth Shenoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Published in Transactions on Machine Learning Research (10/2022)

ZerO Initialization: Initializing Neural Networks with only


Zeros and Ones

Jiawei Zhao [email protected]


California Institute of Technology

Florian Schäfer [email protected]


Georgia Institute of Technology

Anima Anandkumar [email protected]


California Institute of Technology
NVIDIA

Reviewed on OpenReview: https: // openreview. net/ forum? id= 1AxQpKmiTc

Abstract

Deep neural networks are usually initialized with random weights, with adequately selected
initial variance to ensure stable signal propagation during training. However, selecting
the appropriate variance becomes challenging especially as the number of layers grows. In
this work, we replace random weight initialization with a fully deterministic initialization
scheme, viz., ZerO, which initializes the weights of networks with only zeros and ones (up
to a normalization factor), based on identity and Hadamard transforms. Through both
theoretical and empirical studies, we demonstrate that ZerO is able to train networks without
damaging their expressivity. Applying ZerO on ResNet achieves state-of-the-art performance
on various datasets, including ImageNet, which suggests random weights may be unnecessary
for network initialization. In addition, ZerO has many benefits, such as training ultra deep
networks (without batch-normalization), exhibiting low-rank learning trajectories that result
in low-rank and sparse solutions, and improving training reproducibility1 .

1 Introduction

An important question in training deep neural networks is how to initialize the weights. Currently, random
weight initialization is the de-facto practice across all architectures and tasks. However, choosing the variance
of the initial weight distribution is a delicate balance when training deep neural networks. If the variance is
too large, it can lead to an excessive amplification of the activations propagating through the network during
training, resulting in exploding gradients. On the other hand, if the weights are initialized too small, the
activations may not propagate at all, resulting in vanishing gradients. These issues become more challenging
as the number of layers in the network grows.
The above challenges can be avoided if identity initialization is used instead. It initializes each layer in the
network as an identity mapping, such that the input data can be identically propagated to the network output.
In this case, there is no need to introduce any randomness or consider its variance. Identity initialization
is well studied theoretically from an optimization perspective. Hardt & Ma (2017) prove the existence of
a global minimum close to the identity parameterization in a deep residual network. Bartlett et al. (2019)
further prove the rate of convergence of gradient-based optimization under identity initialization.
1 Code repository: https://github.com/jiaweizzhao/ZerO-initialization.

1
Published in Transactions on Machine Learning Research (10/2022)

Table 1: Comparing ZerO initialization with randomly perturbed identity initialization. F represents a
transformation1 of an arbitrary layer l.
Settings Techniques Related Works

xl+1 = xl + F (xl ) F is randomly initialized with small variance Hardt & Ma (2017)
xl+1 = xl + Conv(F (xl )) F is randomly initialized, kernel in Conv is zero Zhang et al. (2019)
xl+1 = xl + N orm(F (xl )) F is randomly initialized, scale in N orm is zero Goyal et al. (2017)
xl+1 = xl + –(F (xl )) F is randomly initialized, scalar – is zero Bachlechner et al. (2020)
All settings above F is a Hadamard/identity transform2 ZerO (ours)

However, prior theoretical works on identity initialization assume that all layers had the same dimensionality,
which does not hold for practical networks. Typically, practical networks have varying dimensionality across
layers, such as the variations of spatial and channel dimensions in ResNet architectures (He et al., 2016).
Directly applying identity initialization to these networks leads to a problem of training degeneracy, as our
theoretical study will demonstrate later.
To avoid the training degeneracy, previous works (summarized in Table 1) employ identity initialization with
random perturbations to facilitate escapes from a saddle point or to break feature symmetry (Blumenfeld
et al., 2020). Broadly, these approaches satisfy the property of dynamical isometry, to preserve the signal
propagation and ensure well-behaved gradients at initialization (Saxe et al., 2014). Despite the efficacy of
random perturbations, they inevitably introduce additional tuning of variances, which can result in gradient
explosion in deep networks without careful tuning.
Our work: we propose ZerO initialization that removes all randomness in the weight initialization. As
illustrated in Figure 1, ZerO initializes networks with Hadamard and identity transforms, which assigns all
the weights to only zeros and ones.
Output

Input

Figure 1: Illustrating ZerO in a 3-layer network with input dimension Nx , hidden dimension Nh , and output
dimension Ny , where Nh > Nx , Ny . H and I are Nh ◊ Nh Hadamard and identity matrix, respectively. The
dimension-increasing layer is initialized by columns of the Hadamard matrix. The rest layers are initialized by
identity matrix or rows of it.

ZerO is not affected by the problem of training degeneracy or accuracy loss. Compared to random initializa-
tion, ZerO provides state-of-the-art performance over various image classification tasks, including ImageNet.
We further discover many unique properties and benefits of ZerO:

Stable training without batch normalization. ZerO ensures well-behaved signal propagation, which
provides stable training without batch normalization. Testing ResNet with over 500 layers, we find ZerO
converges faster than carefully designed random methods, such as Fixup and ReZero (Zhang et al., 2019;
Bachlechner et al., 2020).

Low-rank learning trajectory. We find that ZerO exhibits a low-rank learning trajectory, where the rank
of each matrix gradually increases during training. We believe this is the first time that the greedy low-rank
1 The transformation may contain linear, nonlinear or convolutional operations.
2 Both identity and Hadamard transforms are deterministic parameterizations.

2
Published in Transactions on Machine Learning Research (10/2022)

learning (GLRL) trajectory, a theoretical characterization of gradient descent, has been observed in large-
scale deep learning applications. GLRL is a consequence of implicit rank regularization by gradient descent
under infinitesimal initialization (Li et al., 2021; Razin et al., 2021). It can be viewed as performing a rank-
constrained optimization and greedily relaxing the rank restriction by one whenever it fails to reach a global
minimizer. GLRL has been used to explain the excellent generalization in gradient-based deep learning, as
it converges to a global (or local) minima with the minimum rank. However, the GLRL trajectory has never
been observed in practice due to its impractical requirement of infinitesimal initialization.

Sparse and low-rank solutions. We observe that ZerO-initialized networks converge to sparse and
low-rank solutions. Compared to randomly initialized networks, the sub-networks obtained in trained ZerO-
initialized networks achieve 30% lower (matrix or tensor) rank or 25% higher sparsity without sacrificing
accuracy.

Better training reproducibility. Since ZerO does not require any random perturbations, it is a fully
deterministic initialization scheme. Unlike determinism in random initialization, which needs fixing pseu-
dorandom number generators in hardware, the weights initialized by ZerO are fixed regardless of how the
random seed varies or which hardware is used. ZerO significantly reduces the training variation and thus
achieves better training reproducibility (the remaining randomness is only due to batch selection). Com-
pared to random initialization, ZerO produces 20%-40% lower standard deviation of the final accuracy over
repeated experiments with different random seeds.

Theoretical analysis of ZerO. Theoretically, we demonstrate that ZerO breaks a training degeneracy
that arises when applying identity initialization to networks with varying dimensionality across layers. We
prove that the training degeneracy necessarily occurs in standard identity initialization because the rank
of any Nh ◊ Nh matrix in the network is upper bounded by the input and output dimensions Nx and Ny
throughout the entire training, no matter how large the size of Nh is. This limits the expressivity of each
matrix, resulting in the degeneracy of training.
Our contributions are summarized as follows:

1. We design ZerO initialization, the first fully deterministic initialization that achieves state-of-the-art
performance in practice.

2. ZerO is backed with theoretical understanding. As shown in Theorem 1, we prove how ZerO breaks the
training degeneracy by applying Hadamard transforms.

3. ZerO has many benefits, such as training ultra deep networks (without batch-normalization), exhibit-
ing low-rank learning trajectory, converging to sparse and low-rank solutions, and improving training
reproducibility.

2 Is Randomness Necessary in Identity Initialization?

2.1 Background

We wish to train a function F(x) to learn a particular input-output map given a set of P training samples
(xµ , y µ ) œ RNx ◊Ny , where µ = 1, ..., P . Training is accomplished by minimizing the squared error L =
1
qP µ 2
µ=1 Îy ≠ F(x )Î2 using gradient descent with a step size ÷.
µ
2

We model F(x) to be a multilayer perceptron with L > 2 hidden layers, such that:

xl = Wl zl≠1 zl = Ï(xl ),

with l œ 1, ..., L. Let z0 = x and F(x) = zL . Ï is an element-wise nonlinearity. We assume that F has
uniform hidden dimension Nh , with Wl œ RNh ◊Nh for l œ 2, ..., L ≠ 1, W1 œ RNh ◊Nx , and WL œ RNy ◊Nh .

3
Published in Transactions on Machine Learning Research (10/2022)

The input-output Jacobian is a well-studied proxy for estimating the stability of signal propagation at
initialization (Saxe et al., 2014). It is defined as:
ˆzL
Jio = .
ˆz0
Proposed by (Saxe et al., 2014), dynamical isometry is a condition where all singular values of the Jacobian
Jio concentrate near 1. If Jio is well-conditioned, the backpropagation error ˆz ˆL
at any layer l will be
well-conditioned as well. This ensures stable signal propagation and well-behaved gradients at initialization.
l

Consider the case of Nx = Ny = Nh . Identity initialization is defined as initializing each matrix to be an


identity matrix: Wl = I. In this case, the dynamical isometry property for a linear F can be easily verified
as Jio = I. It also holds for certain nonlinearities when applying the identity initialization on residual
networks: zl = Ï(xl ) + xl≠1 where Wl = 0, such that no signal is passed through the residual branches at
initialization. Table 1 lists a few related examples.
From an optimization perspective, Hardt & Ma (2017) suggest that F has a global minimum very close to
its identity initialization, such that max1ÆlÆL ÎWlÕ Î Æ O(1/L) for large L, where W Õ = W ≠ I. Bartlett
et al. (2019) also proves that under the identity initialization, gradient descent learns an ‘-approximation of
F within iterations polynomial in log(1/‘).

2.2 Extending to Large Hidden Dimension

So far, we have discussed identity initialization in the special


case of fixed dimensionality. Now we extend our discussion to a

Input

Input
more practical setting where F is equipped with a large hidden
Output
Input

Input
dimension, such that Nh >> Nx , Ny . We also focus on the
specific case where Ï is a Relu nonlinearity.
A straightforward approach to identity initialization in this Figure 2: A 3-layer network F (N > N =
case is to initialize W1 such that it projects the input into a Ny ) where W1 , W3 = I ú and W2 = I at ini-
h x

Nx -dimensional subspace of the Nh -dimensional hidden space. tialization.


This can be achieved by initializing W1 with a partial identity
matrix:
Definition 1 (Partial Identity Matrix). Let I ú œ Rl◊r , the partial identity matrix I ú is defined as follows:
Y
](I, 0) where I œ R and 0 œ R if l < r,
_ l,l l,r≠l

I = I
ú
where I œ R l,l
if l = r,
_
[
(I, 0)€ where I œ Rr,r and 0 œ Rr,l≠r otherwise.

For a vector a œ Rr , if l < r, then I ú (a) clips the last few dimension such that I ú (a) = (a1 , a2 , ..., al )€ .
If l > r, then I ú pads l ≠ r additional dimensions with zero, such that (a1 , a2 , ..., ar , 0...0)€ . This is also
known as a zero-padding operator, such as used in channel-expanding layers in ResNet (He et al., 2016). In
the network F, I ú only needs to be applied in the dimension-varying matrices W1 and WL , while leaving
the remaining Nh ◊ Nh matrices to be identity matrix I.
We visualize this process in Figure 2. This may seem like a natural extension of identity initialization to a
large width setting, but we will show in Section 2.3 it suffers from a problem we call “training degeneracy”.
To avoid the problem, we use the Hadamard matrix H to initialize the dimension-increasing matrices, such
that W1 = HI ú . A Hadamard matrix is defined as follows:
Definition 2 (Hadamard matrix). For any Hadamard matrix H = Hm œ R2 ◊2 where m is a positive
m m

integer, we define H0 = 1 by the identity, and the matrix with large m is defined recursively:
Q1 1 1 1 ...
R
3 4 1 ≠1 1 ≠1 ...
Hm≠1 Hm≠1 c1 1 ≠1 ≠1 . . .d
Hm = = a1 ≠1 ≠1 1 . . .b œ R2 ◊2 .
m m

Hm≠1 ≠Hm≠1 . . . . .
. . . . .
. . . . .

4
Published in Transactions on Machine Learning Research (10/2022)

The linear transformation described by the Hadamard matrix, called the Hadamard transform, rotates the
coordinate axes to be equally weakly aligned with the standard basis. For example, in a two-dimensional
plane, the Hadamard transform rotates the standard basis by an exact angle of 45 degree. This turns out to
be an important property for breaking the training degeneracy.

2.3 Identity Initialization limits Network Expressivity

We now present our main result differentiating the training behavior of different initialization methods and
describing the problem of training degeneracy.
Theorem 1. Let F be a neural network with L matrices, where Wl œ RNh ◊Nh for l œ 2, ..., L ≠ 1, W1 œ
RNh ◊Nx , and WL œ RNy ◊Nh . F has a uniform hidden dimension Nh , input dimension Nx , and output
dimension Ny , where Nh Ø Nx , Ny . Define residual component WlÕ = Wl ≠ I. Let zl (x) to be the activation
in the l-th layer under the input x. Then we have the following results for different initializations:

(i) Consider a random perturbation µ œ RNh ◊Nh where each element is sampled from a Gaussian dis-
tribution: µij ≥ N (0, ‡ 2 ). It is well-known that the randomly perturbed matrices Wl = I + µl (for
l ”= 1, L) are full-rank almost surely:

P rob(rank(WlÕ ) = Nh ) = 1 for l œ 2, ..., L ≠ 1. (1)

(ii) When initializing W1 , WL = I ú and the remaining matrices as Wl = I (for l ”= 1, L), for any
x œ RNx , the linear dimension of the set of all possible activations is bounded throughout training as
! !) - *""
dim span zl (x)-x œ RNx Æ Nx for l œ 2, ..., L ≠ 1. (2)

As a result, the ranks of the weight matrices remain bounded throughout training as

rank(WlÕ ) Æ Nx for l œ 2, ..., L ≠ 1. (3)

(iii) When initializing W1 = HI ú , WL = I ú , and the remaining matrices as Wl = I (for l ”= 1, L) it is


possible for the activations at an intermediate layer to attain
! !) - *""
dim span zl (x)-x œ RNx > Nx , (4)

breaking the constraint on the linear dimension described in Equation 2.


Remark 1. The constraints on linear dimensions and matrix ranks in Equation 2 and 3 suggest that no
matter how large hidden dimension Nh is, the network F is only optimized within a low-dimensional sub-
space (depending on input dimension Nx ) of the full parameter space. This restricts the maximum network
expressivity of F, and thus the training may only converge to an underfitting regime, leading to training
degeneracy.
Remark 2. Under the assumptions of (iii), the breaking of training degeneracy described in Equation 4
appears to be the generic case. As verified empirically in Figure 3, applying the Hadamard transform in (ii)
also breaks the rank constraint in Equation 3.
Remark 3. (i) and (iii) avoid the training degeneracy from different directions. Unlike (i), the proposed (iii)
doesn’t introduce any randomness with the help of the Hadamard transform.

Almost all existing works use random weight initialization, which largely affects the rank of each matrix as
shown in (i). (i) can be proved by showing any column (or row) in a random matrix is linearly independent
to the other columns (or rows), almost surely. A detailed proof can be found in the appendix.
Consider identity initialization without any randomness. In (ii), W1 identically maps the input x into a
subspace of the hidden layer z1 , such that z1 = (x€ , 0, ..., 0)€ . Thus, the linear dimension on zl (i.e., the
linear dimension on activations zl1 , ..., zlP ) is equivalent to the linear dimension on x, which is upper bounded
by Nx . This result is held for every layer zl (where l ”= 1, L).

5
Published in Transactions on Machine Learning Research (10/2022)

800
2000 Standard GI-Init
700
1750 Hadamard-based GI-Init
600 Random Init
1500
500
1250
Rank

Rank
400
1000

300 750
Nh = 512
200 500
Nh = 1024
100 Nh = 2048 250
Nh = 4096
0 0

0 2500 5000 7500 10000 12500 15000 17500 0 2500 5000 7500 10000 12500 15000 17500
Iterations Iterations

Figure 3: Verify Theorem 1 in practice. We train a network F with L = 3 on MNIST, and visualize rank(W2 ) over
the training. The red dash line denotes the rank constraint on W2 predicted by the theorem, which is Nx = 784. Left:
we verify (ii) in the theorem by varying Nh . No matter how large Nh is, rank(W2 ) follows the rank constraint through
the entire training. Right: we verify (iii) where applying Hadamard transform breaks the rank constraint introduced
in (ii) , given Nh = 2048. We denote the initializations in (ii) and (iii) as standard and Hadamard-based GI-Init,
respectively. As predicted in (i), random initialization achieves its maximum rank immediately after the initialization.

To show the rank constraint in Equation 3, we track a single derivative of the residual component at layer l:

ÿP
ˆL ˆL
= µ
µ ¢ zl≠1 , (5)
ˆWlÕ µ=1
ˆx l

where ¢ denotes the outer product. We use the following well-known fact:
qQ
Lemma 1. Consider a matrix M to be a sum of vector outer products:M = µ=1 aµ ¢ bµ , where aµ œ RNa
and bµ œ RNb for µ œ 1, ..., Q. Let Q > Na , Nb . If linear dimensions dim(span(a1 , ..., aQ )) Æ U and
dim(span(b1 , ..., bQ )) Æ V , where U Æ Na and V Æ Nb , then:

rank(W ) Æ min(U, V )

By Lemma 1, at initialization, the upper bound Nx on the linear dimension on zl≠1 results in a rank
constraint on rank(WlÕ ). The rank constraint holds during the entire training as ˆWˆL
Õ has a zero-valued
l
q -
ˆL -
Ny ◊ (Nh ≠ Nx ) sub-matrix at every iteration (as shown in the appendix). Since Wl = t=1 ≠÷ ˆW Õ - after
Õ T
l t
T weight updates (by gradient descent with a step size ÷), rank(WlÕ ) is bounded by Nx no matter what T is.
This results in the training degeneracy as described in Remark 1. We also verify it empirically in Figure 3.
To avoid the training degeneracy, we need to overcome limitations on the linear dimension of the set of
possible activations. This is indeed possible when using the Hadamard matrix as W1 = HI ú , as we will
illustrated by means of an example.
Lemma 2. Assume Nh = 4 and Nx = 3. For any vector x œ span(e2 , e3 ) where e2 and e3 are coordinate
vectors (0, 1, 0)€ and (0, 0, 1)€ , it holds that:
!) - *"
span z1 (x)-x œ RNx = span(Relu(HI ú e2 ), Relu(≠HI ú e2 ), Relu(HI ú e3 ), Relu(≠HI ú e3 )), (6)

where Relu(HI ú e2 ), Relu(≠HI ú e2 ), Relu(HI ú e3 ), and Relu(≠HI ú e3 ) are linearly independent. This
indicates that: ! !) - *""
dim span z1 (x)-x œ RNx = 4 = Nh > Nx = 3.

When using Hadamard matrix as W1 = HI ú , the breaking of the training degeneracy described in Lemma 2
appears to be the generic case. As verified empirically in Figure 3, this also breaks the rank constraint in
Equation 3.

6
Published in Transactions on Machine Learning Research (10/2022)

We point out that the increase of the linear dimension of the set of possible zl is only possible due to the
nonlinearity. If F is a linear network, the linear dimension on every zl is at most Nx , no matter how the
weights are initialized.
Nevertheless, the nonlinearity can not increase the linear dimensionality
!) - if we*"initialize the network with a
partial identity matrix. This is because when W1 = I ú , span zl (x)-x œ RNx is aligned with the standard
basis, i.e., each vector in the span at least has Nh ≠ Nx zero coefficients when expressed in the standard
basis. Thus, an element-wise nonlinearity can not increase the linear dimension of its input beyond Nx .
!) - *"
To break the alignment span zl (x)-x œ RNx with the standard basis, we use the Hadamard transform.
This is because it transforms the subspace such that the new basis is equally weakly aligned with the standard
basis. We note that other linear transforms may also detach the subspace from the standard basis, but the
Hadamard transform is the most natural choice.

3 ZerO Initialization

The initialization analyzed in (iii) of Theorem 1 is based on a network condition in which all hidden spaces
z1 , ..., zL≠1 have the same dimension Nh . Motivated by our theoretical understanding, we propose ZerO
initialization, which initializes the weights of any network with arbitrary hidden dimensions. As described
in Algorithm 1, ZerO only initializes dimensional-increasing layers with Hadamard matrices to avoid the
training degeneracy. Other layers are simply initialized by (partial) identity matrices. We also rescale
Hadamard matrices by a normalization factor 2≠(m≠1)/2 , resulting in an orthonormal Hadamard transform.

Algorithm 1 ZerO Initialization.


Input: a neural network F with L matrices Wl œ RPl ◊Ql for l in 1, ..., L. I ú is partial identity matrix
defined in Definition 1. Hm is the Hadamard matrix defined in Definition 2.
For l in 1, ..., L:
If Pl = Ql : Wl Ω I Û Identity mapping
If Pl < Ql : Wl Ω I ú Û Propagate the first Pl dimensions
If Pl > Ql : Wl Ω c I ú Hm I ú , where m = Álog2 (Pl )Ë and c = 2≠(m≠1)/2 Û Apply Hadamard matrix

We also apply ZerO to the well-developed ResNet architectures in He et al. (2016). As shown in Algorithm 2,
we apply ZerO to convolution in a similar way by considering the variation in channel dimensions. When
K is a 1x1 convolution, K also can be viewed a cout ◊ cin matrix, which matches the initialization in
Algorithm 1. We note that Algorithm 2 can be applied to every convolution in ResNet, including the first
3x3 convolution, 1x1 convolutions in spatial-downsampling skip connections, and convolutions in basic block
and bottleneck block.
To achieve dynamical isometry at initialization, we apply a common technique that initializes the last
convolution in each residual block as zero. This helps suppress the signals from residual branches to stabilize
signal propagations, as studied in Zhang et al. (2019); Bachlechner et al. (2020).

Algorithm 2 ZerO Initialization on Convolution.


Input: number of input channels cin , number of output channels cout , odd kernel size k.
Return: a cout ◊ cin ◊ k ◊ k convolutional kernel K.
Let n Ω Âk/2Ê
If cout = cin : K[:, :, n, n] Ω I
If cout < cin : K[:, :, n, n] Ω I ú
If cout > cin : K[:, :, n, n] Ω c I ú Hm I ú , where m = Álog2 (Pl )Ë and c = 2≠(m≠1)/2

We also apply ZerO to networks with or without batch normalization. For ResNet with batch normalization,
we follow the standard practice to initialize the scale and bias in batch normalization as one and zero,

7
Published in Transactions on Machine Learning Research (10/2022)

Dataset Model Initialization Test Error (mean ± std)

CIFAR-10 ResNet-18 ZerO Init 5.13 ± 0.08


Kaiming Init 5.15 ± 0.13
Xavier Init 5.23 ± 0.16
ImageNet ResNet-50 ZerO Init 23.43 ± 0.04
Kaiming Init 23.46 ± 0.07
Xavier Init 23.65 ± 0.11

Table 2: Benchmarking ZerO on CIFAR-10 and ImageNet. We repeat each run 10 times with different
random seeds.

respectively. For training without batch normalization, we adopt a technique proposed by Zhang et al.
(2019), where the batch normalization is replaced by learnable scalar multipliers and biases.

4 Experiments

In this section, we empirically benchmark ZerO on CIFAR-10 and ImageNet datasets, where we evaluate
ResNet-18 on CIFAR-10 and ResNet-50 on ImageNet (Krizhevsky, 2009; Deng et al., 2009). Both ResNet
structures follow the design from He et al. (2016), which includes batch normalization by default.

Hyperparameter settings. We find that ZerO can fully utilize the default hyperparameters, which include
a learning rate of 0.1, a momentum of 0.9, and a weight decay of 0.0001. In addition, we observe the learning
rate warmup is essential for ZerO to achieve a large maximal learning rate, as most of the weights start
from the exact zero. We warm up the learning rate with 5 and 10 epochs for ImageNet and CIFAR-10,
respectively.
We present our main results that compare different initialization schemes. For each dataset, all experiments
use the same hyperparameter settings by default. Each experiment is repeated for ten runs with different
random seeds. As shown in Table 2, ZerO achieves state-of-the-art accuracy on both datasets compared to
other random methods.
In addition, we compare ZerO with a broad range of related works on CIFAR-10 using ResNet-18 and ResNet-
50, including ReZerO (Bachlechner et al., 2020), Fixup (Zhang et al., 2019), SkipInit (De & Smith, 2020)
and ConstNet (Blumenfeld et al., 2020). As shown in Table 3, ZerO consistently achieves top performance
compared to other methods.
We note that the ConstNet proposed by Blumenfeld et al. (2020) is also a deterministic initialization.
However, unlike ZerO which preserves feature diversity, ConstNet is designed to eliminate the diversity
by averaging the features through layers. The feature symmetric problem in ConstNet causes significant
degradation, and additional random noise (e.g., non-deterministic GPU operation and dropout) is needed to
break the symmetry.

Method ZerO ReZero Fixup SkipInit ConstNet ConstNet*

ResNet-18 5.13 5.20 5.17 5.26 72.39 5.41


ResNet-50 4.53 4.72 4.51 4.63 71.58 4.88

Table 3: Compare ZerO with other initialization methods on CIFAR-10. ConstNet* denotes ConstNet with
non-deterministic GPU operations discussed in Blumenfeld et al. (2020). Top-1 test error is reported.

8
Published in Transactions on Machine Learning Research (10/2022)

Training ultra deep network without batch normalization Although there are methods attempting
to train networks without batch normalization (by achieving dynamical isometry), they inevitably introduce
random perturbations at initialization, affecting the training stability when the network is sufficiently deep
(Zhang et al., 2019; De & Smith, 2020). We benchmark ZerO with state-of-the-art methods on training
without batch normalization. As shown in Figure 4 (left), compared to other methods, ZerO achieves the
best training stability for networks with even around 500 layers. It also matches the baseline where batch
normalization is enabled.

Improved reproducibility. In addition, as shown in Table 2, ZerO achieves the lowest standard devi-
ation over the repeated runs. On ImageNet, the gap between ZerO and other methods is even more than
40%. Thus, removing the randomness in the weight initialization improves reproducibility, with possible
implications for topics such as trustworthy machine learning and network interpretation.

Figure 4: Training extreme deep ResNet on CIFAR-10 over 15 epochs.

4.1 ZerO on Transformer

We also apply ZerO to Transformer and evaluate it on WikiText-2 dataset (Vaswani et al., 2017). In each
Transformer layer, we use ZerO to initialize both multi-head attention and feed-forward layers. Because the
embedding size is fixed in the multi-head attention, we initialize the projection matrix of queries WQ as
identity and the projection matrices of keys and values WK , WV at zero. For the feed-forward layers, we
initialize the connection matrices according to their hidden dimensions using Algorithm 1.
We train the Transformer models for 20 epochs with a single learning rate decay at epoch 10 2 . We also vary
the number of layers in the model from 2 to 20. As shown in Table 4, ZerO achieves similar performance
compared to the standard initialization. In addition, it has better training stability over deeper Transformer
models, which is consistent with our previous results on ResNet.

Number of layers 2 4 6 8 10 20

Standard 200.44 168.69 154.67 146.43 diverged diverged


ZerO 192.34 169.73 151.91 149.27 145.62 141.81

Table 4: Evaluate Transformer on WikiText-2. We vary the number of layers in Transformer, where each
layer consists of a multi-head attention and a feed-forward layer. Test perplexity is reported (lower is better).

2 We use a transformer architecture (provided by the link here) that was smaller than the transformers typically used for

this task, explaining the general degradation of the results.

9
Published in Transactions on Machine Learning Research (10/2022)

Figure 5: Low-rank training trajectories in ResNet-18 on CIFAR-10 (top row) and ResNet-50 on ImageNet (bottom
row). We visualize trajectories of the first convolutions in second, third, and fourth groups of residual blocks in ResNet.

5 Low-Rank Learning Trajectory

Although ZerO and random initialization achieve similar test accuracies, their training trajectories differ
significantly. In contrast to random initialization, which begins optimization from a complex network (i.e.,
full-rank weight matrices, as shown in Figure 3), ZerO starts the training from a "simple" network and
gradually increases its complexity.
To show the difference in practice, we track the ranks of convolutional kernels in ResNets during training,
where the rank of each kernel can reflect its complexity. We measure the stable rank, which is defined as
k
ÿ
2 2
ÎW ÎF / ÎW Î2 = ‡i2 (W )/‡max
2
(W ),
i=1

for any matrix W with k singular values ‡i . The stable rank is a soft version of the operator rank, and unlike
the operator rank, it is insensitive to small singular values. We compare the stable ranks of various kernels
between ZerO and random initialization during training. As shown in Figure 5, in contrast to random
methods that begin with extremely high stable ranks, ZerO starts with low stable ranks and gradually
increases them during training.
We believe ZerO’s learning trajectory is the first demonstration of greedy low-rank learning (GLRL) in large-
scale deep learning applications. GLRL is a theoretical characterization of gradient descent, such that: when
matrices are initialized with infinitesimal values, gradient descent performs a rank-constrained optimization
and greedily relaxes the rank restriction by one whenever it fails to reach a minimizer (Li et al., 2021; Razin
et al., 2021).
For example, when a matrix is initialized sufficiently small (where its rank is approximately zero), gradient
descent first searches the solution over all rank-one matrices. If it fails to find a minimizer, it will relax the
rank constraint by one and search again over all rank-two matrices. The search is stopped at rank-n if it
finds a minimizer among all rank-n matrices.

10
Published in Transactions on Machine Learning Research (10/2022)

GLRL suggests that gradient descent implicitly biases the model towards simple solutions by searching
through the solution space in an incremental order of the matrix rank. This helps to explain the excellent
generalization in gradient-based deep learning, as it converges to a global (or local) minima with the minimum
rank.
Although the previous works have proved the existence of GLRL trajectory, it has never been observed
in practice due to its impractical requirement of infinitesimal initialization. ZerO’s learning trajectory we
observed suggests that GLRL not only exists under infinitesimal initialization, but also under initialization
around the identity. If a matrix W is initialized as I, the low-rank structure is actually inside its residual
component: WlÕ = Wl ≠ I. To be noted, every convolutional kernel conv(x) we measured in Figure 5 can
be viewed as the residual component of conv(x) + I, where the skip connection is included.
Figure 5 also suggests that the kernels never reach their maximal stable ranks during training under ZerO
initialization. This implies that the searching over the space of full-rank weight matrices may be unnecessary,
suggesting new avenues towards improved computational efficiency. We hope to explore this direction in
future work.
We also observe that ZerO-based networks converge to low-complexity solutions. As shown in both Figure 5
and 6 (left), the convolutional kernels trained by ZerO usually have lower ranks than the kernels trained by
random initialization. We further measure model complexity through both network pruning and low-rank
approximation.
For network pruning, we use a standard magnitude-based pruning that prunes a portion of weights with the
lowest magnitudes in each layer (Frankle & Carbin, 2019). For low-rank approximation, we apply Tucker-2
decomposition over channel dimensions in convolutions to select the most significant components (Kim et al.,
2016).
As shown in Figure 6, compared to randomly initialized networks, the sub-networks obtained in trained
ZerO-initialized networks achieve 25% higher sparsity or 30% lower (matrix or tensor) rank without sacri-
ficing accuracy. This suggests ZerO encourages the networks to converge to low-complexity solutions, which
improves the computational efficiency for inference.

Figure 6: Left: comparing kernel rank in ResNet-18 trained by ZerO and Kaiming methods. Middle: a magnitude-
based network pruning on ResNet-18. Right: a Tucker-2 decomposition for a particular convolution with 512 channels
in ResNet-18.

6 Related Works

To ensure stable training with random weight initialization, previous works such as Glorot & Bengio (2010);
He et al. (2015) study the propagation of variance in the forward and backward pass under different acti-
vations. Several studies provide a more detailed characterization of the signal propagation with dynamical
isometry (Saxe et al., 2014; Pennington et al., 2017; Xiao et al., 2018).
Inspired by the dynamical isometry property, various initialization methods are proposed to increase the
convergence speed and stabilize the signal propagation, including Saxe et al. (2014); Bachlechner et al.

11
Published in Transactions on Machine Learning Research (10/2022)

(2020); Gehring et al. (2017); Balduzzi et al. (2017). De & Smith (2020); Hoffer et al. (2018) study the
reason behind the success of batch normalization (Ioffe & Szegedy, 2015) , and Zhang et al. (2019); De &
Smith (2020) propose initialization methods to train residual networks without batch normalization.
As summarized in Table 1, many of these methods can be categorized as identity initialization with random
perturbations. However, ZerO eliminates all the randomness using Hadamard and identity transforms. In
another related work, Blumenfeld et al. (2020) discusses whether random initialization is needed from the
perspective of feature diversity. They propose networks with identical features at initialization, which still
require random perturbations to avoid the symmetric problem and improve performance.
Gradient descent biasing models towards low-rank solutions has been well studied in matrix factorization
(Arora et al., 2019). Recent works also demonstrate the existence of greedy low-rank learning trajectory
induced by gradient descent (Li et al., 2021; Razin et al., 2021; Jacot et al., 2021). However, no prior work
demonstrates the greedy low-rank learning trajectory in large-scale applications of deep learning, as most
only consider the problems of matrix factorization or applications of shallow neural networks.

7 Conclusion

In this work, we propose a simple and fully deterministic initialization called ZerO. Extensive experiments
demonstrate that ZerO achieves state-of-the-art performance, suggesting that random weight initialization
may not be necessary for initializing deep neural networks. ZerO has many benefits, such as training ultra
deep networks (without batch-normalization), exhibiting low-rank learning trajectories that result in low-
rank and sparse solutions, and improving training reproducibility.
We believe that ZerO opens up many new possibilities given its various benefits. It can be applied to
networks and tasks sensitive to the variances in weight initialization. Its low-rank learning trajectories enable
the development of rank-constrained training methods that improve computational efficiency. Finally, the
improved training reproducibility can aid model interpretability. We hope our results will inspire other
researchers to consider deterministic initialization schemes and to rethink the role of weight initialization in
training deep neural networks.

Acknowledgments

We are grateful to the anonymous reviewers for their helpful comments and NVIDIA for the computational
support. Dr. Anandkumar is supported by the Bren chair professorship at Caltech.

References
Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factor-
ization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B.
Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Con-
ference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancou-
ver, BC, Canada, pp. 7411–7422, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/
c0c783b5fc0d7d808f1d14a6e9c8280d-Abstract.html.
Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian
McAuley. ReZero is All You Need: Fast Convergence at Large Depth. ArXiv preprint, abs/2003.04887,
2020. URL https://arxiv.org/abs/2003.04887.
David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The
shattered gradients problem: If resnets are the answer, then what is the question? In Doina Precup and
Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017,
Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp.
342–350. PMLR, 2017. URL http://proceedings.mlr.press/v70/balduzzi17b.html.
Peter L. Bartlett, David P. Helmbold, and Philip M. Long. Gradient Descent with Identity Initializa-
tion Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks. Neural

12
Published in Transactions on Machine Learning Research (10/2022)

Computation, 31(3):477–502, 2019. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco_a_01164. URL


https://direct.mit.edu/neco/article/31/3/477-502/8456.

Yaniv Blumenfeld, Dar Gilboa, and Daniel Soudry. Beyond signal propagation: Is feature diversity necessary
in deep neural network initialization? In Proceedings of the 37th International Conference on Machine
Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning
Research, pp. 960–969. PMLR, 2020. URL http://proceedings.mlr.press/v119/blumenfeld20a.html.

Soham De and Samuel L. Smith. Batch normalization biases residual blocks towards the iden-
tity function in deep networks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-
Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Sys-
tems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020,
December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/
e6b738eca0e6792ba8a9cbcba6c1881d-Abstract.html.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp. 248–255. IEEE Computer Society, 2009. doi:
10.1109/CVPR.2009.5206848. URL https://doi.org/10.1109/CVPR.2009.5206848.

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural
networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA,
USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence
to sequence learning. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International
Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70
of Proceedings of Machine Learning Research, pp. 1243–1252. PMLR, 2017. URL http://proceedings.
mlr.press/v70/gehring17a.html.

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks.
In Yee Whye Teh and Mike Titterington (eds.), Proceedings of the Thirteenth International Conference
on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 249–
256, Chia Laguna Resort, Sardinia, Italy, 2010. PMLR. URL https://proceedings.mlr.press/v9/
glorot10a.html.

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew
Tulloch, Yangqing Jia, and Kaiming He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.
ArXiv preprint, abs/1706.02677, 2017. URL https://arxiv.org/abs/1706.02677.

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In 5th International Conference on
Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
OpenReview.net, 2017. URL https://openreview.net/forum?id=ryxB0Rtxx.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision,
ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1026–1034. IEEE Computer Society, 2015. doi:
10.1109/ICCV.2015.123. URL https://doi.org/10.1109/ICCV.2015.123.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV,
USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.90. URL
https://doi.org/10.1109/CVPR.2016.90.

Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization
schemes in deep networks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman,
Nicolò Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31:
Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018,

13
Published in Transactions on Machine Learning Research (10/2022)

Montréal, Canada, pp. 2164–2174, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/


a0160709701140704575d499c997b6ca-Abstract.html.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. In Francis R. Bach and David M. Blei (eds.), Proceedings of the 32nd International
Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop
and Conference Proceedings, pp. 448–456. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/
ioffe15.html.
Arthur Jacot, François Ged, Berfin im ek, Clément Hongler, and Franck Gabriel. Saddle-to-Saddle Dy-
namics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity. ArXiv preprint,
abs/2106.15933, 2021. URL https://arxiv.org/abs/2106.15933.
Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of
deep convolutional neural networks for fast and low power mobile applications. In Yoshua Bengio and Yann
LeCun (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto
Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06530.
Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009.
Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix
factorization: Greedy low-rank learning. In 9th International Conference on Learning Representations,
ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.
net/forum?id=AHOs7Sm5H7R.
Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning
through dynamical isometry: theory and practice. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio,
Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural
Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017,
December 4-9, 2017, Long Beach, CA, USA, pp. 4785–4795, 2017. URL https://proceedings.neurips.
cc/paper/2017/hash/d9fc0cdb67638d50f411432d0d41d0ba-Abstract.html.
Noam Razin, Asaf Maman, and Nadav Cohen. Implicit regularization in tensor factorization. In Marina
Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning,
ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research,
pp. 8913–8924. PMLR, 2021. URL http://proceedings.mlr.press/v139/razin21a.html.
Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of
learning in deep linear neural networks. In Yoshua Bengio and Yann LeCun (eds.), 2nd International
Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference
Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6120.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio,
Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural
Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017,
December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017. URL https://proceedings.neurips.
cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, and Jeffrey Pennington. Dy-
namical isometry and a mean field theory of cnns: How to train 10, 000-layer vanilla convolutional
neural networks. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th Interna-
tional Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-
15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 5389–5398. PMLR, 2018. URL
http://proceedings.mlr.press/v80/xiao18a.html.
Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without nor-
malization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA,
USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=H1gsz30cKX.

14

You might also like