Introduction To Sparse Computing Whitepaper EN v1.4 e
Introduction To Sparse Computing Whitepaper EN v1.4 e
Computing Whitepaper
M O FFE T T A I
AU G US T 202 2
INTRODUCTION TO SPARSE COMPUTING
Copyright Notice
The copyright of this white paper belongs to Moffett AI and It is protected by law.
If the words or ideas of this white paper are reproduced, extracted or otherwise
used, please make a footnote that the source is from Moffett AI. In case of
violation of the above statement, the company will investigate the relevant legal
responsibilities.
Contents
How great things are made 03
The four great computing problems of our age 04
The processor brick wall 05
Sparsity in neural networks 06
Types of sparsity in neural networks 07
Why aren’t neural networks sparse yet? 09
Sparsity becomes more promising 10
The unrealized potential of sparsity 11
Moffett AI and the path to dual sparsity 12
Co-designing hardware and software 14
A vision 15
Closing thoughts 16
About Moffett AI 17
The Sparse ecosystem 18
Patents 19
Published papers 20
Contents PAGE 02
INTRODUCTION TO SPARSE COMPUTING
When we think of building new things, we tend to think about adding materials
together to create something larger and more complex. But in the case of David,
the creative process was one of gradual subtraction, not addition.
Sparsity is a smart way to process data. It is vital in applications where there are
large amounts of data which need to be processed quickly, accurately and at
affordable prices, consuming the least amount of energy.
1 We capitalize the initial letter “S” in sparse to define the adjective. We use “Sparcity” as the noun, but “sparse”
as the verb.
2 The paper “Portfolio Selection” by Harry Markowitz of The Rand Corporation, was published in The Journal of
Finance in March 1952
Humanity today faces challenges from climate change and global health
to wellbeing, safety, education and more. The United Nations Sustainable
Development Goals 3 provide a good summary of the main areas of focus and
identify 17 major areas where work is required for humanity to address these
challenges.
Computing plays a very big role in solving these challenges. At one end, modelling
climate change and sequencing DNA gets better the more data there is, and so
computers are processing larger and larger quantities of data in order to develop
better insights. The more data they can process, the more accurate they get.
This data needs to be processed quickly. If you can process billions of images and
data points in situation where time is critical, you can save lives.
At the other end, there are the vast server farms which house these computers,
consuming more and more electricity to the extent that they are responsible for
at least a few percentage points of the world’s total energy consumption and a
few percentage points of the world’s total greenhouse gas emissions.
Data centers are a vital part of the solution, but also part of the problem: as they
process more data, yet they so they consume more power.
3 The 2030 Agenda for Sustainable Development, adopted by all United Nations Member States in 2015, provides
a shared blueprint for peace and prosperity for people and the planet, now and into the future. At its heart are
the 17 Sustainable Development Goals (SDGs), which are an urgent call for action by all countries - developed
and developing - in a global partnership. They recognize that ending poverty and other deprivations must go
hand-in-hand with strategies that improve health and education, reduce inequality, and spur economic growth
– all while tackling climate change and working to preserve our oceans and forests. https://sdgs.un.org/goals
1. Increase speed - we need to be able to process more data than ever before in
less time and decrease latency.
2. No loss of accuracy - we require computers which can process data with high
precision.
3. Decrease cost - we can’t simply add more and more processors and
electricity, we need to make computing more affordable and reduce the total
cost of ownership (TCO) from both capital and operation.
Sparse computing helps all these four things. The more urgency we see in
addressing the four compute problems, the more of a role there is for Sparse
computing as a part of the solution.
This “Moore’s Wall” has been anticipated for some time. We saw that the CPU took
processing capabilities only so far before the GPU (a graphics processing unit)
was developed to handle the higher quantities of data which image processing
required.
The GPU has triumphed since then. But it, in turn, is now running out of the
capacity to keep up with demand.
Simulating the behavior of this virtual neuron involves calculating a dot product
between the two vectors and then applying the activation function f to the
result.
W XT
× =
A group of M of these neurons, all with the same number of synaptic connections,
could be represented with an M x N matrix where each row is a separate neuron. In
the same way, B different activations, each related to a different stimulus, could
be stacked atop one another to form a B \times N matrix. Using these groupings,
we could compute the responses of a group of M neurons to B different inputs
with a single matrix multiplication, W * XT.
W XT
× =
So far, we’ve assumed that every neuron has a connection to every element in the
activation vector. But this is not generally the case, either in biological systems
or artificial ones. It’s much more common for a neuron to have a few strong
synapses and let the rest be zero. Neuron activations tend to be sparse as well.
This means that sparse matrices are a natural way to represent both the W and X
matrices.
W XT
× =
Synaptic sparsity
By some estimates, the human brain has 86 billion neurons and 150 trillion
synapses 4 . These numbers imply that only 0.000005% 5 of the possible
connections between neurons are actually present. In other words, the
connectivity of the brain is 99.999995% sparse. In this regime, the total number of
synapses grows linearly with the number of neurons. Each biological neuron gets
a fixed number of connections and this number doesn’t change even as the total
number of neurons increases. Researchers call this property synaptic sparsity.
W XT
× =
Activation sparsity
The human brain is not only sparse in synapses; it is also sparse in neuron
activations. The energy consumed by a biological neuron is roughly proportional
to the number of times it fires. So the fewer neurons that fire in the brain, the
less energy it consumes. The brain uses this activation sparsity to save energy.
By contrast, a simulated neuron as described above consumes the same amount
of energy regardless of whether it fires or not. If its output is zero, that zero still
gets multiplied with other numbers.
W XT
× =
Dual sparsity
These two types of sparsity are complementary to one another. Activation
sparsity allows signals to be routed through specific subsets of a network, while
synaptic sparsity keeps those subsets small and efficient. Working together, they
lead to much greater efficiency gains than would be possible if only one were
being used. Researchers suspect that this “dual sparsity” is what permits the
brain to be so efficient.
W XT
× =
Neural network researchers of the 1990’s were aware of the benefits of sparsity
and put a great deal of effort into sparsifying their models, with approaches such
as weight magnitude regularization and magnitude pruning to achieve high levels
of synaptic sparsity. These works show that sparsity was important even in the
early days of AI.
4 Not all scientists will agree with these numbers but “The remarkable, yet not extraordinary, human brain as
a scaled-up primate brain and its associated cost” authored by Suzana Herculano-Houzel and published in the
Proceedings of the National Academy of Sciences of the United States of America, Vol 109 in June 2012, is
a good reference.
5 We obtained this number as follows: $$\quad \textrm{sparsity} = \frac{\textrm{observed number of connec
tions}}{\textrm{possible connections btwn $N$ neurons}} = \frac{150 \times 10^{12}}{N*(N 1)/2} = \frac{150 \
cdot 10^{12}}{(86\cdot10^9)(86\cdot10^9-1)/2} = 4\cdot10^{-8} $$
It is possible that this was because there were so many other fruitful ways to
improve models. First of all, there was better data. By the late 2000’s, the internet
had exploded in size which made it possible for researchers to construct massive
datasets from publicly available data. Second, computing infrastructure grew
much better. Not only did computers in general improve, but researchers found
that they could massively accelerate their models by putting them on GPUs.
A third important event was the rise of automatic differentiation (autodiff)
frameworks like Theano, TensorFlow, and PyTorch. These frameworks made it
easier to design new models, train them on specialized hardware, and run them on
large datasets.
It’s worth noting that neither the GPUs nor the autodiff frameworks were built
with Sparse computing in mind. And so while they enabled big advances in model
size and architecture, they made it very difficult for researchers to reap rewards
from sparsity. As long as significant progress was happening in other areas, this
was to remain the case. But as the 2010’s drew to a close, questions of energy
efficiency and the compute-vs-accuracy tradeoff became more pressing and
sparsity became much more attractive.
MOORE’S LAW IS NEAR ITS PHYSICSAL LIMITS CHIP CAPABILITY INCREASES SLOW DOWN
100,000
90nm
65nm 10,000
45nm 32nm 22nm
14nm 10/7nm ↑3.5%
1,000 ↑12%
↑23%
100 ↑52%
↑25%
1978 1986 2003 2011 2015 2019
Dartmouth First Expert Nvidia releases the first AI chip with sparsity
1,000 feature - A100.
Conf. System
Birth to the DENDRAL Moffett releases the first sparse processing
750 AI unit SPU
1990 2000 2010 2020 1942 1968 2006 2012 2015 2016 2018 2019 2020
AI IN EARLY PHASE NEURAL NETWORK 1.0 NEURAL NETWORK 2.0
Sparsity is one of the most popular The commercialization of sparsity has started,
AI research fields and it will continue to lead the future
6 “Things that deal with sparse parallelism,” said Raja Koduri, Intel’s head of chip architecture, “...will give rise to
some new architectural ideas that are very different from what we are doing in vector-matrix, which is very
mainstream right now.” quoted in ZD Net https://www.zdnet.com/article/intel-data-bandwidth-sparsity-
are-the-two-biggest-challenges-for-ai-chips in August 2020.
Startups
This task, which requires daring and flexibility, is well suited for startups. Indeed,
some of the best work being done in this area is happening at small companies.
Numenta, a Bay Area startup, recently demonstrated a custom chip with
hardware support that runs a popular vision architecture 100 times faster than
more traditional chips. Another company, NeuralMagic, offers model sparsification
for shrinking foundation models so that they can run on laptop CPUs instead of
expensive data center GPUs. But in order to realize the full potential of sparsity,
the industry is going to need to design both hardware and software together.
So far, only a few startups have tried to do this. One of the most interesting and
ambitious of these companies is Moffett AI.
7 See “Sparse-TPU: adapting systolic arrays for sparse matrices” authored by Xin He and others in ICS ‘20:
Proceedings of the 34th ACM International Conference on Supercomputing, June 2020 https://dl.acm.org/
doi/10.1145/3392717.3392751
8 As just one example, in early 2022, Intel advertised an “Intel Neural Compressor” tool aimed at model sparsifica
tion https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Quantizing-ONNX-Mod
els-using-Intel-Neural-Compressor/post/1355237
Dual sparsity
As discussed earlier, dual sparsity, refers to sparsity in both the weights and
activations of neural networks. The diagram below gives an intuitive comparison
of the differences between dense-dense operations, which the majority of AI
chips use currently, dense-sparse operations, which some chips like the NVIDIA
A100 offer, and “dual sparse” operations which Moffett AI supports. One thing to
notice is that using dual sparsity permits researchers to evaluate the products
of much larger matrices while using the same amount of memory, compute, and
energy. In the upper image, Moffett’s approach leads to a substantial speedup.
NVIDIA × =
V100 TENSOR CORE Programmable Granularity: Small
(for both AI & Graphics)
Bank-balanced Sparsity:
Bank Size: 4
NVIDIA Support Sparsity up to 1/2
A100 TENSOR CORE × =
MOFFETT
SPARSE TENSOR CORE × = Bank-balanced dual Sparsity:
Bank Size: 64
Support Sparsity up to 1/32
The first thing to notice is that it can run sparse models 32x and 8x faster than
its two respective baselines. These gains are important because model latency
matters a great deal in real-world settings. For example, a self-driving car needs to
be able to process frames at a rate of at least 20 fps in order to avoid obstacles
while moving at 60 miles per hour. As another example, internet users begin to
lose interest in a webpage if its latency grows beyond a few dozen milliseconds,
meaning that models used to filter information on these pages need to perform
inference in about a millisecond.
The second benefit of dual sparse is that it allows us to run higher accuracy
models for the same computational budget. Because inference is 8-32x faster on
Antoum️, it can run 8-32x larger models, which tend to be much more accurate.
A similar line of reasoning is behind the third benefit, which is less energy
consumption. The energy that a chip consumes is roughly proportional to the
number of mathematical operations it performs. Since dual sparse hardware
ignores all the activations and weights that are set to zero, it saves the energy
that would have been used to multiply them to get more zeros. These savings end
up being between 8-32x as well. And since the cost of running a model scales
with the amount of energy it uses, Antoum️ is 8-32x cheaper to operate.
Model Executable
Dense Model
TVM Runtime
Moffett IR
Sparse Model
Hardware-Level Optimization
objects occur in a scene. Given that objects, even common ones like wheels and
eyes, occur infrequently throughout most images, Moffett’s researchers realized
that it is possible to make the channel dimension very sparse. Starting from this
software observation, they adjusted the chip’s physical design, allocating less
processing power for the channel dimension.
A vision
Moore’s Law in 1965 looked to the heavens to a world where the future would
bring more and more transistors. Gordon Moore wrote: “integrated electronics will
allow the advantages of electronics to be applied generally throughout society”9
and his vision, at that time, was about how economics and manufacturing
technology, yields and physics could be mastered by chip makers.
The vision today in 2022, for the whole Sparse community, and particularly for
Moffett AI, no longer lies with the technology, and certainly not with increases in
yields.
The vision now is to reduce computation by the use of algorithms, which change
the characteristics of the model itself, and combining algorithmic optimization
with chip design, as Moffett AI is doing.
The spirit and ambition of Moore is alive and well, but the world of plenty in 1965
has now been replaced by a world of caution and prudence where it is now about
doing less, and using sparse techniques to prune and thin out data. Over time,
not only should all AI processing become sparse, but all processing could be
sparse, and every computer could be a sparse computer.
9 From “Cramming more components onto integrated circuits” by Dr Gordon E Moore, Fairchild Semiconductor
published in Electronics magazine, April 1965.
A v ision PAGE 15
INTRODUCTION TO SPARSE COMPUTING
Closing thoughts
Although the industry has started to put more time and energy into sparsity,
there are many inefficiencies that have yet to be chiseled away. In coming years,
we will need to adapt everything in AI, from chip design to low-level compilers and
CUDA kernels to high-level autodiff frameworks, to better accommodate sparsity.
Companies like Moffett AI are in a good position to lead this revolution. Perhaps
the infrastructure they are building now will, in a few years, be running the most
powerful AI models in the world.
10 Dr Gordon E Moore established the Gordon and Betty Moore Foundation in 2015 to make a “significant and posi
tive impact in the world”, tackling large, important issues at scale with the areas of interest including environ
mental conservation, scientific research, higher education, having observed changes in the natural world and
from the dependency of all living species on the planet’s health. https://www.moore.org/about/founders-intent
Appendix
About Moffett AI
Moffett Al (commonly referred to as Moffett) is the world leader in Sparse
computing. Its Antoum️ Chip Architecture and S4,S10 and S30 Accelerators
allow data centers to quickly and simply deploy sparse computing techniques
in AI applications to increase processing speed, deliver higher accuracy, reduce
processing cost and use less energy.
Moffett AI designs and produces the Sparse technology products which allow
AI to solve some of humanity’s greatest computing challenges. Our dream is to
touch and improve every AI application and to embrace failure in the relentless
pursuit of success.
Moffett Al was founded in 2018 and released its first product, the FPGA AI
accelerator card in 2019. It owns over 30 patents related to Sparse computing.
A ppen di x PAGE 17
INTRODUCTION TO SPARSE COMPUTING
AI cannot work without Sparse computing and Moffett is the world leader in this
field. Moffett’s Sparse computing platforms are needed to help meet humanity’s
most demanding and advanced computing requirements.
Patents
The patents below are assigned to Moffett AI
11113601
Patents PAGE 19
INTRODUCTION TO SPARSE COMPUTING
Published papers
These are papers co-authored by Moffett founder, Ian E H Yen:
Title: Rethinking Network Pruning -- under the Pre-train and Fine-tune Paradigm.
Authors: Dongkuan Xu, Ian E.H. Yen, Jinxi Zhao, Zhibin Xiao.
In Annual Conference of the North American Chapter of the Association for
Computational Linguistics (NAACL), 2021.
Title: Efficient Global String Kernel with Random Features: Beyond Counting
Substructures.
Authors: Lingfei Wu, Ian En-Hsu Yen, Siyu Huo, Liang Zhao, Kun Xu, Liang Ma, Shouling
Ji and Charu Aggarwal.
In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD),
2019.
Title: Efficient Global String Kernel with Random Features: Beyond Counting
Substructures.
Authors: Lingfei Wu, Ian En-Hsu Yen, Siyu Huo, Liang Zhao, Kun Xu, Liang Ma, Shouling
Ji and Charu Aggarwal.
In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD),
2019.
Title: Scalable Global Alignment Graph Kernel Using Random Features: From Node
Embedding to Graph Embedding.
Authors: Lingfei Wu, Ian En-Hsu Yen, Zhen Zhang, Kun Xu, Liang Zhao, Xi Peng,
Yinglong Xia and Charu Aggarwal.
In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD),
2019.
Title: Doubly Greedy Primal-Dual Coordinate Methods for Sparse Empirical Risk
Minimization.
Authors: Qi Lei, Ian E.H. Yen, Chao-Yuan Wu, Pradeep Ravikumar and Inderjit Dhillon.
In International Conference on Machine Learning (ICML), 2017.
Title: Greedy Direction Method of Multiplier for MAP Inference of Large Output
Domain.
Authors: Xiangru Huang, Ian E.H. Yen, Ruohan Zhang, Qixing Huang, Pradeep
Ravikumar, and Inderjit Dhillon.
In International Conference on Artificial Intelligence and Statistics
(AISTATS), 2017.
Title: Dual Decomposed Learning with Factorwise Oracles for Structural SVMs of
Large Output Domain.
Authors: Ian E.H. Yen, Xiangru Huang, Kai Zhong, Ruohan Zhang, Pradeep Ravikumar
and Inderjit S. Dhillon. In Advances in Neural Information Processing Systems
(NeruIPS), 2016.
Title: PD-Sparse: A Primal and Dual Sparse Approach to Extreme Multiclass and
Multilabel Classification.
Authors: Ian E.H. Yen*, Xiangru Huang*, Kai Zhong, Pradeep Ravikumar and Inderjit
S. Dhillon. (* equally contributed) In International Conference on Machine
Learning (ICML), 2016.
Title: Scalable Exemplar Clustering and Facility Location via Augmented Block
Coordinate Descent with Column Generation.
Authors: Ian E.H. Yen, Dmitry Malioutov, and Abhishek Kumar.
In International Conference on Artificial Intelligence and Statistics
(AISTATS), 2016.
Title: Sparse Linear Programming via Primal and Dual Augmented Coordinate
Descent.
Authors: Ian E.H. Yen, Kai Zhong, Cho-Jui Hsieh, Pradeep Ravikumar and Inderjit S.
Dhillon.
In Advances in Neural Information Processing Systems (NeruIPS), 2015.
Title: Tackling the Achilles Heel of Social Networks: Influence Propagation based
Language Model Smoothing.
Authors: Rui Yan, Ian E.H. Yen, Cheng-Te Li, Shiqi Zhao and Hu Xiaohua.
In International World Wide Web Conference (WWW), 2015.
Title: Optimal Tests of Treatment Effects for the Overall Population and Two
Subpopulations in Randomized Trials, using Sparse Linear Programming.
Authors: Michael Rosenblum, Han Liu, Ian E.H. Yen. In Journal of American Statistical
Association (JASA) (Theory and Methodology), 2014.
Title: Indexed Block Coordinate Descent for Large-Scale Linear Classification with
Limited Memory.
Authors: Ian E.H. Yen, Chun-Fu Chang, Ting-Wei Lin, Shan-Wei Lin, Shou-De Lin.
In ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD), 2013.
Title: Feature engineering and classfier ensemble for KDD Cup 2010.
Authors: H.F. Yu, H.Y. Lo, H.P Hsieh, J.K. Lou, T.G. McKenzie, J.W. Chou, P.H. Chung, C.H.
Ho, C.F. Chang, Y.H. Wei, J.Y. Weng, En-Hsu Yen, C.W. Chang, T.T. Kuo, Y.C. Lo,
P.T. Chang, C. Po, C.Y. Wang, Y.H. Huang, C.W. Hung, Y.X. Ruan, Y.S. Lin, S.D. Lin,
H.T Lin, C.J. Lin. In Proceedings of the KDD Cup 2010 Workshop, July 2010.
First-place winner report of KDD Cup 2010.