0% found this document useful (0 votes)

43 views6 pages

Mxnet: A Flexible and Efficient Machine Learning Library For Heterogeneous Distributed Systems

MXNet is a multi-language machine learning library that aims to combine the advantages of imperative and declarative programming. It allows users to declaratively define neural network architectures as symbolic expressions while also supporting imperative tensor operations. MXNet is efficient, supports heterogeneous systems from mobile to GPU clusters, and integrates with multiple programming languages including Python, R, Julia and C++.

Uploaded by

Arturo Duque

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views6 pages

Mxnet: A Flexible and Efficient Machine Learning Library For Heterogeneous Distributed Systems

Uploaded by

Arturo Duque

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

MXNet: A Flexible and Efficient Machine Learning

Library for Heterogeneous Distributed Systems

Tianqi Chen, Mu Li∗, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang,
U. Washington CMU Stanford NUS TuSimple NYU
Tianjun Xiao, Bing Xu, Chiyuan Zhang, Zheng Zhang
Microsoft U. Alberta MIT NYU Shanghai

Abstract

MXNet is a multi-language machine learning (ML) library to ease the develop-

ment of ML algorithms, especially for deep neural networks. Embedded in the
host language, it blends declarative symbolic expression with imperative tensor
computation. It offers auto differentiation to derive gradients. MXNet is compu-
tation and memory efficient and runs on various heterogeneous systems, ranging
from mobile devices to distributed GPU clusters.
This paper describes both the API design and the system implementation of
MXNet, and explains how embedding of both symbolic expression and tensor
operation is handled in a unified fashion. Our preliminary experiments reveal
promising results on large scale deep neural network applications using multiple
GPU machines.

1 Introduction

The scale and complexity of machine learning (ML) algorithms are becoming increasingly large.
Almost all recent ImageNet challenge [12] winners employ neural networks with very deep layers,
requiring billions of floating-point operations to process one single sample. The rise of structural and
computational complexity poses interesting challenges to ML system design and implementation.
Most ML systems embed a domain-specific language (DSL) into a host language (e.g. Python, Lua,
C++). Possible programming paradigms range from imperative, where the user specifies exactly
“how” computation needs to be performed, and declarative, where the user specification focuses
on “what” to be done. Examples of imperative programming include numpy and Matlab, whereas
packages such as Caffe, CXXNet program over layer definition which abstracts away and hide the
inner-working of actual implementation. The dividing line between the two can be muddy at times.
Frameworks such as Theano and the more recent Tensorflow can also be viewed as a mixture of both,
they declare a computational graph, yet the computation within the graph is imperatively specified.
Related to the issue of programming paradigms is how the computation is carried out. Execution
can be concrete, where the result is returned right away on the same thread, or asynchronize or
delayed, where the statements are gathered and transformed into a dataflow graph as an intermediate
representation first, before released to available devices. These two execution models have different
implications on how inherent parallelisms are discovered. Concrete execution is restrictive (e.g.
parallelized matrix multiplication), whereas asynchronize/delayed execution additionally identified
all parallelism within the scope of an instance of dataflow graph automatically.
The combination of the programming paradigm and execution model yields a large design space,
some of which are more interesting (and valid) than others. In fact, our team has collectively ex-
plored a number of them, as does the rest of the community. For example, Minerva [14] combines
imperative programming with asynchronize execution. While Theano takes an declarative approach,
∗
Corresponding author ([email protected])

1
Imperative Program Declarative Program
Execute Eagerly compute and store the Return a computation graph; bind data to b
a = b+1 results on a as the same type with b. and do the computation later.
Conceptually straightforward, and Obtain the whole computation graph before
often works seamless with the host execution, beneficial for optimizing the
Advan-
language’s build-in data structures, performance and memory utilization. Also
tages
functions, debugger, and third-party convenient to implement functions such as
libraries. load, save, and visualization.

Table 1: Compare the imperative and declarative for domain specific languages.

System Core Binding Devices Distri- Imperative Declarative

Lang Langs (beyond CPU) buted Program Program
√
Caffe [7] C++ Python/Matlab GPU × ×
√
Torch7 [3] Lua - GPU/FPGA × ×
√
Theano [1] Python - GPU ×
√ × √
TensorFlow [11] C++ Python GPU/Mobile √ ×
√ √
MXNet C++ Python/R/Julia/Go GPU/Mobile

Table 2: Compare to other popular open-source ML libraries

enabling more global graph-aware optimization. Similar discipline was adopted in Purine2 [10]. In-
stead, CXXNet adopts declarative programming (over tensor abstraction) and concrete execution,
similar to Caffe [7]. Table 1 gives more examples.
Our combined new effort resulted in MXNet (or “mix-net”), intending to blend advantages of differ-
ent approaches. Declarative programming offers clear boundary on the global computation graph,
discovering more optimization opportunity, whereas imperative programs offers more flexibility. In
the context of deep learning, declarative programming is useful in specifying the computation struc-
ture in neural network configurations, while imperative programming are more natural for parameter
updates and interactive debugging. We also took the effort to embed into multiple host languages,
including C++, Python, R, Go and Julia.
Despite the support of multiple languages and combination of different programming paradigm, we
are able to fuse the execution to the same backend engine. The engine tracks data dependencies
across computation graphs and imperative operations, and schedules them efficiently jointly. We ag-
gressively reduce memory footprint, performing in-place update and memory space reuse whenever
possible. Finally, we designed a compact communication API so that a MXNet program runs on
multiple machines with little change.
Comparing to other open-source ML systems, MXNet provides a superset programming interface
to Torch7 [3], Theano [1], Chainer [5] and Caffe [7], and supports more systems such as GPU clus-
ters. MXNet is close to TensorFlow [11] but can additionally embed imperative tensor operations.
MXNet is lightweight, e.g. the prediction codes fit into a single 50K lines C++ source file with
no other dependency, and has more languages supports. More detailed comparisons are shown in
Table 2.

C/C++ Python R ... Julia

2 Programming Interface
Symbolic Expr
ND KV
2.1 Symbol: Declarative Symbolic Expressions Array Store
Binder

MXNet uses multi-output symbolic expressions, Symbol, declare BLAS Dep Engine Comm

the computation graph. Symbols are composited by operators, such CPU GPU Android ... iOS
as simple matrix operations (e.g. “+”), or a complex neural network
layer (e.g. convolution layer). An operator can take several input Figure 1: MXNet Overview
variables, produce more than one output variables, and have internal
state variables. A variable can be either free, which we can bind with value later, or an output of
another symbol. Figure 2 shows the construction of a multi-layer perception symbol by chaining a
variable , which presents the input data, and several layer operators.

2
using MXNet >>> import mxnet as mx
mlp = @mx.chain mx.Variable(:data) => >>> a = mx.nd.ones((2, 3),
mx.FullyConnected(num_hidden=64) => ... mx.gpu())
mx.Activation(act_type=:relu) => >>> print (a * 2).asnumpy()
mx.FullyConnected(num_hidden=10) => [[ 2. 2. 2.]
mx.Softmax() [ 2. 2. 2.]]

Figure 2: Symbol expression construction in Julia. Figure 3: NDArray interface in Python

To evaluate a symbol we need to bind the free variables with data and declare the required outputs.
Beside evaluation (“forward”), a symbol supports auto symbolic differentiation (“backward”). Other
functions, such as load, save, memory estimation, and visualization, are also provided for symbols.

2.2 NDArray: Imperative Tensor Computation

MXNet offers NDArray with imperative tensor computation to fill the gap between the declarative
symbolic expression and the host language. Figure 3 shows an example which does matrix-constant
multiplication on GPU and then prints the results by numpy.ndarray.
NDArray abstraction works seamlessly with the executions declared by Symbol, we can mix the
imperative tensor computation of the former with the latter. For example, given a symbolic neural
network and the weight updating function, e.g. w = w − ηg. Then we can implement the gradient
descent by
while(1) { net.foward_backward(); net.w -= eta * net.g };
The above is as efficient as the implementation using a single but often much more complex symbolic
expression. The reason is that MXNet uses lazy evaluation of NDArray and the backend engine
can correctly resolve the data dependency between the two.

2.3 KVStore: Data Synchronization Over Devices

The KVStore is a distributed key-value store for data synchronization over multiple devices. It
supports two primitives: push a key-value pair from a device to the store, and pull the value on
a key from the store. In addition, a user-defined updater can specify how to merge the pushed
value. Finally, model divergence is controlled via consistency model [8]. Currently, we support the
sequential and eventual consistency.
The following example implements the distributed gradient descent by data parallelization.
while(1){ kv.pull(net.w); net.foward_backward(); kv.push(net.g); }
where the weight updating function is registered to the KVStore, and each worker repeatedly pulls
the newest weight from the store and then pushes out the locally computed gradient.
The above mixed implementation has the same performance comparing to a single declarative pro-
gram, because the actual data push and pull are executed by lazy evaluation, which are scheduled by
the backend engine just like others.

2.4 Other Modules

MXNet ships with tools to pack arbitrary sized examples into a single compact file to facilitate both
sequential and random seek. Data iterators are also provided. Data pre-fetching and pre-processing
are multi-threaded, reducing overheads due to possible remote file store reads and/or image decoding
and transformation.
The training module implements the commonly used optimization algorithms, such as stochastic
gradient descent. It trains a model on a given symbolic module and data iterators, optionally dis-
tributedly if an additional KVStore is provided.

3
3 Implementation W b ∂b ∂W
X ∂X
3.1 Computation Graph
fullc ∂ fullc

A binded symbolic expression is presented as a computation

relu ∂ relu
graph for evaluation. Figure 4 shows a part of the graph of
both forward and backward of the MLP symbol in Figure 2.
Before evaluation, MXNet transforms the graph to optimize Figure 4: Computation graph for
the efficiency and allocate memory to internal variables. both forward and backward.
Graph Optimization. We explore the following straightforward optimizations. We note first that
only the subgraph required to obtain the outputs specified during binding is needed. For example,
in prediction only the forward graph is needed, while for extracting features from internal layers,
the last layers can be skipped. Secondly, operators can be grouped into a single one. For example,
a × b + 1 is replaced by a single BLAS or GPU call. Finally, we manually implemented well-
optimized “big” operations, such as a layer in neural network.
Memory Allocation. Note that each variable’s life time, namely the period between the creation
and the last time will be used, is known for a computation graph. So we can reuse memory for non-
intersected variables. However, an ideal allocation strategy requires O(n2 ) time complexity, where
n is the number of variables.
We proposed two heuristics strategies with linear time complexity. The first, called inplace, simu-
lates the procedure of traversing the graph, and keeps a reference counter of depended nodes that are
not used so far. If the counter reaches zero, the memory is recycled. The second, named co-share,
allows two nodes to share a piece of memory if only if they cannot be run in parallel. Exploring
co-share imposes one additional dependency constraint. In particular, each time upon scheduling,
among the pending paths in the graph, we find the longest path and perform needed memory alloca-
tions.

3.2 Dependency Engine

In MXNet, each source units, including NDArray, random number generator and temporal space,
is registered to the engine with a unique tag. Any operations, such as a matrix operation or data com-
munication, is then pushed into the engine with specifying the required resource tags. The engine
continuously schedules the pushed operations for execution if dependencies are resolved. Since
there usually exists multiple computation resources such as CPUs, GPUs, and the memory/PCIe
buses, the engine uses multiple threads to scheduling the operations for better resource utilization
and parallelization.
Different to most dataflow engines [14], our engine tracks mutation operations as an existing re-
source unit. That is, ours supports the specification of the tags that a operation will write in addition
to read. This enables scheduling of array mutations as in numpy and other tensor libraries. It also
enables easier memory reuse of parameters, by representing parameter updates as mutating the pa-
rameter arrays. It also makes scheduling of some special operations easier. For example, when
generating two random numbers with the same random seed, we can inform the engine they will
write the seed so that they should not be executed in parallel. This helps reproducibility.

3.3 Data Communication

We implemented KVStore based on the parameter server [8, 9,

4](Figure 5). It differs to previous works in two aspects: First, we
use the engine to schedule the KVStore operations and manage
the data consistency. The strategy not only makes the data synchro-
nization works seamless with computation, and also greatly simpli- dev0 dev1 dev0 dev1
fies the implementation. Second, we adopt an two-level structure. worker0 worker1
A level-1 server manages the data synchronization between the de-
vices within a single machine, while a level-2 server manages inter- Figure 5: Communication.
machine synchronization. Outbound data from a level-1 server can

4
1000
TensorFlow 8 naive 8 naive
Caffe inplace inplace
800 co-share
Torch7 co-share

memory (GB)

memory (GB)
6 inplace & co-share 6 inplace & co-share
Time (ms) MXNet
600
4 4
400
2 2
200

0 0 0
alexnet googlenet vgg alexnet googlenet vgg alexnet googlenet vgg

Figure 6: Compare MXNet Figure 7: Internal memory usage of MXNet under various al-
to others on a single forward- location strategies for only forward (left) and forward-backward
backward performance. (right) with batch size 64.
be aggregated, reducing bandwidth requirement; intra- and inter-machine synchronization can use
different consistency model (e.g. intra- is sequential and inter- is eventual).

4 Evaluation
Raw performance We fist compare MXNet with Torch7, Caffe, and TensorFlow on the popular
“convnet-benchmarks” [2]. All these systems are compiled with CUDA 7.5 and CUDNN 3 except
for TensorFlow, which only supports CUDA 7.0 and CUDNN 2. We use batch size 32 for all
networks and run the experiments on a single Nvidia GTX 980 card. Results are shown in Figure 6.
As expected that MXNet has similar performance comparing to Torch7 and Caffe, because most
computations are spent on the CUDA/CUDNN kernels. TensorFlow is always 2x slower, which
might be due its use of a lower CUDNN version.

Memory usage Figure 7 shows the memory usages of the internal variables excepts for the out-
puts. As can be seen, both “inplace” and “co-share” can effective reduce the memory footprint.
Combing them leads to a 2x reduction for all networks during model training, and further improves
to 4x for model prediction. For instance, even for the most expensive VGG net, training needs less
than 16MB extra.

Scalability We run the experiment on Amazon EC2

g2.8x instances, each of which is shipped with 0.6
four Nvidia GK104 GPUs and 10G Ethernet. We
test accuracy

0.5
train googlenet with batch normalization [6] on the
ILSVRC12 dataset [13] which consists of 1.3 million 0.4
images and 1,000 classes. We fix the learning rate to
.05, momentum to .9, weight decay to 0.05, and feed 0.3
each GPU with 36 images in one batch. 1 machine
0.2 10 machines
The convergence results are shown in Figure 8. As can
be seen, comparing to single machine, the distributed 0 5 10 15 20
training converges slower at the beginning, but outper- data pass
forms after 10 data passes. The average cost of a data
pass is 14K and 1.4K sec on a single machine and 10 Figure 8: Progress of googlenet on
machines, respectively. Consequently, this experiment ILSVRC12 dataset on 1 and 10 machines.
reveals a super-linear speedup.

5 Conclusion
MXNet is a machine learning library combining symbolic expression with tensor computation to
maximize efficiency and flexibility. It is lightweight and embeds in multiple host languages, and can
be run in a distributed setting. Experimental results are encouraging. While we continue to explore
new design choices, we believe it can already benefit the relevant research community. The codes
are available at http://dmlc.io.
Acknowledgment. We sincerely thanks Dave Andersen, Carlos Guestrin, Tong He, Chuntao Hong,
Qiang Kou, Hu Shiwen, Alex Smola, Junyuan Xie, Dale Schuurmans and all other contributors.

5
References
[1] Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud
Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features
and speed improvements. arXiv preprint arXiv:1211.5590, 2012.
[2] Soumith Chintala. Easy benchmarking of all public open-source implementations of convnets,
2015. https://github.com/soumith/convnet-benchmarks.
[3] Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlab-like envi-
ronment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376,
2011.
[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior,
P. Tucker, K. Yang, and A. Ng. Large scale distributed deep networks. In Neural Information
Processing Systems, 2012.
[5] Chainer Developers. Chainer: A powerful, flexible, and intuitive framework of neural net-
works, 2015. http://chainer.org/.
[6] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[7] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,
Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature
embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675–
678. ACM, 2014.
[8] M. Li, D. G. Andersen, J. Park, A. J. Smola, A. Amhed, V. Josifovski, J. Long, E. Shekita, and
B. Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, 2014.
[9] M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efficient distributed machine
learning with the parameter server. In Neural Information Processing Systems, 2014.
[10] Min Lin, Shuo Li, Xuan Luo, and Shuicheng Yan. Purine: A bi-graph based deep learning
framework. arXiv preprint arXiv:1412.6249, 2014.
[11] Abadi Martn, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,
Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,
Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray,
Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden,
Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Large-scale
machine learning on heterogeneous systems. 2015.
[12] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-
Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer
Vision (IJCV), 115(3):211–252, 2015.
[13] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual
recognition challenge. International Journal of Computer Vision, pages 1–42, 2014.
[14] Minjie Wang, Tianjun Xiao, Jianpeng Li, Jiaxing Zhang, Chuntao Hong, and Zheng Zhang.
Minerva: A scalable and highly efficient training platform for deep learning, 2014.

Parallel Programming
100% (2)
Parallel Programming
410 pages
TensorFlow On Cloud
No ratings yet
TensorFlow On Cloud
13 pages
Deploy Machine Learning Models
100% (1)
Deploy Machine Learning Models
45 pages
CISCO Data Center Architecture Overview
No ratings yet
CISCO Data Center Architecture Overview
10 pages
AIM301 Deep Learning With TensorFlow PyTorch and MXNet On AWS
No ratings yet
AIM301 Deep Learning With TensorFlow PyTorch and MXNet On AWS
29 pages
MLPerf - Vision Behind MLPerf
No ratings yet
MLPerf - Vision Behind MLPerf
62 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
L6 Hardware and Software For DL en
No ratings yet
L6 Hardware and Software For DL en
66 pages
Real-Time Machine Learning: The Missing Pieces
No ratings yet
Real-Time Machine Learning: The Missing Pieces
6 pages
T F E: A - , P - DSL: Ensor LOW Ager Multi Stage Ython Embedded FOR Machine Learning
No ratings yet
T F E: A - , P - DSL: Ensor LOW Ager Multi Stage Ython Embedded FOR Machine Learning
12 pages
Mxnet Documentation: Release 0.0.8
No ratings yet
Mxnet Documentation: Release 0.0.8
93 pages
Kerascv and Kerasnlp: Multi-Framework Models: Lead Authors
No ratings yet
Kerascv and Kerasnlp: Multi-Framework Models: Lead Authors
10 pages
Gen: A General-Purpose Probabilistic Programming System With Programmable Inference
No ratings yet
Gen: A General-Purpose Probabilistic Programming System With Programmable Inference
14 pages
C13-Computational Performance
No ratings yet
C13-Computational Performance
45 pages
13-Gradient Descent With Momentum-08!08!2024
No ratings yet
13-Gradient Descent With Momentum-08!08!2024
26 pages
Compiling ONNX Neural Network Models Using Mlir
No ratings yet
Compiling ONNX Neural Network Models Using Mlir
8 pages
Gluon Tutorials: Deep Learning - The Straight Dope
No ratings yet
Gluon Tutorials: Deep Learning - The Straight Dope
403 pages
Flux (Machine-Learning Framework)
No ratings yet
Flux (Machine-Learning Framework)
3 pages
Chainer
No ratings yet
Chainer
3 pages
C6-Builders Guide
No ratings yet
C6-Builders Guide
26 pages
Running Markov Chain Monte Carlo On Modern Hardware and Software
No ratings yet
Running Markov Chain Monte Carlo On Modern Hardware and Software
26 pages
A Comparative Study of Deep Learning
No ratings yet
A Comparative Study of Deep Learning
6 pages
The Deep Learning Compiler: A Comprehensive Survey
No ratings yet
The Deep Learning Compiler: A Comprehensive Survey
20 pages
SAP Collections Management
100% (1)
SAP Collections Management
37 pages
Machine Learning Assignment-1
No ratings yet
Machine Learning Assignment-1
7 pages
Deep Learning Library PDF
No ratings yet
Deep Learning Library PDF
12 pages
Jittor
No ratings yet
Jittor
21 pages
S06 DNN Tensorflow PyTorch Wip
No ratings yet
S06 DNN Tensorflow PyTorch Wip
24 pages
Unit-3 Aiml
No ratings yet
Unit-3 Aiml
10 pages
What Is TensorFlow
No ratings yet
What Is TensorFlow
38 pages
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
No ratings yet
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
119 pages
Graph Ene
No ratings yet
Graph Ene
12 pages
Pytorch Paper
No ratings yet
Pytorch Paper
12 pages
Assignment1 Blog
No ratings yet
Assignment1 Blog
4 pages
Exocompilation For Productive Programming of Hardware Accelerators
No ratings yet
Exocompilation For Productive Programming of Hardware Accelerators
16 pages
Autograph - Imperative-Style Coding With Graph-Based Performance
No ratings yet
Autograph - Imperative-Style Coding With Graph-Based Performance
18 pages
15 ML
No ratings yet
15 ML
60 pages
Let Us Code: Using Deep Learning Through A Library
No ratings yet
Let Us Code: Using Deep Learning Through A Library
17 pages
Declarative Machine Learning Systems
No ratings yet
Declarative Machine Learning Systems
11 pages
Matlab Matlab Toolbox Deep Learning Toolbox Neural Network Toolbox Libraries Functions How To Use
No ratings yet
Matlab Matlab Toolbox Deep Learning Toolbox Neural Network Toolbox Libraries Functions How To Use
5 pages
GENAI Questions
No ratings yet
GENAI Questions
14 pages
ML Unit-5
No ratings yet
ML Unit-5
19 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
ML Libraries PPT (3.3)
No ratings yet
ML Libraries PPT (3.3)
10 pages
Department of Computer Science Vidyasagar University: Paschim Medinipur - 721102
No ratings yet
Department of Computer Science Vidyasagar University: Paschim Medinipur - 721102
26 pages
Image Rotation Using CUDA
No ratings yet
Image Rotation Using CUDA
18 pages
DescribeML A Dataset Description Tool For Machine Learning
No ratings yet
DescribeML A Dataset Description Tool For Machine Learning
5 pages
VLIW Architecture
No ratings yet
VLIW Architecture
53 pages
cs9227 Operating System Lab Manual
No ratings yet
cs9227 Operating System Lab Manual
60 pages
AdvNative Week 4 - MVVM
No ratings yet
AdvNative Week 4 - MVVM
69 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
50 pages
Unit 1
No ratings yet
Unit 1
31 pages
Different Types of Computers
No ratings yet
Different Types of Computers
9 pages
Paper Presentation Schedule - ICACCT2012
No ratings yet
Paper Presentation Schedule - ICACCT2012
14 pages
Parallel Database Systems and Their Architecture
No ratings yet
Parallel Database Systems and Their Architecture
17 pages
Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF
No ratings yet
Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF
33 pages
A Cluster Computer and Its Architecture
No ratings yet
A Cluster Computer and Its Architecture
3 pages
Cbca2103 SG
No ratings yet
Cbca2103 SG
63 pages
Exam Practice Exercise For AOS
No ratings yet
Exam Practice Exercise For AOS
3 pages
Runtime Environment Routines
No ratings yet
Runtime Environment Routines
10 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Parallel Implementation of Cryptographic Algorithm Aes Using Opencl On Gpu
No ratings yet
Parallel Implementation of Cryptographic Algorithm Aes Using Opencl On Gpu
5 pages
Computer Organization: Bus Structures
No ratings yet
Computer Organization: Bus Structures
4 pages
Assignment-2 Ami Pandat Parallel Processing: Time Complexity
No ratings yet
Assignment-2 Ami Pandat Parallel Processing: Time Complexity
12 pages
Objective and Subjective Question For PERFORMANCE ISSUES in Computer Architecture
No ratings yet
Objective and Subjective Question For PERFORMANCE ISSUES in Computer Architecture
7 pages
Code Transformation
No ratings yet
Code Transformation
6 pages
Super Computers
No ratings yet
Super Computers
4 pages
Apache Spark - Executors - How Many Tasks Can My Cluster Run in Parallel - by Swetha Murali - Medium
No ratings yet
Apache Spark - Executors - How Many Tasks Can My Cluster Run in Parallel - by Swetha Murali - Medium
8 pages
PIPELINED MULTIPLIER Project Idea
No ratings yet
PIPELINED MULTIPLIER Project Idea
9 pages
Cheat Sheet - Programming Language
No ratings yet
Cheat Sheet - Programming Language
2 pages
Implementation of 8x8 Vedic Multiplier Using Verilog
No ratings yet
Implementation of 8x8 Vedic Multiplier Using Verilog
8 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
C Programming: Core Concepts and Techniques
From Everand
C Programming: Core Concepts and Techniques
William Smith
No ratings yet
Learn Java Programming in 24 Hours
From Everand
Learn Java Programming in 24 Hours
PublishDrive
No ratings yet
The Complete C++ Programming Guide
From Everand
The Complete C++ Programming Guide
gareth thomas
No ratings yet
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
"C Programming for Beginners: A Step-by-Step Guide"
From Everand
"C Programming for Beginners: A Step-by-Step Guide"
Lov kush
No ratings yet
C# For Beginners: An Introduction to C# Programming with Tutorials and Hands-On Examples
From Everand
C# For Beginners: An Introduction to C# Programming with Tutorials and Hands-On Examples
Nathan Metzler
5/5 (1)
C + +: C++ programming
From Everand
C + +: C++ programming
Ummed Singh
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
C# Fundamentals Made Simple: A Practical Guide with Examples
From Everand
C# Fundamentals Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
Mastering Python Programming: A Comprehensive Guide: The IT Collection
From Everand
Mastering Python Programming: A Comprehensive Guide: The IT Collection
Christopher Ford
5/5 (1)
C++ Learn in 24 Hours
From Everand
C++ Learn in 24 Hours
Alex Nordeen
No ratings yet
Code Beneath the Surface: Mastering Assembly Programming
From Everand
Code Beneath the Surface: Mastering Assembly Programming
Kameron Hussain
No ratings yet

Mxnet: A Flexible and Efficient Machine Learning Library For Heterogeneous Distributed Systems

Uploaded by

Mxnet: A Flexible and Efficient Machine Learning Library For Heterogeneous Distributed Systems

Uploaded by

MXNet: A Flexible and Efficient Machine Learning

Library for Heterogeneous Distributed Systems

MXNet is a multi-language machine learning (ML) library to ease the develop-

System Core Binding Devices Distri- Imperative Declarative

Table 2: Compare to other popular open-source ML libraries

C/C++ Python R ... Julia

Figure 2: Symbol expression construction in Julia. Figure 3: NDArray interface in Python

2.2 NDArray: Imperative Tensor Computation

2.3 KVStore: Data Synchronization Over Devices

2.4 Other Modules

A binded symbolic expression is presented as a computation

3.2 Dependency Engine

3.3 Data Communication

We implemented KVStore based on the parameter server [8, 9,

Scalability We run the experiment on Amazon EC2

You might also like