Array Programming With Numpy: Review
Array Programming With Numpy: Review
Two Python array packages existed before NumPy. The Numeric pack- that operate on it. Because of its inherent simplicity, the NumPy array
age was developed in the mid-1990s and provided array objects and is the de facto exchange format for array data in Python.
array-aware functions in Python. It was written in C and linked to stand- NumPy operates on in-memory arrays using the central processing
ard fast implementations of linear algebra3,4. One of its earliest uses was unit (CPU). To utilize modern, specialized storage and hardware, there
to steer C++ applications for inertial confinement fusion research at has been a recent proliferation of Python array packages. Unlike with
Lawrence Livermore National Laboratory5. To handle large astronomi- the Numarray–Numeric divide, it is now much harder for these new
cal images coming from the Hubble Space Telescope, a reimplementa- libraries to fracture the user community—given how much work is
tion of Numeric, called Numarray, added support for structured arrays, already built on top of NumPy. However, to provide the community with
flexible indexing, memory mapping, byte-order variants, more efficient access to new and exploratory technologies, NumPy is transitioning
memory use, flexible IEEE 754-standard error-handling capabilities, and into a central coordinating mechanism that specifies a well defined
better type-casting rules6. Although Numarray was highly compatible array programming API and dispatches it, as appropriate, to special-
with Numeric, the two packages had enough differences that it divided ized array implementations.
the community; however, in 2005 NumPy emerged as a ‘best of both
worlds’ unification7—combining the features of Numarray with the
small-array performance of Numeric and its rich C API. NumPy arrays
Now, 15 years later, NumPy underpins almost every Python library The NumPy array is a data structure that efficiently stores and accesses
that does scientific or numerical computation8–11, including SciPy12, multidimensional arrays17 (also known as tensors), and enables a wide
Matplotlib13, pandas14, scikit-learn15 and scikit-image16. NumPy is a variety of scientific computation. It consists of a pointer to memory,
community-developed, open-source library, which provides a mul- along with metadata used to interpret the data stored there, notably
tidimensional Python array object along with array-aware functions ‘data type’, ‘shape’ and ‘strides’ (Fig. 1a).
Independent researcher, Logan, UT, USA. 2Brain Imaging Center, University of California, Berkeley, Berkeley, CA, USA. 3Division of Biostatistics, University of California, Berkeley, Berkeley, CA,
1
USA. 4Berkeley Institute for Data Science, University of California, Berkeley, Berkeley, CA, USA. 5Applied Mathematics, Stellenbosch University, Stellenbosch, South Africa. 6Quansight, Austin,
TX, USA. 7Department of Physics, University of Jyväskylä, Jyväskylä, Finland. 8Nanoscience Center, University of Jyväskylä, Jyväskylä, Finland. 9Mercari JP, Tokyo, Japan. 10Department of
Engineering, University of Cambridge, Cambridge, UK. 11Independent researcher, Karlsruhe, Germany. 12Independent researcher, Berkeley, CA, USA. 13Enthought, Austin, TX, USA. 14Google
Research, Mountain View, CA, USA. 15Department of Astronomy and Astrophysics, University of Toronto, Toronto, Ontario, Canada. 16School of Psychology, University of Birmingham,
Edgbaston, Birmingham, UK. 17Department of Physics, Temple University, Philadelphia, PA, USA. 18Google, Zurich, Switzerland. 19Department of Physics and Astronomy, The University of
British Columbia, Vancouver, British Columbia, Canada. 20Amazon, Seattle, WA, USA. 21Independent researcher, Saue, Estonia. 22Department of Mechanics and Applied Mathematics, Institute
of Cybernetics at Tallinn Technical University, Tallinn, Estonia. 23Department of Biological and Agricultural Engineering, University of Georgia, Athens, GA, USA. 24France-IX Services, Paris,
France. 25Department of Economics, University of Oxford, Oxford, UK. 26CCS-7, Los Alamos National Laboratory, Los Alamos, NM, USA. 27Laboratory for Fluorescence Dynamics, Biomedical
Engineering Department, University of California, Irvine, Irvine, CA, USA. ✉e-mail: [email protected]; [email protected]; [email protected]
In [4]: x
b Indexing (view) e Broadcasting
Out[4]:
array([[ 0, 1, 2],
0 1 2 0 1 2 0 1 2 0 0 [ 3, 4, 5],
x[:,1:] → 3 4 5 with slices x[:,::2] → 3 4 5 with slices 3 3 6 [ 6, 7, 8],
6 7 8 6 7 8 with steps × →
6 6 12 [ 9, 10, 11]])
9 10 11 9 10 11
Slices are start:end:step, 9 9 18
any of which can be left blank
In [5]: np.mean(x, axis=0)
Out[5]: array([4.5, 5.5, 6.5])
c Indexing (copy) f Reduction
In [6]: x = x - np.mean(x, axis=0)
x[1,2] → 5 with scalars x[x > 9] → 10 11 with masks 0 1 2 3
sum
3 4 5 axis 1
12 In [7]: x
x 0 1
, 1 2 → x[0,1],x[1,2] → 1 5 with arrays 6 7 8 21 Out[7]:
9 10 11 30 array([[-4.5, -4.5, -4.5],
sum [-1.5, -1.5, -1.5],
1 1 0 1 1 1 0 4 3 with arrays
x → x → with broadcasting
axis 0 sum [ 1.5, 1.5, 1.5],
2 2 2 1 0 7 6 axis (0,1)
, , 18 22 26 66 [ 4.5, 4.5, 4.5]])
Fig. 1 | The NumPy array incorporates several fundamental array concepts. before performing the lookup. d, Vectorization efficiently applies operations
a, The NumPy array data structure and its associated metadata fields. to groups of elements. e, Broadcasting in the multiplication of two-dimensional
b, Indexing an array with slices and steps. These operations return a ‘view’ of arrays. f, Reduction operations act along one or more axes. In this example,
the original data. c, Indexing an array with masks, scalar coordinates or other an array is summed along select axes to produce a vector, or along two axes
arrays, so that it returns a ‘copy’ of the original data. In the bottom example, an consecutively to produce a scalar. g, Example NumPy code, illustrating some of
array is indexed with other arrays; this broadcasts the indexing arguments these concepts.
The data type describes the nature of elements stored in an array. statistics and trigonometry (Fig. 1d). Vectorization—operating on
An array has a single data type, and each element of an array occupies entire arrays rather than their individual elements—is essential to array
the same number of bytes in memory. Examples of data types include programming. This means that operations that would take many tens
real and complex numbers (of lower and higher precision), strings, of lines to express in languages such as C can often be implemented as
timestamps and pointers to Python objects. a single, clear Python expression. This results in concise code and frees
The shape of an array determines the number of elements along users to focus on the details of their analysis, while NumPy handles
each axis, and the number of axes is the dimensionality of the array. looping over array elements near-optimally—for example, taking
For example, a vector of numbers can be stored as a one-dimensional strides into consideration to best utilize the computer’s fast cache
array of shape N, whereas colour videos are four-dimensional arrays memory.
of shape (T, M, N, 3). When performing a vectorized operation (such as addition) on two
Strides are necessary to interpret computer memory, which stores arrays with the same shape, it is clear what should happen. Through
elements linearly, as multidimensional arrays. They describe the num- ‘broadcasting’ NumPy allows the dimensions to differ, and produces
ber of bytes to move forward in memory to jump from row to row, col- results that appeal to intuition. A trivial example is the addition of a
umn to column, and so forth. Consider, for example, a two-dimensional scalar value to an array, but broadcasting also generalizes to more com-
array of floating-point numbers with shape (4, 3), where each element plex examples such as scaling each column of an array or generating
occupies 8 bytes in memory. To move between consecutive columns, a grid of coordinates. In broadcasting, one or both arrays are virtually
we need to jump forward 8 bytes in memory, and to access the next row, duplicated (that is, without copying any data in memory), so that the
3 × 8 = 24 bytes. The strides of that array are therefore (24, 8). NumPy shapes of the operands match (Fig. 1d). Broadcasting is also applied
can store arrays in either C or Fortran memory order, iterating first over when an array is indexed using arrays of indices (Fig. 1c).
either rows or columns. This allows external libraries written in those Other array-aware functions, such as sum, mean and maximum,
languages to access NumPy array data in memory directly. perform element-by-element ‘reductions’, aggregating results across
Users interact with NumPy arrays using ‘indexing’ (to access sub- one, multiple or all axes of a single array. For example, summing an
arrays or individual elements), ‘operators’ (for example, +, − and × n-dimensional array over d axes results in an array of dimension n − d
for vectorized operations and @ for matrix multiplication), as well (Fig. 1f).
as ‘array-aware functions’; together, these provide an easily readable, NumPy also includes array-aware functions for creating, reshaping,
expressive, high-level API for array programming while NumPy deals concatenating and padding arrays; searching, sorting and counting
with the underlying mechanics of making operations fast. data; and reading and writing files. It provides extensive support for
Indexing an array returns single elements, subarrays or elements generating pseudorandom numbers, includes an assortment of prob-
that satisfy a specific condition (Fig. 1b). Arrays can even be indexed ability distributions, and performs accelerated linear algebra, using
using other arrays (Fig. 1c). Wherever possible, indexing that retrieves a one of several backends such as OpenBLAS18,19 or Intel MKL optimized
subarray returns a ‘view’ on the original array such that data are shared for the CPUs at hand (see Supplementary Methods for more details).
between the two arrays. This provides a powerful way to operate on Altogether, the combination of a simple in-memory array repre-
subsets of array data while limiting memory usage. sentation, a syntax that closely mimics mathematics, and a variety
To complement the array syntax, NumPy includes functions that of array-aware utility functions forms a productive and powerfully
perform vectorized calculations on arrays, including arithmetic, expressive array programming language.
scikit-learn scikit-image
Machine learning Image processing
Technique-specific
pandas, statsmodels NetworkX
Statistics Network analysis
SciPy Matplotlib
Foundation
Algorithms Plots
Fig. 2 | NumPy is the base of the scientific Python ecosystem. Essential libraries and projects that depend on NumPy’s API gain access to new array
implementations that support NumPy’s array protocols (Fig. 3).
...
PyData PyData In [8]: x
Sparse Sparse Out[8]: dask.array<..., shape=(4, 3), ...>
...
...
NumPy array protocols
Fig. 3 | NumPy’s API and array protocols expose new arrays to the this case, Dask) and results in a new Dask array. Compare this code to the
ecosystem. In this example, NumPy’s ‘mean’ function is called on a Dask array. example code in Fig. 1g.
The call succeeds by dispatching to the appropriate library implementation (in
on code, and continuous testing that runs an extensive battery of auto- The community’s efforts to fill this gap led to a proliferation of new
mated tests for every proposed change to NumPy. The project also array implementations. For example, each deep-learning framework
has comprehensive, high-quality documentation, integrated with the created its own arrays; the PyTorch38, Tensorflow39, Apache MXNet40
source code31–33. and JAX arrays all have the capability to run on CPUs and GPUs in a
This culture of using best practices for producing reliable scientific distributed fashion, using lazy evaluation to allow for additional per-
software has been adopted by the ecosystem of libraries that build on formance optimizations. SciPy and PyData/Sparse both provide sparse
NumPy. For example, in a recent award given by the Royal Astronomi- arrays, which typically contain few non-zero values and store only those
cal Society to Astropy, they state: “The Astropy Project has provided in memory for efficiency. In addition, there are projects that build on
hundreds of junior scientists with experience in professional-standard NumPy arrays as data containers, and extend its capabilities. Distrib-
software development practices including use of version control, unit uted arrays are made possible that way by Dask, and labelled arrays—
testing, code review and issue tracking procedures. This is a vital skill referring to dimensions of an array by name rather than by index for
set for modern researchers that is often missing from formal university clarity, compare x[:, 1] versus x.loc[:, 'time']—by xarray41.
education in physics or astronomy”34. Community members explicitly Such libraries often mimic the NumPy API, because this lowers the
work to address this lack of formal education through courses and barrier to entry for newcomers and provides the wider community with
workshops35–37. a stable array programming interface. This, in turn, prevents disruptive
The recent rapid growth of data science, machine learning and arti- schisms such as the divergence between Numeric and Numarray. But
ficial intelligence has further and dramatically boosted the scientific exploring new ways of working with arrays is experimental by nature
use of Python. Examples of its important applications, such as the and, in fact, several promising libraries (such as Theano and Caffe) have
eht-imaging library, now exist in almost every discipline in the natu- already ceased development. And each time that a user decides to try a
ral and social sciences. These tools have become the primary software new technology, they must change import statements and ensure that the
environment in many fields. NumPy and its ecosystem are commonly new library implements all the parts of the NumPy API they currently use.
taught in university courses, boot camps and summer schools, and Ideally, operating on specialized arrays using NumPy functions or
are the focus of community conferences and workshops worldwide. semantics would simply work, so that users could write code once,
NumPy and its API have become truly ubiquitous. and would then benefit from switching between NumPy arrays, GPU
arrays, distributed arrays and so forth as appropriate. To support array
operations between external array objects, NumPy therefore added
Array proliferation and interoperability the capability to act as a central coordination mechanism with a well
NumPy provides in-memory, multidimensional, homogeneously typed specified API (Fig. 2).
(that is, single-pointer and strided) arrays on CPUs. It runs on machines To facilitate this interoperability, NumPy provides ‘protocols’ (or
ranging from embedded devices to the world’s largest supercomputers, contracts of operation), that allow for specialized arrays to be passed to
with performance approaching that of compiled languages. For most NumPy functions (Fig. 3). NumPy, in turn, dispatches operations to the
its existence, NumPy addressed the vast majority of array computa- originating library, as required. Over four hundred of the most popular
tion use cases. NumPy functions are supported. The protocols are implemented by
However, scientific datasets now routinely exceed the memory capac- widely used libraries such as Dask, CuPy, xarray and PyData/Sparse.
ity of a single machine and may be stored on multiple machines or in Thanks to these developments, users can now, for example, scale their
the cloud. In addition, the recent need to accelerate deep-learning and computation from a single machine to distributed systems using Dask.
artificial intelligence applications has led to the emergence of special- The protocols also compose well, allowing users to redeploy NumPy
ized accelerator hardware, including graphics processing units (GPUs), code at scale on distributed, multi-GPU systems via, for instance, CuPy
tensor processing units (TPUs) and field-programmable gate arrays arrays embedded in Dask arrays. Using NumPy’s high-level API, users
(FPGAs). Owing to its in-memory data model, NumPy is currently unable can leverage highly parallel code execution on multiple systems with
to directly utilize such storage and specialized hardware. However, millions of cores, all with minimal code changes42.
both distributed data and also the parallel execution of GPUs, TPUs These array protocols are now a key feature of NumPy, and are
and FPGAs map well to the paradigm of array programming: therefore expected to only increase in importance. The NumPy developers—
leading to a gap between available modern hardware architectures and many of whom are authors of this Review—iteratively refine and add
the tools necessary to leverage their computational power. protocol designs to improve utility and simplify adoption.