Unit 4
Unit 4
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean logic –
fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and selection –
operating on data – missing data – Hierarchical indexing – combining datasets – aggregation and
grouping – pivot tables
NumPy array manipulation to access data and subarrays, and to split, reshape, and join the arrays.
2. Attributes:
Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and size (the
total size of the array):
3. Array Indexing: Accessing Single Elements
In a one-dimensional array, you can access the ith value (counting from zero) by specifying the desired index in
square brackets, just as with Python lists:
4. Array Slicing: Accessing Subarrays
We can also use them to access subarrays with the slice notation, marked by the colon (:) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x, use this:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1.
Multidimensional subarrays
Multidimensional slices work in the same way, with multiple slices separated by commas.
For example:
5. Accessing array rows and columns
NumPy array slicing differs from Python list slicing: in lists, slices will be copies.
The array slices is that they return views rather than copies of the array data
Consider our two-dimensional array from before:
7. Creating copies of arrays
Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an
array or a subarray.
This can be most easily done with the copy() method:
8. Reshaping of Arrays
The most flexible way of doing this is with the reshape() method.
For example, if we want to put the numbers 1 through 9 in a 3X3 grid, we can do the following:
Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or
column matrix.
We can do this with the reshape method, or more easily by making use of the newaxis keyword within a slice
operation:
It’s also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays.
Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished through the routines
np.concatenate, np.vstack, and np.hstack. np.concatenate takes a tuple or list of arrays as its first argument
Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the functions np.split, np.hsplit, and
np.vsplit. For each of these, we can pass a list of indices giving the split points:
II. Aggregations: Min, Max, and Everything in Between
3. Multidimensional aggregates
Other aggregation functions
Example: What Is the Average Height of US Presidents?
Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let’s consider the heights of all US presidents.
This data is available in the file president_heights.csv, which is a simple comma-separated list of
labels and values:
III. Computation on Arrays: Broadcasting
Another means of vectorizing operations is to use NumPy’s broadcasting functionality. Broadcasting is simply
a set of rules for applying binary ufuncs (addition, subtraction, multiplication, etc.) on
arrays of different sizes.
Introducing Broadcasting
Recall that for arrays of the same size, binary operations are performed on an element-by-element basis:
Broadcasting allows these types of binary operations to be performed on arrays of different sizes—for example,
we can just as easily add a scalar (think of it as a zero dimensional array) to an array:
Rules of Broadcasting
Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:
• Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is
padded with ones on its leading (left) side.
• Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that
dimension is stretched to match the other shape.
• Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
IV. Comparisons, Masks, and Boolean Logic
This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays. Masking
comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some
criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all
outliers that are above some threshold.
In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.
One approach to this would be to answer these questions by hand: loop through the data, incrementing a
counter each time we see values in some desired range.
For reasons discussed throughout this chapter, such an approach is very inefficient, both from the standpoint of
time writing code and time computing the result.
This section demonstrates the use of NumPy’s structured arrays and record arrays, which provide efficient
storage for compound, hetero‐ geneous data.
Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we’d
like to store these values for use in a Python program. It would be possible to store these in three separate
arrays:
Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:
DataFrame as a dictionary
Operating on Data in Pandas:
Pandas : for unary operations like negation and trigonometric functions, the ufuncs will preserve index and
column labels in the output, and for binary operations such as addition and multiplication, Pandas will
automatically align indices when passing the objects to the ufunc.
Ufuncs: Operations Between DataFrame and Series
When you are performing operations between a DataFrame and a Series, the index and column alignment is
similarly maintained.
Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-
dimensional NumPy array.
subtraction between a two-dimensional array and one of its rows is applied row-wise.
In the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting
datasets will have some amount of data missing
In the masking approach, the mask might be an entirely separate Boolean array, or it may involve appropriation
of one bit in the data representation to locally indicate the null status of a value.
In the sentinel approach, the sentinel value could be some data-specific convention. Eg: IEEE floating-point
specification.
Hierarchical indexing (also known as multi-indexing) - to incorporate multiple index levels within a
single index. In this way, higher-dimensional data can be compactly represented within the familiar one-
dimensional Series and two-dimensional DataFrame objects.
An essential piece of analysis of large data is efficient summarization: computing aggregations like sum(),
mean(), median(), min(), and max()
A canonical example of this split-apply-combine operation, where the “apply” is asummation aggregation, is
illustrated in Figure 3-1.
Figure 3-1 makes clear what the GroupBy accomplishes:
• The split step involves breaking up and grouping a DataFrame depending on the
value of the specified key.
• The apply step involves computing some function, usually an aggregate, transformation,
or filtering, within the individual groups.
• The combine step merges the results of these operations into an output array.
Here it’s important to realize that the intermediate splits do not need to be explicitly instantiated.
The GroupBy object
The GroupBy object is a very flexible abstraction.
Column indexing. The GroupBy object supports column indexing in the same way as
the DataFrame, and returns a modified GroupBy object. For example:
In[14]: planets.groupby('method')
Out[14]: <pandas.core.groupby.DataFrameGroupBy object at 0x1172727b8>
In[15]: planets.groupby('method')['orbital_period']
Out[15]: <pandas.core.groupby.SeriesGroupBy object at 0x117272da0>
Iteration over groups. The GroupBy object supports direct iteration over the groups,
returning each group as a Series or DataFrame:
In[17]: for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))
Dispatch methods. Through some Python class magic, any method not explicitly
implemented by the GroupBy object will be passed through and called on the groups,
whether they are DataFrame or Series objects. For example, you can use the
describe() method of DataFrames to perform a set of aggregations that describe each
group in the data:
In[18]: planets.groupby('method')['year'].describe().unstack()
X. Pivot Tables
We have seen how the GroupBy abstraction lets us explore relationships within a dataset.
A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on
tabular data.
The pivot table takes simple columnwise data as input, and groups the entries into a two-dimensional table that
provides a multidimensional summarization of the data.