Unit III - Data Manipulation Using Python
Unit III - Data Manipulation Using Python
The topic is very broad: datasets can come from a wide range of sources and a wide range
of formats, including collections of documents, collections of images, collections of
sound clips, collections of numerical measurements, etc.
Text can be converted in various ways into numerical representations, perhaps binary
digits.
For this reason, efficient storage and manipulation of numerical arrays is absolutely
fundamental to the process of doing data science.
The specialized tools that Python has for handling such numerical arrays: the NumPy
package and the Pandas package.
Attributes of arrays: Determining the size, shape, memory consumption, and data types
of arrays
Indexing of arrays: Getting and setting the value of individual array elements
Slicing of arrays: Getting and setting smaller subarrays within a larger array
Joining and splitting of arrays: Combining multiple arrays into one, and splitting one
array into many
NumPy Array Attributes
import numpy as np
np.random.seed(0) # seed for reproducibility
Each array has attributes including ndim (the number of dimensions), shape (the size of
each dimension), size (the total size of the array), and dtype (the type of each element):
Output:
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
dtype: int64
In a one-dimensional array, the ith value (counting from zero) can be accessed by
specifying the desired index in square brackets, just as with Python lists:
x1
array([9, 4, 0, 3, 8, 6])
x1[0]
9
x1[4]
8
To index from the end of the array, you can use negative indices:
x1[-1]
6
x1[-2]
8
x2[0, 0]
3
x2[2, 0]
0
x2[2, -1]
9
Values can also be modified using any of the preceding index notation:
x2[0, 0] = 12
x2
array([[12, 1, 3, 7],
[4, 0, 2, 3],
[0, 0, 6, 9]])
We can use square brackets to access individual array elements, we can also use them to
access subarrays with the slice notation, marked by the colon (:) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an
array x, use this:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=<size of
dimension>, step=1.
One-Dimensional Subarrays
x1
array([9, 4, 0, 3, 8, 6])
array([4, 0, 3])
Multidimensional Subarrays
x2
array([[12, 1, 3, 7],
[4, 0, 2, 3],
[0, 0, 6, 9]])
x2_sub[0, 0] = 99
print(x2_sub)
[[99 1]
[ 4 0]]
print(x2)
[[99 1 3 7]
[ 4 0 2 3]
[ 0 0 6 9]]
x2_sub_copy[0, 0] = 42
print(x2_sub_copy)
[[42 1]
[ 4 0]]
print(x2)
[[99 1 3 7]
[ 4 0 2 3]
[ 0 0 6 9]]
Reshaping of Arrays
grid = np.arange(1, 10).reshape(3, 3)
print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]
x = np.array([1, 2, 3])
x.reshape((1, 3)) # row vector via reshape
array([[1, 2, 3]])
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
array([1, 2, 3, 3, 2, 1])
Splitting of Arrays
The opposite of concatenation is splitting, which is implemented by the
functions np.split, np.hsplit, and np.vsplit.
Aggregation
NumPy has fast built-in aggregation functions for working on arrays.
!head -4 data/president_heights.csv
order,name,height(cm)
1,George Washington,189
2,John Adams,170
3,Thomas Jefferson,189
import pandas as pd
data = pd.read_csv('data/president_heights.csv')
heights = np.array(data['height(cm)'])
print(heights)
[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173
174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183
177 185 188 188 182 185]
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # set plot style
plt.hist(heights)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number');
Computation on NumPy Arrays:
Universal Functions
NumPy provides a large number of useful ufuncs, and some of the most useful for the
data scientist are the trigonometric functions.
Another common type of operation available in a NumPy ufunc are the exponents and
logarithms.
Broadcasting
Another means of vectorizing operations is to use NumPy's broadcasting functionality.
Broadcasting is simply a set of rules for applying binary ufuncs (e.g., addition,
subtraction, multiplication, etc.) on arrays of different sizes.
import numpy as np
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
a + b
array([5, 6, 7])
Broadcasting allows these types of binary operations to be performed on arrays of
different sizes.
a + 5
array([5, 6, 7])
M = np.ones((3, 3))
M
array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
M + a
array([[ 1., 2., 3.],
[ 1., 2., 3.],
[ 1., 2., 3.]])
x != 3 # not equal
array([ True, True, False, True, True], dtype=bool)
x == 3 # equal
array([False, False, True, False, False], dtype=bool)
(2 * x) == (x ** 2)
array([False, True, False, False, False], dtype=bool)
A summary of the comparison operators and their equivalent ufunc is shown here:
Given a Boolean array, there are a host of useful operations you can do.
print(x)
[[5 0 3 3]
[7 9 3 5]
[2 4 7 6]]
Counting entries
To count the number of True entries in a Boolean array, np.count_nonzero is useful:
Boolean operators
x
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])
x < 5
array([[False, True, True, True],
[False, False, True, False],
[ True, True, False, False]], dtype=bool)
Now to select these values from the array, we can simply index on this Boolean array;
this is known as a masking operation:
x[x < 5]
array([0, 3, 3, 3, 2, 4])
Fancy Indexing
We saw how to access and modify portions of arrays using simple indices (e.g., arr[0]),
slices (e.g., arr[:5]), and Boolean masks (e.g., arr[arr > 0]).
Here, we'll look at another style of array indexing, known as fancy indexing.
Fancy indexing is like the simple indexing we've already seen, but we pass arrays of
indices in place of single scalars.
This allows us to very quickly access and modify complicated subsets of an array's
values.
import numpy as np
rand = np.random.RandomState(42)
x = rand.randint(100, size=10)
print(x)
[51 92 14 71 60 20 82 86 74 74]
Alternatively, we can pass a single list or array of indices to obtain the same result:
ind = [3, 7, 4]
x[ind]
array([71, 86, 60])
When using fancy indexing, the shape of the result reflects the shape of the index
arrays rather than the shape of the array being indexed:
Fancy indexing also works in multiple dimensions. Consider the following array:
X = np.arange(12).reshape((3, 4))
X
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Combined Indexing
For even more powerful operations, fancy indexing can be combined with the other
indexing schemes we've seen:
print(X)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Structured Data:
While often our data can be well represented by a homogeneous array of values.
Here we demonstrates the use of NumPy's structured arrays and record arrays, which
provide efficient storage for compound, heterogeneous data.
While the patterns shown here are useful for simple operations, scenarios like this often
lend themselves to the use of Pandas Dataframes.
import numpy as np
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
x = np.zeros(4, dtype=int)
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
print(data.dtype)
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]
We've created an empty container array, we can fill the array with our lists of values:
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
[('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0)
('Doug', 19, 61.5)]
The handy thing with structured arrays is that you can now refer to values either by index
or by name:
For clarity, numerical types can be specified using Python types or NumPy dtypes
instead: