ML Interactively
ML Interactively
Supervised learning: Using labeled data to train a model. The labels for
the training dataset represent the class/category that each data
observation belongs to. After training, the model should be able to predict
labels for new data observations (from the same population distribution
as the training data).
Example: Going back to the same picture dataset from above, but
now assume the training dataset is unlabeled. Using unsupervised
learning, a model will be able to pick up on the inherent differences
between pictures with a lake and pictures without a lake, e.g.
differences in pixel color or orientation. This allows the model to
cluster the pictures into two separate groups.
If it is possible to get large enough labeled training datasets, supervised
learning is the way to go. However, it is often difficult to get fully labeled
datasets, which is why many tasks require unsupervised learning or semi-
supervised learning (a mix of supervised and unsupervised learning).
Deciding which type of learning method to use is only the first step towards
creating a machine learning model. You also need to choose the proper model
architecture for your task and, most importantly, be able to process data into a
training pipeline and interpret/analyze model results.
On the other hand, data science deals with gathering insights from datasets.
Traditionally, data scientists have used statistical methods for gathering these
insights. However, as machine learning continues to grow, it has also
penetrated into the field of data science.
2. Data Processing and Preparation: Once you’ve gathered the relevant data,
you need to process it and make sure that it is in a usable format for
training a machine learning model. This includes handling missing data,
dealing with outliers, etc.
3. Feature Engineering: Once you’ve collected and processed your dataset,
you will likely need to transform some of the features (and sometimes
even drop some features) in order to optimize how well a model can be
trained on the data.
4. Model Selection: Based on the dataset, you will choose which model
architecture to use. This is one of the main tasks of industry engineers.
Rather than attempting to come up with a completely novel model
architecture, most tasks can be thoroughly performed with an existing
architecture (or combination of model architectures).
5. Model Training and Data Pipeline: After selecting the model architecture,
you will create a data pipeline for training the model. This means
creating a continuous stream of batched data observations to efficiently
train the model. Since training can take a long time, you want your data
pipeline to be as efficient as possible.
6. Model Validation: After training the model for a sufficient amount of
time, you will need to validate the model’s performance on a held-out
portion of the overall dataset. This data needs to come from the same
underlying distribution as the training dataset, but needs to be different
data that the model has not seen before.
7. Model Persistence: Finally, after training and validating the model’s
performance, you need to be able to properly save the model weights and
possibly push the model to production. This means setting up a process
with which new users can easily use your pre-trained model to make
predictions.
figuring out which features are the most relevant to the task, and picking
out the best combination of features to use.
Picking the correct model architecture to use based on the data. Many
people will always default to using a large neural network for any
machine learning task, but many times this is unnecessary and can even
hurt the model’s final performance if the dataset is not large enough.
Code a machine learning model and train it on processed data. Validate
the model’s performance on held-out data and understand techniques to
improve a model’s performance.
Introduction
In the Data Manipulation section, you will learn how to perform data
manipulation using NumPy.
A. Data Processing
When asked about Google's model for success, Peter Norvig, the director of
research at Google, famously stated,
"We don't have better algorithms than anyone else; we just have more
data."
However, data is not just limited to machine learning. Companies use data to
identify customer trends, political parties use data to determine which
demographics they should target, sports teams use data to analyze players,
etc.
Example baseball data used in sabermetrics. The concept was popularized by the 2011 film,
Moneyball.
The universal usage of data makes data processing, the act of converting raw
data into a meaningful form, an essential skill to have.
B. NumPy
Many scenarios involve mostly numeric datasets. For example, medical data
contains many numeric metrics, such as height, weight, and blood pressure.
Furthermore, the majority of neural networks use input data that is either
numeric or has been converted to a numeric form.
When we deal with numeric data, the best Python library to use is NumPy.
The NumPy library allows us to perform many operations on numeric data,
and convert the data to more usable forms.
Chapter Goals:
Learn about NumPy arrays and how to initialize them
Write code to create several NumPy arrays
A. Arrays
NumPy arrays are basically just Python lists with added features. In fact, you
can easily convert a Python list to a Numpy array using the np.array function,
which takes in a Python list as its required argument. The function also has
quite a few keyword arguments, but the main one to know is dtype . The
dtype keyword argument takes in a NumPy type and manually casts the array
to the specified type.
The code below is an example usage of np.array to create a 2-D matrix. Note
that the array is manually cast to np.float32 .
import numpy as np
When the elements of a NumPy array are mixed types, then the array's type
will be upcast to the highest level type. This means that if an array input has
mixed int and float elements, all the integers will be cast to their floating-
point equivalents. If an array is mixed with int , float , and string elements,
everything is cast to strings.
The code below is an example of np.array upcasting. Both integers are cast to
their floating-point equivalents.
B. Copying
a = np.array([0, 1])
b = np.array([9, 8])
c = a
print('Array a: {}'.format(repr(a)))
c[0] = 5
print('Array a: {}'.format(repr(a)))
d = b.copy()
d[0] = 6
print('Array b: {}'.format(repr(b)))
C. Casting
We cast NumPy arrays through their inherent astype function. The function's
required argument is the new type for the array. It returns the array cast to
the new type.
The code below shows an example of casting using the astype function. The
dtype property returns the type of an array.
D. NaN
The code below shows an example usage of np.nan . Note that np.nan cannot
take on an integer type.
E. In nity
To represent infinity in NumPy, we use the np.inf special value. We can also
represent negative infinity with -np.inf .
The code below shows an example usage of np.inf . Note that np.inf cannot
take on an integer type.
The first array we'll create comes straight from a list of integers and np.nan .
The list contains np.nan as the first element, and the integers from 2 to 5 ,
inclusive, as the next four elements.
# CODE HERE
We now want to copy the array so we can change the first element to 10 . This
way we don't modify the original array.
Set arr2 equal to arr.copy() , then set the first element of arr2 equal to
10 .
# CODE HERE
The next two arrays will use floating point numbers. The first array will be
upcast to floating point numbers, while we manually cast the second array
using np.float32 .
For manual casting, we use an array's inherent astype function, which takes
in the new type as an argument and returns the casted array.
Set float_arr equal to np.array applied to a list with elements 1 , 5.4 , and
3 , in that order.
# CODE HERE
The final array will be a multi-dimensional array, specifically a 2-D matrix.
The 2-D matrix will have the integers 1 , 2 , 3 in its first row, and the integers
4 , 5 , 6 in its second row. We'll also manually set its type to np.float32 .
Set matrix equal to np.array with a list of lists (representing the specified
2-D matrix) as the first argument, and np.float32 as the dtype keyword
argument.
# CODE HERE
NumPy Basics
Chapter Goals:
Learn about some basic NumPy operations
Write code using the basic NumPy functions
A. Ranged data
arr = np.arange(5)
print(repr(arr))
arr = np.arange(5.1)
print(repr(arr))
arr = np.arange(-1, 4)
print(repr(arr))
arr = np.arange(-1.5, 4, 2)
print(repr(arr))
For three arguments, m, n, and s, np.arange will return an array with the
integers in the range [m, n) using a step size of s.
Like np.array , np.arange performs upcasting. It also has the dtype
keyword argument to manually cast the array.
To specify the number of elements in the returned array, rather than the step
size, we can use the np.linspace function.
This function takes in a required first two arguments, for the start and end of
the range, respectively. The end of the range is inclusive for np.linspace ,
unless the keyword argument endpoint is set to False . To specify the number
of elements, we set the num keyword argument (its default value is 50 ).
The code below shows example usages of np.linspace . It also takes in the
dtype keyword argument for manual casting.
B. Reshaping data
We are allowed to use the special value of -1 in at most one dimension of the
new shape. The dimension with -1 will take on the value necessary to allow
the new shape to contain all the elements of the array.
print(repr(reshaped_arr))
print('New shape: {}'.format(reshaped_arr.shape))
While the np.reshape function can perform any reshaping utilities we need,
NumPy provides an inherent function for flattening an array. Flattening an
array reshapes it into a 1D array. Since we need to flatten data quite often, it is
a useful function.
The code below flattens an array using the inherent flatten function.
arr = np.arange(8)
arr = np.reshape(arr, (2, 4))
flattened = arr.flatten()
print(repr(arr))
print('arr shape: {}'.format(arr.shape))
print(repr(flattened))
print('flattened shape: {}'.format(flattened.shape))
C. Transposing
The code below shows an example usage of the np.transpose function. The
matrix rows become columns after the transpose.
arr = np.arange(8)
arr = np.reshape(arr, (4, 2))
transposed = np.transpose(arr)
print(repr(arr))
print('arr shape: {}'.format(arr.shape))
print(repr(transposed))
print('transposed shape: {}'.format(transposed.shape))
The function takes in a required first argument, which will be the array we
want to transpose. It also has a single keyword argument called axes , which
represents the new permutation of the dimensions.
The permutation is a tuple/list of integers, with the same length as the number
of dimensions in the array. It tells us where to switch up the dimensions. For
example, if the permutation had 3 at index 1, it means the old third dimension
of the data becomes the new second dimension (since index 1 represents the
second dimension).
The code below shows an example usage of the np.transpose function with
the axes keyword argument. The shape property gives us the shape of an
array.
arr = np.arange(24)
arr = np.reshape(arr, (3, 4, 2))
transposed = np.transpose(arr, axes=(1, 2, 0))
print('arr shape: {}'.format(arr.shape))
print('transposed shape: {}'.format(transposed.shape))
In this example, the old first dimension became the new third dimension, the
old second dimension became the new first dimension, and the old third
dimension became the new second dimension. The default value for axes is a
dimension reversal (e.g. for 3-D data the default axes value is [2, 1, 0] ).
If we want to create an array of 0's or 1's with the same shape as another
array, we can use np.zeros_like and np.ones_like .
Time to Code!
Our initial array will just be all the integers from 0 to 11, inclusive. We'll also
reshape it so it has three dimensions.
Then, set reshaped equal to np.reshape with arr as the first argument and
(2, 3, 2) as the second argument.
# CODE HERE
Next we want to get a flattened version of the reshaped array (the flattened
version is equivalent to arr ), as well as a transposed version. For the
transposed version of reshaped , we use a permutation of (1, 2, 0) .
Set flattened equal to reshaped.flatten() .
# CODE HERE
We'll create an array of 5 elements, all of which are 0 . We'll also create an
array with the same shape as transposed , but containing only 1 as the
elements.
# CODE HERE
The final array will contain 101 evenly spaced numbers between -3.5 and 1.5,
inclusive. Since they are evenly spaced, the difference between adjacent
numbers is 0.05.
Set points equal to np.linspace with -3.5 and 1.5 as the first two
arguments, respectively, as well as 101 for the num keyword argument.
# CODE HERE
Math
Chapter Goals:
Learn how to perform math operations in NumPy
Write code using NumPy math functions
A. Arithmetic
Using NumPy arithmetic, we can easily modify large amounts of numeric data
with only a few operations. For example, we could convert a dataset of
Fahrenheit temperatures to their equivalent Celsius form.
B. Non-linear functions
Apart from basic arithmetic operations, NumPy also allows you to use non-
linear functions such as exponentials and logarithms.
The code below shows various exponentials and logarithms with NumPy. Note
that np.e and np.pi represent the mathematical constants e and π,
respectively.
To do a regular power operation with any base, we use np.power . The first
argument to the function is the base, while the second is the power. If the base
or power is an array rather than a single number, the operation is applied to
every element in the array.
C. Matrix multiplication
Since NumPy arrays are basically vectors and matrices, it makes sense that
there are functions for dot products and matrix multiplication. Specifically,
the main function to use is np.matmul , which takes two vector/matrix arrays
as input and produces a dot product or matrix multiplication.
The code below shows various examples of matrix multiplication. When both
inputs are 1-D, the output is the dot product.
Note that the dimensions of the two input matrices must be valid for a matrix
multiplication. Specifically, the second dimension of the first matrix must
equal the first dimension of the second matrix, otherwise np.matmul will
result in a ValueError .
We'll create a couple of matrix arrays to perform our math operations on. The
first array will represent the matrix:
1.2 3.1
1.2 0.3
1.5 2.2
Set arr equal to np.array applied to a list of lists representing the first
matrix.
Then set arr2 equal to np.array applied to a list of lists representing the
second matrix.
# CODE HERE
Then set added equal to the result of adding arr and multiplied .
Finally, set squared equal to added with each of its elements squared.
# CODE HERE
After the arithmetic operations, we'll apply the base e exponential and
logarithm to our array matrices.
# CODE HERE
Note that exponential has shape (2, 3) and logged has shape (3, 2) . So we
can perform matrix multiplication both ways.
Set matmul1 equal to np.matmul with first argument logged and second
argument exponential . Note that matmul1 will have shape (3, 3) .
Then set matmul2 equal to np.matmul with first argument exponential and
second argument logged . Note that matmul2 will have shape (2, 2) .
# CODE HERE
Random
Chapter Goals:
Learn about random operations in NumPy
Write code using the np.random submodule
A. Random integers
Similar to the Python random module, NumPy has its own submodule for
pseudo-random number generation called np.random . It provides all the
necessary randomized operations and extends it to multi-dimensional arrays.
To generate pseudo-random integers, we use the np.random.randint function.
print(np.random.randint(5))
print(np.random.randint(5))
print(np.random.randint(5, high=6))
If high is not None , then the required argument will represent the lower
(inclusive) end of the range, while high represents the upper (exclusive) end.
The size keyword argument specifies the size of the output array, where each
integer in the array is randomly drawn from the specified range. As a default,
np.random.randint returns a single integer.
B. Utility functions
The code below uses np.random.seed with the same random seed. Note how
the outputs of the random functions in each subsequent run are identical
when we set the same random seed.
np.random.seed(1)
print(np.random.randint(10))
random_arr = np.random.randint(3, high=100,
size=(2, 2))
print(repr(random_arr))
# New seed
np.random.seed(2)
print(np.random.randint(10))
random_arr = np.random.randint(3, high=100,
size=(2, 2))
print(repr(random_arr))
# Original seed
np.random.seed(1)
print(np.random.randint(10))
random_arr = np.random.randint(3, high=100,
size=(2, 2))
print(repr(random_arr))
The code below shows example usages of np.random.shuffle . Note that only
the rows of matrix are shuffled (i.e. shuffling along first dimension only).
t e o so at a es u ed ( .e. s u gao g st d e s o o y).
C. Distributions
Using np.random we can also draw samples from probability distributions. For
example, we can use np.random.uniform to draw pseudo-random real numbers
from a uniform distribution.
print(np.random.uniform())
print(np.random.uniform(low=-1.5, high=2.2))
print(repr(np.random.uniform(size=3)))
print(repr(np.random.uniform(low=-3.4, high=5.9,
size=(2, 2))))
The size keyword argument is the same as the one for np.random.randint , i.e.
it represents the output size of the array.
print(np.random.normal())
print(np.random.normal(loc=1.5, scale=3.5))
print(repr(np.random.normal(loc=-2.4, scale=4.0,
size=(2, 2))))
NumPy provides quite a few more built-in distributions, which are listed here.
D. Custom sampling
In the example, we set p such that 'red' has a probability of 0.8 of being
chosen, 'blue' has a probability of 0.19, and 'green' has a probability of
0.01. When p is not set, the probabilities are equal for each element in the
distribution (and sum to 1).
Time to Code!
Note: it is important you do all the instructions in the order listed. We
test your code by setting a fixed np.random.seed , so in order for your code
to match the reference output, all the functions must be run in the
correct order.
We'll start off by obtaining some random integers. The first integer we get will
be randomly chosen from the range [0, 5). The remaining integers will be part
of a 3x5 NumPy array, each randomly chosen from the range [3, 10).
# CODE HERE
The next two arrays will be drawn randomly from distributions. The first will
contain 5 numbers drawn uniformly from the range [-2.5, 1.5].
# CODE HERE
The second array will contain 50 numbers drawn from a normal distribution
with mean 2.0 and standard deviation 3.5.
Set random_norm equal to np.random.normal with the loc and scale keyword
arguments set to 2.0 and 3.5 , respectively. The size keyword argument
should be set to (10, 5) .
# CODE HERE
We'll now create our own distribution of strings and randomly select from it.
The values for our distribution will be 'a' , 'b' , 'c' , 'd' .
Set choices equal to a list of the specified values, in the order given.
# CODE HERE
# CODE HERE
Indexing
Chapter Goals:
Learn about indexing arrays in NumPy
Write code for indexing and slicing arrays
A. Array accessing
B. Slicing
NumPy arrays also support slicing. Similar to Python, we use the colon
operator (i.e. arr[:] ) for slicing. We can also use negative indexing to slice in
the backwards direction.
In addition to accessing and slicing arrays, it is useful to figure out the actual
indexes of the minimum and maximum elements. To do this, we use the
np.argmin and np.argmax functions.
The code below shows example usages of np.argmin and np.argmax . Note that
the index of element -6 is index 5 in the flattened version of arr .
The np.argmin and np.argmax functions take the same arguments. The
required argument is the input array and the axis keyword argument
specifies which dimension to apply the operation on.
The code below shows how the axis keyword argument is used for these
functions.
In our example, using axis=0 meant the function found the index of the
minimum row element for each column. When we used axis=1 , the function
found the index of the minimum column element for each row.
Setting axis to -1 just means we apply the function across the last dimension.
In this case, axis=-1 is equivalent to axis=1 .
Time to Code!
Each coding exercise in this chapter will be to complete a small function that
takes in a 2-D NumPy matrix ( data ) as input. The first function to complete is
direct_index .
Set elem equal to the third element of the second row in data (remember
that the first row is index 0). Then return elem .
def direct_index(data):
# CODE HERE
pass
The next function, slice_data , will return two slices from the input data .
The first slice will contain all the rows, but will skip the first element in each
row. The second slice will contain all the elements of the first three rows
except the last two elements.
Set slice1 equal to the specified first slice. Remember that NumPy uses a
comma to separate slices along different dimensions.
Set slice2 equal to the specified second slice.
def slice_data(data):
# CODE HERE
pass
The next function, argmin_data , will find minimum indexes in the input data .
We can use np.argmin to find minimum points in the data array. First, we'll
find the index of the overall minimum element.
We can also return the indexes of each row's minimum element. This is
equivalent to finding the minimum column for each row, which means our
operation is done along axis 1 .
Set argmin1 equal to np.argmin with data as the first argument and the
specified axis keyword argument.
def argmin_data(data):
# CODE HERE
pass
The final function, argmax_data , will find the index of each row's maximum
element in data . Since there are only 2 dimensions in data , we can apply the
operation along either axis 1 or -1 .
Set argmax_neg1 equal to np.argmax with data as the first argument and -1
as the axis keyword argument. Then return argmax_neg1 .
def argmax_data(data):
# CODE HERE
pass
Filtering
Chapter Goals:
Learn how to filter data in NumPy
Write code for filtering NumPy arrays
A. Filtering data
Sometimes we have data that contains values we don't want to use. For
example, when tracking the best hitters in baseball, we may want to only use
the batting average data above .300. In this case, we should filter the overall
data for only the values that we want.
The key to filtering data is through basic relation operations, e.g. == , > , etc. In
NumPy, we can apply basic relation operations element-wise on arrays.
The code below shows relation operations on NumPy arrays. The ~ operation
represents a boolean negation, i.e. it flips each truth value in the array.
Something to note is that np.nan can't be used with any relation operation.
Instead, we use np.isnan to filter for the location of np.nan .
The code below uses np.isnan to determine which locations of the array
contain np.nan values.
arr = np.array([[0, 2, np.nan],
[1, np.nan, -6],
[np.nan, -2, 1]])
print(repr(np.isnan(arr)))
B. Filtering in NumPy
The tuple will have size equal to the number of dimensions in the data, and
each array represents the True indices for the corresponding dimension. Note
that the arrays in the tuple will all have the same length, equal to the number
of True elements in the input argument.
The code below shows how to use np.where with a single argument.
The interesting thing about np.where is that it must be applied with exactly 1
or 3 arguments. When we use 3 arguments, the first argument is still the
boolean array. However, the next two arguments represent the True
Note that our second and third arguments necessarily had the same shape as
the first argument. However, if we wanted to use a constant replacement
value, e.g. -1 , we could incorporate broadcasting. Rather than using an entire
array of the same value, we can just use the value itself as an argument.
C. Axis-wise ltering
The code below shows usage of np.any and np.all with a single argument.
arr = np.array([[-2, -1, -3],
[4, 5, -6],
[3, 9, 1]])
print(repr(arr > 0))
print(np.any(arr > 0))
print(np.all(arr > 0))
Setting axis to -1 just means we apply the function across the last dimension.
The code below shows examples of using np.any and np.all with the axis
keyword argument.
We can use np.any and np.all in tandem with np.where to filter for entire
rows or columns of data.
boolean array as the input to np.where , which gives us the actual indices of
the rows with at least one positive number.
Time to Code!
Each coding exercise in this chapter will be to complete a small function that
takes in a 2-D NumPy matrix ( data ) as input. The first function to complete is
get_positives .
Set a tuple of x_ind, y_ind equal to the output of np.where , applied with
the condition data > 0 .
Next, we'll complete the function replace_zeros . The function replaces each of
the non-positive elements in data with 0. We first create an array of all 0's,
with the same shape as data .
Then we filter the data array and replace the non-positive elements with the
corresponding element from zeros (which will be a 0).
Set zero_replace equal to np.where with the condition of data > 0 . The
second argument will be data and the third argument will be zeros .
Return zero_replace .
def replace_zeros(data):
# CODE HERE
pass
Set neg_one_replace equal to np.where with the condition of data > 0 . The
second argument will be data and the third will be -1 .
Return neg_one_replace .
def replace_neg_one(data):
# CODE HERE
pass
Our final function, coin_flip_filter will apply a filter using a boolean array
as the condition. We ll first create a boolean coin flip array with the same
shape as data .
Then we filter data using bool_coin_flips as the condition. For the False
values in bool_coin_flips , we replace the corresponding index in data with a
1.
Return one_replace .
def coin_flip_filter(data):
# CODE HERE
pass
Statistics
Chapter Goals:
Learn about basic statistical analysis in NumPy
Write code to obtain statistics for NumPy arrays
A. Analysis
It is often useful to analyze data for its main characteristics and interesting
trends. Though we will go more in-depth on data analysis in the section of this
course titled Data Preprocessing with scikit-learn, there are still a few
techniques in NumPy that allow us to quickly inspect data arrays.
The code below shows example usages of the min and max functions.
print(repr(arr.min(axis=0)))
print(repr(arr.max(axis=-1)))
The axis keyword argument is identical to how it was used in np.argmin and
np.argmax from the chapter on Indexing. In our example, we use axis=0 to
find an array of the minimum values in each column of arr and axis=1 to
find an array of the maximum values in each row of arr .
B. Statistical metrics
NumPy also provides basic statistical functions such as np.mean , np.var , and
np.median , to calculate the mean, variance, and median of the data,
respectively.
The code below shows how to obtain basic statistics with NumPy. Note that
np.median applied without axis takes the median of the flattened array.
Each of these functions takes in the data array as a required argument and
axis as a keyword argument. For a more comprehensive list of statistical
functions (e.g. calculating percentiles, creating histograms, etc.), check out the
NumPy statistics page.
Time to Code!
Each coding exercise in this chapter will be to complete a small function that
takes in a 2-D NumPy matrix ( data ) as input. The first function to complete is
get_min_max , which returns the overall minimum and maximum element in
data .
def get_min_max(data):
# CODE HERE
pass
Next, we'll complete col_min , which returns the minimums across each
column of data .
Set min0 equal to data.min with the axis keyword argument set to 0 .
def col_min(data):
# CODE HERE
pass
def basic_stats(data):
# CODE HERE
pass
Aggregation
Chapter Goals:
Learn how to aggregate data in NumPy
Write code to obtain sums and concatenations of NumPy arrays
A. Summation
The function takes in a NumPy array as its required argument, and uses the
axis keyword argument in the same way as described in previous chapters. If
the axis keyword argument is not specified, np.sum returns the overall sum
of the array.
The code below shows how to use np.cumsum . For a 2-D NumPy array, setting
axis=0 returns an array with cumulative sums across each column, while
axis=1 returns the array with cumulative sums across each row. Not setting
axis returns a cumulative sum across all the values of the flattened array.
B. Concatenation
The code below shows how to use np.concatenate , which aggregates arrays by
joining them along a specific dimension. For 2-D arrays, not setting the axis
argument (defaults to axis=0 ) concatenates the arrays vertically. When we set
axis=1 , the arrays are concatenated horizontally.
Time to Code!
Each coding exercise in this chapter will be to complete a small function that
takes in 2-D NumPy matrices as input. The first function to complete is
get_sums , which returns the overall sum and column sums of data .
def get_sums(data):
# CODE HERE
pass
def get_cumsum(data):
# CODE HERE
pass
The final function, concat_arrays , takes in two 2-D NumPy arrays as input. It
returns the column-wise and row-wise concatenations of the input arrays.
Chapter Goals:
Learn how to save and load data in NumPy
Write code to save NumPy data to a file
A. Saving
After performing data manipulation with NumPy, it's a good idea to save the
data in a file for future use. To do this, we use the np.save function.
The first argument for the function is the name/path of the file we want to
save our data to. The file name/path should have a ".npy" extension. If it does
not, then np.save will append the ".npy" extension to it.
The second argument for np.save is the NumPy data we want to save. The
function has no return value. Also, the format of the ".npy" files when viewed
with a text editor is largely gibberish when viewed with a text editor.
If np.save is called with the name of a file that already exists, it will overwrite
the previous file.
B. Loading
After saving our data, we can load it again using np.load . The function's
required argument is the file name/path that contains the saved data. It
returns the NumPy data exactly as it was saved.
Note that np.load will not append the ".npy" extension to the file name/path if
it is not there.
The code below shows how to use np.load to load NumPy data.
Time to Code!
The coding exercise in this chapter will require you to complete the
save_points function, which will save some randomly generated 2-D points in
a file.
You'll generate 100 (x, y) points from a uniform distribution in the range [-2.5,
2.5), then save the points to save_file .
Set points equal to np.random.uniform , with the low and high keyword
arguments representing the lower and upper ends of the range. The size
keyword argument should be set to (100, 2) .
Call np.save with save_file as the first argument and points as the
second argument.
def save_points(save_file):
# CODE HERE
pass
Quiz
1
What does the arange function do?
COMPLETED 0%
1 of 4
Introduction
In the Data Processing section, you will be using pandas to analyze Major
League Baseball (MLB) data. The data comes courtesy of Sean Lahman, and
contains statistics for every player, manager, and team in MLB history. The
full database can be found and downloaded here.
A. Data analysis
Before doing any task with a dataset, it is a good idea to perform preliminary
data analysis. Data analysis allows us to understand the dataset, find potential
outlier values, and figure out which features of the dataset are most important
to our application.
B. pandas
In the following chapters we'll dive into the main data analysis functionalities
of pandas. For a complete overview of the pandas toolkit, you can visit the
official pandas website.
An essential part of data analysis is creating charts and plots to visualize the
data. Similar to the saying, "a picture is worth a thousand words", data
visualization can convey key data trends and correlations through a single
figure.
heatmaps and 3-D plots. While we will only touch on the basic necessities for
our data analysis (e.g. line plots, boxplots, etc.), a full overview of Matplotlib
can be found at the official website.
Series
Chapter Goals
Learn about the pandas Series object and its basic utilities
Write code to create several Series objects
A. 1-D data
Similar to NumPy, pandas frequently deals with 1-D and 2-D data. However,
we use two separate objects to deal with 1-D and 2-D data in pandas. For 1-D
data, we use the pandas.Series objects, which we'll refer to simply as a Series.
The first keyword argument is data , which specifies the elements of the
Series. If data is not set, pd.Series returns an empty Series. Since the data
keyword argument is almost always used, we treat it like a regular first
argument (i.e. skip the data= prefix).
The code below shows how to create pandas Series objects using pd.Series .
series = pd.Series()
# Newline to separate series print statements
print('{}\n'.format(series))
series = pd.Series(5)
print('{}\n'.format(series))
In our examples, we initialized each Series with its values by setting the first
argument using a scalar, list, or NumPy array. Note that pd.Series upcasts
values in the same way as np.array . Furthermore, since Series objects are 1-D,
the ser variable represents a Series with lists as elements, rather than a 2-D
matrix.
B. Index
In the previous examples, you may have noticed the zero-indexed integers to
the left of the elements in each Series. These integers are collectively referred
to as the index of a Series, and each individual index element is referred to as
a label.
The code below shows how to use the index keyword argument with
pd.Series .
The index keyword argument needs to be a list or array with the same length
as the data argument for pd.Series . The values in the index list can be any
hashable type (e.g. integer, float, string).
C. Dictionary input
Another way to set the index of a Series is by using a Python dictionary for the
data argument. The keys of the dictionary represent the index of the Series,
while each individual key is the label for its corresponding value.
The code below shows how to use pd.Series with a Python dictionary as the
first argument. In our example, we set 'a' , 'b' , and 'c' as the Series index,
with corresponding values 1 , 2 , and 3 .
Time to Code!
The coding exercise for this chapter involves creating various pandas Series
objects.
The first Series we create will contain basic floating point numbers. The list
we use to initialize the Series is [1, 3, 5.2] .
Set s1 equal to pd.Series with the specified list as the only argument.
# CODE HERE
# CODE HERE
We'll create another Series, this time with integers. The list we use to initialize
this Series is [1, 3, 8, np.nan] . This Series will also have row labels, which
will be ['a', 'b', 'c', 'd'] .
Set s3 equal to pd.Series with the specified list of integers as the first
argument and the list of labels as the index keyword argument.
# CODE HERE
The final Series we create will be initialized from a Python dictionary. The
dictionary will have key-value pairs 'a':0 , 'b':1 , and 'c':2 .
# CODE HERE
DataFrame
Chapter Goals:
Learn about the pandas DataFrame object and its basic utilities
Write code to create and manipulate a pandas DataFrame
A. 2-D data
One of the main purposes of pandas is to deal with tabular data, i.e. data that
comes from tables or spreadsheets. Since tabular data contains rows and
columns, it is 2-D. For working with 2-D data, we use the pandas.DataFrame
object, which we'll refer to simply as a DataFrame.
df = pd.DataFrame()
# Newline added to separate DataFrames
print('{}\n'.format(df))
df = pd.DataFrame([5, 6])
print('{}\n'.format(df))
df = pd.DataFrame([[5,6]])
print('{}\n'.format(df))
index=['r1', 'r2'])
print('{}\n'.format(df))
Note that when we use a Python dictionary for initialization, the DataFrame
takes the dictionary's keys as its column labels.
B. Upcasting
The code below shows how upcasting works in DataFrames. You'll notice that
upcasting only occurs in the first column for the DataFrame below, because
the second column's values are both integers.
C. Appending rows
Note that the append function returns the modified DataFrame but doesn't
actually change the original. Furthermore, when we append a Series to the
DataFrame, we either need to specify the name for the series or use the
ignore_index keyword argument. Setting ignore_index=True will change the
row labels to integer indexes.
df_app = df.append(ser)
print('{}\n'.format(df_app))
df2 = pd.DataFrame([[0,0],[9,9]])
df_app = df.append(df2)
print('{}\n'.format(df_app))
D. Dropping data
We can drop rows or columns from a given DataFrame through the drop
function. There is no required argument, but the keyword arguments of the
function gives us two ways to drop rows/columns from a DataFrame.
The first way is using the labels keyword argument to specify the labels of
the rows/columns we want to drop. We use this alongside the axis keyword
argument (which has default value of 0 ) to drop from the rows or columns
axis.
The code below shows examples on how to use the drop function.
df_drop = df.drop(index='r2')
print('{}\n'.format(df_drop))
df_drop = df.drop(columns='c2')
print('{}\n'.format(df_drop))
df.drop(index='r2', columns='c2')
print('{}\n'.format(df_drop))
Similar to append , the drop function returns the modified DataFrame but
doesn't actually change the original.
Note that when using labels and axis , we can't drop both rows and columns
from the DataFrame.
Time to Code!
The coding exercise for this chapter involves creating various pandas
DataFrame objects.
We'll first create a DataFrame from a Python dictionary. The dictionary will
have key-value pairs 'c1':[0, 1, 2, 3] and 'c2':[5, 6, 7, 8] , in that order.
The index for the DataFrame will come from the list of row labels ['r1',
'r2', 'r3', 'r4'] .
# CODE HERE
We'll create another DataFrame, this one representing a single row. Rather
than a dictionary for the first argument, we use a list of lists, and manually set
the column labels to ['c1, 'c2'] .
Since there is only one row, the row labels will be ['r5'] .
Set row_df equal to pd.DataFrame with [[9, 9]] as the first argument, and
the specified column and row labels for the columns and index keyword
arguments.
# CODE HERE
After creating row_df , we append it to the end of df and drop row 'r2' .
Then set df_drop equal to df_app.drop with 'r2' as the labels keyword
argument.
# CODE HERE
Combining
Chapter Goals:
Understand the methods used to combine DataFrame objects
Write code for combining DataFrames
In the code example, the final call to pd.concat resulted in a DataFrame with
many NaN values. This is because the row labels for df1 and df3 did not
match, so result was padded with NaN in locations where values did not exist.
B. Merging
mlb_df1 = pd.DataFrame({'name': ['john doe', 'al smith', 'sam black', 'john doe'],
'pos': ['1B', 'C', 'P', '2B'],
'year': [2000, 2004, 2008, 2003]})
mlb_df2 = pd.DataFrame({'name': ['john doe', 'al smith', 'jack lee'],
'year': [2000, 2004, 2012],
'rbi': [80, 100, 12]})
print('{}\n'.format(mlb_df1))
print('{}\n'.format(mlb_df1))
Without using any keyword arguments, pd.merge joins two DataFrames using
all their common column labels. In the code example, the common labels
between mlb_df1 and mlb_df2 were name and year .
The rows that contain the exact same values for the common column labels
will be merged. Since 'john doe' for year 2000 was in both mlb_df1 and
mlb_df2 , its row was merged. However, 'john doe' for year 2003 was only in
mlb_df1 , so its row was not merged.
The pd.merge function takes in many keyword arguments, but often none are
needed to properly merge two DataFrames.
Time to Code!
The coding exercises for this chapter involve completing small functions that
take in two DataFrame objects as input.
The first function, concat_rows will concatenate the rows of the two
DataFrames.
Set row_concat equal to pd.concat with [df1, df2] as the only argument.
Then return row_concat .
The next function, concat_cols will concatenate the columns of the two input
DataFrames.
The final function, merge_dfs will merge the two input DataFrames along
their columns.
Set merged_df equal to pd.merge with df1 and df2 as the first and second
arguments, respectively.
Chapter Goals:
Learn how to index a DataFrame to retrieve rows and columns
Write code for indexing a DataFrame
A. Direct indexing
The code below shows how to directly index into a DataFrame's columns.
col1_df = df[['c1']]
print('{}\n'.format(col1_df))
Note that when we use a single column label inside the bracket (as was the
case for col1 in the code example), the output is a Series representing the
corresponding column. When we use a list of column labels (as was the case
for col1_df and col23 ), the output is a DataFrame that contains the
corresponding columns.
We can also use direct indexing to retrieve a subset of the rows (as a
DataFrame). However, we can only retrieve rows based on slices, rather than
specifying particular rows.
The code below shows how to directly index into a DataFrame's rows.
print('{}\n'.format(df))
first_two_rows = df[0:2]
print('{}\n'.format(first_two_rows))
last_two_rows = df['r2':'r3']
print('{}\n'.format(last_two_rows))
# Results in KeyError
df['r1']
You'll notice that when we used integer indexing for the rows, the end index
was exclusive (e.g. first_two_rows excluded the row at index 2). However,
when we use row labels, the end index is inclusive (e.g. last_two_rows
included the row labeled 'r3' ).
B. Other indexing
Apart from direct indexing, a DataFrame object also contains the loc and
iloc properties for indexing.
We use iloc to access rows based on their integer index. Using iloc we can
access a single row as a Series, and specify particular rows to access through a
list of integers or a boolean array.
The code below shows how to use iloc to access a DataFrame's rows.
print('{}\n'.format(df))
print( {}\n .format(df.iloc[1]))
print('{}\n'.format(df.iloc[[0, 2]]))
The loc property provides the same row indexing functionality as iloc , but
uses row labels rather than integer indexes. Furthermore, with loc we can
perform column indexing along with row indexing, and set new values in a
DataFrame for specific rows and columns.
print('{}\n'.format(df))
print('{}\n'.format(df.loc['r2']))
You'll notice that the way we access rows and columns together with loc is
similar to how we access 2-D NumPy arrays.
Since we can't access columns on their own with loc or iloc , we still use
bracket indexing when retrieving columns of a DataFrame.
Time to Code!
The coding exercises for this chapter involve directly indexing into a
predefined DataFrame, df .
We'll initially use direct indexing to get the first column of df as well as the
first two rows.
# CODE HERE
Next, we'll use iloc to retrieve the first and third rows of df .
# CODE HERE
Finally, we use loc to set each value of the second column, in the third and
fourth rows, equal to 12. The row key we use for indexing will be ['r3','r4'] ,
while the column key will be 'c2' .
Set df.loc , indexed with the specified row and column keys, equal to 12 .
# CODE HERE
File I/O
Chapter Goals:
Learn how to handle file input/output using pandas
Write code for processing data files
A. Reading data
One of the most important features in pandas is the ability to read from data
files. pandas accepts a variety of file formats, ranging from CSV and Excel
spreadsheets to SQL and even HTML. A full list of the available file formats for
pandas can be found here.
In this chapter we'll focus on three of the most common file types: CSV, XLSX
(Microsoft Excel), and JSON. For reading data from a file, we use either the
read_csv , read_excel , or read_json function, depending on the file type.
Each of the file reading functions takes in a file path as the only required
argument. Each function has numerous keyword arguments, so we won't get
into most of them. However, we'll still discuss a couple of the more commonly
used keyword arguments.
CSV
df = pd.read_csv('data.csv', index_col=0)
print('{}\n'.format(df))
df = pd.read_csv('data.csv', index_col=1)
print('{}\n'.format(df))
Excel
When we don't use any keyword arguments, the returned DataFrame from
pd.read_excel contains the first sheet of the Excel workbook. However, when
we set the sheet_name keyword argument, we can obtain a specific
spreadsheet by passing in its integer index or name.
print('MIL DataFrame:')
df = pd.read_excel('data.xlsx', sheet_name='MIL')
print('{}\n'.format(df))
# Sheets 0 and 1
df_dict = pd.read_excel('data.xlsx', sheet_name=[0, 1])
print('{}\n'.format(df_dict[1]))
# All Sheets
df_dict = pd.read_excel('data.xlsx', sheet_name=None)
print(df_dict.keys())
JSON
JSON data is pretty similar to a Python dictionary. In fact, you can use the
json module (part of the Python standard library) to convert between
dictionaries and JSON data. The file path for pd.read_json specifies a file
containing JSON data.
When we don't use any keyword arguments, pd.read_json treats each outer
key of the JSON data as a column label and each inner key as a row label. In
the code example below, you can see df1 treats the player's names as column
labels.
However, when we set orient='index' , the outer keys are treated as row
labels and the inner keys are treated as column labels.
df1 = pd.read_json('data.json')
print('{}\n'.format(df1))
B. Writing to les
We can also use pandas to write data to a file. Focusing again on CSV, Excel,
and JSON, the functions we use to write to files are to_csv , to_excel , and
to_json .
Similar to the file reading functions, each of the writing functions has dozens
of keyword arguments. Therefore, we'll only go over a few of the commonly
used ones.
CSV
Note that when we don't use any keyword arguments, to_csv will write the
row labels as the first column in the CSV file. This is fine if the row labels are
meaningful, but if they are just integers we don't really want them in the CSV
file. In that case, we set index=False , to specify that we don't write the row
labels into the CSV file.
# Predefined mlb_df
print('{}\n'.format(mlb_df))
Excel
# Predefined DataFrames
print('{}\n'.format(mlb_df1))
print('{}\n'.format(mlb_df2))
The to_json function also uses the orient keyword argument that was part of
pd.read_json . Like in pd.read_json , setting orient='index' will set the outer
keys of the JSON data to the row labels and the inner keys to the column
labels.
# Predefined df
print('{}\n'.format(df))
df.to_json('data.json')
df2 = pd.read_json('data.json')
print('{}\n'.format(df2))
df.to_json('data.json', orient='index')
df2 = pd.read_json('data.json')
print('{}\n'.format(df2))
df2 = pd.read_json('data.json', orient='index')
print('{}\n'.format(df2))
Time to Code!
The coding exercises in this chapter reading from CSV files containing
baseball data, manipulating the data, then writing the resulting data into a
new CSV file.
First, we'll read from the two CSV files 'stats.csv' and 'salary.csv' . These
files contain the stats and salaries, respectively, of various baseball players.
# CODE HERE
Rather than having two separate DataFrames, we want a single DataFrame
that contains the yearly stats and salaries for each player. Therefore, we can
just merge the stats_df and salary_df DataFrames.
Set df equal to pd.merge with stats_df and salary_df as the first two
arguments, in that order.
# CODE HERE
Finally, we write the merged DataFrame into the file named 'out.csv' . Since
the original CSV files didn't label the rows, we'll make sure not to label the
rows of 'out.csv' .
Call df.to_csv with 'out.csv' as the first argument and False for the
index keyword argument.
# CODE HERE
Grouping
Chapter Goals:
Learn how to group DataFrames by columns
Write code to retrieve home run statistics through DataFrame grouping
A. Grouping by column
When dealing with large amounts of data, it is usually a good idea to group
the data by common categories. For example, we could group a large dataset
of MLB player statistics by year, so we can deal with each year's data
separately.
With pandas DataFrames, we can perform dataset grouping with the groupby
function. A common usage of the function is to group a DataFrame by values
from a particular column, e.g. a column representing years.
The code below shows how to use the groupby function, with the example of
grouping MLB data by year.
groups = df.groupby('yearID')
for name, group in groups:
print('Year: {}'.format(name))
print('{}\n'.format(group))
print('{}\n'.format(groups.get_group(2016)))
print('{}\n'.format(groups.sum()))
print('{}\n'.format(groups.mean()))
The grouping code example produced three DataFrames for the years 2015,
2016, and 2017. The three DataFrame groups are contained in the groups
variable, and we used its sum and mean functions to retrieve the total and
average per-year statistics.
In addition to aggregation functions like sum and mean , we can also filter the
groups using filter . The filter function takes in another function as its
required argument, which specifies how we want to filter the groups. The
output of filter is the concatenation of all the groups that pass the filter.
In the above code example, the lambda function passed into filter returns
True if the group (represented as x ) represents a year greater than 2015. The
output is the concatenation of the 2016 and 2017 groups.
B. Multiple columns
# player_df is predefined
groups = player_df.groupby(['yearID', 'teamID'])
print(groups.sum())
In the code above, we grouped the MLB data by both year and team, resulting
in each group's name being a tuple of year and team. Using the sum function,
we obtained the annual total hits for each team.
Time to Code!
The coding exercises for this chapter involve performing grouping operations
on df , which contains all MLB batting data from 1871-2017. Using df , our
goal is to retrieve home run (HR) statistics for 2017.
To do this, we need to calculate the total number of home runs hit each year.
This involves first grouping df by year.
# CODE HERE
The yearly stats can be obtained from summing the values across the year-
separated groups.
# CODE HERE
The year_stats DataFrame represents the total value for each stat per year.
The row labels are the years and the column labels are the stat categories, e.g.
home runs. Using the loc property, we'll retrieve the home run total for 2017.
Set hr_2017 equal to year_stats.loc with 2017 as the row index and 'HR'
as the column index.
# CODE HERE
Next we want to get the yearly totals for each batting statistic per team. To do
this, we group the data by both the year and team.
# CODE HERE
Once again, to obtain the yearly stats we just sum over all the groups.
# CODE HERE
Features
Learn about the different feature types that can be part of a dataset.
Chapter Goals:
Understand the difference between quantitative and categorical features
Learn methods to manipulate features and add them to a DataFrame
Write code to add MLB statistics to a DataFrame
A categorical feature, e.g. gender or birthplace, is one where the values are
categories that could be used to group the dataset. These are the features we
would use with the groupby function from the previous chapter.
B. Quantitative features
Two of the most important functions to use with quantitative features are sum
and mean . In the previous chapter we also introduced sum and mean
functions, which were used to aggregate quantitative features for each a
group.
However, while the functions from the previous chapter were applied to the
output of groupby , the ones we use in this chapter are applied to individual
DataFrames.
The code below shows example usages of sum and mean . The df DataFrame
represents three different speed tests (columns) for three different processors
(rows). The data values correspond to the seconds taken for a given speed test
and processor.
df = pd.DataFrame({
'T1': [10, 15, 8],
'T2': [25, 27, 25],
'T3': [16, 15, 10]})
print('{}\n'.format(df))
print('{}\n'.format(df.sum()))
print('{}\n'.format(df.sum(axis=1)))
print('{}\n'.format(df.mean()))
print('{}\n'.format(df.mean(axis=1)))
In the code example, we used a DataFrame representing speed tests for three
different processors (measured in seconds). When we used no argument,
equivalent to using axis=0 , the sum and mean functions calculated total and
average times for each test. When we used axis=1 , the sum and mean
functions calculated total and average test times (across all three tests) for
each processor.
C. Weighted features
or columns (depending on the value of axis ). If a list is used, then the position
of each weight in the list corresponds to which row/column it is multiplied to.
In contrast with sum and mean , the default axis for multiply is the columns
axis. Therefore, to multiply weights along the rows of a DataFrame, we need
to explicitly set axis=0 .
df = pd.DataFrame({
'T1': [0.1, 150.],
'T2': [0.25, 240.],
'T3': [0.16, 100.]})
print('{}\n'.format(df))
print('{}\n'.format(df.multiply(2)))
df_w = df_ms.multiply([1,0.5,1])
print('{}\n'.format(df_w))
print('{}\n'.format(df_w.sum(axis=1)))
In the code above, the test times for processor 'p1' were measured in
seconds, while the times for 'p2' were in milliseconds. So we made all the
times in milliseconds by multiplying the values of 'p1' by 1000 .
Then we multiplied the values in 'T2' by 0.5 , since those tests were done
with two processors rather than one. This makes the final sum a weighted sum
across the three columns.
Time to Code!
The code exercises for this chapter involves calculating various baseball
statistics based on the values of other statistics. The mlb_df DataFrame is
predefined, and contains all historic MLB hitting statistics.
We also provide a col_list_sum function. This is a utility function to calculate
the sum of multiple columns across a DataFrame.
The weights keyword argument represents the weight coefficients we use for
a weighted column sum. Note that if weights is not None , it must have the
same length as col_list .
The mlb_df doesn't contain one of the key stats in baseball, batting average.
Therefore, we'll calculate the batting average and add it as a column in
mlb_df .
To calculate the batting average, simply divide a player's hits ( 'H' ) by their
number of at-bats ( 'AB' ).
# CODE HERE
Though mlb_df contains columns for doubles, triples, and home runs (labeled
'2B' , '3B' , 'HR' ), it does not contain a column for singles.
# CODE HERE
Now that mlb_df contains columns for all four types of hits, we can calculate
slugging percentage (column label 'SLG' ). The formula for slugging
percentage is:
1B + 2 ⋅ 2B + 3 ⋅ 3B + 4 ⋅ HR
SLG =
AB
# CODE HERE
We can now calculate the slugging percentage by dividing the weighted sum
by the number of at-bats.
# CODE HERE
Filtering
Chapter Goals:
Understand how to filter a DataFrame based on filter conditions
Write code to filter a dataset of MLB statistics
A. Filter conditions
df = pd.DataFrame({
'playerID': ['bettsmo01', 'canoro01', 'cruzne02', 'ortizda01', 'cruzne02'],
'yearID': [2016, 2016, 2016, 2016, 2017],
'teamID': ['BOS', 'SEA', 'SEA', 'BOS', 'SEA'],
'HR': [31, 39, 43, 38, 39]})
print('{}\n'.format(df))
The code below shows various examples of string filter conditions. In the final
example using str.contains , we prepend the ~ operation, which negates the
filter condition. This means our final filter condition checked for player IDs
that do not contain 'o' .
df = pd.DataFrame({
'playerID': ['bettsmo01', 'canoro01', 'cruzne02', 'ortizda01', 'cruzne02'],
'yearID': [2016, 2016, 2016, 2016, 2017],
'teamID': ['BOS', 'SEA', 'SEA', 'BOS', 'SEA'],
'HR': [31, 39, 43, 38, 39]})
print('{}\n'.format(df))
str_f1 = df['playerID'].str.startswith('c')
print('{}\n'.format(str_f1))
str_f2 = df['teamID'].str.endswith('S')
print('{}\n'.format(str_f2))
str_f3 = ~df['playerID'].str.contains('o')
print('{}\n'.format(str_f3))
We can also create filter conditions that check for values in a specific set, by
using the isin function. The function only takes in one argument, which is a
list of values that we want to filter for.
The code below demonstrates how to use the isin function for filter
conditions.
df = pd.DataFrame({
'playerID': ['bettsmo01', 'canoro01', 'cruzne02', 'ortizda01', 'cruzne02'],
'yearID': [2016, 2016, 2016, 2016, 2017],
'teamID': ['BOS', 'SEA', 'SEA', 'BOS', 'SEA'],
'HR': [31, 39, 43, 38, 39]})
print('{}\n'.format(df))
isin_f1 = df['playerID'].isin(['cruzne02',
'ortizda01'])
print('{}\n'.format(isin_f1))
df = pd.DataFrame({
'playerID': ['bettsmo01', 'canoro01', 'doejo01'],
'yearID': [2016, 2016, 2017],
'teamID': ['BOS', 'SEA', np.nan],
'HR': [31, 39, 99]})
print('{}\n'.format(df))
isna = df['teamID'].isna()
print('{}\n'.format(isna))
notna = df['teamID'].notna()
print('{}\n'.format(notna))
The isna function returns True in the locations that contain NaN and False
in the locations that don't, while the notna function does the opposite.
C. Feature ltering
The code below shows how to filter using square brackets and filter
conditions.
df = pd.DataFrame({
'playerID': ['bettsmo01', 'canoro01', 'cruzne02', 'ortizda01', 'bettsmo01'],
'yearID': [2016, 2016, 2016, 2016, 2015],
'teamID': ['BOS', 'SEA', 'SEA', 'BOS', 'BOS'],
'HR': [31, 39, 43, 38, 18]})
print('{}\n'.format(df))
str_df = df[df['teamID'].str.startswith('B')]
print('{}\n'.format(str_df))
Time to Code!
In this chapter's code exercises, we'll apply various filters to a predefined
DataFrame, mlb_df , which contains MLB statistics.
We'll first filter mlb_df for the top MLB hitting seasons in history, which we
define as having a batting average above .300.
Set top_hitters equal to mlb_df[] applied with mlb_df['BA'] > .300 as the
filter condition.
# CODE HERE
Next we filter for the players whose player ID does not start with the letter a.
# CODE HERE
We'll now retrieve the statistics for two specific players. Their player IDs are
'bondsba01' and 'troutmi01' .
Set two_ids equal to a list containing the two specified player IDs.
# CODE HERE
Sorting
Chapter Goals:
Learn how to sort a DataFrame by its features
Write code to sort an MLB player's statistics
A. Sorting by feature
When we deal with a dataset that has many features, it is often useful to sort
the dataset. This makes it easier to view the data and spot trends in the values.
The code below demonstrates how to use sort_values with a single column
label. The first example sorts by 'yearID' in ascending order, while the
second sorts 'playerID' in descending lexicographic (alphabetical) order.
# df is predefined
print('{}\n'.format(df))
sort1 = df.sort_values('yearID')
print('{}\n'.format(sort1))
When sorting with a list of column labels, each additional label is used to
break ties. Specifically, label i in the list acts as a tiebreaker for label i - 1.
The code below demonstrates how to sort with a list of column labels.
# df is predefined
print('{}\n'.format(df))
When using two column labels to sort, the list's first label represents the main
sorting criterion, while the second label is used to break ties. In the example
with sorting by 'yearID' and 'playerID' , the DataFrame is first sorted by
year (in ascending order). For identical years, we sort again by player ID (in
ascending order).
Time to Code!
The code exercises in this chapter involve sorting a DataFrame of yearly MLB
player stats, yearly_stats_df .
We'll sort yearly_stats_df using two different methods. The first method sorts
by 'yearID' in ascending order.
# CODE HERE
The next sorting method will sort by 'HR' in descending order.
# CODE HERE
# CODE HERE
Metrics
Chapter Goals:
Understand the common metrics used to summarize numeric data
Learn how to describe categorical data using histograms
A. Numeric metrics
Metric Description
# df is predefined
print('{}\n'.format(df))
metrics1 = df.describe()
print('{}\n'.format(metrics1))
hr_rbi = df[['HR','RBI']]
metrics2 = hr_rbi.describe()
print('{}\n'.format(metrics2))
Using describe with a DataFrame will return a summary of metrics for each
of the DataFrame's numeric features. In our example, df had three features
with numerical values: yearID , HR , and RBI .
metrics1 = hr_rbi.describe(percentiles=[.5])
print('{}\n'.format(metrics1))
metrics2 = hr_rbi.describe(percentiles=[.1])
print('{}\n'.format(metrics2))
metrics3 = hr_rbi.describe(percentiles=[.2,.8])
print('{}\n'.format(metrics3))
Note that the 50th percentile, i.e. the median, is always returned. The values
specified in the percentiles list will replace the default 25th and 75th
percentiles.
B. Categorical features
The frequency count for a specific category of a feature refers to how many
times that category appears in the dataset. In pandas, we use the value_counts
function to obtain the frequency counts for each category in a column feature.
The code below uses the value_counts function to get frequency counts of the
'playerID' feature.
p_ids = df['playerID']
print('{}\n'.format(p_ids.value_counts()))
print('{}\n'.format(p_ids.value_counts(normalize=True)))
print('{}\n'.format(p_ids.value_counts(ascending=True)))
If we just want the names of each unique category in a column, rather than
the frequencies, we use the unique function.
unique_players = df['playerID'].unique()
print('{}\n'.format(repr(unique_players)))
unique_teams = df['teamID'].unique()
print('{}\n'.format(repr(unique_teams)))
y_ids = df['yearID']
print('{}\n'.format(y_ids))
print('{}\n'.format(repr(y_ids.unique())))
print('{}\n'.format(y_ids.value_counts()))
Time to Code!
The coding exercises for this chapter involve getting metrics from a
DataFrame of MLB players, player_df .
# CODE HERE
Next, we want to get summaries specifically for the home run totals. The first
summary will contain the default metrics from describe , while the second
summary will contain the 10th and 90th percentiles.
Finally, we'll treat the 'HR' feature as a categorical variable, with each unique
home run total as a separate category. We then get the frequency counts for
each category.
# CODE HERE
Plotting
Learn how to plot DataFrames using the pyplot API from Matplotlib.
Chapter Goals:
Learn how to plot DataFrames using the pyplot API
A. Basics
The main function used for plotting DataFrames is plot . This function is used
in tandem with the show function from the pyplot API, to produce plot
visualizations. We import the pyplot API with the line:
import matplotlib.pyplot as plt
# predefined df
print('{}\n'.format(df))
df.plot()
plt.show()
# predefined df
print('{}\n'.format(df))
df.plot()
plt.savefig('df.png') # save to PNG file
The plot we created has no title or y-axis label. We can manually set the plot's
title and axis labels using the pyplot API.
# predefined df
print('{}\n'.format(df))
df.plot()
plt.title('HR vs. Year')
plt.xlabel('Year')
plt.ylabel('HR Count')
plt.show()
We use the title function to set the title of our plot, and the xlabel and
ylabel functions to set the axis labels.
B. Other plots
In addition to basic line plots, we can create other plots like histograms or
boxplots by setting the kind keyword argument in plot .
# predefined df
print('{}\n'.format(df))
df.plot(kind='hist')
df.plot(kind='box')
plt.show()
The above code results in these plots:
There are numerous different kinds of plots we can create by setting the kind
keyword argument. A list of the accepted values for kind can be found in the
documentation for plot .
C. Multiple features
We can also plot multiple features on the same graph. This can be extremely
useful when we want visualizations to compare different features.
# predefined df
print('{}\n'.format(df))
df.plot()
df.plot(kind='box')
plt.show()
The above code results in these plots:
These are a line plot and boxplot showing both hits ( H ) and walks ( BB ). Note
that the circles in the boxplot represent outlier values.
To NumPy
Chapter Goals:
Learn how to convert a DataFrame to a NumPy matrix
Write code to modify an MLB dataset and convert it to a NumPy matrix
A. Machine learning
The DataFrame object is great for storing a dataset and performing data
analysis in Python. However, most machine learning frameworks (e.g.
TensorFlow), work directly with NumPy data. Furthermore, the NumPy data
used as input to machine learning models must solely contain quantitative
values.
B. Indicator features
The easiest way to do this is to convert each categorical feature into a set of
indicator features for each of its categories. The indicator feature for a specific
category represents whether or not a given data sample belongs to that
category.
In the code above, the DataFrame df has a single categorical feature called
Color . The corresponding indicator features for Color are shown in
indicator_df .
Note that an indicator feature contains 1 when the row has that particular
category, and 0 if the row does not.
C. Converting to indicators
# predefined df
print('{}\n'.format(df))
converted = pd.get_dummies(df)
print('{}\n'.format(converted.columns))
print('{}\n'.format(converted[['teamID_BOS',
'teamID_PIT']]))
print('{}\n'.format(converted[['lgID_AL',
'lgID_NL']]))
Note that the indicator features have the original categorical feature's label as
a prefix. This makes it easy to see where each indicator feature originally
came from.
D. Converting to NumPy
# predefined indicator df
print('{}\n'.format(df))
n_matrix = df.values
print(repr(n_matrix))
The rows and columns of the output matrix correspond to the rows and
columns of the same position in the DataFrame. In the code above, the first
column of the NumPy matrix represents HR , the second column represents
teamID_BOS , and the third column represents teamID_PIT .
Time to Code!
The code exercise for this chapter will be to convert a DataFrame of MLB
statistics ( df ) into a NumPy matrix.
Filter df for rows where 'yearID' is at least 2000 , then reset df equal to
the filtered output.
# CODE HERE
We also don't want any of the NaN values in our data. We can filter those out
using the special dropna function.
# CODE HERE
Finally, we want to convert each categorical feature into a set of indicator
features for each of its categories.
# CODE HERE
Quiz
1
Which of the following are methods for indexing into a DataFrame?
COMPLETED 0%
1 of 4
Introduction
The main task for machine learning engineers is to first analyze the data for
viable trends, then create an efficient input pipeline for training a model. This
process involves using libraries like NumPy and pandas for handling data,
along with machine learning frameworks like TensorFlow for creating the
model and input pipeline. For more information on ML engineering and the
NumPy and pandas libraries, check out the previous two sections in this
course.
While the NumPy and pandas libraries are also used in data science, the Data
Preprocessing section will cover one of the core libraries that is specific to
industry-level data science: scikit-learn. Data scientists tend to work on
smaller datasets than machine learning engineers, and their main goal is to
analyze the data and quickly extract usable results. Therefore, they focus
more on traditional data inference models (found in scikit-learn), rather than
deep neural networks.
The scikit-learn library includes tools for data preprocessing and data mining.
It is imported in Python via the statement import sklearn .
Standardizing Data
Chapter Goals:
Learn about data standardization
Data can contain all sorts of different values. For example, Olympic 100m
sprint times will range from 9.5 to 10.5 seconds, while calorie counts in large
pepperoni pizzas can range from 1500 to 3000 calories. Even data measuring
the exact same quantities can range in value (e.g. weight in kilograms vs.
weight in pounds).
When data can take on any range of values, it makes it difficult to interpret.
Therefore, data scientists will convert the data into a standard format to make
it easier to understand. The standard format refers to data that has 0 mean
and unit variance (i.e. standard deviation = 1), and the process of converting
data into this format is called data standardization.
x−μ
z=
σ
B. NumPy and scikit-learn
For most scikit-learn functions, the input data comes in the form of a NumPy
array.
Note: The array’s rows represent individual data observations, while each
column represents a particular feature of the data, i.e. the same format as a
spreadsheet data table.
For example, the second data observation in pizza_data has a net weight of
1.6 standard deviations above the mean pizza weight in the dataset.
If for some reason we need to standardize the data across rows, rather than
columns, we can set the axis keyword argument in the scale function to 1.
This may be the case when analyzing data within observations, rather than
within a feature. An example of this would be analyzing a particular student's
test scores in terms of standard deviations from that student's average test
score.
Time to Code!
The coding exercise in this chapter is to complete a generic data
standardization function, standardize_data .
This function will standardize the input NumPy array, data , by using the
scale function (imported in the backend).
Set scaled_data equal to scale applied with data as the only argument.
Then return scaled_data .
def standardize_data(data):
# CODE HERE
pass
Data Range
Chapter Goals:
Learn how to compress data values to a specified range
A. Range scaling
Apart from standardizing data, we can also scale data by compressing it into a
fixed range. One of the biggest use cases for this is compressing data into the
range [0, 1]. This allows us to view the data in terms of proportions, or
percentages, based on the minimum and maximum values in the data.
The formula for scaling based on a range is a two-step process. For a given
data value, x, we first compute the proportion of the value with respect to the
min and max of the data dmin and dmax, respectively).
x − dmin
xprop =
dmax − dmin
The formula above computes the proportion of the data value, xprop. Note that
this only works if not all the data values are the same (i.e. dmax ≠ dmin).
We then use the proportion of the value to scale to the specified range, [rmin,
rmax]. The formula below calculates the new scaled value, xscale.
The code below shows how to use the MinMaxScaler (with the default range
and a custom range).
# predefined data
print('{}\n'.format(repr(data)))
Now lets run the fit and transform functions separately and compare them
with the fit_transform function. fit takes in an input data array and
transform transforms a (possibly different) array based on the data from the
input to the fit function.
# predefined new_data
print('{}\n'.format(repr(new_data)))
Time to Code!
The coding exercise in this chapter uses MinMaxScaler (imported in backend)
to complete the ranged_data function.
The function will compress the input NumPy array, data , into the range given
by value_range .
Understand how outliers can affect data and implement robust scaling.
Chapter Goals:
Learn how to scale data without being affected by outliers
A. Data outliers
A 2-D data plot with the outlier data points circled. Note that the outliers in this plot are
exaggerated, and in real life outliers are not usually this far from the non-outlier data.
The data scaling methods from the previous two chapters are both affected by
outliers. Data standardization uses each feature's mean and standard
deviation, while ranged scaling uses the maximum and minimum feature
values, meaning that they're both susceptible to being skewed by outlier
values.
We can robustly scale the data, i.e. avoid being affected by outliers, by using
use the data's median and Interquartile Range (IQR). Since the median and
IQR are percentile measurements of the data (50% for median, 25% to 75% for
the IQR), they are not affected by outliers. For the scaling method, we just
subtract the median from each data value then scale to the IQR.
# predefined data
print('{}\n'.format(repr(data)))
Time to Code!
The coding exercise in this chapter uses RobustScaler (imported in backend)
to complete the robust_scaling function.
The function will apply outlier-independent scaling to the input NumPy array,
data .
def robust_scaling(data):
# CODE HERE
pass
Normalizing Data
Chapter Goals:
Learn how to apply L2 normalization to data
L2 normalization
So far, each of the scaling techniques we've used has been applied to the data
features (i.e. columns). However, in certain cases we want to scale the
individual data observations (i.e. rows). For instance, when clustering data we
need to apply L2 normalization to each row, in order to calculate cosine
similarity scores. The Clustering section will cover data clustering and cosine
similarities in greater depth.
X = [x1 , x2 , ..., xm ]
m
= [ , , ..., ] , where ℓ = ∑ x2i
x1 x2 xm
⎷
XL2
ℓ ℓ ℓ
i=1
# predefined data
print('{}\n'.format(repr(data)))
normalizer = Normalizer()
transformed = normalizer.fit_transform(data)
print('{}\n'.format(repr(transformed)))
Time to Code!
The coding exercise in this chapter uses Normalizer (imported in backend) to
complete the normalize_data function.
The function will apply L2 normalization to the input NumPy array, data .
def normalize_data(data):
# CODE HERE
pass
Data Imputation
Learn about data imputation and the various methods to accomplish it.
Chapter Goals:
Learn different methods for imputing data
In real life, we often have to deal with data that contains missing values.
Sometimes, if the dataset is missing too many values, we just don't use it.
However, if only a few of the values are missing, we can perform data
imputation to substitute the missing data with some other value(s).
There are many different methods for data imputation. In scikit-learn, the
SimpleImputer transformer performs four different data imputation methods.
The code below shows how to perform data imputation using mean values
from each column.
# predefined data
print('{}\n'.format(repr(data)))
in its column.
The default imputation method for SimpleImputer is using the column means.
By using the strategy keyword argument when initializing a SimpleImputer
object, we can specify a different imputation method.
# predefined data
print('{}\n'.format(repr(data)))
imp_frequent = SimpleImputer(strategy='most_frequent')
transformed = imp_frequent.fit_transform(data)
print('{}\n'.format(repr(transformed)))
The 'median' strategy fills in missing data with the median from each column,
while the 'most_frequent' strategy uses the value that appears the most for
each column.
The code below demonstrates how to fill in missing data with a specific
constant. The fill_value keyword argument is used when initializing the
SimpleImputer object, to specify the constant.
# predefined data
print('{}\n'.format(repr(data)))
In most industry cases these advanced methods are not required, since the
data is either perfectly cleaned or the missing values are scarce. Nevertheless,
the advanced methods could be useful when dealing with open source
datasets, since these tend to be more incomplete.
PCA
Learn about PCA and why it's useful for data preprocessing.
Chapter Goals:
Learn about principal component analysis and why it's used
A. Dimensionality reduction
B. PCA in scikit-learn
Like every other data transformation, we can apply PCA to a dataset in scikit-
learn with a transformer, in this case the PCA module. When initializing the
PCA module, we can use the n_components keyword to specify the number of
principal components. The default setting is to extract m - 1 principal
components, where m is the number of features in the dataset.
The code below shows examples of applying PCA with various numbers of
principal components.
# predefined data
print('{}\n'.format(repr(data)))
pca_obj = PCA(n_components=3)
pc = pca_obj.fit_transform(data).round(3)
print('{}\n'.format(repr(pc)))
pca_obj = PCA(n_components=2)
pc = pca_obj.fit_transform(data).round(3)
print('{}\n'.format(repr(pc)))
In the code output above, notice that when PCA is applied with 4 principal
components, the final column (last principal component) is all 0's. This means
that there are actually only a maximum of three uncorrelated principal
components that can be extracted.
Time to Code!
The coding exercise in this chapter uses PCA (imported in backend) to
complete the pca_data function.
The function will apply principal component analysis (PCA) to the input
NumPy array, data .
Set pca_obj equal to PCA initialized with n_components for the n_components
keyword argument.
Chapter Goals:
Learn about labeled datasets
Separate principle component data by class label
A. Class labels
The code below separates a breast cancer dataset into malignant and benign
categories. The load_breast_cancer function is part of the scikit-learn library,
and its data comes from the Breast Cancer Wisconsin dataset.
# Class labels
print('{}\n'.format(repr(bc.target)))
print('Labels shape: {}\n'.format(bc.target.shape))
# Label names
print('{}\n'.format(list(bc.target_names)))
malignant = bc.data[bc.target == 0]
print('Malignant shape: {}\n'.format(malignant.shape))
benign = bc.data[bc.target == 1]
print('Benign shape: {}\n'.format(benign.shape))
In the example above, the bc.data array contains all the dataset values, while
the bc.target array contains the class ID labels for each row in bc.data . A
Using the bc.target class IDs, we separated the dataset into malignant and
benign data arrays. In other words, the malignant array contains the rows of
bc.data corresponding to the indexes in bc.target containing 0, while the
benign array contains the rows of bc.data corresponding to the indexes in
bc.target containing 1. There are 212 malignant data observations and 357
benign observations.
Time to Code!
The coding exercise in this chapter involves completing the
separate_components function, which will separate principal component data
by class.
The labels input is a 1-D array containing the class label IDs corresponding to
each row of component_data . We can use it to separate the principle
components by class.
The label_names input represents all the string names for the class labels.
Inside the for loop, we can use our helper function to obtain the separated
data for each class.
1
What is the main purpose of standardizing data?
COMPLETED 0%
1 of 4
Introduction
In the Data Modeling section, you will be creating a variety of models for
linear regression and classifying data. You will also learn how to perform
hyperparameter tuning and model evaluation through cross-validation.
The main job of a data scientist is analyzing data and creating models for
obtaining results from the data. Oftentimes, data scientists will use simple
statistical models for their data, rather than machine learning models like
neural networks. This is because data scientists tend to work with smaller
datasets than machine learning engineers, so they can quickly extract good
results using statistical models.
The scikit-learn library provides many statistical models for linear regression.
It also provides a few good models for classifying data, which will be
introduced in later chapters.
When creating these models, data scientists need to figure out the optimal
hyperparameters to use. Hyperparameters are values that we set when
creating a model, e.g. certain constant coefficients used in the model's
calculations. We'll talk more about hyperparameter tuning, the process of
finding the optimal hyperparameter settings, in later chapters.
Linear Regression
Chapter Goals:
Create a basic linear regression model based on input data and labels
One of the main objectives in both machine learning and data science is
finding an equation or distribution that best fits a given dataset. This is known
as data modeling, where we create a model that uses the dataset's features as
independent variables to predict output values for some dependent variable
(with minimal error). However, it is incredibly difficult to find an optimal
model for most datasets, given the amount of noise (i.e. random
errors/fluctuations) in real world data.
Since finding an optimal model for a dataset is difficult, we instead try to find
a good approximating distribution. In many cases, a linear model (a linear
combination of the dataset's features) can approximate the data well. The
term linear regression refers to using a linear model to represent the
relationship between a set of independent variables and a dependent
variable.
The simplest form of linear regression is called least squares regression. This
strategy produces a regression model, which is a linear combination of the
independent variables, that minimizes the sum of squared residuals between
the model's predictions and actual values for the dependent variable.
In scikit-learn, the least squares regression model is implemented with the
LinearRegression object, which is a part of the linear_model module in
sklearn . The object contains a fit function, which takes in an input dataset
of features (independent variables) and an array of labels (dependent
variables) for each data observation (rows of the dataset).
After calling the fit function, the model is ready to use. The predict function
allows us to make predictions on new data.
We can also get the specific coefficients and intercept for the linear
combination using the coef_ and intercept_ properties, respectively.
Finally, we can retrieve the coefficient of determination (or R2 value) using the
score function applied to the dataset and labels. The R2 value tells us how
close of a fit the linear model is to the data, or in other words, how good of a
fit the model is for the data.
price_predicts = reg.predict(new_pizzas)
print('{}\n'.format(repr(price_predicts)))
print('Coefficients: {}\n'.format(repr(reg.coef_)))
print('Intercept: {}\n'.format(reg.intercept_))
# Using previously defined pizza_data, pizza_prices
r2 = reg.score(pizza_data, pizza_prices)
print('R2: {}\n'.format(r2))
Time to Code!
The coding exercise in this chapter uses the LinearRegression object of the
linear_model module (imported in backend) to complete the linear_reg
function.
The function will fit a basic least squares regression model to the input data
and labels.
Call reg.fit with data and labels as the two input arguments. Then
return reg .
Chapter Goals:
Learn about regularization in linear regression
Learn about hyperparameter tuning using cross-validation
Implement a cross-validated ridge regression model in scikit-learn
While ordinary least squares regression is a good way to fit a linear model
onto a dataset, it relies on the fact that the dataset's features are each
independent, i.e. uncorrelated. When many of the dataset features are
linearly correlated, e.g. if a dataset has multiple features depicting the same
price in different currencies, it makes the least squares regression model
highly sensitive to noise in the data.
Because real life data tends to have noise, and will often have some linearly
correlated features in the dataset, we combat this by performing
regularization. For ordinary least squares regression, the goal is to find the
weights (coefficients) for the linear model that minimize the sum of squared
residuals:
n
∑(xi ⋅ w − yi )2
i=1
For regularization, the goal is to not only minimize the sum of squared
residuals, but to do this with coefficients as small as possible. The smaller the
coefficients, the less susceptible they are to random noise in the data. The
most commonly used form of regularization is ridge regularization.
With ridge regularization, the goal is now to find the weights that minimize
the following quantity:
n
α∣∣w∣∣22 + ∑(xi ⋅ w − yi )2
i=1
α∣∣w∣∣22
The plot above shows an example of ordinary least squares regression models
vs. ridge regression models. The two red crosses mark the points (0.5, 0.5) and
(1, 1), and the blue lines are the regression lines for those two points. Each of
the grey lines are the regression lines for the original points with added noise
(which is signified by the grey points).
We can specify the value of the α hyperparameter when initializing the Ridge
object (the default is 1.0). However, rather than manually choosing a value,
we can use cross-validation to help us choose the optimal α from a list of
values.
Time to Code!
The coding exercise in this chapter uses the RidgeCV object of the
linear_model module (imported in backend) to complete the cv_ridge_reg
function.
The function will fit a ridge regression model to the input data and labels. The
model is cross-validated to choose the best α value from the input list alphas .
Chapter Goals:
Learn about sparse linear regression via LASSO
A. Sparse regularization
n
α∣∣w∣∣1 + ∑(xi ⋅ w − yi )2
i=1
The code below demonstrates how to use the Lasso object on a dataset with
150 observations and 4 features.
# predefined dataset
print('Data shape: {}\n'.format(data.shape))
print('Labels shape: {}\n'.format(labels.shape))
from sklearn import linear_model
reg = linear_model.Lasso(alpha=0.1)
reg.fit(data, labels)
print('Coefficients: {}\n'.format(repr(reg.coef_)))
print('Intercept: {}\n'.format(reg.intercept_))
print('R2: {}\n'.format(reg.score(data, labels)))
In the example above, note that a majority of the weights are 0, due to the
LASSO sparse weight preference.
Time to Code!
The coding exercise in this chapter uses the Lasso object of the linear_model
module (imported in backend) to complete the lasso_reg function.
The function will fit a LASSO regression model to the input data and labels.
The α hyperparameter for the model is provided to the function via the alpha
input argument.
Set reg equal to linear_model.Lasso initialized with alpha for the alpha
keyword argument.
Call reg.fit with data and labels as the two input arguments. Then
return reg .
Chapter Goals:
Learn about Bayesian regression techniques
A. Bayesian techniques
In Bayesian statistics, the main idea is to make certain assumptions about the
probability distributions of a model's parameters before being fitted on data.
These initial distribution assumptions are called priors for the model's
parameters.
B. Hyperparameter priors
There's no need to know the specifics of a gamma distribution, other than the
fact that it's a probability distribution defined by a shape parameter and scale
parameter.
Specifically, the α hyperparameter has prior:
Γ(α1 , α2 )
Γ(λ1 , λ2 )
This can all be done with the BayesianRidge object (part of the linear_model
module). Like all the previous regression objects, this one can be initialized
with no required arguments.
We can manually specify the α1 and α2 gamma parameters for α with the
alpha_1 and alpha_2 keyword arguments when initializing BayesianRidge .
Similarly, we can manually set λ1 and λ2 with the lambda_1 and lambda_2
keyword arguments. The default value for each of the four gamma
parameters is 10-6.
Time to Code!
The coding exercise in this chapter uses the BayesianRidge object of the
linear_model module (imported in backend) to complete the bayes_ridge
function.
The function will fit a Bayesian ridge regression model to the input data and
labels.
Call reg.fit with data and labels as the two input arguments. Then
return reg .
Chapter Goals:
Learn about logistic regression for linearly separable datasets
A. Classi cation
Thus far we've learned about several linear regression models and
implemented them with scikit-learn. The logistic regression model, despite its
name, is actually a linear model for classification. It is called logistic
regression because it performs regression on logits, which then allows us to
classify the data based on model probability predictions.
For a more detailed explanation of logistic regression, check out the Intro to
Deep Learning section of this course, which implements logistic regression
via a single layer perceptron model in TensorFlow.
# predefined dataset
print('Data shape: {}\n'.format(data.shape))
# Binary labels
print('Labels:\n{}\n'.format(repr(labels)))
new_data = np.array([
[ 0.3, 0.5, -1.2, 1.4],
[ -1.3, 1.8, -0.6, -8.2]])
print('Prediction classes: {}\n'.format(
repr(reg.predict(new_data))))
The code above created a logistic regression model from a labeled dataset. The
model predicts 1 and 0, respectively, as the labels for the observations in
new_data .
For multiclass classification, i.e. when there are more than two labels, we
initialize the LogisticRegression object with the multi_class keyword
argument. The default value is 'ovr' , which signifies a One-Vs-Rest strategy.
In multiclass classification, we want to use the 'multinomial' strategy.
The code below demonstrates multiclass classification. Note that to use the
'multinomial' strategy, we need to choose a proper solver (see below for
details on solvers). In this case, we choose 'lbfgs' .
# predefined dataset
print('Data shape: {}\n'.format(data.shape))
# Multiclass labels
print('Labels:\n{}\n'.format(repr(labels)))
new_data = np.array([
[ 1.8, -0.5, 6.2, 1.4],
[ 3.3, 0.8, 0.1, 2.5]])
print('Prediction classes: {}\n'.format(
repr(reg.predict(new_data))))
B. Solvers
We can choose a particular solver using the solver keyword argument. The
default solver is currently 'liblinear' (although it will change to 'lbfgs' in
future version). For the 'newton-cg' , 'sag' , and 'lbfgs' solvers, we can also
set the maximum number of iterations the solver takes until the model's
weights converge using the max_iter keyword argument. Since the default
max_iter value is 100, we may want to let the solver run for a higher number
of iterations in certain applications.
The code below demonstrates usage of the solver and max_iter keyword
arguments.
C. Cross-validated model
Like the ridge and LASSO regression models, the logistic regression model
comes with a cross-validated version in scikit-learn. The cross-validated
logistic regression object, LogisticRegressionCV , is initialized and used in the
same way as the regular LogisticRegression object.
Time to Code!
The coding exercise in this chapter uses the LogisticRegression object of the
linear_model module (imported in backend) for multiclass classification.
Call reg.fit with data and labels as the two input arguments. Then
return reg .
Chapter Goals:
Learn about decision trees and how they are constructed
Learn how decision trees are used for classification and regression
A. Making decisions
The leaves of the decision tree determine the class label to predict (in
classification) or the real number value to predict (in regression).
A decision tree for deciding what to eat. This is an example of multiclass classification.
The code below demonstrates how to create decision trees for classification
and regression. Each decision tree uses the fit function for fitting on data
and labels.
# predefined dataset
print('Data shape: {}\n'.format(data.shape))
# Binary labels
print('Labels:\n{}\n'.format(repr(labels)))
clf_tree1.fit(data, labels)
The max_depth keyword argument lets us manually set the maximum number
of layers allowed in the decision tree (i.e. the tree's maximum depth). The
default value is None , meaning that the decision tree will continue to be
constructed until no nodes can have anymore children. Since large decision
trees are prone to overfit data, it can be beneficial to manually set a maximum
depth for the tree.
B. Choosing features
Since a decision tree makes decisions based on feature values, the question
now becomes how we choose the features to decide on at each node. In
general terms, we want to choose the feature value that "best" splits the
remaining dataset at each node.
How we define "best" depends on the decision tree algorithm that's used.
Since scikit-learn uses the CART algorithm, we use Gini Impurity, MSE (mean
squared error), and MAE (mean absolute error) to decide on the best feature
at each node.
Specifically, for classification trees we choose the feature at each node that
minimizes the remaining dataset observations' Gini Impurity. For regression
trees we choose the feature at each node that minimizes the remaining
dataset observations' MSE or MAE, depending on which you choose to use (the
default for DecisionTreeRegressor is MSE).
Time to Code!
The coding exercises in this chapter use the DecisionTreeRegressor object of
the tree module for regression modeling.
You will create a decision tree with max depth equal to 5, then fit the tree on
(predefined) data and labels .
# CODE HERE
Training and Testing
Chapter Goals:
Learn about splitting a dataset into training and testing sets
We've discussed in depth how to fit a model on data and labels. However,
once we fit the model, how do we evaluate it? It is a bad idea to evaluate a
model solely on the same dataset it was fitted on, because the model's
parameters are already tuned for that dataset. Instead, we need to split the
original dataset into two datasets: one for training and one for testing.
The training set is used for fitting the model on data (i.e. training the model),
while the testing set is used for evaluating the model. Therefore, the training
set is much larger than the testing set. Exactly how much larger depends on
the application and requirements.
Increasing the size of the training set will give more data for the model to be
fitted on, which can increase the model's performance. However, because this
decreases the size of the testing set, there's a higher chance that the testing set
may not be representative of the original dataset (which can lead to
inaccurate evaluation).
In general, the testing set is around 10-30% of the original dataset, while the
training set makes up the remaining 70-90%.
The code below demonstrates how to split a dataset into training and testing
sets.
data = np.array([
[10.2 , 0.5 ],
[ 8.7 , 0.9 ],
[ 9.3 , 0.8 ],
[10.1 , 0.4 ],
[ 9.5 , 0.77],
[ 9.1 , 0.68],
[ 7.7 , 0.9 ],
[ 8.3 , 0.8 ]])
labels = np.array(
[1.4, 1.2, 1.6, 1.5, 1.6, 1.3, 1.1, 1.2])
print('{}\n'.format(repr(train_data)))
print('{}\n'.format(repr(train_labels)))
print('{}\n'.format(repr(test_data)))
print('{}\n'.format(repr(test_labels)))
Note that the train_test_split function randomly shuffles the dataset and
corresponding labels prior to splitting. This is good practice to remove any
systematic orderings in the dataset, which could potentially impact the model
into training on the orderings rather than the actual data.
The default size of the testing set is 25% of the original dataset. We can use the
test_size keyword argument to manually specify the proportion of the
original dataset that will go into the testing set.
data = np.array([
[10.2 , 0.5 ],
[ 8.7 , 0.9 ],
[ 9.3 , 0.8 ],
[10.1 , 0.4 ],
[ 9.5 , 0.77],
[ 9.1 , 0.68],
[ 7.7 , 0.9 ],
[ 8.3 , 0.8 ]])
labels = np.array(
[1.4, 1.2, 1.6, 1.5, 1.6, 1.3, 1.1, 1.2])
print('{}\n'.format(repr(train_data)))
print('{}\n'.format(repr(train_labels)))
print('{}\n'.format(repr(test_data)))
print('{}\n'.format(repr(test_labels)))
In later chapters, we'll discuss how we use the testing set to evaluate a trained
(fitted) model.
Time to Code!
The coding exercise for this chapter will be to finish a utility function called
dataset_splitter , which will be used in future chapters.
The function will split the input dataset into training and testing sets, and then
group the data and labels based on type of set.
Set train_set equal to a tuple containing the first and third elements of
split_dataset . Also set test_set equal to a tuple containing the second
and fourth elements of split_dataset .
Chapter Goals:
Learn about the purpose of cross-validation
Implement a function that applies the K-Fold cross-validation algorithm
to a model
Sometimes, it's not enough to just have a single testing set for model
evaluation. Having additional sets of data for evaluation gives us a more
accurate measurement of how good the model is for the original dataset.
If the original dataset is big enough, we can actually split it into three subsets:
training, testing, and validation. The validation set is about the same size as
the testing set, and it is used for evaluating the model after training. The
testing set is then used for final evaluation once the model is done training
and tuning.
However, partitioning the original dataset into three distinct sets will cut into
the size of the training set. This can reduce the performance of the model if
our original dataset is not large enough. A solution to this problem is cross-
validation (CV).
Each round of the K-Fold algorithm, the model is trained on that round's
training set (the combined training folds) and then evaluated on the single
validation fold. The evaluation metric depends on the model. For
classification models, this is usually classification accuracy on the validation
set. For regression models, this can either be the model's mean squared error,
mean absolute error, or R2 value on the validation set.
B. Scored cross-validation
The code below demonstrates K-Fold CV with 3 folds for classification. The
evaluation metric is classification accuracy.
print('{}\n'.format(repr(cv_score)))
The code below demonstrates K-Fold CV with 4 folds for regression. The
evaluation metric is R2 value.
print('{}\n'.format(repr(cv_score)))
Note that we don't call fit with the model prior to using cross_val_score .
This is because the cross_val_score function will use fit for training the
model each round.
Chapter Goals:
Apply K-Fold cross-validation to a decision tree
The code below demonstrates how to apply K-Fold CV to tune a decision tree's
maximum depth. It uses the cv_decision_tree function that you will
implement later in this chapter.
In the above code, we use the cv_decision_tree function to apply 5-Fold cross-
validation to a classification decision tree. We tune its maximum depth
hyperparameter across depths of 3, 4, 5, 6, and 7. For each max_depth value,
we print the 95% confidence interval for the cross-validated scores across the
5 folds.
For the most part, the maximum depth of 4 produces the best 95% confidence
interval of cross-validated scores. This would be the value of max_depth that
we choose for the final decision tree.
Time to Code!
The coding exercise for this chapter is to complete the aforementioned
cv_decision_tree function. The function's first argument defines whether the
decision tree is for classification/regression, the next two arguments represent
the data/labels, and the final two arguments represent the tree's maximum
depth and number of folds, respectively.
First, we'll create the decision tree (using the tree module imported in the
backend).
Set scores equal to cross_val_score applied with d_tree , data , and labels
for the first three arguments. Use cv=cv for the keyword argument, then
return scores .
Chapter Goals:
Learn how to evaluate regression and classification models
A. Making predictions
Each of the models we've worked with has a predict function, which is used
to predict values for new data observations (i.e. data observations not in the
training set).
reg = tree.DecisionTreeRegressor()
# predefined train and test sets
reg.fit(train_data, train_labels)
predictions = reg.predict(test_data)
B. Evaluation metrics
For classification models, we use the classification accuracy on the test set as
the evaluation metric. For regression models, we normally use either the R2
value, mean squared error, or mean absolute error on the test set. The most
commonly used regression metric is mean absolute error, since it represents
the natural definition of error. We use mean squared error when we want to
penalize really bad predictions, since the error is squared. We use the R2 value
when we want to evaluate the fit of the regression model on the data.
reg = tree.DecisionTreeRegressor()
# predefined train and test sets
reg.fit(train_data, train_labels)
predictions = reg.predict(test_data)
clf = tree.DecisionTreeClassifier()
# predefined train and test sets
clf.fit(train_data, train_labels)
predictions = clf.predict(test_data)
Chapter Goals:
Learn how to use grid search cross-validation for exhaustive
hyperparameter tuning
A. Grid-search cross-validation
reg = linear_model.BayesianRidge()
params = {
'alpha_1':[0.1,0.2,0.3],
'alpha_2':[0.1,0.2,0.3]
}
reg_cv = GridSearchCV(reg, params, cv=5, iid=False)
# predefined train and test sets
reg_cv.fit(train_data, train_labels)
print(reg_cv.best_params_)
In the code example above, we searched through each possible pair of α1 and
α2 values based on the two lists in the params dictionary. The search resulted
in an α1 value of 0.3 and an α2 value of 0.1. For each of the models we've
covered, you can take a look at their respective scikit-learn code
documentation pages to determine the model's hyperparameters that can be
used as the params argument for GridSearchCV .
The cv keyword argument represents the number of folds used in the K-Fold
cross-validation for grid search. The iid keyword argument relates to how
the cross-validation score is calculated. We use False to match the standard
definition of cross-validation. Note that in later updates of scikit-learn, the
iid argument will be removed from GridSearchCV .
1
Why is regularization important in regression modeling?
COMPLETED 0%
1 of 4
Introduction
A. Unsupervised learning
So far, we've only used supervised learning methods, since we've exclusively
been dealing with labeled datasets. However, in the real world many datasets
are completely unlabeled, since labeling datasets involves additional work
and foresight. Rather than just ignoring all these unlabeled datasets, we can
still extract meaningful insights using unsupervised learning.
Learn about the cosine similarity metric and how it's used.
Chapter Goals:
Understand how the cosine similarity metric measures the similarity
between two data observations
The cosine similarity for two vectors, u and v, is calculated as the dot product
between the L2-normalization of the vectors. The exact formula for cosine
similarity is:
u v
cossim(u, v) = ⋅
∣∣u∣∣2 ∣∣v∣∣2
where ||u||2 represents the L2 norm of u and ||v||2 represents the L2 norm
of v.
In scikit-learn, cosine similarity is implemented via the cosine_similarity
function (which is part of the metrics.pairwise module). It calculates the
cosine similarities for pairs of data observations in a single dataset, or pairs of
data observations between two datasets.
When we only pass in one dataset into cosine_similarity , the function will
compute cosine similarities between pairs of observations within the dataset.
In the code above, we passed in data (which contains 4 data observations), so
the output of cosine_similarity is a 4x4 array of cosine similarity values.
The value at index (i, j) of cos_sims is the cosine similarity between data
observations i and j in data . Since cosine similarity is symmetric, the
cos_sims array contains the same values at index (i, j) and (j, i).
Note that the cosine similarity between a data observation and itself is 1,
unless the data observation contains only 0's as feature values (in which case
the cosine similarity is 0).
[ 4.2, 1.25],
[-8.1, 1.2]])
cos_sims = cosine_similarity(data, data2)
print('{}\n'.format(repr(cos_sims)))
In the code above, the value at index (i, j) of cos_sims is the cosine similarity
between data observation i in data and data observation j in data2 . Note that
cos_sims is a 4x3 array, since data contains 4 data observations and data2
contains 3.
Time to Code!
The code exercise for this chapter will be to use the cosine_similarity
function to compute the most similar data observations for each data
observation in data . Both cosine_similarity and data are
imported/initialized in the backend.
First, we need to calculate the pairwise cosine similarities for each data
observation.
# CODE HERE
For each row of cos_sims , the column containing the largest cosine similarity
score represents the most similar data observation. We can find the column
indexes with the largest value by using the argmax function of cos_sims .
We set the keyword argument axis equal to 1 to specify the largest column
indexes for each row.
# CODE HERE
Nearest Neighbors
Chapter Goals:
Learn how to find the nearest neighbors for a data observation
The code below finds the 5 nearest neighbors for a new data observation
( new_obs ) based on its fitted dataset ( data ).
data = np.array([
[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])
return_distance=False)
print('{}\n'.format(repr(only_nbrs)))
The NearestNeighbors object is fitted with a dataset, which is then used as the
pool of possible neighbors for new data observations. The kneighbors
function takes in new data observation(s) and returns the k nearest neighbors
along with their respective distances from the input data observations. Note
that the nearest neighbors are the neighbors with the smallest distances from
the input data observation. We can choose not to return the distances by
setting the return_distance keyword argument to False .
data = np.array([
[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])
In the code above, the first row of knbrs and dists correspond to the first
data observation in new obs , while the second row of knbrs and dists
correspond to the second observation in new_obs .
K-Means Clustering
Chapter Goals:
Learn about K-means clustering and how it works
Understand why mini-batch clustering is used for large datasets
A. K-means algorithm
The idea behind clustering data is pretty simple: partition a dataset into
groups of similar data observations. How we go about finding these clusters is
a bit more complex, since there are a number of different methods for
clustering datasets.
cluster = np.array([
[ 1.2, 0.6],
[ 2.4, 0.8],
[-1.6, 1.4],
[ 0. , 1.2]])
print('Cluster:\n{}\n'.format(repr(cluster)))
centroid = cluster.mean(axis=0)
print('Centroid:\n{}\n'.format(repr(centroid)))
An example of K-means clustering on a dataset with 10 clusters chosen (K = 10). Clusters are
distinguished by color. The white crosses represent the centroids of each cluster.
The code below demonstrates how to use the KMeans object (with 3 clusters).
# cluster assignments
print('{}\n'.format(repr(kmeans.labels_)))
# centroids
print('{}\n'.format(repr(kmeans.cluster_centers_)))
new_obs = np.array([
[5.1, 3.2, 1.7, 1.9],
[6.9, 3.2, 5.3, 2.2]])
# predict clusters
print('{}\n'.format(repr(kmeans.predict(new_obs))))
The KMeans object uses K-means++ centroid initialization by default. The
n_clusters keyword argument lets us set the number of clusters we want. In
the example above, we applied clustering to data using 3 clusters.
The labels_ attribute of the object tells us the final cluster assignments for
each data observation, and the cluster_centers_ attribute represents the final
centroids. We use the predict function to assign new data observations to one
of the clusters.
B. Mini-batch clustering
When working with very large datasets, regular K-means clustering can be
quite slow. To reduce the computation time, we can perform mini-batch K-
means clustering, which is just regular K-means clustering applied to
randomly sampled subsets of the data (mini-batches) at a time.
# cluster assignments
print('{}\n'.format(repr(kmeans.labels_)))
# centroids
print('{}\n'.format(repr(kmeans.cluster_centers_)))
new_obs = np.array([
[5.1, 3.2, 1.7, 1.9],
[6.9, 3.2, 5.3, 2.2]])
# predict clusters
print('{}\n'.format(repr(kmeans.predict(new_obs))))
Note that the clusterings can have different permutations, i.e. different cluster
labelings (0, 1, 2 vs. 1, 2, 0). Otherwise, the cluster assignments for both
KMeans and MiniBatchKMeans should be relatively the same.
2 of 2
Time to Code!
The coding exercise for this chapter will be to complete the kmeans_clustering
function, which will use either KMeans or MiniBatchKMeans for clustering data .
Call kmeans.fit with data as the only argument. Then return kmeans .
Chapter Goals:
Learn about hierarchical clustering using the agglomerative approach
A major assumption that the K-means clustering algorithm makes is that the
dataset consists of spherical (i.e. circular) clusters. With this assumption, the
K-means algorithm will create clusters of data observations that are circular
around the centroids. However, real life data often does not contain spherical
clusters, meaning that K-means clustering might end up producing inaccurate
clusters due to its assumption.
B. Agglomerative clustering
# cluster assignments
print('{}\n'.format(repr(agg.labels_)))
Since agglomerative clustering doesn't make use of centroids, there's no
cluster_centers_ attribute in the AgglomerativeClustering object. There's also
no predict function for making cluster predictions on new data (since K-
means clustering makes use of its final centroids for new data predictions).
Mean Shift Clustering
Chapter Goals:
Learn about the mean shift clustering algorithm
Each of the clustering algorithms we've used so far require us to pass in the
number of clusters. This is fine if we already know the number of clusters we
want, or have a good guess for the number of clusters. However, if we don't
have a good idea of what the actual number of clusters for the dataset should
be, there exist algorithms that can automatically choose the number of
clusters for us.
One such algorithm is the mean shift clustering algorithm. Like the K-means
clustering algorithm, the mean shift algorithm is based on finding cluster
centroids. Since we don't provide the number of clusters, the algorithm will
look for "blobs" in the data that can be potential candidates for clusters.
# cluster assignments
print('{}\n'.format(repr(mean_shift.labels_)))
# centroids
print('{}\n'.format(repr(mean_shift.cluster_centers_)))
new_obs = np.array([
[5.1, 3.2, 1.7, 1.9],
[6.9, 3.2, 5.3, 2.2]])
# predict clusters
print('{}\n'.format(repr(mean_shift.predict(new_obs))))
Chapter Goals:
Learn about the DBSCAN algorithm
A. Clustering by density
The mean shift clustering algorithm in the previous chapter usually performs
sufficiently well and can choose a reasonable number of clusters. However, it
is not very scalable due to computation time and still makes the assumption
that clusters have a "blob"-like shape (although this assumption is not as
strong as the one made by K-means).
High-density regions are defined by core samples, which are just data
observations with many neighbors. Each cluster consists of several core
samples and all the observations that are neighbors to a core sample.
Unlike the mean shift algorithm, the DBSCAN algorithm is both highly scalable
and makes no assumptions about the underlying shape of clusters in the
dataset.
smaller and more tightly packed clusters. We also specify the minimum
number of points in the neighborhood of a data observation for the
observation to be considered a core sample (the neighborhood consists of the
data observation and all its neighbors).
The code below demonstrates how to use the DBSCAN object, with ε equal to 1.2
and a minimum size of 30 for a core sample's neighborhood.
# cluster assignments
print('{}\n'.format(repr(dbscan.labels_)))
# core samples
print('{}\n'.format(repr(dbscan.core_sample_indices_)))
num_core_samples = len(dbscan.core_sample_indices_)
print('Num core samples: {}\n'.format(num_core_samples))
In the code above, we used DBSCAN to cluster the 150 data observations in
data . The algorithm found two clusters. In this case all the data observations
fit in a cluster, but in general the non-cluster observations would be labeled
with -1 .
Chapter Goals:
Learn how to evaluate clustering algorithms
A. Evaluation metrics
When we don't have access to any true cluster assignments (labels), the best
we can do to evaluate clusters is to just take a look at them and see if they
make sense with respect to the dataset and domain. However, if we do have
access to the true cluster labels for the data observations, we can apply a
number of metrics to evaluate our clustering algorithm.
One popular evaluation metric is the adjusted Rand index. The regular Rand
index gives a measurement of similarity between the true clustering
assignments (true labels) and the predicted clustering assignments (predicted
labels). The adjusted Rand index (ARI) is a corrected-for-chance version of the
regular one, meaning that the score is adjusted so that random clustering
assignments will not have a good score.
The ARI value ranges from -1 to 1, inclusive. Negative scores represent bad
labelings, random labelings will get a score near 0, and perfect labelings get a
score of 1.
# Perfect labeling
perf_labels = np.array([0, 0, 0, 1, 1, 1])
ari = adjusted_rand_score(true_labels, perf_labels)
print('{}\n'.format(ari))
# symmetric
ami = adjusted_mutual_info_score(pred_labels, true_labels)
print('{}\n'.format(ami))
# Perfect labeling
perf_labels = np.array([0, 0, 0, 1, 1, 1])
ami = adjusted_mutual_info_score(true_labels, perf_labels)
print('{}\n'.format(ami))
# Perfect labeling, permuted
permuted_labels = np.array([1, 1, 1, 0, 0, 0])
ami = adjusted_mutual_info_score(true_labels, permuted_labels)
print('{}\n'.format(ami))
The ARI and AMI metrics are very similar. They both assign a score of 1.0 to
perfect labelings, a score near 0.0 to random labelings, and negative scores to
poor labelings.
A general rule of thumb of when to use which: ARI is used when the true
clusters are large and approximately equal sized, while AMI is used when the
true clusters are unbalanced in size and there exist small clusters.
Feature Clustering
Chapter Goals:
Learn how to use agglomerative clustering for feature dimensionality
reduction
# predefined data
print('Original shape: {}\n'.format(data.shape))
print('First 10:\n{}\n'.format(repr(data[:10])))
1
What does cosine similarity measure (in terms of a dataset)?
COMPLETED 0%
1 of 4
Introduction
The previous 3 sections of this course used scikit-learn for a variety of tasks.
In this section, we cover XGBoost, a state-of-the-art data science library for
performing classification and regression. XGBoost makes use of gradient
boosted decision trees, which provides better performance than regular
decision trees.
For data science and machine learning competitions that use small to medium
sized datasets (e.g. Kaggle), XGBoost is always among the top performing
models.
The problem with regular decision trees is that they are often not complex
enough to capture the intricacies of many large datasets. We could
continuously increase the maximum depth of a decision tree to fit larger
datasets, but decision trees with many nodes tend to overfit the data.
Instead, we make use of gradient boosting to combine many decision trees into
a single model for classification or regression. Gradient boosting starts off
with a single decision tree, then iteratively adds more decision trees to the
overall model to correct the model's errors on the training dataset.
The XGBoost API handles the gradient boosting process for us, which produces
a much better model than if we had used a single decision tree.
XGBoost Basics
Chapter Goals:
Learn about the XGBoost data matrix
Train a Booster object in XGBoost
The basic data structure for XGBoost is the DMatrix , which represents a data
matrix. The DMatrix can be constructed from NumPy arrays.
The code below creates DMatrix objects with and without labels.
data = np.array([
[1.2, 3.3, 1.4],
[5.1, 2.2, 6.6]])
The DMatrix object can be used for training and using a Booster object, which
represents the gradient boosted decision tree. The train function in XGBoost
lets us train a Booster with a specified set of parameters.
The code below trains a Booster object using a predefined labeled dataset.
# training parameters
params = {
'max_depth': 0,
'objective': 'binary:logistic'
}
print('Start training')
bst = xgb.train(params, dtrain) # booster
print('Finish training')
A list of the possible parameters and their values can be found here. In the
example above, we set the 'max_depth' parameter to 0 (which means no limit
on the tree depths, equivalent to None in scikit-learn). We also set the
'objective' parameter (the objective function) to binary classification via
logistic regression. For the remaining available parameters, we used their
default settings (so we didn't include them in params ).
B. Using a Booster
Note that the model's predictions (from the predict function) are
probabilities, rather than class labels. The actual label classifications are just
the rounded probabilities. In the example above, the Booster predicts classes
of 0 and 1, respectively.
Time to Code!
The coding exercise for this chapter will be to train a Booster object on input
data and labels (predefined in the backend).
# CODE HERE
This means that the parameters for the Booster object will have 'max_depth'
set to 2 , 'objective' set to 'multi:softmax' , and 'num_class' set to 3 .
Set params equal to a dictionary with the specified keys and values.
# CODE HERE
Using the data matrix and parameters, we'll train the Booster .
Set bst equal to xgb.train applied with params and dtrain as the
required arguments.
# CODE HERE
Cross-Validation
Chapter Goals:
Learn how to cross-validate parameters in XGBoost
A. Choosing parameters
Since there are many parameters in XGBoost and several possible values for
each parameter, it is usually necessary to tune the parameters. In other words,
we want to try out different parameter settings and see which one gives us the
best results.
The output of cv is a pandas DataFrame (see the Data Processing section for
details). It contains the training and testing results (mean and standard
deviation) of a K-fold cross-validation applied for a given number of boosting
iterations. The value of K for the K-fold cross-validation is set with the nfold
keyword argument (default is 3).
Chapter Goals:
Learn how to save and load Booster models in XGBoost
After finding the best parameters for a Booster and training it on a dataset,
we can save the model into a binary file. Each Booster contains a function
called save_model , which saves the model's binary data into an input file.
The code below saves a trained Booster object, bst , into a binary file called
model.bin.
bst.save_model('model.bin')
We can restore a Booster from a binary file using the load_model function.
This requires us to initialize an empty Booster and load the file's data into it.
The code below loads the previously saved Booster from model.bin.
Chapter Goals:
Learn how to create a scikit-learn style classifier in XGBoost
While XGBoost provides a more efficient model than scikit-learn, using the
model can be a bit convoluted. For people who are used to scikit-learn,
XGBoost provides wrapper APIs around its model for classification and
regression. These wrapper APIs allow us to use XGBoost's efficient model in
the same style as scikit-learn.
model = xgb.XGBClassifier()
# predefined data and labels
model.fit(data, labels)
Note that the predict function for XGBClassifier returns actual predictions
(not probabilities).
All the parameters for the original Booster object are now keyword
arguments for the XGBClassifier . For instance, we can specify the type of
classification, i.e. the 'objective' parameter for Booster objects, with the
objective keyword argument (the default is binary classification).
model = xgb.XGBClassifier(objective='multi:softmax')
# predefined data and labels (multiclass dataset)
model.fit(data, labels)
Chapter Goals:
Learn how to create a scikit-learn style regression model in XGBoost
model = xgb.XGBRegressor(max_depth=2)
# predefined data and labels (for regression)
model.fit(data, labels)
Just like the XGBClassifier object, we can specify the model's parameters with
keyword arguments. In the code above, we set the max_depth parameter
(representing the depth of the boosted decision tree) to 2.
Feature Importance
Chapter Goals:
Understand how to measure each dataset feature's importance in making
model predictions
Use the matplotlib pyplot API to save a feature importance plot to a file
Not every feature in a dataset is used equally for helping a boosted decision
tree make predictions. Certain features are more important than others, and it
is useful to figure out which features are the most important.
The code below prints out the relative feature importances of a model trained
on a dataset of 4 features.
model = xgb.XGBClassifier()
# predefined data and labels
model.fit(data, labels)
We can plot the feature importances for a model using the plot_importance
function.
model = xgb.XGBRegressor()
# predefined data and labels (for regression)
model.fit(data, labels)
xgb.plot_importance(model)
plt.show() # matplotlib plot
Plotting the feature importances and showing the plot using matplotlib.pyplot (plt).
The resulting plot is a bar graph of the F-scores ( F1-scores) for each feature
(the number next to each bar is the exact F-score). Note that the features are
labeled as "fN", where N is the index of the column in the dataset. The F-score
is a standardized measurement of a feature's importance, based on the
specified importance metric.
model = xgb.XGBRegressor()
# predefined data and labels (for regression)
model.fit(data, labels)
xgb.plot_importance(model, importance_type='gain')
plt.show() # matplotlib plot
Plotting the feature importances with information gain as the importance metric
In the code above, we set importance_type equal to 'gain' , which means that
we use information gain as the importance metric. Information gain is a
commonly used metric for determining how good a feature is at
differentiating the dataset, which is important in making predictions with a
decision tree.
Finally, if we don't want to show the exact F-score next to each bar, we can set
the show_values keyword argument to False .
model = xgb.XGBRegressor()
# predefined data and labels (for regression)
model.fit(data, labels)
xgb.plot_importance(model, show_values=False)
plt.savefig('importances.png') # save to PNG file
Chapter Goals:
Apply grid search cross-validation to an XGBoost model
One of the benefits of using XGBoost's scikit-learn style models is that we can
use the models with the actual scikit-learn API. A common scikit-learn object
used with XGBoost models is the GridSearchCV wrapper. For more on
GridSearchCV see the Data Modeling section.
model = xgb.XGBClassifier()
params = {'max_depth': range(2, 5)}
Chapter Goals:
Save and load XGBoost models with the joblib API
Since the XGBClassifier and XGBRegressor models follow the same format as
scikit-learn models, we can save and load them in the same way. This means
using the joblib API.
With the joblib API, we save and load models using the dump and load
functions, respectively. See here for specific examples on model persistence
with scikit-learn models.
Quiz
1
What type of model does XGBoost use?
COMPLETED 0%
1 of 4
Introduction
An overview of the multilayer perceptron neural network and deep learning in TensorFlow.
In the Intro to Deep Learning section, you will learn about one of the most
essential neural networks used in deep learning, the multilayer perceptron.
A. Multilayer perceptron
The multilayer perceptron (MLP) is used for a variety of tasks, such as stock
analysis, spam detection, and election voting predictions. In the following
chapters you will learn how to code your own MLP and apply it to the task of
classifying 2-D points in the Cartesian plane.
You will also learn the basics of creating a computation graph, i.e. the
structure of a neural network. The structure of a neural network can be
viewed in layers of neurons:
Multilayer perceptron
In the diagram, the circles represent neurons and the connections are the
arrows going from neurons in one layer to the next. There are three layers in
this diagram's neural network:
Input layer: The first (bottom) layer.
Output layer: The last (top) layer.
Hidden layer(s): The layer(s) between the input and output layers. In the
diagram, there is 1.
The number of hidden layers represents how "deep" a model is, and you'll see
the power of adding hidden layers to a model.
B. TensorFlow
To code our neural network model, we will be using TensorFlow, one of the
most popular deep learning frameworks. The name for TensorFlow is derived
from tensors, which are basically multidimensional (i.e. generalized)
vectors/matrices. When writing the code, it may be easier to think of anything
with numeric values as being a tensor.
Chapter Goals:
Define a class for an MLP model
Initialize the model
A. Placeholder
The required argument for the placeholder is its type. In Deep Learning with
TensorFlow, our input data is pairs of (x, y) points, so self.inputs has type
tf.float32 . The labels (which are explained in a later chapter) have type
tf.int32 .
The shape argument is a tuple of integers representing the size of each of the
placeholder tensor's dimensions. In Deep Learning with TensorFlow, and
many real world problems, the shape of the input data will be a two integer
tuple. If we view the input data as coming from a data table, the shape is akin
to the dimensions of the table.
x_1 y_1
x_2 y_2
... ...
x_d y_d
The shape of this data is (d, 2). There are d data points, each of dimension 2.
The first integer represents the number of data points we pass in (i.e. number
of rows in the data table). When training the model, we refer to the first
dimension as the batch size.
The second integer represents the number of features in the dataset (i.e.
number of columns). In the remaining chapters of Deep Learning with
TensorFlow, our input data will have two features: the x and y coordinates of
the data point. So input_size is 2.
Each data point also has a label, which is used to identify and categorize the
data. The labels we use for our model's data will have a two dimension shape,
with output_size as the second dimension (the first dimension is still the
batch size). The output_size refers to the number of possible classes a label
can have (explained in a later chapter).
One thing to note about TensorFlow is the use of None in place of a dimension
size. When we use None in the shape tuple, we are allowing that dimension to
take on any size.
x_1 y_1
x_2 y_2
... ...
x_? y_?
The shape of this data is (None, 2). There can be variable number of data points, each of dimension
2.
This is particularly useful because we will use our neural network on input
data with different input sizes.
When we input multiple input data for our neural network, we don't actually
use multiple neural networks. Rather, we use the same neural network on
each of the points simultaneously to obtain the network's output for each
point.
Time to Code!
The code for this chapter will focus on initializing the placeholders for the
MLP model.
First you'll define the placeholder for the model's input data, in the function
init_inputs . Note that TensorFlow is already imported in the backend via the
import statement import tensorflow as tf .
def init_inputs(input_size):
# CODE HERE
pass
Next, we'll define placeholders for the labels, in the function init_labels .
Set labels equal to a tf.placeholder with data type tf.int32 and keyword
arguments shape=(None, output_size) and name='labels' .
def init_labels(output_size):
# CODE HERE
pass
Logits
Dive into the inner layers of a neural network and understand the importance of logits.
Chapter Goals:
Build a single fully-connected model layer
Output logits, AKA log-odds
A. Fully-connected layer
Before we can get into multilayer perceptrons, we need to start off with a
single layer perceptron. The single fully-connected layer means that the input
layer, i.e. self.inputs , is directly connected to the output layer, which has
output_size neurons. Each of the input_size neurons in the input layer has a
connection to each neuron in the output layer, hence the fully-connected
layer.
output layer.
The bias neuron helps our neural network produce better results, by allowing
each fully-connected layer to model a true linear combination of the input
values.
B. Weighted connections
The forces that drive a neural network are the real number weights attached
to each connection. The weight on a connection from neuron A into neuron B
tells how strongly A affects B as well as whether that effect is positive or
negative, i.e. direct vs. inverse relationship.
The diagram above has three weighted connections:
A → B: Direct relationship.
A → C: No relationship.
A → D: Inverse relationship.
Fully-connected layer with input neuron values x1 and x2, a bias neuron, and weight values w1,
w2, and w3
w1 ⋅ x1 + w2 ⋅ x2 + w3
The logits produced by our single layer perceptron are therefore just a linear
combination of the input data feature values.
C. Logits
So what exactly are logits? In classification problems they represent log-odds,
which maps a probability between 0 and 1 to a real number. When
output_size = 1 , our model outputs a single logit per data point. The logits
will then be converted to probabilities representing how likely it is for the
data point to be labeled 1 (as opposed to 0).
In the above diagram, the x-axis represents the probability and the y-axis
represents the logit.
Note the vertical asymptotes at x = 0 and x = 1.
We want our neural network to produce logits with large absolute values,
since those represent probabilities close to 0 (meaning we are very sure the
label is 0/False) or probabilities close to 1 (meaning we are very sure the label
is 1/True).
D. Regression
In the next chapter, you'll be producing actual probabilities from the logits.
This makes our single layer perceptron model equivalent to logistic
regression. Despite the name, logistic regression is used for classification, not
regression problems. If we wanted a model for regression problems (i.e.
predicting a real number such as a stock price), we would have our model
directly output the logits rather than convert them to probabilities. In this
case, it would be better to rename logits , since they don't map to a
probability anymore.
Time to Code!
The code for this chapter will build a single layer perceptron, whose output is
logits . The code goes inside the model_layers function.
We're going to obtain the logits by applying a dense layer to inputs (the
placeholder from Chapter 2) to return a tensor with shape (None,
self.output_size) .
Discover the most commonly used metrics for evaluating a neural network.
Chapter Goals:
Convert logits to probabilities
Obtain model predictions from probabilities
Calculate prediction accuracy
A. Sigmoid
As discussed in the previous chapter, our model outputs logits based on the
input data. These logits represent real number mappings from probabilities.
Therefore, if we had the inverse mapping we could obtain the original
probabilities. Luckily, we have exactly that, in the form of the sigmoid
function.
The above plot shows a sigmoid function graph. The x-axis represents logits,
while the y-axis represents probabilities.
while a probability closer to 0.5 means the model is unsure (0.5 is equivalent
to a random guess of the label's value).
A probability closer to 1 means the model is more sure that the label is 1,
while a probability closer to 0 means the model is more sure that the label is
0. Therefore, we can obtain model predictions just by rounding each
probability to the nearest integer, which would be 0 or 1. Then our prediction
accuracy would be the number of correct predictions divided by the number
of labels.
We use these metrics to evaluate how good our model is both during and after
training. This way we can experiment with different computation graphs,
such as different numbers of neurons or layers, and find which structure
works best for our model.
Time to Code!
The coding exercise for this chapter focuses on obtaining predictions and
accuracy based on model logits.
In the backend, we've loaded the logits tensor from the previous chapter's
model_layers function. We'll now obtain probabilities from the logits using
the sigmoid function.
# CODE HERE
# CODE HERE
The problem with rounded_probs is that it's still type tf.float32 , which
doesn't match the type of the labels placeholder. This is an issue, since we
need the types to match to compare the two. We can fix this problem using
tf.cast .
# CODE HERE
The final metric we want is the accuracy of our predictions. We'll directly
compare predictions to labels by using tf.equal , which returns a tensor
that has True at each position where our prediction matches the label and
False elsewhere. Let's call this tensor is_correct .
# CODE HERE
The neat thing about is_correct is that the number of True values divided by
the number of total values in the tensor gives us our accuracy. We can use
tf.reduce_mean to do this calculation. We just need to cast is_correct to type
tf.float32 , which converts True to 1.0 and False to 0.0.
Set is_correct_float equal to tf.cast with is_correct as the first
argument and data type tf.float32 as the second argument.
# CODE HERE
Optimization
Chapter Goals:
Know the relationship between training, weights, and loss
Understand the intuitive definition of loss
Obtain the model's loss from logits
Write a training operation to minimize the loss
A. What is training?
For any neural network, training involves setting up a loss function. The loss
function tells us how bad the neural network's output is compared to the
actual labels.
Since a larger loss means a worse model, we want to train the model to output
values that minimize the loss function. The model does this by learning the
optimal weight settings. Remember, the weights are just real numbers, so the
model is essentially just figuring out the best numbers to set the weights to.
B. Loss as error
∑ ∣actuali − predictedi ∣
i
These provide an error metric for how far the predictions are from the labels,
so the goal is to minimize the prediction error by minimizing the L1 and L2
norm.
C. Cross entropy
Rather than defining error as being right or wrong in our prediction, we can
instead define it in terms of probability. Therefore, we want a loss function
that is small when the probability is close to the label (i.e. a probability of 0.99
for a label of 1) and large when the probability is far from the label (i.e. a
probability of 0.99 for a label of 0). The loss function that achieves this is
known as cross entropy, also referred to as log loss.
Cross entropy (log loss) for a label of 1. Thex-axis represents the probability and the y-axis
represents the log loss.
D. Optimization
Now we can just minimize the cross entropy based on the model's logits and
labels to get our optimal weights. We do this through gradient descent, where
the model updates its weights based on a gradient of the loss function until it
reaches the minimum loss (at which point the weights converge to the
optimum). We use backpropagation to find the optimal gradient for the model
to follow. Gradient descent is implemented as an object in TensorFlow, called
tf.train.GradientDescentOptimizer .
In the above graph, the colored shape represents values of the loss function,
and the x and y axes represent weight values in the model. The model follows
a gradient (red line) towards the minimum of the loss function.
The size of the gradient depends on something called the learning rate. A
larger learning rate means the model could potentially reach the minimum
loss quicker, but could also overshoot the minimum. Smaller learning rates
are more likely to reach the minimum, but may take longer. Usually we test
out learning rates between 0.001 to 0.1 to find the best one for model training.
You can set the learning rate via the learning_rate argument when
initializing a TensorFlow Optimizer (e.g. GradientDescentOptimizer ).
Time to Code!
The coding exercises for this chapter build on top of the code from the
previous chapter. Specifically, the optimization code for this chapter is only
needed for model training (not evaluation or testing).
We'll first set up the loss parameter, a variable named loss . Since loss will
be a floating-point number, we'll need to cast labels to type tf.float32 .
Set labels_float equal to tf.cast with labels as the first argument and
data type tf.float32 as the second argument.
# CODE HERE
# CODE HERE
Since cross_entropy represents the sigmoid cross entropy for each input data
label (so its shape is (None, 1) ), we'll use the overall mean of the cross
entropy as our loss.
# CODE HERE
We'll now use the Adam optimization algorithm to set our training operation.
First, we'll initialize an Adam optimizer object called adam .
Then we'll use adam to minimize loss , which becomes our training operation
(i.e. how we train the model's weights).
# CODE HERE
Training
Initialize and train a TensorFlow neural network using actual training data.
Chapter Goals:
Learn how to feed data values into a neural network
Understand how to run a neural network using input values
Train the neural network on batched input data and labels
In this chapter and the next, you will be running your model on input data,
using a tf.Session object and the tf.placeholder variables from the previous
chapters.
The tf.Session object has an extremely important function called run . All the
code written in the previous chapters was to build the computation graph of
the neural network, i.e. its layers and operations. However, we can only train
or evaluate the model on real input data using run . The function takes in a
single required argument and a few keyword arguments.
B. Using run
t = tf.constant([1, 2, 3])
sess = tf.Session()
arr = sess.run(t)
print('{}\n'.format(repr(arr)))
t2 = tf.constant(4)
tup = sess.run((t, t2))
print('{}\n'.format(repr(tup)))
p t( {}\ . o at( ep (tup)))
Of the keyword arguments for run , the important one for most applications is
feed_dict . The feed_dict is a python dictionary. Each key is a tensor from the
model's computation graph. The key's value can be a Python scalar, list, or
NumPy array.
C. Initializing variables
When we call run , every tensor in the model's computation graph must either
already have a value or must be fed in a value through feed_dict . However,
when we start training from scratch, none of our variables (e.g. weights) have
values yet. We need to initialize all the variables using
tf.global_variables_initializer . This returns an operation that, when used
as the required argument in run , initializes all the variables in the model.
In the code below, the variables that are initialized are part of
tf.layers.dense . The variable initialization process is defined internally by
the function. In this case, the variables are initialized in a way that results in
zero logits.
inputs = tf.placeholder(tf.float32, shape=(None, 2))
feed_dict = {
inputs: [[1.1, -0.3],
[0.2, 0.1]]
}
logits = tf.layers.dense(inputs, 1, name='logits')
init_op = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init_op) # variable initialization
arr = sess.run(logits, feed_dict=feed_dict)
print('{}\n'.format(repr(arr)))
D. Training logistics
# predefined dataset
print('Input data:')
print('{}\n'.format(repr(input_data)))
print('Labels:')
print('{}\n'.format(repr(input_labels)))
The batch size determines how the model trains. Larger batch sizes usually
result in faster training but less accurate models, and vice-versa for smaller
batch sizes. Choosing the batch size is a speed-precision tradeoff.
When training a neural network, it's usually a good idea to print out the loss
every so often, so you know that the model is training correctly and to stop the
training when the loss has converged to a minimum.
Time to Code!
The coding exercise for this chapter sets up the training utility, using the
ADAM training operation and model layers from the previous chapters.
We first need to initialize all the model variables (e.g. the variables via
tf.layers.dense ).
Create a tf.Session object named sess , and call its run function on
init_op .
# CODE HERE
After initializing the variables, we can run our model training. We'll run the
training for 1000 steps. We provide a for loop for you, which iterates 1000
steps.
The rest of the code for this chapter goes inside the for loop.
Each iteration of the for loop we will feed in the ith data observation (and its
corresponding label) into the model.
The ADAM training operation ( train_op ) is defined in the backend, using the
code from Chapter 5. Using sess.run , we can run training on the input data
and labels.
Using the sess object defined earlier, call the run function with first
argument train_op and keyword argument feed_dict=feed_dict .
Evaluate a fully trained neural network using the model accuracy as the evaluation metric.
Chapter Goals:
Evaluate model performance on a test set
The code for this chapter makes use of the accuracy metric defined in Chapter
4. The accuracy represents the classification accuracy of an already trained
model, i.e. the proportion of correct predictions the model makes on a test set.
Since we used None when defining our placeholder shapes, it allows us to run
training or evaluation on any number of data points, which is super helpful
since we normally want to evaluate on many more data points than the
training batch size.
Training set (~80% of dataset): Used for model training and optimization
Validation set (~10% of dataset): Used to evaluate the model in between
training runs, e.g. when tweaking model parameters like batch size
Test set (~10% of dataset): Used to evaluate the final model, usually
through some accuracy metric
Time to Code!
The coding exercise for this chapter uses the accuracy metric from Chapter 4
(which is initialized in the backend). We also provide the test set's data and
labels ( test_data and test_labels ) as NumPy arrays initialized in the
backend, as well as the inputs and labels placeholders.
We've taken the liberty of loading a pretrained model in the backend using a
tf.Session object called sess . You'll be evaluating the accuracy of the
pretrained model on the test data and labels.
Set eval_acc equal to the output of sess.run , with first argument accuracy
and a keyword argument feed_dict=feed_dict .
Chapter Goals:
Understand the limitations of a single layer perceptron model
The input data we've been using to train and evaluate the single layer
perceptron model has been pairs of (x, y) points with labels indicating
whether the point is above (labeled 1) or below (labeled 0) the y = x line. We
trained the pretrained model on the generated data for 100,000 iterations at a
batch size of 50.
Running the model on some randomly generated 2-D points will give a plot
like this:
In the plot above, there are 100 randomly generated 2-D points. The model
was tasked with classifying whether or not each point is above or below the y
= x line (red for above, blue for below). As you can see, the model classifies
each of the points correctly.
However, if we instead train the single layer perceptron on 2-D points being
inside or outside a circle, the plot will look something like this:
In this case, red means the model thinks the point is inside the circle, while
blue means the model thinks the point is outside the circle. In this case, the
single layer perceptron does a very poor job of classifying the points, despite
being trained on the circle dataset until the loss converged.
The lack of performance from the model is not due to undertraining, but
instead due to an inherent limitation in the abilities of a single layer
perceptron. With just a single fully-connected layer, the model is only able to
learn linear decision boundaries. So for any set of points that are divided by a
line in the 2-D plane, the model can be trained to correctly classify those
points. However, for non-linear decision boundaries, such as this circle
example, no matter how much you train the model it will not be able to
perform well in classification.
In the next chapter, we delve into how we can create perceptron models that
are able to learn non-linear decision boundaries.
Hidden Layer
Chapter Goals:
Add a hidden layer to the model's layers
Understand the purpose of non-linear activations
Learn about the ReLU activation function
In the previous chapter we saw that the single layer perceptron was unable to
classify points as being inside or outside a circle centered around the origin.
This was because the output of the model (the logits) only had connections
directly from the input features.
Why exactly is this a limitation? Think about how the connections work in a
single layer neural network. The neurons in the output layer have a
connection coming in from each of the neurons in the input layer, but the
connection weights are all just real numbers.
The single layer neural network architecture. The weight values are denoted w1, w2, and w3.
logits = w1 ⋅ x + w2 ⋅ y + w3
The above linear combination shows the single layer perceptron can model
any linear boundary. However, the equation of the circle boundary is:
x2 + y 2 = r2
B. Hidden layers
If a single linear combination doesn't work, what can we do? The answer is to
add more linear combinations, as well as non-linearities. We add more linear
combinations to our model by adding more layers. The single layer
perceptron has only an input and output layer. We will now add an additional
hidden layer between the input and output layers, officially making our model
a multilayer perceptron. The hidden layer will have 5 neurons, which means
that it will add an additional 5 linear combinations to the model's
computation.
The multilayer perceptron architecture.
The 5 neuron hidden layer is more than enough for our circle example.
However, for very complex datasets a model could have multiple hidden
layers with hundreds of neurons per layer. In the next chapter, we discuss
some tips on choosing the number of neurons and hidden layers in a model.
C. Non-linearity
The 3 most common activation functions used in deep learning are tanh,
ReLU, and the aforementioned sigmoid. Each has its uses in deep learning, so
it's normally best to choose activation functions based on the problem.
However, the ReLU activation function has been shown to work well in most
general-purpose situations, so we ll apply ReLU activation for our hidden
layer.
D. ReLU
You might wonder why ReLU even works. While tanh and sigmoid are both
inherently non-linear, the ReLU function seems pretty linear (it's just f(x) = 0
for x < 0 and f(x) = x for x ≥ 0). However, let's take a look at the following
function:
Though a bit rough on the edges, it looks somewhat like the quadratic
2
function, f(x) = x . In fact, with enough linear combinations and ReLU
Time to Code!
The coding exercise for this chapter involves modifying the model_layers
function from Chapter 3. You will be adding an additional hidden layer to the
model, to change it from a single layer to a multilayer perceptron.
The additional hidden layer will go directly before the 'logits' layer.
We also need to update the layer which produces the logits, so that it takes in
hidden1 as input.
After adding in the hidden layer to the model, the multilayer perceptron will
be able to classify the 2-D circle dataset. Specifically, the classification plot will
now look like:
Blue represents points the model thinks is outside the circle and red
represents points the model thinks is inside. As you can see, the model is a lot
more accurate now.
Multiclass
Chapter Goals:
Learn about multiclass classification
Understand the purpose of multiple hidden layers
Learn the pros and cons of adding hidden layers
The example is an extension of the previous circle example, but now there is
an additional circle with radius 1 centered at the origin. The classes are now:
B. One-hot
0 1 0
0 0 1
An example set of labels. In this case there are 3 possible classes, exactly one of which is the hot
index.
When deciding how many hidden layers a model needs (i.e. how deep it is)
and how many neurons are at each hidden layer, it is a good idea to base the
decision on the problem itself. There are a few general rules of thumb, but
they do not apply to every scenario. For example, it is common not to need
more than 3 hidden layers in a neural network, but if you are working on a
complicated problem you would most likely need more (Google's Alpha Go
needed more than a dozen layers).
If you don't have much domain knowledge for the particular problem you're
working on, it's usually best to only add extra layers or neurons when they're
needed. The fewer layers and neurons, the faster your model trains, and the
quicker you can evaluate how good it is. It then becomes easier to optimize
the number of layers and neurons in your model through experimentation.
D. Over tting
One thing to note is that the more hidden layers or neurons a neural network
has, the more prone it is to overfitting the training data. Overfitting refers to
the model becoming very accurate in classifying the training data, but then
performing poorly on other data. Since we want models that can generalize
well and accurately classify data it has never seen before, it is best to avoid
going overboard in adding hidden layers.
Time to Code!
The coding exercise for this chapter involves modifying the model_layers
function from the previous chapter. You will be adding an additional hidden
layer to the model, bringing the total number of hidden layers to 2.
The additional hidden layer will go directly before the 'logits' layer.
We also need to update the layer which produces the logits, so that it takes in
hidden2 as input.
Use the softmax function to convert a neural network from binary to multiclass classi cation.
Chapter Goals:
Update the model to use the softmax function
Perform multiclass classification
The softmax function takes in a vector of numbers (logits for each class), and
converts the numbers to a probability distribution. This means that the sum of
the probabilities across all the classes equals 1, and each class's individual
probability is based on how large its logit was relative to the sum of all the
classes's logits.
When training our model, we also replace the sigmoid cross entropy with a
softmax cross entropy, for the same reason as stated above. The cross entropy
is now calculated for each class and then averaged at the end.
B. Predictions
Our model's prediction now becomes the class with the highest probability.
Since we label each class with a unique index, we need to return the index
with the maximum probability. TensorFlow provides a function that lets us do
this, called tf.argmax .
Time to Code!
The coding exercises for this chapter calculates multiclass predictions from
logits, and then applies training with softmax cross entropy loss.
We'll first calculate the probabilities from logits (predefined in the backend),
using the softmax function. Then we can use the tf.argmax function to
generate the predictions.
Since our labels are one-hot vectors (predefined in the backend), we need to
convert them back to class indexes to calculate our accuracy.
# CODE HERE
From this point, the calculation of the model's accuracy (using the is_correct
variable is the same as in Chapter 4.
For training the model, the main change we need to make is going from
sigmoid cross entropy to softmax cross entropy. In TensorFlow, softmax cross
entropy is applied via the tf.nn.softmax_cross_entropy_with_logits_v2
function.
# CODE HERE
Now if we run the trained 2-hidden layer MLP model on a multiclass circle
dataset, the plot will look like this:
Blue represents points that the model believes is outside both circles, green
represents points the model believes is between the two circles, and red
represents points the model believes is inside both circles.
As you can see, the multiple hidden layer MLP model performs quite well on
this basic multiclass classification task.
Quiz
1
What is a benefit to using more hidden layers in a neural network?
COMPLETED 0%
1 of 4
Introduction
In the Intro to Keras section, you will learn about the Keras API, a simple and
compact API for creating neural networks. You will use Keras to build a
multilayer perceptron model for multiclass classification.
While Keras is excellent for building small deep learning projects, TensorFlow
is still the preferred framework for industry-level projects since it provides
more utilities and efficient training mechanisms.
B. Multilayer perceptron
The MLP model is one of the most important neural networks for deep
learning. It is a relatively simple model, but versatile enough for a variety of
different applications. In this section, we'll be focusing on the Keras
implementation of an MLP, rather than go into details on how it works or
what it can be used for.
For specific details on the MLP model, check out the Intro to Deep Learning
section.
Sequential Model
Chapter Goals:
Initialize an MLP model in Keras
The most commonly used Keras neural network layer is the Dense layer. This
represents a fully-connected layer in the neural network, and it is the most
important building block of an MLP model.
model = Sequential()
layer1 = Dense(5, input_dim=4)
model.add(layer1)
layer2 = Dense(3, activation='relu')
model.add(layer2)
The Dense object takes in a single required argument, which is the number of
neurons in the fully-connected layer. The activation keyword argument
specifies the activation function for the layer (the default is no activation). In
the code snippets above, we used no activation for layer1 and ReLU
activation for layer2 .
The first layer of the Sequential model represents the input layer. Therefore,
in the first layer we need to specify the feature dimension of the input data for
the model, which we do with the input_dim keyword argument.
In the code snippets above, we set the input feature dimension to 4, meaning
that the input data has shape (batch_size, 4) (where batch_size is the data's
batch size, decided at runtime).
Time to code!
The coding exercise for this chapter involves setting up a Keras Sequential
model with a single Dense layer. We start off with an empty initialized
Sequential object.
# CODE HERE
We'll build a three layer MLP model. The first layer will consist of 5 neurons
and use ReLU activation. It will also act as the input layer for the model.
To create the input layer, we'll initialize a Dense object with the requisite
number of neurons and activation. We'll also set the input_dim keyword
argument to 2 , which represents the feature dimension of the input data for
the model.
Set layer1 equal to a Dense with 5 as the required argument, 'relu' for
the activation keyword argument, and 2 for the input_dim keyword
argument.
Then call model.add on layer1 .
# CODE HERE
Model Output
Chapter Goals:
Add the final layers to the MLP for multiclass classification
In the Intro to Deep Learning section, we built the MLP classification models
such that each model produced logits. This is because the TensorFlow cross-
entropy loss functions applied the sigmoid/softmax function to the output of
the MLP.
model = Sequential()
layer1 = Dense(5, activation='relu', input_dim=4)
model.add(layer1)
layer2 = Dense(1, activation='sigmoid')
model.add(layer2)
model = Sequential()
layer1 = Dense(5, input_dim=4)
model.add(layer1)
layer2 = Dense(3, activation='softmax')
model.add(layer2)
Creating an MLP model for multiclass classification with 3 classes (softmax activation).
Time to code!
The coding exercise will complete the Keras Sequential model that was set up
in the previous chapter. Note that the output size of the model will be 3 (there
are 3 possible classes for each data observation).
Set layer2 equal to a Dense with 5 as the required argument and 'relu'
for the activation keyword argument. Then call model.add on layer2 .
model = Sequential()
layer1 = Dense(5, activation='relu', input_dim=2)
model.add(layer1)
# CODE HERE
Model Configuration
Chapter Goals:
Learn how to configure a Keras model for training
When it comes to configuring the model for training, Keras shines with its
simplicity. A single call to the compile function allows to set up all the training
requirements for the model.
The two main keyword arguments to know are loss and metrics . The loss
keyword argument specifies the loss function to use. For binary classification,
we set the value to 'binary_crossentropy' , which is the binary cross-entropy
function. For multiclass classification, we set the value to
'categorical_crossentropy' , which is the multiclass cross-entropy function.
model = Sequential()
layer1 = Dense(5, activation='relu', input_dim=4)
model.add(layer1)
layer2 = Dense(1, activation='sigmoid')
model.add(layer2)
model.compile('adam',
loss='binary_crossentropy',
metrics=['accuracy'])
The code example above creates an MLP model for binary classification and
configures it for training. We specified classification accuracy as the metric to
track.
Time to code!
The coding exercise will complete the Keras multiclass classification model for
training.
To configure the model for training, we need to use the compile function. The
function sets up the model's loss, optimizer, and evaluation metrics.
For our model, we'll use the ADAM optimizer. Since the model performs
multiclass classification, we'll use 'categorical_crossentropy' for the loss.
The only metric we need to know during training and evaluation is the
model's classification accuracy.
model = Sequential()
layer1 = Dense(5, activation='relu', input_dim=2)
model.add(layer1)
layer2 = Dense(5, activation='relu')
model.add(layer2)
layer3 = Dense(3, activation='softmax')
model.add(layer3)
# CODE HERE
Model Execution
Learn how to train, evaluate, and make predictions with a Keras model.
Chapter Goals:
Understand the facets of model execution for Keras models
A. Training
After configuring a Keras model for training, it only takes a single line of code
to actually perform the training. We use the Sequential model's fit function
to train the model on input data and labels.
The first two arguments of the fit function are the input data and labels,
respectively. Unlike TensorFlow (where we need to use tensor objects for any
sort of data), we can simply use NumPy arrays as the input arguments for the
fit function.
The training batch size can be specified using the batch_size keyword
argument (the default is a batch size of 32). We can also specify the number of
epochs, i.e. number of full run-throughs of the dataset during training, using
the epochs keyword argument (the default is 1 epoch).
model = Sequential()
layer1 = Dense(200, activation='relu', input_dim=4)
model.add(layer1)
layer2 = Dense(200, activation='relu')
model.add(layer2)
layer3 = Dense(3, activation='softmax')
model.add(layer3)
model.compile('adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
The output of the fit function is a History object, which records the training
metrics. The object's history attribute is a dictionary that contains the metric
values at each epoch of training.
print(train_output.history)
The history for the train_output object from the previous code snippet.
B. Evaluation
Evaluating a trained Keras model is just as simple as training it. We use the
Sequential model's evaluate function, which also takes in data and labels
(NumPy arrays) as its first two arguments.
Calling the evaluate function will evaluate the model over the entire input
dataset and labels. The function returns a list containing the evaluation loss as
well as the values for each metric specified during model configuration.
C. Predictions
Finally, we can make predictions with a Keras model using the predict
function. The function takes in a NumPy array dataset as its required
argument, which represents the data observations that the model will make
predictions for.
The output of the predict function is the output of the model. That means for
classification, the predict function is the model's class probabilities for each
data observation.
# 3 new data observations
print('{}'.format(repr(model.predict(new_data))))
In the code above, the model returned class probabilities for the 3 new data
observations. Based on the probabilities, the first observation would be
classified as class 0, the second observation would be classified as class 2, and
the third observation would be classified as class 1.
Quiz
1
What is the main benefit of Keras over TensorFlow?
COMPLETED 0%
1 of 3
Course Conclusion
A. Closing thoughts
Machine Learning (ML) is one of the hottest fields in technology right now,
and is inherently relevant to artificial intelligence and data science. From
healthcare and agriculture to manufacturing and retail, many companies
across a variety of different industries are leveraging these technologies to get
ahead.
B. Course recap
This course has provided an overview of how to write useful code and
impactful machine learning applications. You’ll be able to take the practical
lessons and actionable insights from this course and apply them to your
projects.
C. Topics covered
Data analysis/visualization
Feature engineering
Supervised learning
Unsupervised learning
Deep learning