0% found this document useful (0 votes)
27 views56 pages

Machine Learning With Python Cookbook, 2nd Edition (First Early Release) Kyle Gallatin - The Full Ebook Version Is Ready For Instant Download

The document promotes the 'Machine Learning with Python Cookbook, 2nd Edition' by Kyle Gallatin and Chris Albon, available for early release download at ebookmeta.com. It provides a comprehensive overview of using NumPy for machine learning, covering topics such as creating vectors and matrices, selecting elements, and applying functions. The document also includes links to additional recommended ebooks and resources for further reading.

Uploaded by

tebboseyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views56 pages

Machine Learning With Python Cookbook, 2nd Edition (First Early Release) Kyle Gallatin - The Full Ebook Version Is Ready For Instant Download

The document promotes the 'Machine Learning with Python Cookbook, 2nd Edition' by Kyle Gallatin and Chris Albon, available for early release download at ebookmeta.com. It provides a comprehensive overview of using NumPy for machine learning, covering topics such as creating vectors and matrices, selecting elements, and applying functions. The document also includes links to additional recommended ebooks and resources for further reading.

Uploaded by

tebboseyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Read Anytime Anywhere Easy Ebook Downloads at ebookmeta.

com

Machine Learning with Python Cookbook, 2nd Edition


(First Early Release) Kyle Gallatin

https://ebookmeta.com/product/machine-learning-with-python-
cookbook-2nd-edition-first-early-release-kyle-gallatin/

OR CLICK HERE

DOWLOAD EBOOK

Visit and Get More Ebook Downloads Instantly at https://ebookmeta.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Machine Learning with Python Cookbook, 2nd Edition Kyle


Gallatin

https://ebookmeta.com/product/machine-learning-with-python-
cookbook-2nd-edition-kyle-gallatin/

ebookmeta.com

Machine Learning with Python Cookbook 2nd Edition Chris


Albon

https://ebookmeta.com/product/machine-learning-with-python-
cookbook-2nd-edition-chris-albon/

ebookmeta.com

Machine Learning with Python Cookbook Practical Solutions


from Preprocessing to Deep Learning 2nd Ed Release 5 2nd
Edition Chris Albon
https://ebookmeta.com/product/machine-learning-with-python-cookbook-
practical-solutions-from-preprocessing-to-deep-learning-2nd-ed-
release-5-2nd-edition-chris-albon/
ebookmeta.com

The White Educators’ Guide to Equity Jeramy Wallace

https://ebookmeta.com/product/the-white-educators-guide-to-equity-
jeramy-wallace/

ebookmeta.com
Lawyer Games After Midnight in the Garden of Good and Evil
2nd Edition Dep Kirkland

https://ebookmeta.com/product/lawyer-games-after-midnight-in-the-
garden-of-good-and-evil-2nd-edition-dep-kirkland/

ebookmeta.com

Artificial Intelligence A Modern Approach 3rd edition


Stuart Russell Peter Norvig

https://ebookmeta.com/product/artificial-intelligence-a-modern-
approach-3rd-edition-stuart-russell-peter-norvig/

ebookmeta.com

Body and Soul in Hellenistic Philosophy 1st Edition Brad


Inwood

https://ebookmeta.com/product/body-and-soul-in-hellenistic-
philosophy-1st-edition-brad-inwood/

ebookmeta.com

Gravity Falls Don t Color This Book 1st Edition Emmy


Cicierega Alex Hirsch

https://ebookmeta.com/product/gravity-falls-don-t-color-this-book-1st-
edition-emmy-cicierega-alex-hirsch/

ebookmeta.com

Folk Tales of Bengal 1st Edition Lal Behari Day

https://ebookmeta.com/product/folk-tales-of-bengal-1st-edition-lal-
behari-day/

ebookmeta.com
Annual Review of Gerontology and Geriatrics Volume 39 2019
154th Edition Roland J Thorpe Jr Phd

https://ebookmeta.com/product/annual-review-of-gerontology-and-
geriatrics-volume-39-2019-154th-edition-roland-j-thorpe-jr-phd/

ebookmeta.com
Machine Learning with
Python Cookbook
SECOND EDITION
Practical Solutions from Preprocessing to Deep Learning

With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.

Kyle Gallatin and Chris Albon


Machine Learning with Python Cookbook
by Kyle Gallatin and Chris Albon
Copyright © 2023 Kyle Gallatin. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(http://oreilly.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or
[email protected].
Acquisitions Editor: Nicole Butterfield
Development Editor Jeff Bleiel
Production Editor: Christopher Faucher
Interior Designer: David Futato
Cover Designer: Karen Montgomery
April 2018: First Edition
October 2023: Second Edition
Revision History for the Early Release
2022-08-24: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781098135720 for
release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.
Machine Learning with Python Cookbook, the cover image, and
related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors and do
not represent the publisher’s views. While the publisher and the
authors have used good faith efforts to ensure that the information
and instructions contained in this work are accurate, the publisher
and the authors disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from
the use of or reliance on this work. Use of the information and
instructions contained in this work is at your own risk. If any code
samples or other technology this work contains or describes is
subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof
complies with such licenses and/or rights.
978-1-098-13566-9
Chapter 1. Working with
Vectors, Matrices and Arrays
in NumPy

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the authors’
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 1st chapter of the final book.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the authors at [email protected].

1.0 Introduction
NumPy is a foundational tool of the Python machine learning stack.
NumPy allows for efficient operations on the data structures often
used in machine learning: vectors, matrices, and tensors. While
NumPy is not the focus of this book, it will show up frequently
throughout the following chapters. This chapter covers the most
common NumPy operations we are likely to run into while working
on machine learning workflows.

1.1 Creating a Vector

Problem
You need to create a vector.
Solution
Use NumPy to create a one-dimensional array:

# Load library
import numpy as np

# Create a vector as a row


vector_row = np.array([1, 2, 3])

# Create a vector as a column


vector_column = np.array([[1],
[2],
[3]])

Discussion
NumPy’s main data structure is the multidimensional array. A vector
is just an array with a single dimension. In order to create a vector,
we simply create a one-dimensional array. Just like vectors, these
arrays can be represented horizontally (i.e., rows) or vertically (i.e.,
columns).

See Also
Vectors, Math Is Fun
Euclidean vector, Wikipedia

1.2 Creating a Matrix

Problem
You need to create a matrix.

Solution
Use NumPy to create a two-dimensional array:
# Load library
import numpy as np

# Create a matrix
matrix = np.array([[1, 2],
[1, 2],
[1, 2]])

Discussion
To create a matrix we can use a NumPy two-dimensional array. In
our solution, the matrix contains three rows and two columns (a
column of 1s and a column of 2s).
NumPy actually has a dedicated matrix data structure:

matrix_object = np.mat([[1, 2],


[1, 2],
[1, 2]])

matrix([[1, 2],
[1, 2],
[1, 2]])

However, the matrix data structure is not recommended for two


reasons. First, arrays are the de facto standard data structure of
NumPy. Second, the vast majority of NumPy operations return
arrays, not matrix objects.

See Also
Matrix, Wikipedia
Matrix, Wolfram MathWorld
1.3 Creating a Sparse Matrix

Problem
Given data with very few nonzero values, you want to efficiently
represent it.

Solution
Create a sparse matrix:

# Load libraries
import numpy as np
from scipy import sparse

# Create a matrix
matrix = np.array([[0, 0],
[0, 1],
[3, 0]])

# Create compressed sparse row (CSR) matrix


matrix_sparse = sparse.csr_matrix(matrix)

Discussion
A frequent situation in machine learning is having a huge amount of
data; however, most of the elements in the data are zeros. For
example, imagine a matrix where the columns are every movie on
Netflix, the rows are every Netflix user, and the values are how many
times a user has watched that particular movie. This matrix would
have tens of thousands of columns and millions of rows! However,
since most users do not watch most movies, the vast majority of
elements would be zero.
A sparse matrix is a matrix in which most elements are 0. Sparse
matrices only store nonzero elements and assume all other values
will be zero, leading to significant computational savings. In our
solution, we created a NumPy array with two nonzero values, then
converted it into a sparse matrix. If we view the sparse matrix we
can see that only the nonzero values are stored:

# View sparse matrix


print(matrix_sparse)

(1, 1) 1
(2, 0) 3

There are a number of types of sparse matrices. However, in


compressed sparse row (CSR) matrices, (1, 1) and (2, 0)
represent the (zero-indexed) indices of the non-zero values 1 and 3,
respectively. For example, the element 1 is in the second row and
second column. We can see the advantage of sparse matrices if we
create a much larger matrix with many more zero elements and then
compare this larger matrix with our original sparse matrix:

# Create larger matrix


matrix_large = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[3, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

# Create compressed sparse row (CSR) matrix


matrix_large_sparse = sparse.csr_matrix(matrix_large)

# View original sparse matrix


print(matrix_sparse)

(1, 1) 1
(2, 0) 3

# View larger sparse matrix


print(matrix_large_sparse)

(1, 1) 1
(2, 0) 3
As we can see, despite the fact that we added many more zero
elements in the larger matrix, its sparse representation is exactly the
same as our original sparse matrix. That is, the addition of zero
elements did not change the size of the sparse matrix.
As mentioned, there are many different types of sparse matrices,
such as compressed sparse column, list of lists, and dictionary of
keys. While an explanation of the different types and their
implications is outside the scope of this book, it is worth noting that
while there is no “best” sparse matrix type, there are meaningful
differences between them and we should be conscious about why
we are choosing one type over another.

See Also
Sparse matrices, SciPy documentation
101 Ways to Store a Sparse Matrix

1.4 Pre-allocating Numpy Arrays

Problem
You need to pre-allocate arrays of a given size with some value.

Solution
NumPy has functions for generating vectors and matrices of any size
using 0s, 1s, or values of your choice.

# Load library
import numpy as np

# Generate a vector of shape (1,5) containing all zeros


vector = np.zeros(shape=5)
# View the vector
print(vector)

array([0., 0., 0., 0., 0.])

# Generate a matrix of shape (3,3) containing all ones


matrix = np.full(shape=(3,3), 1)

# View the vector


print(matrix)

array([[1., 1., 1.],


[1., 1., 1.],
[1., 1., 1.]])

Discussion
Generating arrays prefilled with data is useful for a number of
purposes, such as making code more performant or having synthetic
data to test algorithms with. In many programming languages, pre-
allocating an array of default values (such as 0s) is considered
common practice.

1.5 Selecting Elements

Problem
You need to select one or more elements in a vector or matrix.

Solution
NumPy’s arrays make it easy to select elements in vectors or
matrices:

# Load library
import numpy as np

# Create row vector


vector = np.array([1, 2, 3, 4, 5, 6])

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Select third element of vector


vector[2]

# Select second row, second column


matrix[1,1]

Discussion
Like most things in Python, NumPy arrays are zero-indexed, meaning
that the index of the first element is 0, not 1. With that caveat,
NumPy offers a wide variety of methods for selecting (i.e., indexing
and slicing) elements or groups of elements in arrays:

# Select all elements of a vector


vector[:]

array([1, 2, 3, 4, 5, 6])

# Select everything up to and including the third element


vector[:3]

array([1, 2, 3])

# Select everything after the third element


vector[3:]

array([4, 5, 6])
# Select the last element
vector[-1]

# Reverse the vector


vector[::-1]

array([6, 5, 4, 3, 2, 1])

# Select the first two rows and all columns of a matrix


matrix[:2,:]

array([[1, 2, 3],
[4, 5, 6]])

# Select all rows and the second column


matrix[:,1:2]

array([[2],
[5],
[8]])

1.6 Describing a Matrix

Problem
You want to describe the shape, size, and dimensions of the matrix.

Solution
Use the shape, size, and ndim attributes of a NumPy object:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])

# View number of rows and columns


matrix.shape

(3, 4)

# View number of elements (rows * columns)


matrix.size

12

# View number of dimensions


matrix.ndim

Discussion
This might seem basic (and it is); however, time and again it will be
valuable to check the shape and size of an array both for further
calculations and simply as a gut check after some operation.

1.7 Applying Functions Over Each Element

Problem
You want to apply some function to all elements in an array.

Solution
Use NumPy’s vectorize method:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Create function that adds 100 to something


add_100 = lambda i: i + 100

# Create vectorized function


vectorized_add_100 = np.vectorize(add_100)

# Apply function to all elements in matrix


vectorized_add_100(matrix)

array([[101, 102, 103],


[104, 105, 106],
[107, 108, 109]])

Discussion
NumPy’s vectorize class converts a function into a function that
can apply to all elements in an array or slice of an array. It’s worth
noting that vectorize is essentially a for loop over the elements
and does not increase performance. Furthermore, NumPy arrays
allow us to perform operations between arrays even if their
dimensions are not the same (a process called broadcasting). For
example, we can create a much simpler version of our solution using
broadcasting:

# Add 100 to all elements


matrix + 100

array([[101, 102, 103],


[104, 105, 106],
[107, 108, 109]])
1.8 Finding the Maximum and Minimum Values

Problem
You need to find the maximum or minimum value in an array.

Solution
Use NumPy’s max and min methods:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Return maximum element


np.max(matrix)

# Return minimum element


np.min(matrix)

Discussion
Often we want to know the maximum and minimum value in an
array or subset of an array. This can be accomplished with the max
and min methods. Using the axis parameter we can also apply the
operation along a certain axis:

# Find maximum element in each column


np.max(matrix, axis=0)
array([7, 8, 9])

# Find maximum element in each row


np.max(matrix, axis=1)

array([3, 6, 9])

1.9 Calculating the Average, Variance, and


Standard Deviation

Problem
You want to calculate some descriptive statistics about an array.

Solution
Use NumPy’s mean, var, and std:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Return mean
np.mean(matrix)

5.0

# Return variance
np.var(matrix)

6.666666666666667

# Return standard deviation


np.std(matrix)
2.5819888974716112

Discussion
Just like with max and min, we can easily get descriptive statistics
about the whole matrix or do calculations along a single axis:

# Find the mean value in each column


np.mean(matrix, axis=0)

array([ 4., 5., 6.])

1.10 Reshaping Arrays

Problem
You want to change the shape (number of rows and columns) of an
array without changing the element values.

Solution
Use NumPy’s reshape:

# Load library
import numpy as np

# Create 4x3 matrix


matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]])

# Reshape matrix into 2x6 matrix


matrix.reshape(2, 6)

array([[ 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12]])
Discussion
reshape allows us to restructure an array so that we maintain the
same data but it is organized as a different number of rows and
columns. The only requirement is that the shape of the original and
new matrix contain the same number of elements (i.e., the same
size). We can see the size of a matrix using size:

matrix.size

12

One useful argument in reshape is -1, which effectively means “as


many as needed,” so reshape(1, -1) means one row and as
many columns as needed:

matrix.reshape(1, -1)

array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]])

Finally, if we provide one integer, reshape will return a 1D array of


that length:

matrix.reshape(12)

array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

1.11 Transposing a Vector or Matrix

Problem
You need to transpose a vector or matrix.
Solution
Use the T method:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Transpose matrix
matrix.T

array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])

Discussion
Transposing is a common operation in linear algebra where the
column and row indices of each element are swapped. One nuanced
point that is typically overlooked outside of a linear algebra class is
that, technically, a vector cannot be transposed because it is just a
collection of values:

# Transpose vector
np.array([1, 2, 3, 4, 5, 6]).T

array([1, 2, 3, 4, 5, 6])

However, it is common to refer to transposing a vector as converting


a row vector to a column vector (notice the second pair of brackets)
or vice versa:

# Tranpose row vector


np.array([[1, 2, 3, 4, 5, 6]]).T
array([[1],
[2],
[3],
[4],
[5],
[6]])

1.12 Flattening a Matrix

Problem
You need to transform a matrix into a one-dimensional array.

Solution
Use flatten:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Flatten matrix
matrix.flatten()

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

Discussion
flatten is a simple method to transform a matrix into a one-
dimensional array. Alternatively, we can use reshape to create a
row vector:

matrix.reshape(1, -1)
array([[1, 2, 3, 4, 5, 6, 7, 8, 9]])

One more common method to flatten arrays is the ravel method.


Unlike flatten which returns a copy of the original array, ravel
operates on the original object itself and is therefore slightly faster.
It also lets us flatten lists of arrays, which we can’t do with the
flatten method. This operation is useful for flattening very large
arrays and speeding up code.

# Create one matrix


matrix_a = np.array([[1, 2],
[3, 4]])

# Create a second matrix


matrix_b = np.array([[5, 6],
[7, 8]])

# Create a list of matrices


matrix_list = [matrix_a, matrix_b]

# Flatten the entire list of matrices


np.ravel(matrix_list)

array([1, 2, 3, 4, 5, 6, 7, 8])

1.13 Finding the Rank of a Matrix

Problem
You need to know the rank of a matrix.

Solution
Use NumPy’s linear algebra method matrix_rank:

# Load library
import numpy as np
# Create matrix
matrix = np.array([[1, 1, 1],
[1, 1, 10],
[1, 1, 15]])

# Return matrix rank


np.linalg.matrix_rank(matrix)

Discussion
The rank of a matrix is the dimensions of the vector space spanned
by its columns or rows. Finding the rank of a matrix is easy in
NumPy thanks to matrix_rank.

See Also
The Rank of a Matrix, CliffsNotes

1.14 Getting the Diagonal of a Matrix

Problem
You need to get the diagonal elements of a matrix.

Solution
Use diagonal:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[2, 4, 6],
[3, 8, 9]])
# Return diagonal elements
matrix.diagonal()

array([1, 4, 9])

Discussion
NumPy makes getting the diagonal elements of a matrix easy with
diagonal. It is also possible to get a diagonal off from the main
diagonal by using the offset parameter:

# Return diagonal one above the main diagonal


matrix.diagonal(offset=1)

array([2, 6])

# Return diagonal one below the main diagonal


matrix.diagonal(offset=-1)

array([2, 8])

1.15 Calculating the Trace of a Matrix

Problem
You need to calculate the trace of a matrix.

Solution
Use trace:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[2, 4, 6],
[3, 8, 9]])
# Return trace
matrix.trace()

14

Discussion
The trace of a matrix is the sum of the diagonal elements and is
often used under the hood in machine learning methods. Given a
NumPy multidimensional array, we can calculate the trace using
trace. We can also return the diagonal of a matrix and calculate its
sum:

# Return diagonal and sum elements


sum(matrix.diagonal())

14

See Also
The Trace of a Square Matrix

1.16 Calculating Dot Products

Problem
You need to calculate the dot product of two vectors.

Solution
Use NumPy’s dot:

# Load library
import numpy as np

# Create two vectors


vector_a = np.array([1,2,3])
vector_b = np.array([4,5,6])

# Calculate dot product


np.dot(vector_a, vector_b)

32

Discussion
The dot product of two vectors, a and b, is defined as:

where ai is the ith element of vector a. We can use NumPy’s dot


function to calculate the dot product. Alternatively, in Python 3.5+
we can use the new @ operator:

# Calculate dot product


vector_a @ vector_b

32

See Also
Vector dot product and vector length, Khan Academy
Dot Product, Paul’s Online Math Notes

1.17 Adding and Subtracting Matrices

Problem
You want to add or subtract two matrices.
Solution
Use NumPy’s add and subtract:

# Load library
import numpy as np

# Create matrix
matrix_a = np.array([[1, 1, 1],
[1, 1, 1],
[1, 1, 2]])

# Create matrix
matrix_b = np.array([[1, 3, 1],
[1, 3, 1],
[1, 3, 8]])

# Add two matrices


np.add(matrix_a, matrix_b)

array([[ 2, 4, 2],
[ 2, 4, 2],
[ 2, 4, 10]])

# Subtract two matrices


np.subtract(matrix_a, matrix_b)

array([[ 0, -2, 0],


[ 0, -2, 0],
[ 0, -2, -6]])

Discussion
Alternatively, we can simply use the + and - operators:

# Add two matrices


matrix_a + matrix_b

array([[ 2, 4, 2],
[ 2, 4, 2],
[ 2, 4, 10]])
1.18 Multiplying Matrices

Problem
You want to multiply two matrices.

Solution
Use NumPy’s dot:

# Load library
import numpy as np

# Create matrix
matrix_a = np.array([[1, 1],
[1, 2]])

# Create matrix
matrix_b = np.array([[1, 3],
[1, 2]])

# Multiply two matrices


np.dot(matrix_a, matrix_b)

array([[2, 5],
[3, 7]])

Discussion
Alternatively, in Python 3.5+ we can use the @ operator:

# Multiply two matrices


matrix_a @ matrix_b

array([[2, 5],
[3, 7]])

If we want to do element-wise multiplication, we can use the *


operator:
# Multiply two matrices element-wise
matrix_a * matrix_b

array([[1, 3],
[1, 4]])

See Also
Array vs. Matrix Operations, MathWorks

1.19 Inverting a Matrix

Problem
You want to calculate the inverse of a square matrix.

Solution
Use NumPy’s linear algebra inv method:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 4],
[2, 5]])

# Calculate inverse of matrix


np.linalg.inv(matrix)

array([[-1.66666667, 1.33333333],
[ 0.66666667, -0.33333333]])

Discussion
The inverse of a square matrix, A, is a second matrix A–1, such that:
where I is the identity matrix. In NumPy we can use linalg.inv
to calculate A–1 if it exists. To see this in action, we can multiply a
matrix by its inverse and the result is the identity matrix:

# Multiply matrix and its inverse


matrix @ np.linalg.inv(matrix)

array([[ 1., 0.],


[ 0., 1.]])

See Also
Inverse of a Matrix

1.20 Generating Random Values

Problem
You want to generate pseudorandom values.

Solution
Use NumPy’s random:

# Load library
import numpy as np

# Set seed
np.random.seed(0)

# Generate three random floats between 0.0 and 1.0


np.random.random(3)

array([ 0.5488135 , 0.71518937, 0.60276338])


Discussion
NumPy offers a wide variety of means to generate random numbers,
many more than can be covered here. In our solution we generated
floats; however, it is also common to generate integers:

# Generate three random integers between 0 and 10


np.random.randint(0, 11, 3)

array([3, 7, 9])

Alternatively, we can generate numbers by drawing them from a


distribution:

# Draw three numbers from a normal distribution with mean


0.0
# and standard deviation of 1.0
np.random.normal(0.0, 1.0, 3)

array([-1.42232584, 1.52006949, -0.29139398])

# Draw three numbers from a logistic distribution with mean


0.0 and scale of 1.0
np.random.logistic(0.0, 1.0, 3)

array([-0.98118713, -0.08939902, 1.46416405])

# Draw three numbers greater than or equal to 1.0 and less


than 2.0
np.random.uniform(1.0, 2.0, 3)

array([ 1.47997717, 1.3927848 , 1.83607876])

Finally, it can sometimes be useful to return the same random


numbers multiple times to get predictable, repeatable results. We
can do this by setting the “seed” (an integer) of the pseudorandom
generator. Random processes with the same seed will always
produce the same output. We will use seeds throughout this book so
that the code you see in the book and the code you run on your
computer produces the same results.
Chapter 2. Loading Data

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the authors’
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 2nd chapter of the final book.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the authors at [email protected].

2.0 Introduction
The first step in any machine learning endeavor is to get the raw
data into our system. The raw data might be a logfile, dataset file,
database, or cloud blob store such as Amazon S3. Furthermore,
often we will want to retrieve data from multiple sources.
The recipes in this chapter look at methods of loading data from a
variety of sources, including CSV files and SQL databases. We also
cover methods of generating simulated data with desirable
properties for experimentation. Finally, while there are many ways to
load data in the Python ecosystem, we will focus on using the
pandas library’s extensive set of methods for loading external data,
and using scikit-learn—an open source machine learning library in
Python—for generating simulated data.
2.1 Loading a Sample Dataset

Problem
You want to load a preexisting sample dataset from the scikit-learn
library.

Solution
scikit-learn comes with a number of popular datasets for you to use:

# Load scikit-learn's datasets


from sklearn import datasets

# Load digits dataset


digits = datasets.load_digits()

# Create features matrix


features = digits.data

# Create target vector


target = digits.target

# View first observation


features[0]

array([ 0., 0., 5., 13., 9., 1., 0., 0.,


0., 0., 13.,
15., 10., 15., 5., 0., 0., 3., 15.,
2., 0., 11.,
8., 0., 0., 4., 12., 0., 0., 8.,
8., 0., 0.,
5., 8., 0., 0., 9., 8., 0., 0.,
4., 11., 0.,
1., 12., 7., 0., 0., 2., 14., 5.,
10., 12., 0.,
0., 0., 0., 6., 13., 10., 0., 0.,
0.])
Discussion
Often we do not want to go through the work of loading,
transforming, and cleaning a real-world dataset before we can
explore some machine learning algorithm or method. Luckily, scikit-
learn comes with some common datasets we can quickly load. These
datasets are often called “toy” datasets because they are far smaller
and cleaner than a dataset we would see in the real world. Some
popular sample datasets in scikit-learn are:
load_boston
Contains 503 observations on Boston housing prices. It is a good
dataset for exploring regression algorithms.

load_iris
Contains 150 observations on the measurements of Iris flowers.
It is a good dataset for exploring classification algorithms.

load_digits
Contains 1,797 observations from images of handwritten digits. It
is a good dataset for teaching image classification.
To see more details on any of the datasets above, you can print the
DESCR attribute:

# Load scikit-learn's datasets


from sklearn import datasets

# Load digits dataset


digits = datasets.load_digits()

# Print the attribute


print(digits.DESCR)

.. _digits_dataset:

Optical recognition of handwritten digits dataset


--------------------------------------------------
**Data Set Characteristics:**

:Number of Instances: 1797


:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in
the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998
...

See Also
scikit-learn toy datasets
The Digit Dataset

2.2 Creating a Simulated Dataset

Problem
You need to generate a dataset of simulated data.

Solution
scikit-learn offers many methods for creating simulated data. Of
those, three methods are particularly useful: make_regression,
make_classification, and make_blobs.
When we want a dataset designed to be used with linear regression,
make_regression is a good choice:

# Load library
from sklearn.datasets import make_regression

# Generate features matrix, target vector, and the true


coefficients
features, target, coefficients = make_regression(n_samples
= 100,
n_features
= 3,

n_informative = 3,
n_targets
= 1,
noise =
0.0,
coef =
True,

random_state = 1)

# View feature matrix and target vector


print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
[[ 1.29322588 -0.61736206 -0.11044703]
[-2.793085 0.36633201 1.93752881]
[ 0.80186103 -0.18656977 0.0465673 ]]
Target Vector
[-10.37865986 25.5124503 19.67705609]

If we are interested in creating a simulated dataset for classification,


we can use make_classification:

# Load library
from sklearn.datasets import make_classification

# Generate features matrix and target vector


features, target = make_classification(n_samples = 100,
n_features = 3,
n_informative = 3,
n_redundant = 0,
n_classes = 2,
weights = [.25,
.75],
random_state = 1)

# View feature matrix and target vector


print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])
Feature Matrix
[[ 1.06354768 -1.42632219 1.02163151]
[ 0.23156977 1.49535261 0.33251578]
[ 0.15972951 0.83533515 -0.40869554]]
Target Vector
[1 0 0]

Finally, if we want a dataset designed to work well with clustering


techniques, scikit-learn offers make_blobs:

# Load library
from sklearn.datasets import make_blobs

# Generate feature matrix and target vector


features, target = make_blobs(n_samples = 100,
n_features = 2,
centers = 3,
cluster_std = 0.5,
shuffle = True,
random_state = 1)

# View feature matrix and target vector


print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
[[ -1.22685609 3.25572052]
[ -9.57463218 -4.38310652]
[-10.71976941 -4.20558148]]
Target Vector
[0 1 1]

Discussion
As might be apparent from the solutions, make_regression
returns a feature matrix of float values and a target vector of float
values, while make_classification and make_blobs return a
feature matrix of float values and a target vector of integers
representing membership in a class.
scikit-learn’s simulated datasets offer extensive options to control the
type of data generated. scikit-learn’s documentation contains a full
description of all the parameters, but a few are worth noting.
In make_regression and make_classification,
n_informative determines the number of features that are used
to generate the target vector. If n_informative is less than the
total number of features (n_features), the resulting dataset will
have redundant features that can be identified through feature
selection techniques.
In addition, make_classification contains a weights
parameter that allows us to simulate datasets with imbalanced
classes. For example, weights = [.25, .75] would return a
dataset with 25% of observations belonging to one class and 75% of
observations belonging to a second class.
For make_blobs, the centers parameter determines the number
of clusters generated. Using the matplotlib visualization library,
we can visualize the clusters generated by make_blobs:

# Load library
import matplotlib.pyplot as plt

# View scatterplot
plt.scatter(features[:,0], features[:,1], c=target)
plt.show()
See Also
make_regression documentation
make_classification documentation
make_blobs documentation

2.3 Loading a CSV File

Problem
You need to import a comma-separated values (CSV) file.

Solution
Use the pandas library’s read_csv to load a local or hosted CSV
file:

# Load library
import pandas as pd

# Create URL
Discovering Diverse Content Through
Random Scribd Documents
THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the free


distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund
from the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only be


used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law
in the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name associated
with the work. You can easily comply with the terms of this
agreement by keeping this work in the same format with its attached
full Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.

1.E. Unless you have removed all references to Project Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears,
or with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived


from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted


with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning
of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this


electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the Project
Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing


access to or distributing Project Gutenberg™ electronic works
provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for


the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,


the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.

You might also like