0% found this document useful (0 votes)
22 views99 pages

Python Codes

This document provides an overview of Python programming concepts, including list, tuple, set, and dictionary operations, as well as functions and examples for data science packages like NumPy, pandas, and matplotlib. It covers basic array manipulations, mathematical operations, and data visualization techniques. Additionally, it includes examples of linear regression using pandas and sklearn, along with descriptive statistics methods in pandas.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views99 pages

Python Codes

This document provides an overview of Python programming concepts, including list, tuple, set, and dictionary operations, as well as functions and examples for data science packages like NumPy, pandas, and matplotlib. It covers basic array manipulations, mathematical operations, and data visualization techniques. Additionally, it includes examples of linear regression using pandas and sklearn, along with descriptive statistics methods in pandas.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 99

Python Codes

Chapter 2

Example Description

len(list) Finds the length of the list

list.appen
Adds x to the end of the list
d(x)

Removes the element with an index of x and returns


list.pop(x)
that element

list.remov
Removes x from the list
e(x)

list1 +
Concatenates the two lists
list2

 list = [10, 'abc'] creates a list with elements 10 and 'abc'.

 list = [] creates an empty list.

 len(tuple) returns the number of elements in tuple.

 tuple1 + tuple2 returns a tuple consisting of tuple1 followed by tuple2.

 set = {33, 4,'abc'} creates a set of three elements.

 {33, 4,'abc'} and {'abc', 4, 33} are the same set, since sets are not ordered.

 dict = {'LAX': 161, 'DEN': 141} creates a dictionary with keys 'LAX' and 'DEN' and
values 161 and 141.

 dict = {} creates an empty dictionary.

 del dict['Sofia'] removes the element with key 'Sofia'.

 dict['Rajesh'] = 'A+' either changes the value of an existing 'Rajesh' element to


'A+' or adds a new 'Rajesh' element.

Functions example:

def calcPizzaVolume(pizzaDiameter, pizzaHeight):


piVal = 3.14159265
pizzaRadius = pizzaDiameter / 2.0
pizzaArea = piVal * pizzaRadius * pizzaRadius
pizzaVolume = pizzaArea * pizzaHeight
return pizzaVolume
print('12.0 x 0.3 inch pizza is', calcPizzaVolume(12.0, 0.3), 'cubic inches')
print('16.0 x 0.8 inch pizza is', calcPizzaVolume(16.0, 0.8), 'cubic inches')

Example 2:

def changeName():
employeeName = 'Juliet'
employeeName = 'Romeo'
changeName()
print('Employee name:', employeeName)

PRINTS: “Employee name: Romeo”

def changeName():
global employeeName
employeeName = 'Juliet'
employeeName = 'Romeo'
changeName()
print('Employee name:', employeeName)

PRINTS: “Employee name: Juliet”

# Define function that prints full name


def printName(first, last, lastFirst=False):
if lastFirst:
print(last + ', ' + first)
if not lastFirst:
print(first + ' ' + last)

# Call with keyword arguments


printName(first='Dana', last='Patel', lastFirst=True)

2.4: Data science packages


Commo
Import name Description
n alias

NumPy includes functions and classes that aid in numerical


numpy np computation. NumPy is used in many other data science
packages.

pandas provides methods and classes for tabular and time-


pandas pd
series data.
scikit-learn provides implementations of many machine
learning algorithms with a uniform syntax for preprocessing
sklearn sk
data, specifying models, fitting models with cross-validation,
and assessing models.

matplotlib.py matplotlib allows the creation of data visualizations in


plt
plot Python. The functions mostly expect NumPy arrays.

seaborn also allows the creation of data visualizations but


seaborn sns
works better with pandas DataFrame.

SciPy provides algorithms and functions for computing


scipy.stats sp.stats problems that arise in science, engineering and
statistics. scipy.stats provides the functions for statistics.

statsmodels adds functionality to Python to estimate many


statsmodels sm different kinds of statistical models, make inferences from
those models, and explore data.

## This style prevents running much of a notebook to find a package needs to be


installed.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split

2.5: NumPy package


Array functions

 NumPy functions are written with the prefix 'numpy' or an alias. The tables omit
this prefix. Ex: sort(array) stands for numpy.sort(array).

Function Parameters Description

Returns an array constructed from object. object must be a scalar or an


object
ordered container, such as tuple or list. The array element type is inferred
array() dtype=None
ndim=0 from object unless a dtype is specified. ndim is the minimum number of
array dimensions.

arr
Deletes a slice of input array arr. axis is the axis along which to remove a
delete() obj
axis=None slice. obj is the index of the slice along the axis.

full() shape Returns an array filled with fill_value. The shape tuple specifies array
fill_value shape. dtype specifies the array type. If dtype=None, the type is inferred
dtype=None from fill_value.

arr
obj Inserts array values to input array arr. axis is the axis along which to
insert()
values insert. obj is the index before which values is inserted.
axis=None

shape Returns an array filled with zeros. The shape tuple specifies array
zeros()
dtype=float shape. dtype specifies the array type.

shape Returns an array filled with ones. The shape tuple specifies array
ones()
dtype=None shape. dtype specifies the array type. If dtype=None, the type is float64.

a Sorts array a along axis. The default axis=-1 sorts along the last axis
sort()
axis=-1 in a. axis=None flattens a before sorting.

Shape functions
Paramet
Function Description
ers

a
ravel() order=' Returns flattened array a.
C'

a
Returns an array with the same data as a but a different
newsha
reshape( shape. newshape is an integer or tuple of integers that specifies the new
pe
) shape. The new shape must have the same number of elements as the
order='
original shape.
C'

Returns an array with the same data as a but a different


a shape. newshape is an integer or tuple of integers that specifies the new
resize() newsha shape. The new and original arrays may have a different number of
pe elements. If the new array is larger than the original array, then the new
array is filled with repeated copies of a.

transpos Returns a transposed copy of array a. Zero- and one-dimensional arrays


a
e() are not changed. Equivalent to the attribute array.T.

Variable array is assigned with [ [1, 2, 3, 4], [5, 6, 7, 8]

reshape(array, (2, 2, 2)) [ [ [1, 2], [3, 4] ], [ [5, 6], [7, 8] ] ]

ravel(array, order='F') [1, 5, 2, 6, 3, 7, 4, 8]

transpose(array) [ [1, 5], [2, 6], [3, 7], [4, 8] ]

ravel(array, order='C') [1, 2, 3, 4, 5, 6, 7, 8]


resize(array, (2, 5)) [ [1, 2, 3, 4, 5], [6, 7, 8, 1, 2] ]

Math operator and function examples.


Expression Description

array1 + array2 Element-wise addition

array1 - array2 Element-wise subtraction


Arithmetic
operators
array1 * array2 Element-wise multiplication

array1 / array2 Element-wise division

sqrt(array1) Square root of array elements

Simple functions log(array1) Logarithm of array elements

sin(array1) Sine of array elements

max(array1) Maximum of array elements

median(array1) Median of array elements


Aggregate
functions Standard deviation of array
std(array1)
elements

var(array1) Variance of array elements

dot(array1, Dot product array1 rows with array2


array2) columns

matmul(array1,
Matrix functions Matrix product of array1 and array2
array2)

cross(array1,
Cross product of array1 and array2
array2)

2.6 pandas package


Slice notation.
Notati
Description
on

index values
a:b
from a to b-1

:b index values before b

index values
a:
from a onwards
Comparison operators.
Comparison
Description
operator

== Outputs True if the two operands are equal.

!= Outputs True if the two operands are not equal.

Outputs True if the left operand is greater than the right


>
operand.

Outputs True if the left operand is greater than or equal to


>=
the right operand.

Outputs True if the left operand is less than the right


<
operand.

Outputs True if the left operand is less than or equal to the


<=
right operand.

Logical operators.
Logical
Description
operator

Outputs True if the two operands are


&
both True.

Outputs True if at least one of the


|
operands is True.

Outputs the opposite truth value of the


~
expression.

Example dataframe methods.


Method Parameters Description

index
at[] Returns the dataframe value stored at index and column.
column

labels=None Removes rows (axis=0) or columns (axis=1)


drop() axis=0 from dataframe. labels specifies the labels of rows or columns to
inplace=False drop.

Removes duplicate rows from dataframe. subset specifies the labels


drop_duplicat subset=None
of columns used to identify duplicates. If subset=None, all columns
es() inplace=False
are used.

dropna() axis=0 Removes rows (axis=0) or columns (axis=1) containing missing


how='any' values from dataframe. subset specifies labels on the opposite axis
subset=None to consider for missing values. how indicates whether to drop the
inplace=False row or column if any or if all values are missing.

loc Inserts a column to dataframe. loc specifies the integer position of


insert() column the new column. column specifies a string or numeric column
value label. value specifies column values as a Scalar or Series.

to_replace=None
value=
Replaces to_replace values
replace() in dataframe with value. to_replace and value may be string,
NoDefault.no_def
dictionary, list, regular expressions, or other data types.
ault
inplace=False

Sorts dataframe columns or rows. by specifies indexes or labels on


by
which to sort. axis specifies whether to sort rows (0) or columns
axis=0
sort_values() (1). ascending specifies whether to sort ascending or
ascending=True
descending. inplace specifies whether to sort dataframe or return a
inplace=False
new dataframe.

2.7 matplotlib package


import matplotlib.pyplot as plt.
 plt.figure()—creates a new figure.
 plt.show()—displays the figure and all the objects the figure contains.
 plt.savefig(fname)—saves the figure in the current working directory with the
filename fname.
 plt.title() : add a title to a figure
 plt.xlabel(): adds text for the x-axis
 plt.ylabel(): adds text for the y-axis
 plt.text(x, y, s): adds string s to the figure at coordinates (x, y)
 plt.annotate(s, xy, xytext): links string s at coordinates given by xytext to a point
given by xy
 plt.legend(): adds legend in the figure
Characters for line color, line style, and marker style.
Character Line Character( Character
Marker style Marker style
(s) color/style s) (s)

b Blue . Point marker 1 Tri-down marker

g Green , Pixel marker 2 Tri-up marker

r Red o Circle marker 3 Tri-left marker

w White + Plus marker 4 Tri-right marker

k Black X X marker h Hexagon1 marker

Triangle-down
y Yellow v H Hexagon2 marker
marker

m Magenta ^ Triangle-up marker D Diamond marker

- Solid line < Triangle-left marker d Thin diamond marker

Triangle-right
: Dotted line > | Vertical line marker
marker

-- Dashed line * Star marker _ Horizontal line marker

Dashed-dot
-. p Pentagon marker s Square marker
line

The plt.grid() function adds grid lines to plots.


plt.subplot() function takes three parameters: nrows, ncols, and index.
plt.suptitle() adds a title to the entire figure, not just the individual plots.

Example:
# Load packages
import matplotlib.pyplot as plt
import pandas as pd
# Load oldfaithfulCluster.csv data
df = pd.read_csv('oldfaithfulCluster.csv')
plt.subplot(2, 1, 1)
plt.scatter(df['Eruption'], df['Waiting'])
plt.suptitle('Eruption time vs. waiting time', fontsize=20, c='black')
plt.ylabel('Waiting time', fontsize=14)
plt.subplot(2, 1, 2)
group1 = df[df['Cluster'] == 1]
group0 = df[df['Cluster'] == 0]
plt.scatter(group1['Eruption'], group1['Waiting'], label='1', edgecolors='white')
plt.scatter(group0['Eruption'], group0['Waiting'], label='0', edgecolors='white')
plt.xlabel('Eruption time', fontsize=14)
plt.ylabel('Waiting time', fontsize=14)
plt.legend()

LAB: Importing packages


fullscreen
Full screen1 / 1
Import the necessary modules and read in a csv file. The homes dataset contains 18
features giving the characteristics of 76 homes being sold. The modules will be used with
the homes.csv file to perform a linear regression. Linear regression will be covered in a
different chapter.
 Import the NumPy using the alias np and pandas using the alias pd.
 Import the function LinearRegression from the sklearn.linear_model package.
 Read in the csv file homes.csv.
Ex: If the csv file homes_small.csv is used instead of homes.csv, the output is:
The intercept of the regression is 249.522
The slope of the regression is 36.758
# Import NumPy and pandas
import numpy as np
import pandas as pd

# Import the LinearRegression function from sklearn.linear_model


from sklearn.linear_model import LinearRegression # Your code here

# Read in the csv file homes.csv


homes= pd.read_csv("homes.csv")

# Store relevant columns as variables


y = homes['Price']
y = np.reshape(y.values, (-1,1))
X = homes['Floor']
X = np.reshape(X.values, (-1,1))

# Fit a least squares regression model


linModel = LinearRegression()
linModel.fit(X,y)

# Print the intercept and slope of the regression


print('The intercept of the regression is ', end="")
print('%.3f' % linModel.intercept_)

print('The slope of the regression is ', end="")


print('%.3f' % linModel.coef_)

CHAPTER 3
Pandas descriptive statistics methods.
Method Parameters Description

DataFrame.mean
() axis=None Returns the mean or median of the values over the
DataFrame.medi skipna=True requested axis. skipna=True excludes NA/null values.
an()

Returns the unbiased sample variance (divides by n−1)


axis=None
DataFrame.var() or standard deviation of the values over the requested
skipna=True
DataFrame.std() axis. The divisor used is n−ddof, where n represents
ddof=1
the number of non-NA/null values.

DataFrame.min() axis=None Returns the minimum or maximum of the values over


DataFrame.max( skipna=True the requested axis.
)

q=0.5 Returns the value of the given quantile(s), q, over the


DataFrame.quant axis=None requested axis. interpolation specifies the method to
ile() interpolation='li determine a quantile when the quantile lies between
near' two values.

DataFrame.skew( axis=None Returns the skewness of the values over the requested
) skipna=True axis.

Returns the kurtosis of the values over the requested


DataFrame.kurto axis=None
axis. Computes Fisher's definition of kurtosis where a
sis() skipna=True
normal distribution has 0 kurtosis.

Returns descriptive statistics. For numerical features,


results include the count, mean, standard deviation,
DataFrame.descr percentiles=Non
minimum, maximum, 0.25 quantile, 0.50 quantile or
ibe() e
median, and 0.75 quantile. The returned percentiles
can be modified with percentiles.

 Using a descriptive statistics method, calculate the mean number of homes sold
("sales") over all cities.

# Import packages and functions


import pandas as pd
housing = pd.read_csv('txhousing.csv')
meanHomes = housing['sales'].mean()# Your code goes here
print('Mean:', meanHomes)

SciPy functions for probability distributions.


Distributi
Functions Parameters Description
on

bernoulli.pmf() returns the


Bernoull bernoulli.pmf(k, p) p=π sets the probability of a probability P(X= k), and
i bernoulli.cdf(k, p) "success". the bernoulli.cdf() returns the
probability P(X≤ k).

binomial.pmf() returns the


n=n sets the number of
Binomia binom.pmf(k, n, p) probability P(X= k), and
observations. p=π sets the
l binom.cdf(k, n, p) the binomial.cdf() returns the
probability of a "success".
probability P(X≤ k).

norm.pdf(x, loc,
loc=μ sets the mean norm.pdf() returns the density curve's
scale)
Normal and scale=σ sets the value at x, and norm.cdf() returns the
norm.cdf(x, loc,
standard deviation. probability P(X≤ x).
scale)

t t.pdf(x, df) df=d⁢f sets the degrees of t.pdf() method returns the density
curve's value at x, and t.cdf() returns the
t.cdf(x, df) freedom for the distribution. probability P(X≤ x).

# Calculate the probability of less than a value, P(X<=8), using cdf()


norm.cdf(x=8, loc=10, scale=2)

# Calculate the probability of greater than a value, P(X>8)=1-P(X<=8), using cdf()


1 - norm.cdf(x=8, loc=10, scale=2)

# Calculate the probability between two values, P(8<X<12), using cdf()


norm.cdf(x=12, loc=10, scale=2) - norm.cdf(x=8, loc=10, scale=2)

# Calculate P(X<=0)
t.cdf(x=0, df=4)

# Using the symmetry of the t-distribution curve, calculate P(X < -2 or X > 2)
t.cdf(x=-2, df=4) * 2

# Calculate probability in the tails P(X < -2 or X > 2)


t.cdf(x=-2, df=4) + (1 - t.cdf(x=2, df=4))

Functions for inference about proportions.


Function Parameters Description

count: number/array of
successes
nobs: number/array of
observations
Returns the test statistic and p-value for a hypothesis test
value: value in the null
proportions_zte based on a normal (z) test. count and nobs take a single
hypothesis
st() value for a one proportion test and an array of values for
alternative: type of the
a two proportion test.
alternative hypothesis
prop_var=False: estimate
variance based on
sample proportions

count: number of
successes
nobs: number of
proportion_confi observations Returns a (1-alpha)∗100% confidence interval for a
nt() alpha: significance level population proportion.
method='normal': use
normal approximation to
calculate interval

Functions for inference about means.


Function Parameters Description

ttest_1sa a: array of values Returns the t-statistic and p-value from a one-sample t-test for
popmean: value in null
hypothesis the null hypothesis that the population mean of a sample, a, is
mp()
alternative: type of alternative equal to a specified value.
hypothesis

a: array of values from sample


1
b: array of values from sample
Returns the t-statistic and p-value from a two-sample t-test for
2
ttest_ind() the null hypothesis that two independent samples, a and b,
equal_var=False: assumes
have equal population means.
non-equal variances
alternative: type of alternative
hypothesis

Lab: The mtcars dataset contains data from the 1974 Motor Trends magazine, and
includes 10 features of performance and design from a sample of 32 cars.
 Import the csv file mtcars.csv as a data frame using a pandas module function.
 Find the mean, median, and mode of the column wt.
 Print the mean and median.

import pandas as pd
# Read in the file mtcars.csv
cars = pd.read_csv('mtcars.csv') # Your code here
# Find the mean of the column wt
mean = cars['wt'].mean()# Your code here
# Find the median of the column wt
median = cars['wt'].median()# Your code here
print("mean = {:.5f}, median = {:.3f}".format(mean, median))

The intelligence quotient (IQ) of a randomly selected person follows a normal distribution
with a mean of 100 and a standard deviation of 15. Use the scipy function norm and user
input values for IQ1 and IQ2 to perform the following tasks:
 Calculate the probability that a randomly selected person will have an IQ less than
or equal to IQ1.
 Calculate the probability that a randomly selected person will have an IQ
between IQ1 and IQ2.

# Import norm from scipy.stats


from scipy.stats import norm
# Input two IQs, making sure that IQ1 is less than IQ2
IQ1 = float(input())
IQ2 = float(input())
mean = 100
std_dev = 15

while IQ1 > IQ2:


print("IQ1 should be less than IQ2. Enter numbers again.")
IQ1 = float(input())
IQ2 = float(input())
# Calculate the probability that a randomly selected person has an IQ less than or equal
to IQ1.
probLT = norm.cdf(IQ1, loc=mean, scale=std_dev)# Your code here
# Calculate the probability that a randomly selected person has an IQ between IQ1 and
IQ2
probBetw = norm.cdf(IQ2, loc=mean, scale=std_dev)-norm.cdf(IQ1, loc=mean,
scale=std_dev)# Your code here
print("The probability that a randomly selected person \n has an IQ less than or equal to
" + str(IQ1) + " is ", end="")
print('%.3f' % probLT + ".")
print("The probability that a randomly selected person \n has an IQ between " + str(IQ1)
+ " and " + str(IQ2)+ " is ", end="")
print('%.3f' % probBetw + ".")

The gpa dataset is a toy dataset containing the features height and gpa for 35 students.
Use the statsmodels function proportions_ztest and the user defined values for the
proportion for the null hypothesis value and the gpa cutoff cutoff to perform the following
tasks:
 Load the gpa.csv data set.
 Find the number of students with a gpa greater than cutoff.
 Find the total number of students.
 Perform a z-test for the user input expected proportion. Modify
the prop_var parameter to use the user input expected proportion instead of the
sample proportion to calculate the standard error.
 Determine if the hypothesis that the actual proportion is different from the
expected proportion should be rejected at the alpha = 0.01 significance level.

import statsmodels.stats as st
from statsmodels.stats.proportion import proportions_ztest
import pandas as pd

# Read in gpa.csv
gpa = pd.read_csv('gpa.csv')# Your code here

# Get the value of the proportion for the null hypothesis


value = float(input())
# Get the gpa cutoff
cutoff = float(input())

# Determine the number of students with a gpa higher than cutoff


counts = (gpa['gpa'] > cutoff).sum() # Your code here

# Determine the total number of students


nobs = len(gpa)# Your code here

# Perform z-test for counts, nobs, and value


# Modify prop_var parameter
ztest = proportions_ztest(count=counts, nobs=nobs, value=value, alternative='two-
sided', prop_var=value)# Your code here
print("(", end="")
print('%.3f' % ztest[0] + ", ", end="")
print('%.3f' % ztest[1] + ")")

if ztest[1] < 0.01:


print("The two-tailed p-value, ", end="")
print('%.3f' % ztest[1] + ", is less than \u03B1. Thus, sufficient evidence exists to
support the hypothesis that the proportion is different from", value)
else:
print("The two-tailed p-value, ", end="")
print('%.3f' % ztest[1] + ", is greater than \u03B1. Thus, insufficient evidence exists to
support the hypothesis that the proportion is different from", value)

CHAPTER 4
Common operators.
Type Operator Description Example Value

+ Adds two numeric values 4+3 7

- (unary) Reverses the sign of one numeric value -(-2) 2

Subtracts one numeric value from


- (binary) 11 - 5 6
another

* Multiplies two numeric values 3*5 15


Arithmeti
c
/ Divides one numeric value by another 4/2 2

Divides one numeric value by another


% (modu
and returns 5%2 1
lo)
the integer remainder

Raises one numeric value to the power


^ 5^2 25
of another

Comparis = Compares two values for equality 1=2 FALS


on E
!= Compares two values for inequality 1 != 2 TRUE

FALS
< Compares two values with < 2<2
E

<= Compares two values with ≤ 2 <= 2 TRUE

'2019-08-13' > '2021- FALS


> Compares two values with >
08-13' E

FALS
>= Compares two values with ≥ 'apple' >= 'banana'
E

Returns TRUE only when both values FALS


AND TRUE AND FALSE
are TRUE E

Logical Returns FALSE only when both values


OR TRUE OR FALSE TRUE
are FALSE

NOT Reverses a logical value NOT FALSE TRUE

Operator precedence.
Preceden
Operators
ce

1 - (unary)

2 ^

3 * / %

4 + - (binary)

= != < > <=


5
>=

6 NOT

7 AND

8 OR

SELECT with expressions.


SELECT Expression1, Expression2, ...
FROM TableName;

SELECT with columns.


SELECT Column1, Column2, ...
FROM TableName;
SELECT with asterisk.
SELECT *
FROM TableName;

WHERE clause.
SELECT Expression1, Expression2, ...
FROM TableName
WHERE Condition;

The given SQL creates a Movie table and inserts some movies. The SELECT statement
selects all movies released before January 1, 2000.
Modify the SELECT statement to select the title and release date of PG-13 movies that
are released after January 1, 2008.
Run your solution and verify the result table shows just the titles and release dates
for The Dark Knight and Crazy Rich Asians.

CREATE TABLE Movie (


ID INT AUTO_INCREMENT,
Title VARCHAR(100),
Rating CHAR(5) CHECK (Rating IN ('G', 'PG', 'PG-13', 'R')),
ReleaseDate DATE,
PRIMARY KEY (ID)
);

INSERT INTO Movie (Title, Rating, ReleaseDate) VALUES


('Casablanca', 'PG', '1943-01-23'),
('Bridget Jones\'s Diary', 'PG-13', '2001-04-13'),
('The Dark Knight', 'PG-13', '2008-07-18'),
('Hidden Figures', 'PG', '2017-01-06'),
('Toy Story', 'G', '1995-11-22'),
('Rocky', 'PG', '1976-11-21'),
('Crazy Rich Asians', 'PG-13', '2018-08-15');

-- Modify the SELECT statement:


SELECT *
FROM Movie
WHERE ReleaseDate < '2000-01-01';

LIKE
 % matches any number of characters. Ex: LIKE 'L%t' matches "Lt", "Lot", "Lift", and
"Lol cat".
 _ matches exactly one character. Ex: LIKE 'L_t' matches "Lot" and "Lit" but not "Lt"
and "Loot".

The given SQL creates a Movie table and inserts some movies. The SELECT statement
selects all movies.
Modify the SELECT statement to select movies with the word "star" somewhere in the
title.
Run your solution and verify the result table shows just the movies Rogue One: A Star
Wars Story, Star Trek and Stargate.
CREATE TABLE Movie (
ID INT AUTO_INCREMENT,
Title VARCHAR(100),
Rating CHAR(5) CHECK (Rating IN ('G', 'PG', 'PG-13', 'R')),
ReleaseDate DATE,
PRIMARY KEY (ID)
);

INSERT INTO Movie (Title, Rating, ReleaseDate) VALUES


('Rogue One: A Star Wars Story', 'PG-13', '2016-12-16'),
('Star Trek', 'PG-13', '2009-05-08'),
('The Dark Knight', 'PG-13', '2008-07-18'),
('Stargate', 'PG-13', '1994-10-28'),
('Avengers: Endgame', 'PG-13', '2019-04-26');

-- Modify the SELECT statement:


SELECT *
FROM Movie;

Simple functions.
Type Function Description Example Result

Numer SELECT ABS(-5);


ABS(n) Absolute value of n 5
ic

SELECT LOG(10);
LOG(n) Natural logarithm of n 2.302585

SELECT POW(2, 3);


POW(x, y) x to the power of y 8

Random number between 0 (inclusive) SELECT RAND();


RAND() 0.118318
and 1 (exclusive)

ROUND(n, d) n rounded to d decimal places SELECT ROUND(16.25, 16.3


1);
SELECT SQRT(25);
SQRT(n) Square root of n 5

SELECT CONCAT('Dis',
CONCAT(s1, 'Disenga
Concatenation of the strings s1, s2, ... 'en', 'gage');
s2, ...) ge'

SELECT
LOWER(s) s converted to lower case LOWER('MySQL'); 'mysql'

SELECT UPPER('mysql');
UPPER(s) s converted to upper case 'MYSQL'
String

SELECT
REPLACE(s, s with all occurrences of from replaced REPLACE(‘Orange', 'O',
'Strange'
from, to) by to 'St');

SELECT SUBSTRING
SUBSTRING(s, Substring of s that starts at
('Boomerang', 1, 4); 'Boom'
pos, len) position pos with length len

CURDATE() Current date, time, or date and time


SELECT CURDATE(); '2019-
CURTIME() in 'YYYY-MM-DD', 'HH:MM:SS', or 'YYYY-
01-25'
NOW() MM-DD HH:MM:SS' format

DAY(d) SELECT MONTH('2016-


MONTH(d) Day, month, or year of d 10-25'); 10
YEAR(d)
Date
Time HOUR(t) SELECT
MINUTE(t) Hour, minute, or second of t MINUTE('22:11:45'); 11
SECOND(t)

DATEDIFF(dt1, SELECT
dt2) Difference of dt1 − dt2, in number of DATEDIFF('2013-03-10',
6
TIMEDIFF(dt1, days or amount of time '2013-03-04');
dt2)

 COUNT() counts the number of selected values.


 MIN() finds the minimum of selected values.
 MAX() finds the maximum of selected values.
 SUM() sums selected values.
 AVG() computes the arithmetic mean of selected values.
 VARIANCE() computes the standard variance of selected values.
GROUP BY clause
 One or more columns are listed after GROUP BY, separated by commas.
 GROUP BY clause returns one row for each group.
 Each group may be ordered with the ORDER BY clause.
 GROUP BY clause must appear before the ORDER BY clause and after the WHERE
clause (if present).

import mysql.connector
from mysql.connector import errorcode

try:
reservationConnection = mysql.connector.connect(
user='samsnead',
password='*jksi72$',
host='127.0.0.1',
database='Reservation')

except mysql.connector.Error as err:


if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
print('Invalid credentials')
elif err.errno == errorcode.ER_BAD_DB_ERROR:
print('Database not found')
else:
print('Cannot connect to database:', err)

else:
# Execute database operations...
reservationConnection.close()

 The cursor.rowcount property is the number of rows returned or altered by a


query.
 The cursor.column_names property is a list of column names in a query result.
 The cursor.fetchwarnings() method returns a list of warnings generated by a
query.
 The connection.commit() method saves all changes
 The connection.rollback() method discards all changes
 cursor.fetchone() returns a tuple containing a single result row or the value None if no rows are
selected. If a query returns multiple rows, cursor.fetchone() may be executed repeatedly until it
returns None.
 cursor.fetchall() returns a list of tuples containing all result rows. The tuple list can be processed
in a loop. Ex: for rowTuple in cursor.fetchall() assigns each row to rowTuple and terminates
when all rows are processed.

flightCursor = reservationConnection.cursor()
flightQuery = ('SELECT FlightNumber, DepartureTime FROM Flight '
'WHERE AirportCode = %s AND AirlineName = %s')
flightData = ('PEK', 'China Airlines')
flightCursor.execute(flightQuery, flightData)

for row in flightCursor.fetchall():


print('Flight', row[0], 'departs at', row[1])

flightCursor.close()

CHAPTER 5

Data wrangling with Python and pandas.


Method Parameters Description

Returns a dataframe constructed from a CSV


file. filepath_or_buffer is a string containing the
filepath_or_buffer
full path for the CSV file. When the file is in the
read_csv() sep=NoDefault.no_d
same directory as the code, only the file name is
efault
needed. sep specifies the character that
separates values in the CSV file.

Returns a dataframe constructed from an Excel


spreadsheet. io is a string containing the full
io path for the Excel file. When the file is in the
read_excel()
sheet_name=0 same directory as the code, only the file name is
needed. sheet_name is a string or integer that
specifies which Excel sheet to read.

Returns a dataframe constructed from an SQL


table_name table. table_name specifies the table
read_sql_tab con name. con specifies a database server
le() schema=None connection string. schema specifies the schema
columns=none in the database server. columns specifies which
table columns to include in the dataframe.

Returns a new dataframe. data specifies


data=None dataframe values as an array, dictionary, or
index=None another dataframe. index and columns specify
DataFrame()
columns=None row and column labels. The
defaults index=None and columns=None genera
te integer labels.

dataframe.a index Returns the dataframe value stored


t[] column at index and column.

Returns information about dataframe, such as


number of rows and columns, data types, and
dataframe.i
verbose=None memory usage. If verbose=False, shows only
nfo()
summary dataframe information and hides
column details.

dataframe.lo indexRange Returns a slice


of dataframe. indexRange specifies rows in the
slice,
c[] columnRange
as startIndex:endIndex. columnRange specifies
columns in the slice as startLabel:endLabel.

Sorts dataframe columns or rows. by specifies


indexes or labels on which to sort. axis specifies
by
whether to sort rows (0) or columns
dataframe.s axis=0
(1). ascending specifies whether to sort
ort_values() ascending=True
ascending or descending. inplace specifies
inplace=False
whether to sort dataframe or return a new
dataframe.

Python data structuring methods.


Paramet
Method Description
ers

string[start:e Returns the substring of string that begins at the index start and ends at the
none
nd] index end - 1.

string.capital
ize()
Returns a copy of string with the initial character uppercase, all characters
string.upper(
none uppercase, all characters lowercase, or the initial character of all words
)
uppercase.
string.lower()
string.title()

Converts arg to datetime data type and returns the converted object. Data
to_datetime(
arg type of arg may be int, float, str, datetime, list, tuple, one-dimensional array,
)
Series, or DataFrame.

Converts arg to numeric data type and returns the converted object. Data
to_numeric() arg
type of arg may be scalar, list, tuple, one-dimensional array, or Series.

pandas data structuring methods.


Paramete
Method Description
rs

dtype
df.astyp Converts the data type of all dataframe df columns to dtype. To alter individual
copy=Tr
e() columns, specify dtype as {col: dtype, col:dtype, . . .}.
ue

loc
df.insert Inserts a new column with label column at location loc in dataframe df. value is a
column
() Scalar, Series, or Array of values for the new column.
value

scikit-learn data structuring methods.


Method Parameters Description
X Standardizes data in input X of data type Array
axis=0 or DataFrame. axis indicates whether to
with_mean=Tru standarize along columns (0) or rows
preprocessing.scale()
e (1). with_mean=True centers the data at the
with_std=True mean value. with_std=True scales the data so
copy=True that one represents a standard deviation.

Normalizes data in input X,


feature_range a fit_transform() parameter of data type Array or
preprocessing.MinMaxScaler().fit_tr =(0,1) DataFrame. feature_range specifies the range of
ansform() copy=True scaled
X data. feature_range and copy are MinMaxScaler(
) parameters.

pandas data cleaning methods.


Method Parameters Description

labels=None Removes rows (axis=0) or columns (axis=1) from


df.drop() axis=0 dataframe df. labels specifies the labels of rows or columns
inplace=False to drop.

Removes duplicate rows from df. subset specifies the labels


df.drop_duplica subset=None
of columns used to identify duplicates. If subset=None, all
tes() inplace=False
columns are used.

Removes rows (axis=0) or columns (axis=1) containing


axis=0
missing values from df. subset specifies labels on the
how='any'
df.dropna() opposite axis to consider for missing values. how indicates
subset=None
whether to drop the row or column if any or if all values are
inplace=False
missing.

Returns a Boolean series that identifies duplicate rows


in df. true indicates a duplicate row. subset specifies the
df.duplicated() subset=None
labels of columns used to identify duplicates.
If subset=None, all columns are used.

value=None Replaces NA and NaN values in df with value, which may be


df.fillna()
inplace=False a scalar, dict, Series, or DataFrame.

Returns a dataframe of Boolean values. True in the returned


df.isnull()
none dataframe indicates the corresponding value of the
df.isna()
input df is None, NaT or NaN.

Returns the mean values of rows (axis=0) or columns


axis=0
(axis=1) of df. skipna indicates whether to exclude unknown
df.mean() skip_na=True
values in the calculation. numeric_only indicates whether to
numeric_only=None
exclude non-numeric rows or columns.

df.replace() to_replace=None Replaces to_replace values


value=NoDefault.no_d in df with value. to_replace and value may be str, dict, list,
efault regex, or other data types.
inplace=False

Python data enriching methods.


Method Parameters Description

objs
Appends dataframes specified in objs parameter. Appends rows
axis=0
pd.conca if axis=0 or columns if axis=1. join specifies whether to perform
join='outer'
t() an 'outer' or 'inner' join. Resulting index values are unchanged
ignore_index=F
if ignore_index=False or renumbered if ignore_index=True.
alse

Applies the function specified in func parameter to a dataframe df. Applies


df.apply( func
function to each column if axis=0 or to each row if axis=1. Returns a
) axis=0
Series or DataFrame.

loc Inserts a column to df. loc specifies the integer position of the new
df.insert(
column column. column specifies a string or numeric column label. value specifies
)
value column values as a Scalar or Series.

right Joins df with the right dataframe. how specifies whether to perform
df.merge how='inner' a 'left', 'right', 'outer', or 'inner' join. on specifies join column labels, which
() on=None must appear in both dataframes. If on=None, all matching labels become
sort=False join columns. sort=True sorts rows on the join columns.

LAB: Cleaning data using dropna() and fillna()


fullscreen
Full screen1 / 1
The hmeq_small dataset contains information on 5960 home equity loans, including 7
features on the characteristics of the loan.
 Load the data set hmeq_small.csv as a data frame.
 Create a new data frame with all the rows with missing data deleted.
 Create a second data frame with all missing data filled in with the mean value of
the column.
 Find the means of the columns for both new data frames.
import pandas as pd

# Read in hmeq_small.csv
hmeq = pd.read_csv('hmeq_small.csv')# Your code here

# Create a new data frame with the rows with missing values dropped
hmeqDelete = hmeq.dropna() # Your code here

# Create a new data frame with the missing values filled in by the mean of the column
hmeqReplace = hmeq.fillna(hmeq.mean(numeric_only=True)) # Your code here

# Print the means of the columns for each new data frame
print("Means for hmeqDelete are ",hmeqDelete.mean(numeric_only=True)) # Your code
here)

print("Means for hmeqReplace are ", hmeqReplace.mean(numeric_only=True)) # Your


code here)

LAB: Structuring data using scale() and MinMaxScaler()


fullscreen
Full screen1 / 1
The hmeq_small dataset contains information on 5960 home equity loans, including 7
features on the characteristics of the loan.
 Load the hmeq_small.csv data set as a data frame.
 Standardize the data set as a new data frame.
 Normalize the data set as a new data frame.
 Print the means and standard deviations of both the standardized and normalized
data.

import pandas as pd
from sklearn import preprocessing

# Read in the file hmeq_small.csv


hmeq = pd.read_csv('hmeq_small.csv')

# Standardize the data


standardized = preprocessing.scale(hmeq)

# Output the standardized data as a data frame with column names


hmeqStand = pd.DataFrame(standardized, columns=hmeq.columns)

# Normalize the data (min-max scaling)


normalized = preprocessing.minmax_scale(hmeq)

# Output the normalized data as a data frame with column names


hmeqNorm = pd.DataFrame(normalized, columns=hmeq.columns)

# Print the means and standard deviations of hmeqStand and hmeqNorm


print("The means of hmeqStand are ", hmeqStand.mean())
print("The standard deviations of hmeqStand are ", hmeqStand.std())
print("The means of hmeqNorm are ", hmeqNorm.mean())
print("The standard deviations of hmeqNorm are ", hmeqNorm.std())
The forestfires dataset contains meteorological information and the area burned for 517
forest fires that occurred in Montesinho Natural Park in Portugal. The columns of interest
are FFMC, DMC, DC, ISI, temp, RH, wind, and rain.
 Read in the file forestfires.csv.
 Create a new data frame X from the columns FFMC, DMC, DC, ISI, temp, RH, wind,
and rain, in that order.
 Calculate the correlation matrix for the data in X.
 Scale the data.
 Use sklearn's PCA function to perform four-component factor analysis on the scaled
data.
 Print the factors and the explained variance.

# Import the necessary modules


import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Read in forestfires.csv
fires = pd.read_csv('forestfires.csv')# Your code here

# Create a new data frame with the columns FFMC, DMC, DC, ISI, temp, RH, wind, and
rain, in that order
X = fires[['FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain']]# Your code here

# Calculate the correlation matrix for the data in the data frame X
XCorr = X.corr()# Your code here
print(XCorr)

# Scale the data.


scaler = StandardScaler()# Your code here
firesScaled = scaler.fit_transform(X) # Your code here

# Perform four-component factor analysis on the scaled data.


# Your code here
pca = PCA(n_components=4)
firesPCA = pca.fit_transform(firesScaled)

# Print the factors and the explained variance.


print("Factors: ", pca.components_) # Your code here)
print("Explained variance: ", pca.explained_variance_) # Your code here)

CHAPTER 6

Seaborn single feature plots.


Command Description

sns.histplot(df, Creates a histogram of the named numerical feature from


x='Feature') the dataframe.

sns.kdeplot(df, Creates a density plot of the named numerical feature from


x='Feature') the dataframe.

sns.countplot(df, Creates a bar chart of the named categorical feature from


x='Feature') the dataframe.

sns.boxplot(df, Creates a box plot of the named numerical feature from


x='Feature') the dataframe.

sns.violinplot(df, Creates a violin plot of the named numerical feature from


x='Feature') the dataframe.

Two feature plots in seaborn.


Command Description

sns.scatterplot(df, x='Horizontal feature',


Creates a scatter plot of the features provided.
y='Vertical feature')

sns.swarmplot(df, x='Numerical feature', Creates a swarm plot displaying the distribution of x


y='Categorical feature') for each group in y.

sns.stripplot(df, x='Numerical feature', Creates a strip plot displaying the distribution of x for
y='Categorical feature') each group in y.

Dataset summary functions.


Function Behavior

.shape returns the dataframe's dimensions and displays as (number of instances,


df.shape
number of features). df.shape is useful when code needs one of these dimensions.

.info() displays the name, number of non-null values, and type of each feature in the
df.info()
dataframe.

.describe() displays summary statistics (count, mean, standard deviation, min/max,


df.describe(inclu and quartiles) for each numerical feature. Including include = "all" displays the
de = "all") count, number of categories, and mode's name and frequency for categorical
features.

Many relationship visualization in pandas.


Function Behavior

df.hist() df.hist() plots a histogram for every column in the dataframe.

df.boxplot() df.boxplot() plots a box plot for every column in the dataframe.
pd.plotting.scatter_matrix(df) plots every pair of numerical features as an
pd.plotting.scatter_mat
individual scatter plot. For more control, seaborn provides the
rix(df)
function sns.pairplot(df).

LAB: Visualizing mpg data using matplotlib


fullscreen
Full screen1 / 1
The dataset mpg contains information on miles per gallon (mpg) and engine size for cars
sold from 1970 through 1982. The dataset has the
features mpg, cylinders, displacement, horsepower, weight, acceleration, model_year, ori
gin, and name.
 Load the dataset mpg.csv.
 Create a new dataframe using the columns weight and mpg.
 Use matplotlib to make a scatter plot of weight vs mpg labelling the x-
axis Weight and the y-axis MPG.

import matplotlib.pyplot as plt


import pandas as pd
import seaborn as sns

# Load the mpg data set


mpg = sns.load_dataset('mpg')# Your code here

# Create a new data frame with the columns "weight" and "mpg"
mpgSmall = mpg[['weight', 'mpg']]# Your code here

print(mpgSmall)

# Create a scatter plot of weight vs mpg with x label "Weight" and y label "MPG"
# Your code here
plt.scatter(mpgSmall['weight'], mpgSmall['mpg'])
plt.xlabel('Weight')
plt.ylabel('MPG')
plt.title('Weight vs MPG')

plt.savefig('mpg_scatter.png')

LAB: Visualizing Titanic passenger statistics using bar charts


fullscreen
Full screen1 / 1
The titanic dataset contains data on 887 Titanic passengers, including each passenger's
survival status, embarkation location, cabin class, and sex. Write a program that
performs the following tasks:
 Load the dataset in titanic.csv as titanic.
 Create a new data frame, firstSouth, by subsetting titanic to include instances
where a passenger is in the first class cabin (pclass feature is 1) and boarded from
Southampton (embarked feature is S).
 Create a new data frame, secondThird, by subsetting titanic to include instances
where a passenger is either in the second (pclass feature is 2) or third class
(pclass feature is 3) cabin.
 Create bar charts for the following:
o Passengers in first class who embarked in Southampton grouped by sex.
o Passengers in second and third class grouped by survival status.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load titanic.csv
titanic = sns.load_dataset('titanic') # Your code here

# Subset the titanic dataset to include first class passengers who embarked in
Southampton
firstSouth = titanic[(titanic['pclass'] == 1) & (titanic['embarked'] == 'S')]#
Your code here

# Subset the titanic dataset to include either second or third class passenger
secondThird = titanic[(titanic['pclass'] == 2) | (titanic['pclass'] == 3)]# Your
code here

print(firstSouth.head())
print(secondThird.head())

# Create a bar chart for the first class passengers who embarked in Southampton
grouped by sex
sns.countplot(data=firstSouth, x='sex')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.title('First-Class Passengers from Southampton by Sex')

# Your code here


plt.savefig('titanic_bar_1.png')

# Create a bar chart for the second and third class passengers grouped by survival
status
sns.countplot(data=secondThird, x='survived')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.title('Survival Count of 2nd and 3rd Class Passengers')
# Your code here
plt.legend(labels=["0","1"], title = "survived")
plt.savefig('titanic_bar_2.png')

CHAPTER 7

Simple linear regression


# Import packages

import matplotlib.pyplot as plt


import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import r_regression

# Import data
crabs = pd.read_csv('crab-groups.csv')

# Store relevant columns as variables


X = crabs[['latitude']].values.reshape(-1, 1)
y = crabs[['mean_mm']].values.reshape(-1, 1)

# Fit a least squares regression model


linModel = LinearRegression()
linModel.fit(X, y)
yPredicted = linModel.predict(X)

# Graph the model


plt.scatter(X, y, color='black')
plt.plot(X, yPredicted, color='blue', linewidth=2)
plt.xlabel('Latitude', fontsize=14)
plt.ylabel('Mean length (mm)', fontsize=14)

# Graph the residuals


plt.scatter(X, y, color='black')
plt.plot(X, yPredicted, color='blue', linewidth=2)
for i in range(len(X)):
plt.plot([X[i], X[i]], [y[i], yPredicted[i]], color='grey', linewidth=1)
plt.xlabel('Latitude', fontsize=14)
plt.ylabel('Mean length (mm)', fontsize=14)

# Output the intercept of the least squares regression


intercept = linModel.intercept_
print(intercept[0])

# Output the slope of the least squares regression


slope = linModel.coef_
print(slope[0][0])

# Write the least squares model as an equation


print("Predicted mean length = ", intercept[0], " + ", slope[0][0], "* (latitude)")

# Compute the sum of squared errors for the least squares model
SSEreg = sum((y - yPredicted) ** 2)[0]
SSEreg

# Compute the sum of squared errors for the horizontal line model
SSEyBar = sum((y - np.mean(y)) ** 2)[0]
SSEyBar

# Compute the proportion of variation explained by the linear regression


# using the sum of squared errors
(SSEyBar - SSEreg) / (SSEyBar)

# Compute the correlation coefficient r


r = r_regression(X, np.ravel(y))[0]
r

# Compute the proportion of variation explained by the linear regression


# using correlation coefficient
r**2

# Compute the proportion of variation explained by the linear regression


# using the LinearModel object's score method
linModel.score(X, y)

John F. Kennedy International Airport (JFK) is a major airport serving New York City. JFK
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival and departure delays (in minutes)
were recorded.
 Initialize a linear regression model for predicting arrival delay based on departure
delay.
The code contains all imports, loads the dataset, fits the regression model, and prints the
model's intercept.

# Import packages and functions


import pandas as pd
from sklearn.linear_model import LinearRegression

# Import flights and remove missing values


flights = pd.read_csv('flightsJFK.csv').dropna()

# Define X and y and convert to proper format


X = flights[['dep_delay']].values.reshape(-1, 1)
y = flights[['arr_delay']].values.reshape(-1, 1)

# Initialize a linear regression model


linearModel = LinearRegression() # Your code goes here

# Fit the linear model


linearModel = linearModel.fit(X, y)

print('Intercept:', linearModel.intercept_[0])

Newark Liberty International Airport (EWR) is a major airport serving New York City. EWR
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival and departure delays (in minutes)
were recorded.
 Initialize a linear regression model for predicting arrival delay based on departure
delay.
 Fit the linear regression model.
The code contains all imports, loads the dataset, and prints the model's intercept.

# Import packages and functions


import pandas as pd
from sklearn.linear_model import LinearRegression

# Import flights and remove missing values


flights = pd.read_csv('flightsEWR.csv').dropna()

# Define X and y and convert to proper format


X = flights[['dep_delay']].values.reshape(-1, 1)
y = flights[['arr_delay']].values.reshape(-1, 1)

# Initialize and fit a linear regression model


linearModel= LinearRegression() # Your code goes here
linearModel = linearModel.fit(X,y)

print('Intercept:', linearModel.intercept_[0])
John F. Kennedy International Airport (JFK) is a major airport serving New York City. JFK
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival and departure delays (in minutes)
were recorded.
 Predict the arrival delay for a flight that departed 8 minutes late, and assign
variable yHat with the prediction.
 Assign variable slope with the slope coefficient of the model.
The code contains all imports, loads the dataset, initializes and fits the model, and
prints yHat and slope once calculated.

# Import packages and functions


import pandas as pd
from sklearn.linear_model import LinearRegression

# Import flights and remove missing values


flights = pd.read_csv('flightsJFK.csv').dropna()

# Define X and y and convert to proper format


X = flights[['dep_delay']].values.reshape(-1, 1)
y = flights[['arr_delay']].values.reshape(-1, 1)

# Initialize a linear regression model


linearModel = LinearRegression()

# Fit the linear model


linearModel = linearModel.fit(X, y)

# Predict the arrival delay and assign the slope


# Your code goes here
yHat = linearModel.predict([[8]])
slope =linearModel.coef_

print('Predicted arrival delay:', yHat[0][0])


print('Slope coefficient:', slope[0][0])

Residual plots with Python.

# Import packages
import pandas as pd
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

# Import data
crabs = pd.read_csv('crab-groups.csv')

# Store relevant columns as variables


X = crabs[['latitude']].values.reshape(-1, 1)
y = crabs[['mean_mm']].values.reshape(-1, 1)

# Fit a least squares regression model


linModel = LinearRegression();
linModel.fit(X, y);

# regplot() creates a scatter plot with the regression line overlaid


p = sns.regplot(data=crabs, x='latitude', y='mean_mm', ci=False,
scatter_kws={'color':'black'})
p.set_xlabel('Latitude', fontsize=14);
p.set_ylabel('Mean length (mm)', fontsize=14);

# Calculate predicted values and residuals


yPredicted = linModel.predict(X)
yResid = yPredicted – y

# Scatter plot with predicted values vs. residuals


# Points should be scattered around a horizontal line at y=0 with no obvious pattern
p = sns.regplot(x=yPredicted, y=yResid, ci=False, scatter_kws={'color':'black'})
p.set_xlabel('Fitted values', fontsize=14);
p.set_ylabel('Residuals', fontsize=14);
p.set_title('Fitted value vs. residual plot', fontsize=16);

# Residuals must be stored as a flattened array


resid = np.ravel(yResid)

# Use qqplot() from statsmodels to make a QQ plot


p = sm.qqplot(resid, line='45')

plt.title('Normal Q-Q plot', fontsize=16);


plt.xlabel('Theoretical quantiles', fontsize=14);
plt.ylabel('Sample quantiles', fontsize=14);
Multiple linear regression in Python
# Import packages
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from mpl_toolkits import mplot3d

# Load the dataset


mpg = pd.read_csv('mpg.csv')

# Remove rows that have missing fields


mpg = mpg.dropna()

# Store relevant columns as variables


X = mpg[['acceleration', 'weight']].values.reshape(-1, 2)
y = mpg[['mpg']].values.reshape(-1, 1)

# Graph acceleration vs MPG


plt.scatter(X[:, 0], y, color='black')
plt.xlabel('Acceleration', fontsize=14);
plt.ylabel('MPG', fontsize=14);

# Graph weight vs MPG


plt.scatter(X[:, 1], y, color='black')
plt.xlabel('Weight', fontsize=14);
plt.ylabel('MPG', fontsize=14);

# Fit a least squares multiple linear regression model


linModel = LinearRegression()
linModel.fit(X, y)

# Write the least squares model as an equation


print(
"Predicted MPG = ",
linModel.intercept_[0],
" + ",
linModel.coef_[0][0],
"* (Acceleration)",
" + ",
linModel.coef_[0][1],
"* (Weight)",
)

# Set up the figure


fig = plt.figure()
ax = plt.axes(projection='3d')
# Plot the points
ax.scatter3D(X[:, 0], X[:, 1], y, color="Black")
# Plot the regression as a plane
xDeltaAccel, xDeltaWeight = np.meshgrid(
np.linspace(X[:, 0].min(), X[:, 0].max(), 2),
np.linspace(X[:, 1].min(), X[:, 1].max(), 2),
)
yDeltaMPG = (
linModel.intercept_[0]
+ linModel.coef_[0][0] * xDeltaAccel
+ linModel.coef_[0][1] * xDeltaWeight
)
ax.plot_surface(xDeltaAccel, xDeltaWeight, yDeltaMPG, alpha=0.5)
# Axes labels
ax.set_xlabel('Acceleration');
ax.set_ylabel('Weight');
ax.set_zlabel('MPG');
# Set the view angle
ax.view_init(30, 50);
ax.set_xlim(28, 9);

# Make a prediction
yMultyPredicted = linModel.predict([[20, 3000]])
print(
"Predicted MPG for a car with acceleration = 20 seconds and Weight = 3000 pounds \
n",
"using the multiple linear regression is ",
yMultyPredicted[0][0],
"miles per gallon",
)

# Store weight as an array


X2 = X[:, 1].reshape(-1, 1)

# Fit a quadratic regression model using just Weight


polyFeatures = PolynomialFeatures(degree=2, include_bias=False)
xPoly = polyFeatures.fit_transform(X2)
polyModel = LinearRegression()
polyModel.fit(xPoly, y)

# Graph the quadratic regression


plt.scatter(X2, y, color='black')
xDelta = np.linspace(X2.min(), X2.max(), 1000)
yDelta = polyModel.predict(polyFeatures.fit_transform(xDelta.reshape(-1, 1)))
plt.plot(xDelta, yDelta, color='blue', linewidth=2)
plt.xlabel('Weight', fontsize=14)
plt.ylabel('MPG', fontsize=14)

# Write the quadratic model as an equation


print(
"Predicted MPG = ",
polyModel.intercept_[0],
" + ",
polyModel.coef_[0][0],
"* (Weight)",
" + ",
polyModel.coef_[0][1],
"* (Weight)^2",
)

# Make a prediction
polyInputs = polyFeatures.fit_transform([[3000]])
yPolyPredicted = polyModel.predict(polyInputs)
print(
"Predicted MPG for a car with Weight = 3000 pounds \n",
"using the simple polynomial regression is ", yPolyPredicted[0][0], "miles per gallon",
)

# Fit a quadratic regression model using acceleration and weight


polyFeatures2 = PolynomialFeatures(degree=2, include_bias=False)
xPoly2 = polyFeatures.fit_transform(X)
polyModel2 = LinearRegression()
polyModel2.fit(xPoly2, y)

# Write the quadratic regression as an equation


print(
"Predicted MPG =", polyModel2.intercept_[0], "\n",
" + ", polyModel2.coef_[0][0], "* (Acceleration)\n",
" + ", polyModel2.coef_[0][1], "* (Weight)", "\n",
" + ", polyModel2.coef_[0][2], "* (Acceleration)^2 \n",
" + ", polyModel2.coef_[0][3], "* (Acceleration)*(Weight) \n",
" + ", polyModel2.coef_[0][4], "* (Weight)^2 \n",
)

# Make a prediction
polyInputs2 = polyFeatures2.fit_transform([[20, 3000]])
yPolyPredicted2 = polyModel2.predict(polyInputs2)
print(
"Predicted MPG for a car with acceleration = 20 seconds and Weight = 3000 pounds \
n",
"using the polynomial regression is ", yPolyPredicted2[0][0], "miles per gallon",
)

LaGuardia Airport (LGA) is a major airport serving New York City. LGA wanted to predict
the arrival delay of an incoming flight based on the departure delay. 50 recent flights
were randomly selected, and the arrival delays (in minutes) were recorded.
 Initialize a multiple regression model for predicting arrival delay based on
departure delay and flight distance.
The code contains all imports, loads the dataset, fits the regression model, and prints the
model's intercept.
# Import packages and functions
import pandas as pd
from sklearn.linear_model import LinearRegression

# Import flights and remove missing values


flights = pd.read_csv('flightsLGA.csv').dropna()

# Define X and y and convert to proper format


X = flights[['dep_delay', 'distance']].values.reshape(-1, 2)
y = flights[['arr_delay']].values.reshape(-1, 1)

# Initialize a linear regression model


multipleModel = LinearRegression()# Your code goes here

# Fit the linear model


multipleModel = multipleModel.fit(X, y)

print('Intercept:', multipleModel.intercept_)

John F. Kennedy International Airport (JFK) is a major airport serving New York City. JFK
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival delays (in minutes) were recorded.
 Create a dataframe containing month (month) and distance (distance) in that
order. Use the reshape() function to ensure the input features are in the proper
format.
The code contains all imports, loads the dataset, fits the regression model, and prints the
model's intercept.

# Import packages and functions


import pandas as pd
from sklearn.linear_model import LinearRegression

# Import flights and remove missing values


flights = pd.read_csv('flightsJFK.csv').dropna()

# Define X and y and convert to proper format


X = flights[['month','distance']].values.reshape(-1,2)# Your code goes here
y = flights[['arr_delay']].values.reshape(-1, 1)

# Initialize a linear regression model


multipleModel = LinearRegression()

# Fit the linear model


multipleModel = multipleModel.fit(X, y)

print('Intercept:', multipleModel.intercept_)

Newark Liberty International Airport (EWR) is a major airport serving New York City. EWR
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival delays (in minutes) were recorded.
 Predict the arrival delay for a flight with departure time of 1868 and distance of
1752, and assign variable yHat with the prediction.
 Calculate the slope coefficients for multipleModel and assign slope with the result.
The code contains all imports, loads the dataset, fits the multiple regression model, and
prints yHat and slope once calculated.
# Import packages and functions
import pandas as pd
from sklearn.linear_model import LinearRegression

# Import flights and remove missing values


flights = pd.read_csv('flightsEWR.csv').dropna()

# Define X and y and convert to proper format


X = flights[['dep_time', 'distance']].values.reshape(-1, 2)
y = flights[['arr_delay']].values.reshape(-1, 1)
# Initialize a linear regression model
multipleModel = LinearRegression()

# Fit the linear model


multipleModel = multipleModel.fit(X, y)

# Predict the arrival delay and save the slope coefficient


# Your code goes here
yHat = multipleModel.predict([[1868, 1752]])
slope = multipleModel.coef_
print('Predicted arrival delay:', yHat)
print('Slope coefficients:', slope)

Logistic regression in Python.

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

# Load the Wisconsin Breast Cancer dataset


WBCD = pd.read_csv("WisconsinBreastCancerDatabase.csv")
# Convert Diagnosis to 0 and 1.
WBCD.loc[WBCD['Diagnosis'] == 'B', 'Diagnosis'] = 0
WBCD.loc[WBCD['Diagnosis'] == 'M', 'Diagnosis'] = 1
WBCD

# Store relevant columns as variables


X = WBCD[['Radius mean']].values.reshape(-1, 1)
y = WBCD[['Diagnosis']].values.reshape(-1, 1).astype(int)

# Logistic regression predicting diagnosis from tumor radius


logisticModel = LogisticRegression()
logisticModel.fit(X, np.ravel(y.astype(int)))

# Graph logistic regression probabilities


plt.scatter(X, y)
xDelta = np.linspace(X.min(), X.max(), 10000)
yPredicted = logisticModel.predict(X).reshape(-1, 1).astype(int)
yDeltaProb = logisticModel.predict_proba(xDelta.reshape(-1, 1))[:, 1]
plt.plot(xDelta, yDeltaProb, color='red')
plt.xlabel('Radius', fontsize=14);
plt.ylabel('Probability of malignant tumor', fontsize=14);

# Display the slope parameter estimate


logisticModel.coef_

# Display the intercept parameter estimate


logisticModel.intercept_

# Predict the probability a tumor with radius mean 13 is benign / malignant


pHatProb = logisticModel.predict_proba([[13]])
pHatProb[0]

# Classify whether tumor with radius mean 13 is benign (0) or malignant (1)
pHat = logisticModel.predict([[13]])
pHat[0]

print(
"A tumor with radius mean 13 has predicted probability: \n",
pHatProb[0][0],
"of being benign\n",
pHatProb[0][1],
"of being malignant\n",
"and overall is classified to be benign",
)

The US Forest Service regularly monitors weather conditions to predict which areas are
at risk of wildfires. Data scientists working with the US Forest Service would like to
predict whether a wildfire will occur based on wind speed.
 Fit the logistic regression model, logisticModel, to predict whether a wildfire will
occur.
The code contains all imports, loads the dataset, and prints the model coefficients.

# Import packages and functions


import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Load the dataset


fires = pd.read_csv('fires.csv')

# Create input matrix X and output matrix y


X = fires['wind'].values.reshape(-1, 1)
y = np.ravel(fires['fire'])

# Define and fit the logistic regression model


logisticModel = LogisticRegression()
logisticModel.fit(X,y)# Your code goes here

# Print the estimated coefficients


print('Slope:', logisticModel.coef_[0][0])
print('Intercept:', logisticModel.intercept_[0])

The US Forest Service regularly monitors weather conditions to predict which areas are
at risk of wildfires. Data scientists working with the US Forest Service would like to
predict whether a wildfire will occur based on temperature.
 Use the fitted logistic regression model, logisticModel, to predict whether a wildfire
will occur on a day with temperature = 25. Assign the prediction to pred.
The code contains all imports, loads the dataset, and prints the prediction.

# Import packages and functions


import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Load the dataset


fires = pd.read_csv('fires.csv')

# Create input matrix X and output matrix y


X = fires['temp'].values.reshape(-1, 1)
y = np.ravel(fires['fire'])

# Define and fit the logistic regression model


logisticModel = LogisticRegression()
logisticModel = logisticModel.fit(X, y)

# Calculate the predicted value and assign to pred


pred = logisticModel.predict([[25]]) # Your code goes here

# Print the predicted value


print('Is a wildfire predicted? (0 = no, 1 = yes):', pred[0])

The US Forest Service regularly monitors weather conditions to predict which areas are
at risk of wildfires. Data scientists working with the US Forest Service would like to
predict whether a wildfire will occur based on daily rainfall.
 Use the fitted logistic regression model, logisticModel, to calculate the probabilities
of each outcome on a day with daily rainfall = 2. Assign the probabilities to prob.
The code contains all imports, loads the dataset, and prints the probabilities.

# Import packages and functions


import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Load the dataset


fires = pd.read_csv('fires.csv')

# Create input matrix X and output matrix y


X = fires['rain'].values.reshape(-1, 1)
y = np.ravel(fires['fire'])

# Define and fit the logistic regression model


logisticModel = LogisticRegression()
logisticModel = logisticModel.fit(X, y)

# Calculate the probabilities and assign to prob


prob = logisticModel.predict_proba([[2]])# Your code goes here

# Print the predicted value


print('Probability of no wildfire:', prob[0][0])
print('Probability of a wildfire:', prob[0][1])

LAB: Creating simple linear regression models


The nbaallelo_slr dataset contains information on 126315 NBA games between 1947 and
2015. The columns report the points made by one team, the Elo rating of that team
coming into the game, the Elo rating of the team after the game, and the points made by
the opposing team. The Elo score measures the relative skill of teams in a league.
 Load the dataset into a data frame.
 Create a new column y in the data frame that is the difference between the points
made by the two teams.
 Use sklearn's LinearRegression() function to perform a simple linear regression on
the y and elo_i columns.
 Compute the proportion of variation explained by the linear regression using
the LinearRegression object's score method.

# Import the necessary modules


import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
# Read in nbaallelo_slr.csv
nba = pd.read_csv('nbaallelo_slr.csv')
# Your code here

# Create a new column in the data frame that is the difference between pts and opp_pts
nba['y'] = nba['pts'] - nba['opp_pts']
# Your code here

# Store relevant columns as variables


X = nba[['elo_i']].values.reshape (-1,1)
# Your code here
y = nba[['y']].values.reshape (-1,1)
# Your code here

# Initialize the linear regression model


SLRModel = LinearRegression()
# Your code here
# Fit the model on X and y
SLRModel.fit(X,y)

# Your code here

# Print the intercept


intercept = SLRModel.intercept_
# Your code here
print('The intercept of the linear regression line is ', end="")
print('%.3f' % intercept[0] + ". ")

# Print the slope


slope = SLRModel.coef_
# Your code here
print('The slope of the linear regression line is ', end="")
print('%.3f' % slope[0][0] + ". ")

# Compute the proportion of variation explained by the linear regression using the
LinearRegression object's score method
score = SLRModel.score(X,y)
# Your code here
print('The proportion of variation explained by the linear regression model is ', end="")
print('%.3f' % score + ". ")

LAB: Performing logistic regression using LogisticRegression()


The nbaallelo_log file contains data on 126314 NBA games from 1947 to 2015. The
dataset includes the features pts, elo_i, win_equiv, and game_result. Using the csv
file nbaallelo_log.csv and scikit-learn's LogisticRegression function, construct a logistic
regression model to classify whether a team will win or lose a game based on the team's
elo_i score.
 Hot encode the game_result variable as a numeric variable with 0 for L and 1 for W
 Use the LogisticRegression function to construct a logistic regression model
with game_result as the target and elo_i as the predictor.
 Predict the probability of a win from an elo_i score of 1310.
 Predict whether a team with an elo_i score of 1310 will win.

# Import the necessary libraries


import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np

# Load nbaallelo_log.csv into a dataframe


NBA = pd.read_csv("nbaallelo_log.csv")

# Hot encode the game_result variable as a numeric variable with 0 for L and 1 for W
NBA.loc[NBA['game_result']=='L','game_result']=0
NBA.loc[NBA['game_result']=='W','game_result']=1
# Your code here

# Store relevant columns as variables


X = NBA[['elo_i']].values.reshape(-1, 1)
y = NBA[['game_result']]. values.ravel().astype(int)

# Initialize and fit the logistic model using the LogisticRegression function
NBAmodel = LogisticRegression()
NBAmodel.fit(X,y)
# Your code here

# Predict the probability that an elo_i score of 1310 is a win / loss


outcomeProb = NBAmodel.predict_proba([[1310]])
# Your code here

# Predict whether an elo_i score of 1310 is a win (1) or loss (0)


outcomePred = NBAmodel.predict([[1310]])

# Your code here

print("A team with the given elo_i score has predicted probability: \n", end="")
print('%.3f' % outcomeProb[0][0] + " losing\n", end="")
print('%.3f' % outcomeProb[0][1] + " winning")
print("and the overall prediction is",
outcomePred[0])
Chapter 8
Binary classification metrics in Python.
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

# Load breast cancer data and hot encodes categorical variable


WBCD = pd.read_csv("WisconsinBreastCancerDatabase.csv")
WBCD.loc[WBCD['Diagnosis'] == 'B', 'Diagnosis'] = 0
WBCD.loc[WBCD['Diagnosis'] == 'M', 'Diagnosis'] = 1

# Store relevant columns as variables


X = WBCD[['Radius mean']].values.reshape(-1, 1)
y = WBCD[['Diagnosis']].values.reshape(-1, 1).astype(int)

# Logistic regression predicting diagnosis from tumor radius


logisticModel = LogisticRegression()
logisticModel.fit(X, np.ravel(y.astype(int)))
cutoff = 0.5
yPredictedProb = logisticModel.predict_proba(X)[:, 1]
yPredLowCutoff = []
for i in range(0, yPredictedProb.size):
if yPredictedProb[i] < cutoff:
yPredLowCutoff.append(0)
else:
yPredLowCutoff.append(1)

# Display confusion matrix


metrics.confusion_matrix(y, yPredLowCutoff)

# Display accuracy
metrics. accuracy_score(y, yPredLowCutoff)

# Display precision
metrics.precision_score(y, yPredLowCutoff)

# Display recall
metrics.recall_score(y, yPredLowCutoff)

# Plot the ROC curve


pred = logisticModel.predict_proba(X)[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(y, pred)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(
fpr=fpr, tpr=tpr, roc_auc=roc_auc, pos_label='Malignant, 1'
)
display.plot()
plt.show()

Loss functions for regression in Python.


# Import packages
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

# Load tortoise data


tortoise = pd.read_csv("Tortoises.csv")

# Store relevant columns as variables


X = tortoise["Length"]
y = tortoise["Clutch"]

# Split data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123
)

# Create a linear model using the training set and predictions using the test set
X_test = np.asarray(X_test)
y_test = np.asarray(y_test)
linModel = LinearRegression()
linModel.fit(X_train.values.reshape(-1, 1), y_train.values.reshape(-1, 1))
y_pred = np.ravel(linModel.predict(X_test.reshape(-1, 1)))

# Display linear model and scatter plot of the test set


plt.scatter(X_test, y_test)
plt.xlabel("Length (mm)", fontsize=14)
plt.ylabel("Clutch size", fontsize=14)
plt.plot(X_test, y_pred, color='red')
plt.ylim([0, 14])
for i in range(5):
plt.plot([X_test[i], X_test[i]], [y_test[i], y_pred[i]], color='grey', linewidth=2)

# Display MSE
metrics.mean_squared_error(y_test, y_pred)

# Display RMSE
metrics.mean_squared_error(y_test, y_pred, squared=False)

# Display MAE
metrics.mean_absolute_error(y_test, y_pred)

# Create a quadratic model using the training set and predictions using the test set
X_train = np.asarray(X_train)
y_train = np.asarray(y_train)
poly = PolynomialFeatures().fit_transform(X_train.reshape(-1, 1))
poly_reg_model = LinearRegression().fit(poly, y_train)
poly_test = PolynomialFeatures().fit_transform(X_test.reshape(-1, 1))
y_pred = poly_reg_model.predict(poly_test)

# Display quadratic model and scatter plot of the test set


plt.scatter(X_test, y_test)
plt.xlabel("Length (mm)", fontsize=14)
plt.ylabel("Clutch size", fontsize=14)
x = np.linspace(X_test.min(), X_test.max(), 100)
y=(
poly_reg_model.coef_[2] * x**2
+ poly_reg_model.coef_[1] * x
+ poly_reg_model.intercept_
)
plt.plot(x, y, color='red', linewidth=2)
plt.ylim([0, 14])
for i in range(5):
plt.plot([X_test[i], X_test[i]], [y_test[i], y_pred[i]], color='grey', linewidth=2)

# Display MSE
metrics.mean_squared_error(y_test, y_pred)

# Display RMSE
metrics.mean_squared_error(y_test, y_pred, squared=False)

# Display MAE
metrics.mean_absolute_error(y_test, y_pred)

Loss functions for classification in Python.


# Import packages and functions
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load the Wisconsin Breast Cancer dataset


WBCD = pd.read_csv('WisconsinBreastCancerDatabase.csv')

# Convert Diagnosis to 0 and 1


WBCD.loc[WBCD['Diagnosis'] == 'B', 'Diagnosis'] = 0
WBCD.loc[WBCD['Diagnosis'] == 'M', 'Diagnosis'] = 1

# Store relevant columns as variables


X = WBCD[['Radius mean']].values.reshape(-1, 1)
y = WBCD[['Diagnosis']].values.reshape(-1, 1).astype(int)

# Split data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123
)

# Logistic regression predicting diagnosis from tumor radius


logisticModel = LogisticRegression();
logisticModel.fit(X_train, np.ravel(y_train.astype(int)));

# Graph logistic regression probabilities


plt.scatter(X_test, y_test)
x_prob = np.linspace(X_test.min(), X_test.max(), 1000)
y_prob = logisticModel.predict_proba(x_prob.reshape(-1, 1))[:, 1]
plt.plot(x_prob, y_prob, color='red')
plt.xlabel('Radius mean', fontsize=14);
plt.ylabel('Probability of malignant tumor', fontsize=14);

# Predict the probabilities for the test set


p_hat = logisticModel.predict_proba(X_test)

# Display the log-loss


metrics.log_loss(y_test, p_hat)

Training-validation-test split in Python.


# Import packages
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

# Load bad drivers data


badDrivers = pd.read_csv('bad-drivers.csv')

# Set the proportions of the training-validation-test split


trainingProportion = 0.70
validationProportion = 0.10
testProportion = 0.20

# Split off the test data


trainingAndValidationData, testData = train_test_split(
badDrivers, test_size=testProportion
)

# Split the remaining into training and validation data


trainingData, validationData = train_test_split(
trainingAndValidationData,
train_size=trainingProportion / (trainingProportion + validationProportion),
)

# Display the scatter plot for the entire sample data


plt.scatter(
badDrivers[['Losses incurred by insurance companies for collisions per insured driver
($)']],
badDrivers[['Car Insurance Premiums ($)']],
)
plt.xlabel('Losses incurred by insurance companies', fontsize=14)
plt.ylabel('Car insurance premiums', fontsize=14)
plt.xlim(80, 200)
plt.ylim(600, 1400)
plt.title('Sample data')
plt.show()

# Display the scatter plot for the training data


plt.scatter(
trainingData[['Losses incurred by insurance companies for collisions per insured driver
($)']],
trainingData[['Car Insurance Premiums ($)']],
)
plt.xlabel('Losses incurred by insurance companies', fontsize=14)
plt.ylabel('Car insurance premiums', fontsize=14)
plt.xlim(80, 200)
plt.ylim(600, 1400)
plt.title('Training data')
plt.show()

# Display the scatter plot for the validation data


plt.scatter(
validationData[['Losses incurred by insurance companies for collisions per insured
driver ($)']],
validationData[['Car Insurance Premiums ($)']],
)
plt.xlabel('Losses incurred by insurance companies', fontsize=14)
plt.ylabel('Car insurance premiums', fontsize=14)
plt.xlim(80, 200)
plt.ylim(600, 1400)
plt.title('Validation data')
plt.show()

# Display the scatter plot for the test data


plt.scatter(
testData[['Losses incurred by insurance companies for collisions per insured driver
($)']],
testData[['Car Insurance Premiums ($)']],
)
plt.xlabel('Losses incurred by insurance companies', fontsize=14)
plt.ylabel('Car insurance premiums', fontsize=14)
plt.xlim(80, 200)
plt.ylim(600, 1400)
plt.title('Test data')
plt.show()

train_test_split(df,train_size=0.90)
A, B= train_test_split(df,test_size=0.05).

Researchers collected measurements from loblolly pines.


 Set the proportions for the training dataset to 70%, validation dataset to 10%, and
testing dataset to 20%.
The code provided contains all imports, loads the dataset, splits the dataset into training,
validation, and test datasets, prints the sizes of these samples, and prints the test
dataset.

# Import packages and functions


import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

rng = np.random.RandomState(2)

# Load the dataset


pines = pd.read_csv('pinesSample.csv')

# Set proportions of train-validate-test split

# Your code goes here


trainingPropPercent = 0.70
validatingPropPercent = 0.10
testingPropPercent = 0.20

# Split dataset into training/validation data and testing data


trainAndValidate, testingDataPercent = train_test_split(pines,
test_size=testingPropPercent, random_state=rng)

# Split training/validation data into training data and validation data


trainingDataPercent, validatingDataPercent = train_test_split(trainAndValidate,
train_size=trainingPropPercent/(trainingPropPercent+validatingPropPercent),
random_state=rng)

# Print split sizes and test dataset


print('original dataset:', len(pines),
'\ntrain_data:', len(trainingDataPercent),
'\nvalidation_data:', len(validatingDataPercent),
'\n\ntest_data:', len(testingDataPercent),
'\n', testingDataPercent)
# Import packages and functions
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

rng = np.random.RandomState(42)

# Load the dataset


pines = pd.read_csv('pinesSample.csv')

# Set proportions of train-validate-test split


trainingPropPercent = 0.6
validatingPropPercent = 0.2
testingPropPercent = 0.2

# Split dataset into training/validation data and testing data

trainAndValidate, testingDataPercent =
train_test_split(pines,test_size=testingPropPercent, random_state=rng) # Your code goes
here

# Split training/validation data into training data and validation data


trainingDataPercent, validatingDataPercent = train_test_split(
trainAndValidate,
train_size=trainingPropPercent/(trainingPropPercent+validatingPropPercent),
random_state=rng
)

# Print split sizes and test dataset


print('original dataset:', len(pines),
'\ntrain_data:', len(trainingDataPercent),
'\nvalidation_data:', len(validatingDataPercent),
'\n\ntest_data:', len(testingDataPercent),
'\n', testingDataPercent

# Import packages and functions


import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

rng = np.random.RandomState(33)

# Load the dataset


loblolly = pd.read_csv('loblollySample.csv')
# Set proportions of train-validate-test split
trainPropPercent = 0.6
validatePropPercent = 0.2
testPropPercent = 0.2

# Split dataset into training/validation data and testing data


trainAndValidate, testDataPercent = train_test_split(
loblolly,
test_size=testPropPercent,
random_state=rng
)

# Split training/validation data into training data and validation data

trainDataPercent, validateDataPercent = train_test_split(


trainAndValidate,
train_size=trainPropPercent / (trainPropPercent + validatePropPercent),
random_state=rng
) # Your code goes here

# Print split sizes and test dataset


print('original dataset:', len(loblolly),
'\ntrain_data:', len(trainDataPercent),
'\nvalidation_data:', len(validateDataPercent),
'\n\ntest_data:', len(testDataPercent),
'\n', testDataPercent
)

k-fold cross-validation in Python.

# Import packages and functions


import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

# Import dataset
badDrivers = pd.read_csv('bad-drivers.csv')

# Split off 20% of the data to be left out as test data


badDriversTrainingdata, testData = train_test_split(badDrivers, test_size=0.20)

# Store relevant columns as variables


X = badDriversTrainingdata[
['Losses incurred by insurance companies for collisions per insured driver ($)']
].values.reshape(-1, 1)
y = badDriversTrainingdata[['Car Insurance Premiums ($)']].values.reshape(-1, 1)

# Fit a linear model to the data


linModel = LinearRegression()
linModel.fit(X, y)
yPredicted = linModel.predict(X)

# Plot the fitted model


plt.scatter(X, y, color='black')
plt.plot(X, yPredicted, color='blue', linewidth=1)
plt.xlabel('Losses incurred by insurance companies', fontsize=14);
plt.ylabel('Car insurance premiums', fontsize=14);

# neg_mean_square_error is the negative MSE, so add a - so the scores are positive.


ten_fold_scores = -cross_val_score(
linModel, X, y, scoring='neg_mean_squared_error', cv=10
)

# neg_mean_square_error is the negative MSE, so add a - so the scores are positive.


LOOCV_scores = -cross_val_score(linModel, X, y, scoring='neg_mean_squared_error',
cv=40)

# Plot the errors for both scores


plt.plot(np.zeros_like(ten_fold_scores), ten_fold_scores, '.')
plt.plot(np.zeros_like(LOOCV_scores) + 1, LOOCV_scores, '.')
plt.ylabel('Mean squared errors', fontsize=14);
plt.xticks([0, 1], ['10-fold', 'LOOCV']);

cross_val_score(M, X, y, cv=5)

Bootstrap method in Python.

# Import packages and functions


import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.utils import resample
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load data set


badDrivers = pd.read_csv('bad-drivers.csv')

# Create bootstrap samples and collect errors

bootstrapErrors = []
for i in range(0, 30):
# Create the bootstrap sample and the out-of-bag sample
boot = resample(badDrivers, replace=True, n_samples=51)
oob = badDrivers[~badDrivers.index.isin(boot.index)]

# Fit a linear model to the bootstrap sample


XBoot = boot[
['Losses incurred by insurance companies for collisions per insured driver ($)']
].values.reshape(-1, 1)
yBoot = boot[['Car Insurance Premiums ($)']].values.reshape(-1, 1)
linModel = LinearRegression()
linModel.fit(XBoot, yBoot)

# Predict y values for the out-of-bag sample


XOob = oob[
['Losses incurred by insurance companies for collisions per insured driver ($)']
].values.reshape(-1, 1)
YOob = oob[['Car Insurance Premiums ($)']].values.reshape(-1, 1)
YOobPredicted = linModel.predict(XOob)

# Calculate the error


bootError = mean_squared_error(YOob, YOobPredicted)
bootstrapErrors.append(bootError)

# Calculate the mean of the errors


np.mean(bootstrapErrors)

# Calculate the standard deviation of the errors


np.std(bootstrapErrors)

# Plot the errors


plt.plot(bootstrapErrors, np.zeros_like(bootstrapErrors), '.')
plt.xlabel('Bootstrap errors (MSE)', fontsize=14)
plt.gca().axes.yaxis.set_ticks([]);
Model selection in Python.

# Import packages
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import PolynomialFeatures

# Import dataset
thurber = pd.read_csv('Thurber.csv')

# Split off 20% of the data to be left out as test data


thurberTrainingData, test_data = train_test_split(thurber, test_size=0.20)

# Store relevant columns as variables


X = thurberTrainingData[['log(Density)']].values.reshape(-1, 1)
y = thurberTrainingData[['Electron mobility']].values.reshape(-1, 1)

# Fit a cubic regression model


polyFeatures = PolynomialFeatures(degree=3, include_bias=False)
XPoly = polyFeatures.fit_transform(X)
polyModel = LinearRegression()
polyModel.fit(XPoly, y)

# Graph the scatterplot and the polynomial regression


plt.scatter(X, y, color='black')
xDelta = np.linspace(X.min(), X.max(), 1000)
yDelta = polyModel.predict(polyFeatures.fit_transform(xDelta.reshape(-1, 1)))
plt.plot(xDelta, yDelta, color='blue', linewidth=2)
plt.xlabel('log(Density)', fontsize=14);
plt.ylabel('Electron mobility', fontsize=14);

# Collect cross-validation metrics


cvMeans = []
cvStdDev = []

for i in range(1, 7):


# Fit a degree i polynomial regression model
polyFeatures = PolynomialFeatures(degree=i, include_bias=False)
XPoly = polyFeatures.fit_transform(X)
polyModel = LinearRegression()
polyModel.fit(XPoly, y)

# Carry out 10-fold cross-validation for the a degree i polynomial regression model
polyscore = -cross_val_score(
polyModel, XPoly, y, scoring='neg_mean_squared_error', cv=10
)

# Store the mean and standard deviation of the 10-fold cross-validation for the degree
i polynomial regression model
cvMeans.append(np.mean(polyscore))
cvStdDev.append(np.std(polyscore))

# Graph the errorbar chart using the cross-validation means and std deviations
plt.errorbar(x=range(1, 7), y=cvMeans, yerr=cvStdDev, marker='o', color='black')
plt.xlabel('Degree of regression polynomial', fontsize=14)
plt.ylabel('Mean squared error', fontsize=14)

Linear model for predicting house prices.

# Import packages and functions


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression

# Import and view data


homes = pd.read_csv('homes.csv').dropna()
homes

# Set seed
seed = 123

# Set proportion of data for the test set


test_p = 0.20

# Define input and output features


X = homes[['Floor']]
y = homes[['Price']]
# Create training and testing data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_p, random_state=seed
)

# Plot training dataset and regression line


p = sns.regplot(x=X_train, y=y_train, ci=False, line_kws={'color': 'black'})
p.set_xlabel('Square feet (1000s)', fontsize=14);
p.set_ylabel('Price ($1000s)', fontsize=14);
p.set_title('Training model', fontsize=16);

# Initialize and fit the linear model


linearModel = LinearRegression()
linearModel = linearModel.fit(X_train, y_train)

# Print model coefficients


print('beta1 =', linearModel.coef_)
print('beta0 =', linearModel.intercept_)

# Regression metrics on training dataset


y_pred = linearModel.predict(X_train)
print('MSE =', mean_squared_error(y_train, y_pred))
print('MAE =', mean_absolute_error(y_train, y_pred))
print('R-squared =', r2_score(y_train, y_pred))

# Regression metrics on testing dataset


y_pred = linearModel.predict(X_test)
print('MSE =', mean_squared_error(y_test, y_pred))
print('MAE =', mean_absolute_error(y_test, y_pred))
print('R-squared =', r2_score(y_test, y_pred))

# Plot the model for the training and testing sets


plt.rcParams["figure.figsize"] = (12, 5)

x = pd.array([1, 2, 3])
yhat = 213.13396131 + 37.92605345 * x

plt.subplot(1, 2, 1)

# Training set subplot


p = sns.scatterplot(x=X_train['Floor'], y=y_train['Price'])
plt.plot(x, yhat, color='black')
p.set_xlabel('Square feet (1000s)', fontsize=14)
p.set_ylabel('Price ($1000s)', fontsize=14)
p.set_title('Training dataset', fontsize=16)
p.set_ylim(140, 460)

plt.subplot(1, 2, 2)
# Testing set subplot
p = sns.scatterplot(x=X_test['Floor'], y=y_test['Price'])
plt.plot(x, yhat, color='black')
p.set_xlabel('Square feet (1000s)', fontsize=14);
p.set_ylabel('Price ($1000s)', fontsize=14);
p.set_title('Testing dataset', fontsize=16);
p.set_ylim(140, 460);

8.10 LAB: Evaluating linear regression using cross-validation

The nbaallelo_slr dataset contains information on 126315 NBA games between 1947 and
2015. The columns report the points made by one team, the Elo rating of that team
coming into the game, the Elo rating of the team after the game, and the points made by
the opposing team. The Elo rating measures the relative skill of teams in a league.
 The code creates a new column y in the data frame that is the difference
between pts and opp_pts.
 Split the data into 70 percent training set and 30 percent testing set
using sklearn's train_test_split function. Set random_state=0.
 Store elo_i and y from the training data as the variables X and y.
 The code performs a simple linear regression on X and y.
 Perform 10-fold cross-validation with the default scorer using scikit-
learn's cross_val_score function.

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

nba = pd.read_csv("nbaallelo_slr.csv")

# Create a new column in the data frame that is the difference between pts and opp_pts
nba['y'] = nba['pts'] - nba['opp_pts']

# Split the data into training and test sets


train, test = # Your code here

# Store relevant columns as variables


X = # Your code here
y = # Your code here

# Initialize the linear regression model


SLRModel = LinearRegression()
# Fit the model on X and y
SLRModel.fit(X,y)

# Perform 10-fold cross-validation with the default scorer


tenFoldScores = # Your code here
print('The cross-validation scores are', tenFoldScores)

Solution:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

nba = pd.read_csv("nbaallelo_slr.csv")

# Create a new column in the data frame that is the difference between pts and
opp_pts
nba['y'] = nba['pts'] - nba['opp_pts']

# Split the data into training and test sets


# Your code here

# Store relevant columns as variables


X = nba[['elo_i']].values # Your code here
y = nba[['y']].values# Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=0)

# Initialize the linear regression model


SLRModel = LinearRegression()
# Fit the model on X and y
SLRModel.fit(X,y)

# Perform 10-fold cross-validation with the default scorer


tenFoldScores = cross_val_score(SLRModel, X_train, y_train, scoring='r2', cv=10)
# Your code here
print('The cross-validation scores are', tenFoldScores)

Chapter 9

k-nearest neighbors classification in Python

# Import needed packages for classification


from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Import packages for visualization of results


import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from mlxtend.plotting import plot_decision_regions

# Iport packages for evaluation


from sklearn.model_selection import train_test_split
from sklearn import metrics

# Read data, clean up names

beans = pd.read_csv('Dry_Bean_Dataset.csv')
beans['Class'] = beans['Class'].str.capitalize()
print(beans.shape)
beans.describe()

# Initialize model
beanKnnClassifier = KNeighborsClassifier(n_neighbors=5)
# Split data
X = beans[['MajorAxisLength', 'MinorAxisLength']]
y = beans[['Class']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Train model and make predictions for the test set.


beanKnnClassifier.fit(X_train_scaled, np.ravel(y_train))
y_pred = beanKnnClassifier.predict(scaler.transform(X_test))

# Predict one bean


bean = pd.DataFrame(data={'MajorAxisLength': [400], 'MinorAxisLength': [200]})
beanKnnClassifier.predict(scaler.transform(bean))

# Take a sample to keep runtime low while seeing what areas are classified as each bean
beanSample = beans.sample(200, random_state=123)
beanSample.describe()

# Create integer-valued labels for plot_decision_regions()


beanSample['Int'] = beanSample['Class'].replace(
to_replace = ['Barbunya', 'Bombay', 'Cali', 'Dermason', 'Horoz', 'Seker', 'Sira'],
value = [int(0), int(1), int(2), int(3), int(4), int(5), int(6)])

# Define input and output features


X = beanSample[['MajorAxisLength', 'MinorAxisLength']]
y = beanSample[['Int']]

# Fit model
beanKnnClassifier.fit(X, np.ravel(y))

# Set background opacity to 20%


contourf_kwargs = {'alpha': 0.2}

# Plot decision boundary regions


p = plot_decision_regions(X.to_numpy(), np.ravel(y), clf=beanKnnClassifier,
contourf_kwargs=contourf_kwargs)

# Add title and axis labels


p.set_xlabel('MajorAxisLength', fontsize=14)
p.set_ylabel('MinorAxisLength', fontsize=14)

# Add legend
L = plt.legend()

L.get_texts()[0].set_text('Barbunya')
L.get_texts()[1].set_text('Bombay')
L.get_texts()[2].set_text('Cali')
L.get_texts()[3].set_text('Dermason')
L.get_texts()[4].set_text('Horoz')
L.get_texts()[5].set_text('Seker')
L.get_texts()[6].set_text('Sira')

This dataset contains data on sleep habits for 30 randomly selected mammals. Each
mammal is categorized as an omnivore, herbivore, carnivore, or insectivore.
 Initialize a k-nearest neighbors classification model with k=4.
The code contains all imports, loads the dataset, fits the model, and applies the model

# Import packages and functions


import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Import dataset
sleep = pd.read_csv('sleep.csv')

# Create input matrix X and output matrix y


X = sleep[['awake', 'sleep_rem']]
y = sleep[['vore']]

knnModel= KNeighborsClassifier(n_neighbors=4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Your code goes here

knnModel = knnModel.fit(X, np.ravel(y))

# Print predictions
print(knnModel.predict(X))

This dataset contains data on sleep habits for 25 randomly selected mammals. Each
mammal is categorized as an omnivore, herbivore, carnivore, or insectivore.
REM sleep cycles of guinea pigs average 0.8 hours. Guinea pigs are awake on average
14.6 hours per day.
 Use the kneighbors() method to find the instances in the training data that are
closest to guinea pigs. Assign the instances, but not the distances, to neighbors.
The code contains all imports, loads the dataset, initializes the model, and applies the
model to a test dataset.

# Import packages and functions


import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier


from sklearn.model_selection import train_test_split

# Import dataset
sleep = pd.read_csv('sleep.csv')

# Create input matrix X and output matrix y


X = sleep[['sleep_rem', 'awake']]
y = sleep[['vore']]
knnModel = KNeighborsClassifier(n_neighbors=5)
knnModel = knnModel.fit(X.values, np.ravel(y.values))
guinea_pig = np.array([[0.8, 14.6]])
neighbors = knnModel.kneighbors(guinea_pig, return_distance=False)# Your code goes
here

# Print neighbors
print(neighbors)

Naive Bayes classification in Python.


# Import packages and functions
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Read in the data and view the first five instances.


# File does not include column headers so they are provided via names.
messages = pd.read_table('SMSSpamCollection.csv', names=['Class', 'Message'])
messages.head()

# Split into testing and training sets


X_train, X_test, Y_train, Y_test = train_test_split(
messages['Message'], messages['Class'], random_state=123
)

# Count the words that appear in the messages


vectorizer = CountVectorizer(ngram_range=(1, 1))
vectorizer.fit(X_train)
# Uncomment the line below to see the words.
#vectorizer.vocabulary_

# Count the words in the training set and store in a matrix


X_train_vectorized = vectorizer.transform(X_train)
X_train_vectorized

# Initialize the model and fit with the training data


NBmodel = MultinomialNB()
NBmodel.fit(X_train_vectorized, Y_train)
# Make predictions onto the training and testing sets.
trainPredictions = NBmodel.predict(vectorizer.transform(X_train))
testPredictions = NBmodel.predict(vectorizer.transform(X_test))

# How does the model work on the training set?


confusion_matrix(Y_train, trainPredictions)

# Display that in terms of correct porportions


confusion_matrix(Y_train, trainPredictions, normalize='true')

# How does the model work on the test set?


confusion_matrix(Y_test, testPredictions, normalize='true')

# Predict some phrases. Add your own.


NBmodel.predict(
vectorizer.transform(
["Big sale today! Free cash.",
"I'll be there in 5"]))

Re1/200*(200-25)/200*(200-21)/200*200/400 = 0.0019578125

Not Re 19/200*87/200*(200-5)/200 = 0.040291875

Support vector machine classification in Python.


# Load packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler

# Load and view data


penguins = sns.load_dataset('penguins')
penguins

# Remove the penguins with missing data


penguinsClean = penguins[~penguins['body_mass_g'].isna()]

# Only use numeric values. Categorical values could be encoded as dummy variables.

X = penguinsClean[
['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
]
Y = penguinsClean['species']

# Split the data into training and testing sets.


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=20220621)

# Scale the input variable because SVM is dependent on differences in scale for
distances
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define and fit the model.


# Adjust C from 0.01 to 100 by changing the number of decimal places or zeros.
# C controls the slope of the hinge function. Larger values make misclassification less
frequent.

penguinsSVMlinear = svm.SVC(kernel='linear', C=0.01)


penguinsSVMlinear.fit(X_train_scaled, Y_train)

# Predict for the test set


Y_pred = penguinsSVMlinear.predict(X_test_scaled)

# Display the confusion matrix


confusion_matrix(Y_test, Y_pred)
# Adjust the number of decimal places in
# gamma (affects distance a point has influence, smaller value of gamma allow
influence to spread more )
# and C

penguinsSVMrbf = svm.SVC(kernel='rbf', C=10, gamma=0.01)


penguinsSVMrbf.fit(X_train_scaled, Y_train)

Predict for the test set


Y_pred = penguinsSVMrbf.predict(X_test_scaled)

# Display the confusion matrix


confusion_matrix(Y_test, Y_pred)

# Adjust the number of decimal places in C and change degree by steps of 1.


# Degree impacts the degree of the polynomial for the kernel.

penguinsSVMpoly = svm.SVC(kernel='poly', C=0.1, degree=5)


penguinsSVMpoly.fit(X_train_scaled, Y_train)

# Predict for the test set


Y_pred = penguinsSVMpoly.predict(X_test_scaled)

# Display the confusion matrix


confusion_matrix(Y_test, Y_pred)

# The number of support vectors for each class


penguinsSVMrbf.n_support_

# Which instances in the training set are support vectors


penguinsSVMrbf.support_

# The coefficients of the hyperplanes for each pair of classes in the form intercept =
coefficient1*variable1 + coefficient2*variable2 + ...
penguinsSVMlinear.coef_

# The intercept of the hyperplanes for each pair of classes.


penguinsSVMlinear.intercept_
K-nearest neighbors classification

The dataset SDSS contains 17 observational features and one class feature for 10000
deep sky objects observed by the Sloan Digital Sky Survey.
Use sklearn's KNeighborsClassifier function to perform kNN classification to classify each
object by the object's redshift and u-g color.
 Import the necessary modules for kNN classification.
 Create a dataframe X with features redshift and u_g.
 Create dataframe y with feature class.
 Initialize a kNN model with k=3.
 Fit the model using the training data.
 Find the predicted classes for the test data.
 Calculate the accuracy score and confusion matrix.

# Import needed packages for classification


from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Import packages for visualization of results
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from mlxtend.plotting import plot_decision_regions

# Iport packages for evaluation


from sklearn.model_selection import train_test_split
from sklearn import metrics
skySurvey = pd.read_csv('SDSS.csv')
skySurvey['u_g'] = skySurvey['u']-skySurvey['g']

# Initialize model with k=3


skySurveyKnn =KNeighborsClassifier(n_neighbors=3) # Your code here
X = skySurvey[['redshift', 'u_g']] # Features
y = skySurvey['class']

# Fit model using X_train and y_train


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
skySurveyKnn.fit(X_train, y_train)# Your code here

# Find the predicted classes for X_test


y_pred = skySurveyKnn.predict(X_test) # Your code here

# Calculate accuracy score


score = metrics.accuracy_score(y_test, y_pred)# Your code here

# Print accuracy score


print('Accuracy score is ', end="")
print('%.3f' % score)

# Print confusion matrix


print(metrics.confusion_matrix(y_test, y_pred))# Your code here)

Chapter 10
K-means clustering in Python.

# Import packages and functions


import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

from sklearn.cluster import KMeans

# Load dataset
geyser = pd.read_csv('oldfaithful.csv')
geyser

# Visual exploration
p = sns.scatterplot(data=geyser, x='Eruption', y='Waiting')
p.set_xlabel('Eruption time (min)', fontsize=14);
p.set_ylabel('Waiting time (min)', fontsize=14);

# Initialize a k-means model with k=2


kmModel = KMeans(n_clusters=2)

# Fit the model


kmModel = kmModel.fit(geyser)
# Save the cluster centroids
centroids = kmModel.cluster_centers_
centroids[1]

# Save the cluster assignments


clusters = kmModel.fit_predict(geyser[['Eruption', 'Waiting']])

# View the clusters for the first five instances


clusters[0:5]

# Plot clusters
p = sns.scatterplot(
data=geyser, x='Eruption', y='Waiting', hue=clusters, style=clusters
)
p.set_xlabel('Eruption time (min)', fontsize=14);
p.set_ylabel('Waiting time (min)', fontsize=14);

# Add centroid for cluster 0


plt.scatter(x=centroids[0, 0], y=centroids[0, 1], c='black')

# Add centroid for cluster 1


plt.scatter(x=centroids[1, 0], y=centroids[1, 1], c='black', marker='X')

# Fit k-means clustering with k=1,...,5 and save WCSS for each
WCSS = []
k = [1, 2, 3, 4, 5]
for j in k:
kmModel = KMeans(n_clusters=j)
kmModel = kmModel.fit(geyser)
WCSS.append(kmModel.inertia_)

# Plot the WCSS for each cluster


ax = plt.figure().gca()
plt.plot(k, WCSS, '*-')
plt.xlabel('Number of clusters (k)', fontsize=14);
plt.ylabel('Within-cluster sum of squares (WCSS)', fontsize=14);

K-means clustering using scikit-learn.

Researchers studying chemical properties of wines collected data on a sample of white


wines in northern Portugal. One of the research goals was to cluster wines based on
similar chemical properties.
 Fit the k-means clustering model to cluster wines based on alcohol concentration
(alcohol) and total sulfur dioxide (total_sulfur_dioxide).
The code provided initializes the model with n_clusters=5 and random_state=rng and
prints the cluster centers

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

# Read in the data


wines = pd.read_csv('whitewine.csv')

# Seed random number generator


rng = np.random.RandomState(43)

# Initialize k-means clustering model


kmeansModel = KMeans(n_clusters=5, random_state=rng)

# Your code goes here Solution:


kmeansModel=kmeansModel.fit(wines[['alcohol','total_sulfur_dioxide']])

print(kmeansModel.cluster_centers_)

 Initialize a k-means clustering model with n_clusters=3 and random_state=rng.


 Fit the model to cluster wines based on free sulfur dioxide (free_sulfur_dioxide) and
density (density).
kmeansModel = KMeans(n_clusters=3, random_state=rng)
clusters = kmeansModel.fit_predict(wines[['free_sulfur_dioxide', 'density']])

Agglomerative clustering in Python.

# Import packages and functions


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from scipy.cluster.hierarchy import dendrogram, linkage


from scipy.spatial.distance import squareform

# Load the dataset


cytochrome = pd.read_csv('cytochrome.csv', header=None, usecols=range(1, 14))
cytochrome

# Add labels for each species and save as a data frame


species = [
"Human",
"Monkey",
"Horse",
"Cow",
"Dog",
"Whale",
"Rabbit",
"Kangaroo",
"Chicken",
"Penguin",
"Duck",
"Turtle",
"Frog",
]

pd.DataFrame(data=cytochrome.to_numpy(), index=species, columns=species)

# Format the data as a distance matrix


# In this case, the data already represents distance between points (species)
differences = squareform(cytochrome)

# Define a clustering model with single linkage


clusterModel = linkage(differences, method='single')

# Create the dendrogram


dendrogram(clusterModel, labels=species, leaf_rotation=90)

# Plot the dendrogram


plt.ylabel('Amino acid differences', fontsize=14)
plt.yticks(np.arange(0, 11, step=1))
plt.xlabel('Species', fontsize=14)
plt.title('Single linkage clustering', fontsize=16)
plt.show()

Hierarchical clustering using scipy and scikit-learn.

import pandas as pd
from scipy.cluster.hierarchy import linkage
from scipy.spatial.distance import pdist
from sklearn.preprocessing import StandardScaler

wine = pd.read_csv('wine1.csv')

# Calculate a distance matrix with selected variables


X = wine[['alcohol', 'fixed_acidity']]

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X))

# pdist() calculates pairs of distances between each instance in the dataset


dist = pdist(X)

# Your code goes here

print(clusterModel)

 Cluster wines with complete linkage.

clusterModel = linkage(dist, method='complete')

 Using pdist(), calculate a distance matrix for wines. The matrix of input features, X,
has already been created.
 Use the distance matrix to cluster the wines with centroid linkage.

dist = pdist(X)
clusterModel = linkage(dist, method='centroid')

DBSCAN in Python.

# Import packages and functions


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import DBSCAN


from numpy import where
from sklearn.preprocessing import StandardScaler

# Load the dataset


homes = pd.read_csv('homes.csv')
homes

# Create a smaller data frame with two variables: Price and Floor
homes_pf = homes[['Price', 'Floor']]
homes_pf.describe()

# Define a scaler to transform values


scaler = StandardScaler()

# Apply scaler and view result


homes_scaled = pd.DataFrame(scaler.fit_transform(homes_pf), columns=['Price', 'Floor'])
homes_scaled.describe()

# Initialize DBSCAN model


# Setting a large epsilon will cluster all "middle" values and detect outliers
dbscanModel = DBSCAN(eps=1, min_samples=12)

# Fit the model


dbscanModel = dbscanModel.fit(homes_scaled)

# Predict clusters
clusters = dbscanModel.fit_predict(homes_scaled)
clusters = pd.Categorical(clusters)
clusters

# Visualize scaled outliers


p = sns.scatterplot(data=homes_scaled, x='Floor', y='Price', hue=clusters)
p.set_xlabel('Scaled floor', fontsize=14);
p.set_ylabel('Scaled price', fontsize=14);

# Points where the prediction is -1 are considered outliers


outliers_scaled = homes_scaled[clusters == -1]
outliers_scaled

# Outliers on original scale (price and square footage in thousands)


outliers_unscaled = homes[clusters == -1]
outliers_unscaled

# Visualize outliers on original scale


p = sns.scatterplot(data=homes, x='Floor', y='Price', hue=clusters)
p.set_xlabel('Floors', fontsize=14);
p.set_ylabel('Price', fontsize=14);
Researchers studying chemical properties of wines collected data on a sample of white
wines in Northern Portugal. A research goal was to cluster wines based on similar
chemical properties.
 Fit the DBSCAN model to cluster wines.
The code provided creates a dataframe with two features (citric_acid and fixed_acidity),
normalizes the dataframe, initializes the DBSCAN model, and prints the cluster labels for
each point in the dataset.

import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

wine = pd.read_csv('wine1.csv')

# Create an input matrix with selected features


X = wine[['citric_acid', 'fixed_acidity']]

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X))

# Cluster using DBSCAN with default options


dbscanModel = DBSCAN()

# Your code goes here

print(dbscanModel.labels_)

Fit the DBSCAN model to cluster wines


dbscanModel = dbscanModel.fit(wine)

 Use the DBSCAN clustering function to cluster wines. Keep eps and min_samples at
default values.
 Fit the DBSCAN model to cluster wines.
dbscanModel = DBSCAN()
dbscanModel = dbscanModel.fit(wine)

 Use the DBSCAN clustering function to cluster wines.


Set eps=0.75 and min_samples=3.
 Fit the DBSCAN model to cluster wines.

import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
wine = pd.read_csv('wine1.csv')

# Create an input matrix with selected features


X = wine[['chlorides', 'density']]

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X))

dbscanModel = DBSCAN(eps=0.75, min_samples=3)


dbscanModel = dbscanModel.fit(X)
# Your code goes here

print(dbscanModel.labels_)

dbscanModel = DBSCAN(eps=0.75, min_samples=3)


dbscanModel = dbscanModel.fit(X)

Factor analysis in Python.

# Load the pandas package


import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib_inline.backend_inline

matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

# Load the rock.csv dataset


rock = pd.read_csv('rock.csv')

# Display the correlation matrix using a heatmap


plt.figure(figsize=(4, 4))
sns.heatmap(rock.corr(), cmap="YlGnBu", annot=True)

# Create a scatter plot using perimeter and area


plt.figure(figsize=(4, 4))
plt.scatter(rock['Perimeter'], rock['Area'])
plt.xlabel('Perimeter', fontsize=14);
plt.ylabel('Area', fontsize=14);

# Create a scatter plot with a linear regression line


model = st.linregress(rock['Perimeter'], rock['Area'])
plt.figure(figsize=(4, 4))
plt.scatter(rock['Perimeter'], rock['Area'])
x = np.linspace(0, 5000, 10000)
y = model[0] * x + model[1]
plt.plot(x, y, '-r', linewidth=2.5)
plt.xlabel('Perimeter', fontsize=14);
plt.ylabel('Area', fontsize=14);

# Scale the data


scaler = StandardScaler()
rock = pd.DataFrame(
scaler.fit_transform(rock), columns=['Area', 'Perimeter', 'Shape', 'Permeability']
)

# Initialize and fit a PCA model on the rock data


pcaModel = PCA(n_components=4);
pcaModel.fit(rock);

# Display the explained variance (eigenvalues)


pcaModel.explained_variance_

# Show the factor loadings


pcaModel.components_.T * np.sqrt(pcaModel.explained_variance_)

# Create a scree plot


xint = range(0, 5)
plt.xticks(xint)
plt.plot([1, 2, 3, 4], pcaModel.explained_variance_, 'b*-')
plt.xlabel('Factors', fontsize='14');
plt.ylabel('Eigenvalues', fontsize='14');

Researchers studying chemical properties of wines collected data on a sample of white


wines in northern Portugal. Several chemical components in the wines were highly
correlated.
 Create a dataframe, X, that contains five features in the following
order: fixed_acidity, quality, total_sulfur_dioxide, volatile_acidity, and pH.
The code provided prints the correlation matrix for the features in X.
import pandas as pd
white_wine = pd.read_csv('white_wine.csv')

# Your code goes here

print(X.corr())

 Fit the principal components model to the dataframe X.

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

wines = pd.read_csv('wines.csv')

X = wines[['citric_acid', 'fixed_acidity', 'free_sulfur_dioxide', 'density']]

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X))

pcaModel = PCA(n_components=2)

# Your code goes here

print(pcaModel.explained_variance_ratio_)

 Fit the principal components model to the dataframe X.


 Use print() to calculate and display the factor loading matrix.
model = PCA(n_components=2)
model.fit(X)
print(model.components_.T * np.sqrt(model.explained_variance_))

Principal components with the travel ratings dataset.


# Import packages and data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

reviews = pd.read_csv('tripadvisor_review.csv').dropna()
# Drop user ratings
X = reviews.drop(axis=1, labels='User ID')

# Standardize input features to mean=0 and sd=1


scaler = StandardScaler()
X = pd.DataFrame(
scaler.fit_transform(X),
columns=[
'Art',
'Clubs',
'Juice bars',
'Restaurants',
'Museums',
'Resorts',
'Parks',
'Beaches',
'Theaters',
'Religious',
],
)
X.describe().round(2)

# Plot correlation matrix for input features


plt.figure(figsize=(15, 10))
plt.rcParams.update({'font.size': 14})
sns.heatmap(X.corr().round(2), cmap="RdBu", annot=True, vmin=-1, vmax=1)
plt.show()

# Initialize and fit a PCA model on the travel ratings data


pcaModel = PCA(n_components=10);
pcaModel.fit(X);

# Display eigenvalues
pcaModel.explained_variance_.round(3)

# Calculate PC1 and PC2


pca = PCA(n_components=2)
pca_result = pca.fit_transform(X.values)
pca_result

# Add PC1 and PC2 to X and display updated correlations


X['PC1'] = pca_result[:, 0]
X['PC2'] = pca_result[:, 1]
plt.figure(figsize=(15, 10))
plt.rcParams.update({'font.size': 14})
sns.heatmap(X.corr().round(2), cmap="RdBu", annot=True, vmin=-1, vmax=1)
plt.show()

Clustering with the travel ratings dataset.

# Import packages and data


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist

reviews = pd.read_csv('tripadvisor_review.csv').dropna()

# seed for reproducibility


seed = 123

# Drop user ID from dataset


X = reviews.drop(axis=1, labels=['User ID'])
X

# Initialize a k-means model with k=4


kmModel = KMeans(n_clusters=4, random_state=seed, n_init=10)
kmModel = kmModel.fit(X)
clusters = kmModel.fit_predict(X)
centroids = kmModel.cluster_centers_

# Show cluster ratings for juice bars


p = sns.kdeplot(data=X, x='Juice bars', hue=clusters, palette='viridis', linewidth=2.5)
p.set_xlabel('Juice bars', fontsize=14)
p.set_ylabel('Density', fontsize=14)
plt.show()

# Describe cluster ratings for juice bars


X[['Juice bars']].groupby(by=clusters).describe().round(2)

# Show cluster ratings for resorts


p = sns.kdeplot(data=X, x='Resorts', hue=clusters, palette='viridis', linewidth=2.5)
p.set_xlabel('Resorts', fontsize=14)
p.set_ylabel('Density', fontsize=14)
plt.show()

# Describe cluster ratings for juice bars


X[['Resorts']].groupby(by=clusters).describe().round(2)

# Show cluster ratings for religious sites


p = sns.kdeplot(data=X, x='Religious', hue=clusters, palette='viridis', linewidth=2.5)
p.set_xlabel('Religious sites', fontsize=14)
p.set_ylabel('Density', fontsize=14)
plt.show()

# Describe cluster ratings for religious sites


X[['Religious']].groupby(by=clusters).describe().round(2)

Outlier detection with the travel ratings dataset.

# Import packages and data


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import DBSCAN

reviews = pd.read_csv('tripadvisor_review.csv').dropna()

# Drop user ID
X = reviews.drop(axis=1, labels='User ID')

# Define DBSCAN model


dbscanModel = DBSCAN(eps=1, min_samples=20)

# Fit the model


dbscanModel = dbscanModel.fit(X)
clusters = dbscanModel.fit_predict(X)

# Subset of outliers
outliers = X[clusters == -1]
outliers.describe()

# Subset of non-outliers
nonoutliers = X[clusters == 0]
nonoutliers.describe()
# Plot art gallery and club ratings
p = sns.scatterplot(
data=X, x='Art', y='Clubs', hue=clusters, style=clusters, palette='Paired_r'
)
p.set_xlabel('Art galleries', fontsize=14)
p.set_ylabel('Clubs', fontsize=14)
plt.legend(labels=['Non-outlier', 'Outlier'])
plt.show()

# Plot restaurant and beach ratings


p = sns.scatterplot(
data=X,
x='Restaurants',
y='Beaches',
hue=clusters,
style=clusters,
palette='Paired_r',
)
p.set_xlabel('Restaurants', fontsize=14)
p.set_ylabel('Beaches', fontsize=14)
plt.legend(labels=['Non-outlier', 'Outlier'])
plt.show()

LAB: Grouping mammal sleep habits using k-means clustering

The msleep dataset contains information on sleep habits for 83 mammals. Features
include total sleep, length of the sleep cycle, time spent awake, brain weight, and body
weight. Animals are also labeled with their name, genus, and conservation status.
 Load the dataset msleep.csv into a data frame.
 Create a new data frame X with sleep_total and sleep_cycle.
 Initialize a k-means clustering model with 4 clusters and random_state = 0.
 Fit the model to the data subset X.
 Find the centroids of the clusters in the model.
 Graph the clusters using the cluster numbers to specify colors.
 Find the within-cluster sum of squares for 1, 2, 3, 4, and 5 clusters.

from sklearn.cluster import KMeans


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
# Load the dataset
mammalSleep = pd.read_csv('msleep.csv') # Your code here

# Clean the data


mammalSleep = mammalSleep.dropna()

# Create a dataframe with the columns sleep_total and sleep_cycle


X = mammalSleep[['sleep_total', 'sleep_cycle']] # Your code here

# Initialize a k-means clustering model with 4 clusters and random_state = 0


km = KMeans(n_clusters=4, random_state = 0)# Your code here

# Fit the model


km.fit(X)# Your code here

# Find the centroids of the clusters


mammalSleepCentroids = km.cluster_centers_# Your code here
print(mammalSleepCentroids)

# Predict the cluster for each data point in mammal_sleep


mammalSleep['cluster'] = km.predict(X) # Your code here

plt.figure(figsize=(6, 6))

# Graph the clusters


# Your code here
sns.scatterplot(data=mammalSleep, x='sleep_total', y='sleep_cycle', hue='cluster',
palette='Set2')
plt.xlabel('Total sleep', fontsize=14)
plt.ylabel('Length of sleep cycle',fontsize=14)
plt.savefig('msleep_clusters.png')

WCSS = []
k = [1,2,3,4,5]
for j in k:
km = KMeans(n_clusters = j)
mammalSleepKmWCSS = km.fit(X)
intermediateWCSS =(km.inertia_)# find the within-cluster sum of squares
WCSS.append(round(intermediateWCSS,1))

print(WCSS)

Analyzing factors in forest fire data using PCA


The forestfires dataset contains meteorological information and the area burned for 517
forest fires that occurred in Montesinho Natural Park in Portugal. The columns of interest
are FFMC, DMC, DC, ISI, temp, RH, wind, and rain.
 Read in the file forestfires.csv.
 Create a new data frame X from the columns FFMC, DMC, DC, ISI, temp, RH, wind,
and rain, in that order.
 Calculate the correlation matrix for the data in X.
 Scale the data.
 Use sklearn's PCA function to perform four-component factor analysis on the scaled
data.
 Print the factors and the explained variance.
 # Import the necessary modules
 import pandas as pd
 from sklearn.preprocessing import StandardScaler
 from sklearn.decomposition import PCA
 import seaborn as sns

 # Read in forestfires.csv
 fires = pd.read_csv('forestfires.csv') # Your code here

 # Create a new data frame with the columns FFMC, DMC, DC, ISI, temp, RH, wind,
and rain, in that order
 X =fires[['FFMC','DMC','DC','ISI','temp','RH','wind','rain']]# Your code here

 # Calculate the correlation matrix for the data in the data frame X
 XCorr = X.corr()# Your code here
 print(XCorr)

 # Scale the data.
 scaler = StandardScaler()# Your code here
 firesScaled = pd.DataFrame(
 scaler.fit_transform(X),
columns=['FFMC','DMC','DC','ISI','temp','RH','wind','rain']
 )
 # Your code here

 # Perform four-component factor analysis on the scaled data.
 pca = PCA(n_components=4)
 firesPCA = pca.fit_transform(firesScaled)
 # Your code here

 # Print the factors and the explained variance.
 print("Factors: ", pca.components_) # Your code here

 print("Explained variance: ",pca.explained_variance_)# Your code here)
Chapter 11
Building a regression tree using scikit-learn.

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text

# Seed the random number generator


rng = np.random.RandomState(39)

# Read in the data


raptorExample = pd.read_csv('raptorExample.csv')

# Encode sex as a dummy variable


raptorExampleWithDummy = pd.get_dummies(raptorExample, drop_first=True)

# Assign outcome to y and features to X


y = raptorExampleWithDummy['Wing']
X = raptorExampleWithDummy.drop('Wing', axis=1)

# Define model
raptorRT = DecisionTreeRegressor(max_depth=2, min_samples_leaf=3,
random_state=rng)

# Fit the model


# Your code goes here

# Print regression tree


print(export_text(raptorRT, feature_names=X.columns.to_list()))

Classification trees in Python.

# Import packages and functions


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn import metrics, tree

# Load the penguins data


penguins = pd.read_csv('palmer_penguins.csv')
# Drop penguins with missing values
penguins = penguins.dropna()

# Calculate summary statistics using .describe()


penguins.describe(include='all')

# Save output features as y


y = penguins[['species']]

# Save input features as x


X = penguins[['flipper_length_mm', 'bill_length_mm']]

# Initialize the model


classtreeModel = DecisionTreeClassifier(max_depth=2)

# Fit the model


classtreeModel = classtreeModel.fit(X, y)

# Print tree as text


print(export_text(classtreeModel, feature_names=X.columns.to_list()))

# Resize the plotting window


plt.figure(figsize=[12, 8])

# Values in brackets represent classes in alphabetical order


# [Adelie, Chinstrap, Gentoo]
p = tree.plot_tree(classtreeModel, feature_names=X.columns, filled=False, fontsize=10)

# Calculate cross-entroy and error rate

print("Cross-entropy: ", metrics.log_loss(y, classtreeModel.predict_proba(X)))


print("Error rate: ", 1 - metrics.accuracy_score(y, classtreeModel.predict(X)))

# Calculate the confusion matrix


metrics.confusion_matrix(y, classtreeModel.predict(X))

# Plot the confusion matrix


metrics.ConfusionMatrixDisplay.from_predictions(y, classtreeModel.predict(X))

# Calculate the Gini index


probs = pd.DataFrame(data=classtreeModel.predict_proba(X))

print("Gini index: ", (probs * (1 - probs)).mean().sum())


11.3.2: Building a classification tree using scikit-learn.

The dataset contains age and body measurements for a sample of hawks observed near
Iowa City, Iowa.
 Initialize the model using the DecisionTreeClassifier() type of classification tree
with min_samples_split of 3 and the random number generator random_state set
to rng.
The code contains all imports, loads the dataset, fits the model, and prints the tree.

# Import packages and functions


import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text

# Seed random number generator


rng = np.random.RandomState(35)

# Load the dataset


raptor = pd.read_csv('raptor_Example.csv')

# Assign outcome to y and features to X


y = raptor['Age']
X = raptor.drop('Age', axis=1)

# Initialize the model -- decision tree classifier

# Your code goes here


raptorCT =

raptorCT = DecisionTreeClassifier(min_samples_split=3, random_state=rng)

# Fit the model


raptorCT.fit(X,y)

# Print classification tree


print(export_text(raptorCT, feature_names=X.columns.to_list()))

# Import packages and functions


import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
# Seed random number generator
rng = np.random.RandomState(49)

# Load the dataset


birdOfPrey = pd.read_csv('birdOfPrey_Example.csv')

# Assign outcome to y and features to X


y = birdOfPrey['Age']
X = birdOfPrey.drop('Age', axis=1)

# Initialize the model -- decision tree classifier


birdOfPreyCT = DecisionTreeClassifier(max_depth=3, min_samples_split=5,
min_samples_leaf=1, random_state=rng)

# Fit the model

# Your code goes here

# Print classification tree

Classification random forests in Python.


# Import packages and functions
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import metrics, tree


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load the penguins data


penguins = pd.read_csv('palmer_penguins.csv')

# Drop penguins with missing values


penguins = penguins.dropna()

# Calculate summary statistics using .describe()


penguins.describe(include='all')

# y = output features
y = penguins['species']

# X = input features
X = penguins.drop('species', axis=1)

# Convert categorical inputs like species and island into dummy variables
X = pd.get_dummies(X, drop_first=True)

# Create a training/testing split


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=8675309
)

# Initialize the random forest model


rfModel = RandomForestClassifier(max_depth=2, max_features='sqrt',
random_state=99);

# Fit the random forest model on the training data


rfModel.fit(X_train, y_train);

pd.DataFrame(
data={
'feature': rfModel.feature_names_in_,
'importance': rfModel.feature_importances_,
}
).sort_values('importance', ascending=False)

# Predict species on the testing data


y_pred = rfModel.predict(X_test)

# Calculate a confusion matrix


metrics.confusion_matrix(y_test, y_pred)

# Plot the confusion matrix


metrics.ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

# Calculate the Gini index


probs = pd.DataFrame(data=rfModel.predict_proba(X_test))
print("Gini index ", (probs * (1 - probs)).mean().sum())

# Save the first random forest tree as singleTree


singleTree = rfModel.estimators_[0]

# Set image size


plt.figure(figsize=[15, 8])
# Plot a single regression tree
tree.plot_tree(singleTree, feature_names=X.columns, filled=False, fontsize=10);

Building random forest classification trees using scikit-learn.

# Import packages and functions


import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_text

# Seed random number generator


rng = np.random.RandomState(29)

# Load the dataset


birdOfPrey = pd.read_csv('birdOfPrey_Example.csv')

# Assign outcome to y and features to X


y = birdOfPrey['Species']
X = birdOfPrey.drop('Species', axis=1)

# Split dataset into training data and testing data


XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size=.3, random_state=rng)

# Initialize the model -- random forest classification trees


birdOfPreyRFC = # Your code goes here

birdOfPreyRFC = RandomForestClassifier(n_estimators=74, criterion='gini',


max_features='sqrt', bootstrap= True, random_state=rng) # Your code goes
here

# Fit the model with training data


birdOfPreyRFC = birdOfPreyRFC.fit(XTrain, yTrain)

# Print first and last random trees generated in the forest


print('First tree:')
print(export_text(birdOfPreyRFC[0], feature_names=X.columns.to_list()))
print('Last tree:')
print(export_text(birdOfPreyRFC[74-1], feature_names=X.columns.to_list()))

LAB: Creating a regression tree using mpg data


The dataset mpg contains information on miles per gallon (mpg) and engine size for cars
sold from 1970 through 1982. The dataset has the
features mpg, cylinders, displacement, horsepower, weight, acceleration, model_year, ori
gin, and name.
 Load the mpg.csv dataset.
 Create a dataframe, X, using weight and model_year as features.
 Create a dataframe, y, using mpg.
 Initialize a regression tree with random_state = 100 that has depth 3 and a
minimum number of samples in each leaf of 5.
 Fit the regression tree on X.

# Load the necessary libraries


import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text

# Load the mpg dataset


mpg = # Your code here

# Subset the data containing weight and model_year


X = # Your code here

# Subset the data containing mpg


y = # Your code here

# Initialize a regression tree with random_state = 100


# that has depth 3 and a minimum number of samples in each leaf of 5
mpgRT = # Your code here

# Fit the X and y data


# Your code here

# Print regression tree


print("max_depth = %s, %s"% (mpgRT.max_depth, mpgRT.random_state))
# Print tree
print(export_text(mpgRT, feature_names=X.columns.to_list()))

Chapter 12

Perceptron models in Python.

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Loads haberman.csv
haberman = pd.read_csv('haberman.csv')

# Slices the features of the dataset


X = haberman[['Age', 'Year', 'Nodes']]
y = haberman[["Survived"]]

# Scales the features


scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=['Age', 'Year', 'Nodes'])

# Splits the data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=123
)

# Initializes and fits a perceptron model


clf = Perceptron(tol=0.00001, eta0=0.1, max_iter=20000);
clf.fit(X_train, np.ravel(y_train));

# Creates a list of predictions from the test features


y_pred = clf.predict(X_test)

# Finds the accuracy score


accuracy_score(y_pred, y_test)

# Displays a heatmap for the confusion matrix


sns.heatmap(confusion_matrix(y_pred, y_test), annot=True)

Single-layer perceptron using scikit-learn.


# Import packages and functions
import pandas as pd
import numpy as np
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load the dataset
heart = pd.read_csv('heart.csv')

# Slices the features of the dataset


X = heart[['trestbps', 'age', 'thalach']]
y = heart[['target']]

# Scales the features


scaler = StandardScaler()
XScaled = pd.DataFrame(scaler.fit_transform(X), columns=['trestbps','age','thalach'])

# Splits the data into train and test sets


XTrain, XTest, yTrain, yTest = train_test_split(XScaled, y, test_size=0.2,
random_state=123)

# Initializes and fits a perceptron model


pModel = # Your code goes here
pModel.fit(XTrain, np.ravel(yTrain))

print(pModel.coef_)
print(pModel.intercept_)

Multilayer perceptron models in Python.

# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
import matplotlib_inline.backend_inline

matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

homes = pd.read_csv('homes.csv')

# Loads input and output features


X = homes[['Bed', 'Floor']]
y = homes[['Price']]

# Splits the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(y), random_state=123)
# Initializes and trains a multilayer perceptron regressor model on the training set
# This cell takes a long time to run.
mlpReg_train = MLPRegressor(
random_state=1, max_iter=500000, hidden_layer_sizes=[1]
).fit(X_train, np.ravel(y_train))

# Predicts the price of a 5 bedroom house with 2,896 sq ft


mlpReg_train.predict([[5, 2.896]])

# Plots the loss curves for the training sets


f, ax = plt.subplots(1, 1)
sns.lineplot(
x=range(len(mlpReg_train.loss_curve_)), y=mlpReg_train.loss_curve_, label='Training'
)
ax.set_xlabel('Epochs', fontsize=14);
ax.set_ylabel('Loss', fontsize=14);

# Compare the final loss between train and test sets


print(mlpReg_train.loss_)
print(
mean_squared_error(y_test, mlpReg_train.predict(X_test)) / 2
) # division by 2 to get squared error to match squared error.

# Obtains the final weights and biases


print(mlpReg_train.coefs_)
print(mlpReg_train.intercepts_)

Multilayer perceptron using scikit-learn.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor

# Seed random number generator


rng = np.random.RandomState(26)

# Loads the cabsNY.csv dataset


cabsNY = pd.read_csv('cabsNY.csv')

# Loads predictor and target variables


X = cabsNY[['fare','toll']].to_numpy() # converted to numpy type array
y = cabsNY[['distance']]
# Splits the data into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, np.ravel(y),random_state=rng)

# Initializes and trains a multilayer perceptron regressor model on the training and
validation sets

multLayerPercModelTrain = # Your code goes here


multLayerPercModelValidation = # Your code goes here

# Predicts the distance of a taxi ride with a specific fare and toll cost
print(multLayerPercModelTrain.predict([[4, 7]]))

# Prints the final weights, biases, and losses


weights = multLayerPercModelTrain.coefs_
biases = multLayerPercModelTrain.intercepts_
loss = multLayerPercModelTrain.loss_
print('{}\n{}\n{}'.format(weights, biases, loss))

LAB: Single-layer perceptron


The nbaallelo_log file contains data on 126314 NBA games from 1947 to 2015. The
dataset includes the features pts, elo_i, win_equiv, and game_result. Using the csv
file nbaallelo_log.csv and sklearn's Perceptron function, construct a perceptron model to
classify whether a team will win or lose a game based on the
features pts, elo_i, win_equiv. Complete the program with the following tasks:
 Scale the features in X and y.
 Use the Perceptron function to initialize and fit a perceptron model with a learning
rate of 0.05 and 20000 epochs.
 Print the weights for the input variables and bias term.
 Find the accuracy score.
Note: The program reads in a csv file's name from user.

import pandas as pd
import numpy as np
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

# Load input into a dataframe


NBA = pd.read_csv(input())
# Hot encode the game_result variable as a numeric variable with 0 for L and 1 for W
NBA.loc[NBA['game_result']=='L','game_result']=0
NBA.loc[NBA['game_result']=='W','game_result']=1

# Store relevant columns as variables


X = NBA[['pts','elo_i','win_equiv']]
y = NBA[['game_result']].astype(int)

# Scale the input features


scaler = StandardScaler()
XScaled = # Your code here

np.random.seed(42)

# Split the data into train and test sets


XTrain, XTest, yTrain, yTest = train_test_split(XScaled, y, test_size=0.3,
random_state=123)

# Initialize a perceptron model with a learning rate of 0.05 and 20000 epochs
classifyNBA = # Your code here
# Fit the perceptron model
# Your code here

# Create a list of predictions from the test features


yPred = # Your code here

# Find the weights for the input variables


weightVar = # Your code here
print(weightVar)

# Find the weights for the bias term


weightBias = # Your code here
print(weightBias)

# Find the accuracy score


score = # Your code here
print('%.3f' % score)
s

import pandas as pd
import numpy as np
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

# Load input into a dataframe


NBA = pd.read_csv(input())

# Hot encode the game_result variable as a numeric variable with 0 for L and 1 for W
NBA.loc[NBA['game_result']=='L','game_result']=0
NBA.loc[NBA['game_result']=='W','game_result']=1

# Store relevant columns as variables


X = NBA[['pts','elo_i','win_equiv']]
y = NBA[['game_result']].astype(int)

# Scale the input features


scaler = StandardScaler()
XScaled = pd.DataFrame(scaler.fit_transform(X), columns=['pts', 'elo_i', 'win_equiv'])
yScaled= pd.DataFrame(scaler.fit_transform(y), columns=['game_result'])# Your code
here

np.random.seed(42)

# Split the data into train and test sets


XTrain, XTest, yTrain, yTest = train_test_split(XScaled, y, test_size=0.3,
random_state=123)

# Initialize a perceptron model with a learning rate of 0.05 and 20000 epochs
classifyNBA = Perceptron(eta0=0.05, max_iter=20000); # Your code here
classifyNBA.fit(XTrain,np.ravel(yTrain))# Fit the perceptron model
# Your code here

# Create a list of predictions from the test features


yPred =classifyNBA.predict(XTest) # Your code here

# Find the weights for the input variables


weightVar = classifyNBA.coef_# Your code here
print(weightVar)

# Find the weights for the bias term


weightBias = classifyNBA.intercept_ # Your code here
print(weightBias)
# Find the accuracy score
score =accuracy_score(yPred, yTest)# Your code here
print('%.3f' % score)

You might also like