Python Codes
Python Codes
Chapter 2
Example Description
list.appen
Adds x to the end of the list
d(x)
list.remov
Removes x from the list
e(x)
list1 +
Concatenates the two lists
list2
{33, 4,'abc'} and {'abc', 4, 33} are the same set, since sets are not ordered.
dict = {'LAX': 161, 'DEN': 141} creates a dictionary with keys 'LAX' and 'DEN' and
values 161 and 141.
Functions example:
Example 2:
def changeName():
employeeName = 'Juliet'
employeeName = 'Romeo'
changeName()
print('Employee name:', employeeName)
def changeName():
global employeeName
employeeName = 'Juliet'
employeeName = 'Romeo'
changeName()
print('Employee name:', employeeName)
NumPy functions are written with the prefix 'numpy' or an alias. The tables omit
this prefix. Ex: sort(array) stands for numpy.sort(array).
arr
Deletes a slice of input array arr. axis is the axis along which to remove a
delete() obj
axis=None slice. obj is the index of the slice along the axis.
full() shape Returns an array filled with fill_value. The shape tuple specifies array
fill_value shape. dtype specifies the array type. If dtype=None, the type is inferred
dtype=None from fill_value.
arr
obj Inserts array values to input array arr. axis is the axis along which to
insert()
values insert. obj is the index before which values is inserted.
axis=None
shape Returns an array filled with zeros. The shape tuple specifies array
zeros()
dtype=float shape. dtype specifies the array type.
shape Returns an array filled with ones. The shape tuple specifies array
ones()
dtype=None shape. dtype specifies the array type. If dtype=None, the type is float64.
a Sorts array a along axis. The default axis=-1 sorts along the last axis
sort()
axis=-1 in a. axis=None flattens a before sorting.
Shape functions
Paramet
Function Description
ers
a
ravel() order=' Returns flattened array a.
C'
a
Returns an array with the same data as a but a different
newsha
reshape( shape. newshape is an integer or tuple of integers that specifies the new
pe
) shape. The new shape must have the same number of elements as the
order='
original shape.
C'
matmul(array1,
Matrix functions Matrix product of array1 and array2
array2)
cross(array1,
Cross product of array1 and array2
array2)
index values
a:b
from a to b-1
index values
a:
from a onwards
Comparison operators.
Comparison
Description
operator
Logical operators.
Logical
Description
operator
index
at[] Returns the dataframe value stored at index and column.
column
to_replace=None
value=
Replaces to_replace values
replace() in dataframe with value. to_replace and value may be string,
NoDefault.no_def
dictionary, list, regular expressions, or other data types.
ault
inplace=False
Triangle-down
y Yellow v H Hexagon2 marker
marker
Triangle-right
: Dotted line > | Vertical line marker
marker
Dashed-dot
-. p Pentagon marker s Square marker
line
Example:
# Load packages
import matplotlib.pyplot as plt
import pandas as pd
# Load oldfaithfulCluster.csv data
df = pd.read_csv('oldfaithfulCluster.csv')
plt.subplot(2, 1, 1)
plt.scatter(df['Eruption'], df['Waiting'])
plt.suptitle('Eruption time vs. waiting time', fontsize=20, c='black')
plt.ylabel('Waiting time', fontsize=14)
plt.subplot(2, 1, 2)
group1 = df[df['Cluster'] == 1]
group0 = df[df['Cluster'] == 0]
plt.scatter(group1['Eruption'], group1['Waiting'], label='1', edgecolors='white')
plt.scatter(group0['Eruption'], group0['Waiting'], label='0', edgecolors='white')
plt.xlabel('Eruption time', fontsize=14)
plt.ylabel('Waiting time', fontsize=14)
plt.legend()
CHAPTER 3
Pandas descriptive statistics methods.
Method Parameters Description
DataFrame.mean
() axis=None Returns the mean or median of the values over the
DataFrame.medi skipna=True requested axis. skipna=True excludes NA/null values.
an()
DataFrame.skew( axis=None Returns the skewness of the values over the requested
) skipna=True axis.
Using a descriptive statistics method, calculate the mean number of homes sold
("sales") over all cities.
norm.pdf(x, loc,
loc=μ sets the mean norm.pdf() returns the density curve's
scale)
Normal and scale=σ sets the value at x, and norm.cdf() returns the
norm.cdf(x, loc,
standard deviation. probability P(X≤ x).
scale)
t t.pdf(x, df) df=df sets the degrees of t.pdf() method returns the density
curve's value at x, and t.cdf() returns the
t.cdf(x, df) freedom for the distribution. probability P(X≤ x).
# Calculate P(X<=0)
t.cdf(x=0, df=4)
# Using the symmetry of the t-distribution curve, calculate P(X < -2 or X > 2)
t.cdf(x=-2, df=4) * 2
count: number/array of
successes
nobs: number/array of
observations
Returns the test statistic and p-value for a hypothesis test
value: value in the null
proportions_zte based on a normal (z) test. count and nobs take a single
hypothesis
st() value for a one proportion test and an array of values for
alternative: type of the
a two proportion test.
alternative hypothesis
prop_var=False: estimate
variance based on
sample proportions
count: number of
successes
nobs: number of
proportion_confi observations Returns a (1-alpha)∗100% confidence interval for a
nt() alpha: significance level population proportion.
method='normal': use
normal approximation to
calculate interval
ttest_1sa a: array of values Returns the t-statistic and p-value from a one-sample t-test for
popmean: value in null
hypothesis the null hypothesis that the population mean of a sample, a, is
mp()
alternative: type of alternative equal to a specified value.
hypothesis
Lab: The mtcars dataset contains data from the 1974 Motor Trends magazine, and
includes 10 features of performance and design from a sample of 32 cars.
Import the csv file mtcars.csv as a data frame using a pandas module function.
Find the mean, median, and mode of the column wt.
Print the mean and median.
import pandas as pd
# Read in the file mtcars.csv
cars = pd.read_csv('mtcars.csv') # Your code here
# Find the mean of the column wt
mean = cars['wt'].mean()# Your code here
# Find the median of the column wt
median = cars['wt'].median()# Your code here
print("mean = {:.5f}, median = {:.3f}".format(mean, median))
The intelligence quotient (IQ) of a randomly selected person follows a normal distribution
with a mean of 100 and a standard deviation of 15. Use the scipy function norm and user
input values for IQ1 and IQ2 to perform the following tasks:
Calculate the probability that a randomly selected person will have an IQ less than
or equal to IQ1.
Calculate the probability that a randomly selected person will have an IQ
between IQ1 and IQ2.
The gpa dataset is a toy dataset containing the features height and gpa for 35 students.
Use the statsmodels function proportions_ztest and the user defined values for the
proportion for the null hypothesis value and the gpa cutoff cutoff to perform the following
tasks:
Load the gpa.csv data set.
Find the number of students with a gpa greater than cutoff.
Find the total number of students.
Perform a z-test for the user input expected proportion. Modify
the prop_var parameter to use the user input expected proportion instead of the
sample proportion to calculate the standard error.
Determine if the hypothesis that the actual proportion is different from the
expected proportion should be rejected at the alpha = 0.01 significance level.
import statsmodels.stats as st
from statsmodels.stats.proportion import proportions_ztest
import pandas as pd
# Read in gpa.csv
gpa = pd.read_csv('gpa.csv')# Your code here
CHAPTER 4
Common operators.
Type Operator Description Example Value
FALS
< Compares two values with < 2<2
E
FALS
>= Compares two values with ≥ 'apple' >= 'banana'
E
Operator precedence.
Preceden
Operators
ce
1 - (unary)
2 ^
3 * / %
4 + - (binary)
6 NOT
7 AND
8 OR
WHERE clause.
SELECT Expression1, Expression2, ...
FROM TableName
WHERE Condition;
The given SQL creates a Movie table and inserts some movies. The SELECT statement
selects all movies released before January 1, 2000.
Modify the SELECT statement to select the title and release date of PG-13 movies that
are released after January 1, 2008.
Run your solution and verify the result table shows just the titles and release dates
for The Dark Knight and Crazy Rich Asians.
LIKE
% matches any number of characters. Ex: LIKE 'L%t' matches "Lt", "Lot", "Lift", and
"Lol cat".
_ matches exactly one character. Ex: LIKE 'L_t' matches "Lot" and "Lit" but not "Lt"
and "Loot".
The given SQL creates a Movie table and inserts some movies. The SELECT statement
selects all movies.
Modify the SELECT statement to select movies with the word "star" somewhere in the
title.
Run your solution and verify the result table shows just the movies Rogue One: A Star
Wars Story, Star Trek and Stargate.
CREATE TABLE Movie (
ID INT AUTO_INCREMENT,
Title VARCHAR(100),
Rating CHAR(5) CHECK (Rating IN ('G', 'PG', 'PG-13', 'R')),
ReleaseDate DATE,
PRIMARY KEY (ID)
);
Simple functions.
Type Function Description Example Result
SELECT LOG(10);
LOG(n) Natural logarithm of n 2.302585
SELECT CONCAT('Dis',
CONCAT(s1, 'Disenga
Concatenation of the strings s1, s2, ... 'en', 'gage');
s2, ...) ge'
SELECT
LOWER(s) s converted to lower case LOWER('MySQL'); 'mysql'
SELECT UPPER('mysql');
UPPER(s) s converted to upper case 'MYSQL'
String
SELECT
REPLACE(s, s with all occurrences of from replaced REPLACE(‘Orange', 'O',
'Strange'
from, to) by to 'St');
SELECT SUBSTRING
SUBSTRING(s, Substring of s that starts at
('Boomerang', 1, 4); 'Boom'
pos, len) position pos with length len
DATEDIFF(dt1, SELECT
dt2) Difference of dt1 − dt2, in number of DATEDIFF('2013-03-10',
6
TIMEDIFF(dt1, days or amount of time '2013-03-04');
dt2)
import mysql.connector
from mysql.connector import errorcode
try:
reservationConnection = mysql.connector.connect(
user='samsnead',
password='*jksi72$',
host='127.0.0.1',
database='Reservation')
else:
# Execute database operations...
reservationConnection.close()
flightCursor = reservationConnection.cursor()
flightQuery = ('SELECT FlightNumber, DepartureTime FROM Flight '
'WHERE AirportCode = %s AND AirlineName = %s')
flightData = ('PEK', 'China Airlines')
flightCursor.execute(flightQuery, flightData)
flightCursor.close()
CHAPTER 5
string[start:e Returns the substring of string that begins at the index start and ends at the
none
nd] index end - 1.
string.capital
ize()
Returns a copy of string with the initial character uppercase, all characters
string.upper(
none uppercase, all characters lowercase, or the initial character of all words
)
uppercase.
string.lower()
string.title()
Converts arg to datetime data type and returns the converted object. Data
to_datetime(
arg type of arg may be int, float, str, datetime, list, tuple, one-dimensional array,
)
Series, or DataFrame.
Converts arg to numeric data type and returns the converted object. Data
to_numeric() arg
type of arg may be scalar, list, tuple, one-dimensional array, or Series.
dtype
df.astyp Converts the data type of all dataframe df columns to dtype. To alter individual
copy=Tr
e() columns, specify dtype as {col: dtype, col:dtype, . . .}.
ue
loc
df.insert Inserts a new column with label column at location loc in dataframe df. value is a
column
() Scalar, Series, or Array of values for the new column.
value
objs
Appends dataframes specified in objs parameter. Appends rows
axis=0
pd.conca if axis=0 or columns if axis=1. join specifies whether to perform
join='outer'
t() an 'outer' or 'inner' join. Resulting index values are unchanged
ignore_index=F
if ignore_index=False or renumbered if ignore_index=True.
alse
loc Inserts a column to df. loc specifies the integer position of the new
df.insert(
column column. column specifies a string or numeric column label. value specifies
)
value column values as a Scalar or Series.
right Joins df with the right dataframe. how specifies whether to perform
df.merge how='inner' a 'left', 'right', 'outer', or 'inner' join. on specifies join column labels, which
() on=None must appear in both dataframes. If on=None, all matching labels become
sort=False join columns. sort=True sorts rows on the join columns.
# Read in hmeq_small.csv
hmeq = pd.read_csv('hmeq_small.csv')# Your code here
# Create a new data frame with the rows with missing values dropped
hmeqDelete = hmeq.dropna() # Your code here
# Create a new data frame with the missing values filled in by the mean of the column
hmeqReplace = hmeq.fillna(hmeq.mean(numeric_only=True)) # Your code here
# Print the means of the columns for each new data frame
print("Means for hmeqDelete are ",hmeqDelete.mean(numeric_only=True)) # Your code
here)
import pandas as pd
from sklearn import preprocessing
# Read in forestfires.csv
fires = pd.read_csv('forestfires.csv')# Your code here
# Create a new data frame with the columns FFMC, DMC, DC, ISI, temp, RH, wind, and
rain, in that order
X = fires[['FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain']]# Your code here
# Calculate the correlation matrix for the data in the data frame X
XCorr = X.corr()# Your code here
print(XCorr)
CHAPTER 6
sns.stripplot(df, x='Numerical feature', Creates a strip plot displaying the distribution of x for
y='Categorical feature') each group in y.
.info() displays the name, number of non-null values, and type of each feature in the
df.info()
dataframe.
df.boxplot() df.boxplot() plots a box plot for every column in the dataframe.
pd.plotting.scatter_matrix(df) plots every pair of numerical features as an
pd.plotting.scatter_mat
individual scatter plot. For more control, seaborn provides the
rix(df)
function sns.pairplot(df).
# Create a new data frame with the columns "weight" and "mpg"
mpgSmall = mpg[['weight', 'mpg']]# Your code here
print(mpgSmall)
# Create a scatter plot of weight vs mpg with x label "Weight" and y label "MPG"
# Your code here
plt.scatter(mpgSmall['weight'], mpgSmall['mpg'])
plt.xlabel('Weight')
plt.ylabel('MPG')
plt.title('Weight vs MPG')
plt.savefig('mpg_scatter.png')
# Load titanic.csv
titanic = sns.load_dataset('titanic') # Your code here
# Subset the titanic dataset to include first class passengers who embarked in
Southampton
firstSouth = titanic[(titanic['pclass'] == 1) & (titanic['embarked'] == 'S')]#
Your code here
# Subset the titanic dataset to include either second or third class passenger
secondThird = titanic[(titanic['pclass'] == 2) | (titanic['pclass'] == 3)]# Your
code here
print(firstSouth.head())
print(secondThird.head())
# Create a bar chart for the first class passengers who embarked in Southampton
grouped by sex
sns.countplot(data=firstSouth, x='sex')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.title('First-Class Passengers from Southampton by Sex')
# Create a bar chart for the second and third class passengers grouped by survival
status
sns.countplot(data=secondThird, x='survived')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.title('Survival Count of 2nd and 3rd Class Passengers')
# Your code here
plt.legend(labels=["0","1"], title = "survived")
plt.savefig('titanic_bar_2.png')
CHAPTER 7
# Import data
crabs = pd.read_csv('crab-groups.csv')
# Compute the sum of squared errors for the least squares model
SSEreg = sum((y - yPredicted) ** 2)[0]
SSEreg
# Compute the sum of squared errors for the horizontal line model
SSEyBar = sum((y - np.mean(y)) ** 2)[0]
SSEyBar
John F. Kennedy International Airport (JFK) is a major airport serving New York City. JFK
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival and departure delays (in minutes)
were recorded.
Initialize a linear regression model for predicting arrival delay based on departure
delay.
The code contains all imports, loads the dataset, fits the regression model, and prints the
model's intercept.
print('Intercept:', linearModel.intercept_[0])
Newark Liberty International Airport (EWR) is a major airport serving New York City. EWR
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival and departure delays (in minutes)
were recorded.
Initialize a linear regression model for predicting arrival delay based on departure
delay.
Fit the linear regression model.
The code contains all imports, loads the dataset, and prints the model's intercept.
print('Intercept:', linearModel.intercept_[0])
John F. Kennedy International Airport (JFK) is a major airport serving New York City. JFK
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival and departure delays (in minutes)
were recorded.
Predict the arrival delay for a flight that departed 8 minutes late, and assign
variable yHat with the prediction.
Assign variable slope with the slope coefficient of the model.
The code contains all imports, loads the dataset, initializes and fits the model, and
prints yHat and slope once calculated.
# Import packages
import pandas as pd
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
# Import data
crabs = pd.read_csv('crab-groups.csv')
# Make a prediction
yMultyPredicted = linModel.predict([[20, 3000]])
print(
"Predicted MPG for a car with acceleration = 20 seconds and Weight = 3000 pounds \
n",
"using the multiple linear regression is ",
yMultyPredicted[0][0],
"miles per gallon",
)
# Make a prediction
polyInputs = polyFeatures.fit_transform([[3000]])
yPolyPredicted = polyModel.predict(polyInputs)
print(
"Predicted MPG for a car with Weight = 3000 pounds \n",
"using the simple polynomial regression is ", yPolyPredicted[0][0], "miles per gallon",
)
# Make a prediction
polyInputs2 = polyFeatures2.fit_transform([[20, 3000]])
yPolyPredicted2 = polyModel2.predict(polyInputs2)
print(
"Predicted MPG for a car with acceleration = 20 seconds and Weight = 3000 pounds \
n",
"using the polynomial regression is ", yPolyPredicted2[0][0], "miles per gallon",
)
LaGuardia Airport (LGA) is a major airport serving New York City. LGA wanted to predict
the arrival delay of an incoming flight based on the departure delay. 50 recent flights
were randomly selected, and the arrival delays (in minutes) were recorded.
Initialize a multiple regression model for predicting arrival delay based on
departure delay and flight distance.
The code contains all imports, loads the dataset, fits the regression model, and prints the
model's intercept.
# Import packages and functions
import pandas as pd
from sklearn.linear_model import LinearRegression
print('Intercept:', multipleModel.intercept_)
John F. Kennedy International Airport (JFK) is a major airport serving New York City. JFK
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival delays (in minutes) were recorded.
Create a dataframe containing month (month) and distance (distance) in that
order. Use the reshape() function to ensure the input features are in the proper
format.
The code contains all imports, loads the dataset, fits the regression model, and prints the
model's intercept.
print('Intercept:', multipleModel.intercept_)
Newark Liberty International Airport (EWR) is a major airport serving New York City. EWR
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival delays (in minutes) were recorded.
Predict the arrival delay for a flight with departure time of 1868 and distance of
1752, and assign variable yHat with the prediction.
Calculate the slope coefficients for multipleModel and assign slope with the result.
The code contains all imports, loads the dataset, fits the multiple regression model, and
prints yHat and slope once calculated.
# Import packages and functions
import pandas as pd
from sklearn.linear_model import LinearRegression
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# Classify whether tumor with radius mean 13 is benign (0) or malignant (1)
pHat = logisticModel.predict([[13]])
pHat[0]
print(
"A tumor with radius mean 13 has predicted probability: \n",
pHatProb[0][0],
"of being benign\n",
pHatProb[0][1],
"of being malignant\n",
"and overall is classified to be benign",
)
The US Forest Service regularly monitors weather conditions to predict which areas are
at risk of wildfires. Data scientists working with the US Forest Service would like to
predict whether a wildfire will occur based on wind speed.
Fit the logistic regression model, logisticModel, to predict whether a wildfire will
occur.
The code contains all imports, loads the dataset, and prints the model coefficients.
The US Forest Service regularly monitors weather conditions to predict which areas are
at risk of wildfires. Data scientists working with the US Forest Service would like to
predict whether a wildfire will occur based on temperature.
Use the fitted logistic regression model, logisticModel, to predict whether a wildfire
will occur on a day with temperature = 25. Assign the prediction to pred.
The code contains all imports, loads the dataset, and prints the prediction.
The US Forest Service regularly monitors weather conditions to predict which areas are
at risk of wildfires. Data scientists working with the US Forest Service would like to
predict whether a wildfire will occur based on daily rainfall.
Use the fitted logistic regression model, logisticModel, to calculate the probabilities
of each outcome on a day with daily rainfall = 2. Assign the probabilities to prob.
The code contains all imports, loads the dataset, and prints the probabilities.
# Create a new column in the data frame that is the difference between pts and opp_pts
nba['y'] = nba['pts'] - nba['opp_pts']
# Your code here
# Compute the proportion of variation explained by the linear regression using the
LinearRegression object's score method
score = SLRModel.score(X,y)
# Your code here
print('The proportion of variation explained by the linear regression model is ', end="")
print('%.3f' % score + ". ")
# Hot encode the game_result variable as a numeric variable with 0 for L and 1 for W
NBA.loc[NBA['game_result']=='L','game_result']=0
NBA.loc[NBA['game_result']=='W','game_result']=1
# Your code here
# Initialize and fit the logistic model using the LogisticRegression function
NBAmodel = LogisticRegression()
NBAmodel.fit(X,y)
# Your code here
print("A team with the given elo_i score has predicted probability: \n", end="")
print('%.3f' % outcomeProb[0][0] + " losing\n", end="")
print('%.3f' % outcomeProb[0][1] + " winning")
print("and the overall prediction is",
outcomePred[0])
Chapter 8
Binary classification metrics in Python.
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
# Display accuracy
metrics. accuracy_score(y, yPredLowCutoff)
# Display precision
metrics.precision_score(y, yPredLowCutoff)
# Display recall
metrics.recall_score(y, yPredLowCutoff)
# Create a linear model using the training set and predictions using the test set
X_test = np.asarray(X_test)
y_test = np.asarray(y_test)
linModel = LinearRegression()
linModel.fit(X_train.values.reshape(-1, 1), y_train.values.reshape(-1, 1))
y_pred = np.ravel(linModel.predict(X_test.reshape(-1, 1)))
# Display MSE
metrics.mean_squared_error(y_test, y_pred)
# Display RMSE
metrics.mean_squared_error(y_test, y_pred, squared=False)
# Display MAE
metrics.mean_absolute_error(y_test, y_pred)
# Create a quadratic model using the training set and predictions using the test set
X_train = np.asarray(X_train)
y_train = np.asarray(y_train)
poly = PolynomialFeatures().fit_transform(X_train.reshape(-1, 1))
poly_reg_model = LinearRegression().fit(poly, y_train)
poly_test = PolynomialFeatures().fit_transform(X_test.reshape(-1, 1))
y_pred = poly_reg_model.predict(poly_test)
# Display MSE
metrics.mean_squared_error(y_test, y_pred)
# Display RMSE
metrics.mean_squared_error(y_test, y_pred, squared=False)
# Display MAE
metrics.mean_absolute_error(y_test, y_pred)
train_test_split(df,train_size=0.90)
A, B= train_test_split(df,test_size=0.05).
rng = np.random.RandomState(2)
rng = np.random.RandomState(42)
trainAndValidate, testingDataPercent =
train_test_split(pines,test_size=testingPropPercent, random_state=rng) # Your code goes
here
rng = np.random.RandomState(33)
# Import dataset
badDrivers = pd.read_csv('bad-drivers.csv')
cross_val_score(M, X, y, cv=5)
bootstrapErrors = []
for i in range(0, 30):
# Create the bootstrap sample and the out-of-bag sample
boot = resample(badDrivers, replace=True, n_samples=51)
oob = badDrivers[~badDrivers.index.isin(boot.index)]
# Import packages
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import PolynomialFeatures
# Import dataset
thurber = pd.read_csv('Thurber.csv')
# Carry out 10-fold cross-validation for the a degree i polynomial regression model
polyscore = -cross_val_score(
polyModel, XPoly, y, scoring='neg_mean_squared_error', cv=10
)
# Store the mean and standard deviation of the 10-fold cross-validation for the degree
i polynomial regression model
cvMeans.append(np.mean(polyscore))
cvStdDev.append(np.std(polyscore))
# Graph the errorbar chart using the cross-validation means and std deviations
plt.errorbar(x=range(1, 7), y=cvMeans, yerr=cvStdDev, marker='o', color='black')
plt.xlabel('Degree of regression polynomial', fontsize=14)
plt.ylabel('Mean squared error', fontsize=14)
# Set seed
seed = 123
x = pd.array([1, 2, 3])
yhat = 213.13396131 + 37.92605345 * x
plt.subplot(1, 2, 1)
plt.subplot(1, 2, 2)
# Testing set subplot
p = sns.scatterplot(x=X_test['Floor'], y=y_test['Price'])
plt.plot(x, yhat, color='black')
p.set_xlabel('Square feet (1000s)', fontsize=14);
p.set_ylabel('Price ($1000s)', fontsize=14);
p.set_title('Testing dataset', fontsize=16);
p.set_ylim(140, 460);
The nbaallelo_slr dataset contains information on 126315 NBA games between 1947 and
2015. The columns report the points made by one team, the Elo rating of that team
coming into the game, the Elo rating of the team after the game, and the points made by
the opposing team. The Elo rating measures the relative skill of teams in a league.
The code creates a new column y in the data frame that is the difference
between pts and opp_pts.
Split the data into 70 percent training set and 30 percent testing set
using sklearn's train_test_split function. Set random_state=0.
Store elo_i and y from the training data as the variables X and y.
The code performs a simple linear regression on X and y.
Perform 10-fold cross-validation with the default scorer using scikit-
learn's cross_val_score function.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
nba = pd.read_csv("nbaallelo_slr.csv")
# Create a new column in the data frame that is the difference between pts and opp_pts
nba['y'] = nba['pts'] - nba['opp_pts']
Solution:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
nba = pd.read_csv("nbaallelo_slr.csv")
# Create a new column in the data frame that is the difference between pts and
opp_pts
nba['y'] = nba['pts'] - nba['opp_pts']
Chapter 9
beans = pd.read_csv('Dry_Bean_Dataset.csv')
beans['Class'] = beans['Class'].str.capitalize()
print(beans.shape)
beans.describe()
# Initialize model
beanKnnClassifier = KNeighborsClassifier(n_neighbors=5)
# Split data
X = beans[['MajorAxisLength', 'MinorAxisLength']]
y = beans[['Class']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Take a sample to keep runtime low while seeing what areas are classified as each bean
beanSample = beans.sample(200, random_state=123)
beanSample.describe()
# Fit model
beanKnnClassifier.fit(X, np.ravel(y))
# Add legend
L = plt.legend()
L.get_texts()[0].set_text('Barbunya')
L.get_texts()[1].set_text('Bombay')
L.get_texts()[2].set_text('Cali')
L.get_texts()[3].set_text('Dermason')
L.get_texts()[4].set_text('Horoz')
L.get_texts()[5].set_text('Seker')
L.get_texts()[6].set_text('Sira')
This dataset contains data on sleep habits for 30 randomly selected mammals. Each
mammal is categorized as an omnivore, herbivore, carnivore, or insectivore.
Initialize a k-nearest neighbors classification model with k=4.
The code contains all imports, loads the dataset, fits the model, and applies the model
# Import dataset
sleep = pd.read_csv('sleep.csv')
knnModel= KNeighborsClassifier(n_neighbors=4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Your code goes here
# Print predictions
print(knnModel.predict(X))
This dataset contains data on sleep habits for 25 randomly selected mammals. Each
mammal is categorized as an omnivore, herbivore, carnivore, or insectivore.
REM sleep cycles of guinea pigs average 0.8 hours. Guinea pigs are awake on average
14.6 hours per day.
Use the kneighbors() method to find the instances in the training data that are
closest to guinea pigs. Assign the instances, but not the distances, to neighbors.
The code contains all imports, loads the dataset, initializes the model, and applies the
model to a test dataset.
# Import dataset
sleep = pd.read_csv('sleep.csv')
# Print neighbors
print(neighbors)
Re1/200*(200-25)/200*(200-21)/200*200/400 = 0.0019578125
# Only use numeric values. Categorical values could be encoded as dummy variables.
X = penguinsClean[
['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
]
Y = penguinsClean['species']
# Scale the input variable because SVM is dependent on differences in scale for
distances
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# The coefficients of the hyperplanes for each pair of classes in the form intercept =
coefficient1*variable1 + coefficient2*variable2 + ...
penguinsSVMlinear.coef_
The dataset SDSS contains 17 observational features and one class feature for 10000
deep sky objects observed by the Sloan Digital Sky Survey.
Use sklearn's KNeighborsClassifier function to perform kNN classification to classify each
object by the object's redshift and u-g color.
Import the necessary modules for kNN classification.
Create a dataframe X with features redshift and u_g.
Create dataframe y with feature class.
Initialize a kNN model with k=3.
Fit the model using the training data.
Find the predicted classes for the test data.
Calculate the accuracy score and confusion matrix.
Chapter 10
K-means clustering in Python.
# Load dataset
geyser = pd.read_csv('oldfaithful.csv')
geyser
# Visual exploration
p = sns.scatterplot(data=geyser, x='Eruption', y='Waiting')
p.set_xlabel('Eruption time (min)', fontsize=14);
p.set_ylabel('Waiting time (min)', fontsize=14);
# Plot clusters
p = sns.scatterplot(
data=geyser, x='Eruption', y='Waiting', hue=clusters, style=clusters
)
p.set_xlabel('Eruption time (min)', fontsize=14);
p.set_ylabel('Waiting time (min)', fontsize=14);
# Fit k-means clustering with k=1,...,5 and save WCSS for each
WCSS = []
k = [1, 2, 3, 4, 5]
for j in k:
kmModel = KMeans(n_clusters=j)
kmModel = kmModel.fit(geyser)
WCSS.append(kmModel.inertia_)
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
print(kmeansModel.cluster_centers_)
import pandas as pd
from scipy.cluster.hierarchy import linkage
from scipy.spatial.distance import pdist
from sklearn.preprocessing import StandardScaler
wine = pd.read_csv('wine1.csv')
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X))
print(clusterModel)
Using pdist(), calculate a distance matrix for wines. The matrix of input features, X,
has already been created.
Use the distance matrix to cluster the wines with centroid linkage.
dist = pdist(X)
clusterModel = linkage(dist, method='centroid')
DBSCAN in Python.
# Create a smaller data frame with two variables: Price and Floor
homes_pf = homes[['Price', 'Floor']]
homes_pf.describe()
# Predict clusters
clusters = dbscanModel.fit_predict(homes_scaled)
clusters = pd.Categorical(clusters)
clusters
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
wine = pd.read_csv('wine1.csv')
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X))
print(dbscanModel.labels_)
Use the DBSCAN clustering function to cluster wines. Keep eps and min_samples at
default values.
Fit the DBSCAN model to cluster wines.
dbscanModel = DBSCAN()
dbscanModel = dbscanModel.fit(wine)
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
wine = pd.read_csv('wine1.csv')
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X))
print(dbscanModel.labels_)
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
print(X.corr())
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
wines = pd.read_csv('wines.csv')
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X))
pcaModel = PCA(n_components=2)
print(pcaModel.explained_variance_ratio_)
reviews = pd.read_csv('tripadvisor_review.csv').dropna()
# Drop user ratings
X = reviews.drop(axis=1, labels='User ID')
# Display eigenvalues
pcaModel.explained_variance_.round(3)
reviews = pd.read_csv('tripadvisor_review.csv').dropna()
reviews = pd.read_csv('tripadvisor_review.csv').dropna()
# Drop user ID
X = reviews.drop(axis=1, labels='User ID')
# Subset of outliers
outliers = X[clusters == -1]
outliers.describe()
# Subset of non-outliers
nonoutliers = X[clusters == 0]
nonoutliers.describe()
# Plot art gallery and club ratings
p = sns.scatterplot(
data=X, x='Art', y='Clubs', hue=clusters, style=clusters, palette='Paired_r'
)
p.set_xlabel('Art galleries', fontsize=14)
p.set_ylabel('Clubs', fontsize=14)
plt.legend(labels=['Non-outlier', 'Outlier'])
plt.show()
The msleep dataset contains information on sleep habits for 83 mammals. Features
include total sleep, length of the sleep cycle, time spent awake, brain weight, and body
weight. Animals are also labeled with their name, genus, and conservation status.
Load the dataset msleep.csv into a data frame.
Create a new data frame X with sleep_total and sleep_cycle.
Initialize a k-means clustering model with 4 clusters and random_state = 0.
Fit the model to the data subset X.
Find the centroids of the clusters in the model.
Graph the clusters using the cluster numbers to specify colors.
Find the within-cluster sum of squares for 1, 2, 3, 4, and 5 clusters.
plt.figure(figsize=(6, 6))
WCSS = []
k = [1,2,3,4,5]
for j in k:
km = KMeans(n_clusters = j)
mammalSleepKmWCSS = km.fit(X)
intermediateWCSS =(km.inertia_)# find the within-cluster sum of squares
WCSS.append(round(intermediateWCSS,1))
print(WCSS)
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text
# Define model
raptorRT = DecisionTreeRegressor(max_depth=2, min_samples_leaf=3,
random_state=rng)
The dataset contains age and body measurements for a sample of hawks observed near
Iowa City, Iowa.
Initialize the model using the DecisionTreeClassifier() type of classification tree
with min_samples_split of 3 and the random number generator random_state set
to rng.
The code contains all imports, loads the dataset, fits the model, and prints the tree.
# y = output features
y = penguins['species']
# X = input features
X = penguins.drop('species', axis=1)
# Convert categorical inputs like species and island into dummy variables
X = pd.get_dummies(X, drop_first=True)
pd.DataFrame(
data={
'feature': rfModel.feature_names_in_,
'importance': rfModel.feature_importances_,
}
).sort_values('importance', ascending=False)
Chapter 12
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Loads haberman.csv
haberman = pd.read_csv('haberman.csv')
print(pModel.coef_)
print(pModel.intercept_)
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
homes = pd.read_csv('homes.csv')
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
# Initializes and trains a multilayer perceptron regressor model on the training and
validation sets
# Predicts the distance of a taxi ride with a specific fare and toll cost
print(multLayerPercModelTrain.predict([[4, 7]]))
import pandas as pd
import numpy as np
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
np.random.seed(42)
# Initialize a perceptron model with a learning rate of 0.05 and 20000 epochs
classifyNBA = # Your code here
# Fit the perceptron model
# Your code here
import pandas as pd
import numpy as np
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
# Hot encode the game_result variable as a numeric variable with 0 for L and 1 for W
NBA.loc[NBA['game_result']=='L','game_result']=0
NBA.loc[NBA['game_result']=='W','game_result']=1
np.random.seed(42)
# Initialize a perceptron model with a learning rate of 0.05 and 20000 epochs
classifyNBA = Perceptron(eta0=0.05, max_iter=20000); # Your code here
classifyNBA.fit(XTrain,np.ravel(yTrain))# Fit the perceptron model
# Your code here