0% found this document useful (0 votes)

35 views99 pages

Python Codes

This document provides an overview of Python programming concepts, including list, tuple, set, and dictionary operations, as well as functions and examples for data science packages like NumPy, pandas, and matplotlib. It covers basic array manipulations, mathematical operations, and data visualization techniques. Additionally, it includes examples of linear regression using pandas and sklearn, along with descriptive statistics methods in pandas.

Uploaded by

Azalia Delgado Vera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views99 pages

Python Codes

Uploaded by

Azalia Delgado Vera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 99

Python Codes

Chapter 2

Example Description

len(list) Finds the length of the list

list.appen
Adds x to the end of the list
d(x)

Removes the element with an index of x and returns

list.pop(x)
that element

list.remov
Removes x from the list
e(x)

list1 +
Concatenates the two lists
list2

 list = [10, 'abc'] creates a list with elements 10 and 'abc'.

 list = [] creates an empty list.

 len(tuple) returns the number of elements in tuple.

 tuple1 + tuple2 returns a tuple consisting of tuple1 followed by tuple2.

 set = {33, 4,'abc'} creates a set of three elements.

 {33, 4,'abc'} and {'abc', 4, 33} are the same set, since sets are not ordered.

 dict = {'LAX': 161, 'DEN': 141} creates a dictionary with keys 'LAX' and 'DEN' and
values 161 and 141.

 dict = {} creates an empty dictionary.

 del dict['Sofia'] removes the element with key 'Sofia'.

 dict['Rajesh'] = 'A+' either changes the value of an existing 'Rajesh' element to

'A+' or adds a new 'Rajesh' element.

Functions example:

def calcPizzaVolume(pizzaDiameter, pizzaHeight):

piVal = 3.14159265
pizzaRadius = pizzaDiameter / 2.0
pizzaArea = piVal * pizzaRadius * pizzaRadius
pizzaVolume = pizzaArea * pizzaHeight
return pizzaVolume
print('12.0 x 0.3 inch pizza is', calcPizzaVolume(12.0, 0.3), 'cubic inches')
print('16.0 x 0.8 inch pizza is', calcPizzaVolume(16.0, 0.8), 'cubic inches')

Example 2:

def changeName():
employeeName = 'Juliet'
employeeName = 'Romeo'
changeName()
print('Employee name:', employeeName)

PRINTS: “Employee name: Romeo”

def changeName():
global employeeName
employeeName = 'Juliet'
employeeName = 'Romeo'
changeName()
print('Employee name:', employeeName)

PRINTS: “Employee name: Juliet”

# Define function that prints full name

def printName(first, last, lastFirst=False):
if lastFirst:
print(last + ', ' + first)
if not lastFirst:
print(first + ' ' + last)

# Call with keyword arguments

printName(first='Dana', last='Patel', lastFirst=True)

2.4: Data science packages

Commo
Import name Description
n alias

NumPy includes functions and classes that aid in numerical

numpy np computation. NumPy is used in many other data science
packages.

pandas provides methods and classes for tabular and time-

pandas pd
series data.
scikit-learn provides implementations of many machine
learning algorithms with a uniform syntax for preprocessing
sklearn sk
data, specifying models, fitting models with cross-validation,
and assessing models.

matplotlib.py matplotlib allows the creation of data visualizations in

plt
plot Python. The functions mostly expect NumPy arrays.

seaborn also allows the creation of data visualizations but

seaborn sns
works better with pandas DataFrame.

SciPy provides algorithms and functions for computing

scipy.stats sp.stats problems that arise in science, engineering and
statistics. scipy.stats provides the functions for statistics.

statsmodels adds functionality to Python to estimate many

statsmodels sm different kinds of statistical models, make inferences from
those models, and explore data.

## This style prevents running much of a notebook to find a package needs to be

installed.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split

2.5: NumPy package

Array functions

 NumPy functions are written with the prefix 'numpy' or an alias. The tables omit
this prefix. Ex: sort(array) stands for numpy.sort(array).

Function Parameters Description

Returns an array constructed from object. object must be a scalar or an

object
ordered container, such as tuple or list. The array element type is inferred
array() dtype=None
ndim=0 from object unless a dtype is specified. ndim is the minimum number of
array dimensions.

arr
Deletes a slice of input array arr. axis is the axis along which to remove a
delete() obj
axis=None slice. obj is the index of the slice along the axis.

full() shape Returns an array filled with fill_value. The shape tuple specifies array
fill_value shape. dtype specifies the array type. If dtype=None, the type is inferred
dtype=None from fill_value.

arr
obj Inserts array values to input array arr. axis is the axis along which to
insert()
values insert. obj is the index before which values is inserted.
axis=None

shape Returns an array filled with zeros. The shape tuple specifies array
zeros()
dtype=float shape. dtype specifies the array type.

shape Returns an array filled with ones. The shape tuple specifies array
ones()
dtype=None shape. dtype specifies the array type. If dtype=None, the type is float64.

a Sorts array a along axis. The default axis=-1 sorts along the last axis
sort()
axis=-1 in a. axis=None flattens a before sorting.

Shape functions
Paramet
Function Description
ers

a
ravel() order=' Returns flattened array a.
C'

a
Returns an array with the same data as a but a different
newsha
reshape( shape. newshape is an integer or tuple of integers that specifies the new
pe
) shape. The new shape must have the same number of elements as the
order='
original shape.
C'

Returns an array with the same data as a but a different

a shape. newshape is an integer or tuple of integers that specifies the new
resize() newsha shape. The new and original arrays may have a different number of
pe elements. If the new array is larger than the original array, then the new
array is filled with repeated copies of a.

transpos Returns a transposed copy of array a. Zero- and one-dimensional arrays

a
e() are not changed. Equivalent to the attribute array.T.

Variable array is assigned with [ [1, 2, 3, 4], [5, 6, 7, 8]

reshape(array, (2, 2, 2)) [ [ [1, 2], [3, 4] ], [ [5, 6], [7, 8] ] ]

ravel(array, order='F') [1, 5, 2, 6, 3, 7, 4, 8]

transpose(array) [ [1, 5], [2, 6], [3, 7], [4, 8] ]

ravel(array, order='C') [1, 2, 3, 4, 5, 6, 7, 8]

resize(array, (2, 5)) [ [1, 2, 3, 4, 5], [6, 7, 8, 1, 2] ]

Math operator and function examples.

Expression Description

array1 + array2 Element-wise addition

array1 - array2 Element-wise subtraction

Arithmetic
operators
array1 * array2 Element-wise multiplication

array1 / array2 Element-wise division

sqrt(array1) Square root of array elements

Simple functions log(array1) Logarithm of array elements

sin(array1) Sine of array elements

max(array1) Maximum of array elements

median(array1) Median of array elements

Aggregate
functions Standard deviation of array
std(array1)
elements

var(array1) Variance of array elements

dot(array1, Dot product array1 rows with array2

array2) columns

matmul(array1,
Matrix functions Matrix product of array1 and array2
array2)

cross(array1,
Cross product of array1 and array2
array2)

2.6 pandas package

Slice notation.
Notati
Description
on

index values
a:b
from a to b-1

:b index values before b

index values
a:
from a onwards
Comparison operators.
Comparison
Description
operator

== Outputs True if the two operands are equal.

!= Outputs True if the two operands are not equal.

Outputs True if the left operand is greater than the right

>
operand.

Outputs True if the left operand is greater than or equal to

>=
the right operand.

Outputs True if the left operand is less than the right

<
operand.

Outputs True if the left operand is less than or equal to the

<=
right operand.

Logical operators.
Logical
Description
operator

Outputs True if the two operands are

&
both True.

Outputs True if at least one of the

|
operands is True.

Outputs the opposite truth value of the

~
expression.

Example dataframe methods.

Method Parameters Description

index
at[] Returns the dataframe value stored at index and column.
column

labels=None Removes rows (axis=0) or columns (axis=1)

drop() axis=0 from dataframe. labels specifies the labels of rows or columns to
inplace=False drop.

Removes duplicate rows from dataframe. subset specifies the labels

drop_duplicat subset=None
of columns used to identify duplicates. If subset=None, all columns
es() inplace=False
are used.

dropna() axis=0 Removes rows (axis=0) or columns (axis=1) containing missing

how='any' values from dataframe. subset specifies labels on the opposite axis
subset=None to consider for missing values. how indicates whether to drop the
inplace=False row or column if any or if all values are missing.

loc Inserts a column to dataframe. loc specifies the integer position of

insert() column the new column. column specifies a string or numeric column
value label. value specifies column values as a Scalar or Series.

to_replace=None
value=
Replaces to_replace values
replace() in dataframe with value. to_replace and value may be string,
NoDefault.no_def
dictionary, list, regular expressions, or other data types.
ault
inplace=False

Sorts dataframe columns or rows. by specifies indexes or labels on

by
which to sort. axis specifies whether to sort rows (0) or columns
axis=0
sort_values() (1). ascending specifies whether to sort ascending or
ascending=True
descending. inplace specifies whether to sort dataframe or return a
inplace=False
new dataframe.

2.7 matplotlib package

import matplotlib.pyplot as plt.
 plt.figure()—creates a new figure.
 plt.show()—displays the figure and all the objects the figure contains.
 plt.savefig(fname)—saves the figure in the current working directory with the
filename fname.
 plt.title() : add a title to a figure
 plt.xlabel(): adds text for the x-axis
 plt.ylabel(): adds text for the y-axis
 plt.text(x, y, s): adds string s to the figure at coordinates (x, y)
 plt.annotate(s, xy, xytext): links string s at coordinates given by xytext to a point
given by xy
 plt.legend(): adds legend in the figure
Characters for line color, line style, and marker style.
Character Line Character( Character
Marker style Marker style
(s) color/style s) (s)

b Blue . Point marker 1 Tri-down marker

g Green , Pixel marker 2 Tri-up marker

r Red o Circle marker 3 Tri-left marker

w White + Plus marker 4 Tri-right marker

k Black X X marker h Hexagon1 marker

Triangle-down
y Yellow v H Hexagon2 marker
marker

m Magenta ^ Triangle-up marker D Diamond marker

- Solid line < Triangle-left marker d Thin diamond marker

Triangle-right
: Dotted line > | Vertical line marker
marker

-- Dashed line * Star marker _ Horizontal line marker

Dashed-dot
-. p Pentagon marker s Square marker
line

The plt.grid() function adds grid lines to plots.

plt.subplot() function takes three parameters: nrows, ncols, and index.
plt.suptitle() adds a title to the entire figure, not just the individual plots.

Example:
# Load packages
import matplotlib.pyplot as plt
import pandas as pd
# Load oldfaithfulCluster.csv data
df = pd.read_csv('oldfaithfulCluster.csv')
plt.subplot(2, 1, 1)
plt.scatter(df['Eruption'], df['Waiting'])
plt.suptitle('Eruption time vs. waiting time', fontsize=20, c='black')
plt.ylabel('Waiting time', fontsize=14)
plt.subplot(2, 1, 2)
group1 = df[df['Cluster'] == 1]
group0 = df[df['Cluster'] == 0]
plt.scatter(group1['Eruption'], group1['Waiting'], label='1', edgecolors='white')
plt.scatter(group0['Eruption'], group0['Waiting'], label='0', edgecolors='white')
plt.xlabel('Eruption time', fontsize=14)
plt.ylabel('Waiting time', fontsize=14)
plt.legend()

LAB: Importing packages

fullscreen
Full screen1 / 1
Import the necessary modules and read in a csv file. The homes dataset contains 18
features giving the characteristics of 76 homes being sold. The modules will be used with
the homes.csv file to perform a linear regression. Linear regression will be covered in a
different chapter.
 Import the NumPy using the alias np and pandas using the alias pd.
 Import the function LinearRegression from the sklearn.linear_model package.
 Read in the csv file homes.csv.
Ex: If the csv file homes_small.csv is used instead of homes.csv, the output is:
The intercept of the regression is 249.522
The slope of the regression is 36.758
# Import NumPy and pandas
import numpy as np
import pandas as pd

# Import the LinearRegression function from sklearn.linear_model

from sklearn.linear_model import LinearRegression # Your code here

# Read in the csv file homes.csv

homes= pd.read_csv("homes.csv")

# Store relevant columns as variables

y = homes['Price']
y = np.reshape(y.values, (-1,1))
X = homes['Floor']
X = np.reshape(X.values, (-1,1))

# Fit a least squares regression model

linModel = LinearRegression()
linModel.fit(X,y)

# Print the intercept and slope of the regression

print('The intercept of the regression is ', end="")
print('%.3f' % linModel.intercept_)

print('The slope of the regression is ', end="")

print('%.3f' % linModel.coef_)

CHAPTER 3
Pandas descriptive statistics methods.
Method Parameters Description

DataFrame.mean
() axis=None Returns the mean or median of the values over the
DataFrame.medi skipna=True requested axis. skipna=True excludes NA/null values.
an()

Returns the unbiased sample variance (divides by n−1)

axis=None
DataFrame.var() or standard deviation of the values over the requested
skipna=True
DataFrame.std() axis. The divisor used is n−ddof, where n represents
ddof=1
the number of non-NA/null values.

DataFrame.min() axis=None Returns the minimum or maximum of the values over

DataFrame.max( skipna=True the requested axis.
)

q=0.5 Returns the value of the given quantile(s), q, over the

DataFrame.quant axis=None requested axis. interpolation specifies the method to
ile() interpolation='li determine a quantile when the quantile lies between
near' two values.

DataFrame.skew( axis=None Returns the skewness of the values over the requested
) skipna=True axis.

Returns the kurtosis of the values over the requested

DataFrame.kurto axis=None
axis. Computes Fisher's definition of kurtosis where a
sis() skipna=True
normal distribution has 0 kurtosis.

Returns descriptive statistics. For numerical features,

results include the count, mean, standard deviation,
DataFrame.descr percentiles=Non
minimum, maximum, 0.25 quantile, 0.50 quantile or
ibe() e
median, and 0.75 quantile. The returned percentiles
can be modified with percentiles.

 Using a descriptive statistics method, calculate the mean number of homes sold
("sales") over all cities.

# Import packages and functions

import pandas as pd
housing = pd.read_csv('txhousing.csv')
meanHomes = housing['sales'].mean()# Your code goes here
print('Mean:', meanHomes)

SciPy functions for probability distributions.

Distributi
Functions Parameters Description
on

bernoulli.pmf() returns the

Bernoull bernoulli.pmf(k, p) p=π sets the probability of a probability P(X= k), and
i bernoulli.cdf(k, p) "success". the bernoulli.cdf() returns the
probability P(X≤ k).

binomial.pmf() returns the

n=n sets the number of
Binomia binom.pmf(k, n, p) probability P(X= k), and
observations. p=π sets the
l binom.cdf(k, n, p) the binomial.cdf() returns the
probability of a "success".
probability P(X≤ k).

norm.pdf(x, loc,
loc=μ sets the mean norm.pdf() returns the density curve's
scale)
Normal and scale=σ sets the value at x, and norm.cdf() returns the
norm.cdf(x, loc,
standard deviation. probability P(X≤ x).
scale)

t t.pdf(x, df) df=d⁢f sets the degrees of t.pdf() method returns the density
curve's value at x, and t.cdf() returns the
t.cdf(x, df) freedom for the distribution. probability P(X≤ x).

# Calculate the probability of less than a value, P(X<=8), using cdf()

norm.cdf(x=8, loc=10, scale=2)

# Calculate the probability of greater than a value, P(X>8)=1-P(X<=8), using cdf()

1 - norm.cdf(x=8, loc=10, scale=2)

# Calculate the probability between two values, P(8<X<12), using cdf()

norm.cdf(x=12, loc=10, scale=2) - norm.cdf(x=8, loc=10, scale=2)

# Calculate P(X<=0)
t.cdf(x=0, df=4)

# Using the symmetry of the t-distribution curve, calculate P(X < -2 or X > 2)
t.cdf(x=-2, df=4) * 2

# Calculate probability in the tails P(X < -2 or X > 2)

t.cdf(x=-2, df=4) + (1 - t.cdf(x=2, df=4))

Functions for inference about proportions.

Function Parameters Description

count: number/array of
successes
nobs: number/array of
observations
Returns the test statistic and p-value for a hypothesis test
value: value in the null
proportions_zte based on a normal (z) test. count and nobs take a single
hypothesis
st() value for a one proportion test and an array of values for
alternative: type of the
a two proportion test.
alternative hypothesis
prop_var=False: estimate
variance based on
sample proportions

count: number of
successes
nobs: number of
proportion_confi observations Returns a (1-alpha)∗100% confidence interval for a
nt() alpha: significance level population proportion.
method='normal': use
normal approximation to
calculate interval

Functions for inference about means.

Function Parameters Description

ttest_1sa a: array of values Returns the t-statistic and p-value from a one-sample t-test for
popmean: value in null
hypothesis the null hypothesis that the population mean of a sample, a, is
mp()
alternative: type of alternative equal to a specified value.
hypothesis

a: array of values from sample

1
b: array of values from sample
Returns the t-statistic and p-value from a two-sample t-test for
2
ttest_ind() the null hypothesis that two independent samples, a and b,
equal_var=False: assumes
have equal population means.
non-equal variances
alternative: type of alternative
hypothesis

Lab: The mtcars dataset contains data from the 1974 Motor Trends magazine, and
includes 10 features of performance and design from a sample of 32 cars.
 Import the csv file mtcars.csv as a data frame using a pandas module function.
 Find the mean, median, and mode of the column wt.
 Print the mean and median.

import pandas as pd
# Read in the file mtcars.csv
cars = pd.read_csv('mtcars.csv') # Your code here
# Find the mean of the column wt
mean = cars['wt'].mean()# Your code here
# Find the median of the column wt
median = cars['wt'].median()# Your code here
print("mean = {:.5f}, median = {:.3f}".format(mean, median))

The intelligence quotient (IQ) of a randomly selected person follows a normal distribution
with a mean of 100 and a standard deviation of 15. Use the scipy function norm and user
input values for IQ1 and IQ2 to perform the following tasks:
 Calculate the probability that a randomly selected person will have an IQ less than
or equal to IQ1.
 Calculate the probability that a randomly selected person will have an IQ
between IQ1 and IQ2.

# Import norm from scipy.stats

from scipy.stats import norm
# Input two IQs, making sure that IQ1 is less than IQ2
IQ1 = float(input())
IQ2 = float(input())
mean = 100
std_dev = 15

while IQ1 > IQ2:

print("IQ1 should be less than IQ2. Enter numbers again.")
IQ1 = float(input())
IQ2 = float(input())
# Calculate the probability that a randomly selected person has an IQ less than or equal
to IQ1.
probLT = norm.cdf(IQ1, loc=mean, scale=std_dev)# Your code here
# Calculate the probability that a randomly selected person has an IQ between IQ1 and
IQ2
probBetw = norm.cdf(IQ2, loc=mean, scale=std_dev)-norm.cdf(IQ1, loc=mean,
scale=std_dev)# Your code here
print("The probability that a randomly selected person \n has an IQ less than or equal to
" + str(IQ1) + " is ", end="")
print('%.3f' % probLT + ".")
print("The probability that a randomly selected person \n has an IQ between " + str(IQ1)
+ " and " + str(IQ2)+ " is ", end="")
print('%.3f' % probBetw + ".")

The gpa dataset is a toy dataset containing the features height and gpa for 35 students.
Use the statsmodels function proportions_ztest and the user defined values for the
proportion for the null hypothesis value and the gpa cutoff cutoff to perform the following
tasks:
 Load the gpa.csv data set.
 Find the number of students with a gpa greater than cutoff.
 Find the total number of students.
 Perform a z-test for the user input expected proportion. Modify
the prop_var parameter to use the user input expected proportion instead of the
sample proportion to calculate the standard error.
 Determine if the hypothesis that the actual proportion is different from the
expected proportion should be rejected at the alpha = 0.01 significance level.

import statsmodels.stats as st
from statsmodels.stats.proportion import proportions_ztest
import pandas as pd

# Read in gpa.csv
gpa = pd.read_csv('gpa.csv')# Your code here

# Get the value of the proportion for the null hypothesis

value = float(input())
# Get the gpa cutoff
cutoff = float(input())

# Determine the number of students with a gpa higher than cutoff

counts = (gpa['gpa'] > cutoff).sum() # Your code here

# Determine the total number of students

nobs = len(gpa)# Your code here

# Perform z-test for counts, nobs, and value

# Modify prop_var parameter
ztest = proportions_ztest(count=counts, nobs=nobs, value=value, alternative='two-
sided', prop_var=value)# Your code here
print("(", end="")
print('%.3f' % ztest[0] + ", ", end="")
print('%.3f' % ztest[1] + ")")

if ztest[1] < 0.01:

print("The two-tailed p-value, ", end="")
print('%.3f' % ztest[1] + ", is less than \u03B1. Thus, sufficient evidence exists to
support the hypothesis that the proportion is different from", value)
else:
print("The two-tailed p-value, ", end="")
print('%.3f' % ztest[1] + ", is greater than \u03B1. Thus, insufficient evidence exists to
support the hypothesis that the proportion is different from", value)

CHAPTER 4
Common operators.
Type Operator Description Example Value

+ Adds two numeric values 4+3 7

- (unary) Reverses the sign of one numeric value -(-2) 2

Subtracts one numeric value from

- (binary) 11 - 5 6
another

* Multiplies two numeric values 3*5 15

Arithmeti
c
/ Divides one numeric value by another 4/2 2

Divides one numeric value by another

% (modu
and returns 5%2 1
lo)
the integer remainder

Raises one numeric value to the power

^ 5^2 25
of another

Comparis = Compares two values for equality 1=2 FALS

on E
!= Compares two values for inequality 1 != 2 TRUE

FALS
< Compares two values with < 2<2
E

<= Compares two values with ≤ 2 <= 2 TRUE

'2019-08-13' > '2021- FALS

> Compares two values with >
08-13' E

FALS
>= Compares two values with ≥ 'apple' >= 'banana'
E

Returns TRUE only when both values FALS

AND TRUE AND FALSE
are TRUE E

Logical Returns FALSE only when both values

OR TRUE OR FALSE TRUE
are FALSE

NOT Reverses a logical value NOT FALSE TRUE

Operator precedence.
Preceden
Operators
ce

1 - (unary)

2 ^

3 * / %

4 + - (binary)

= != < > <=

5
>=

6 NOT

7 AND

8 OR

SELECT with expressions.

SELECT Expression1, Expression2, ...
FROM TableName;

SELECT with columns.

SELECT Column1, Column2, ...
FROM TableName;
SELECT with asterisk.
SELECT *
FROM TableName;

WHERE clause.
SELECT Expression1, Expression2, ...
FROM TableName
WHERE Condition;

The given SQL creates a Movie table and inserts some movies. The SELECT statement
selects all movies released before January 1, 2000.
Modify the SELECT statement to select the title and release date of PG-13 movies that
are released after January 1, 2008.
Run your solution and verify the result table shows just the titles and release dates
for The Dark Knight and Crazy Rich Asians.

CREATE TABLE Movie (

ID INT AUTO_INCREMENT,
Title VARCHAR(100),
Rating CHAR(5) CHECK (Rating IN ('G', 'PG', 'PG-13', 'R')),
ReleaseDate DATE,
PRIMARY KEY (ID)
);

INSERT INTO Movie (Title, Rating, ReleaseDate) VALUES

('Casablanca', 'PG', '1943-01-23'),
('Bridget Jones\'s Diary', 'PG-13', '2001-04-13'),
('The Dark Knight', 'PG-13', '2008-07-18'),
('Hidden Figures', 'PG', '2017-01-06'),
('Toy Story', 'G', '1995-11-22'),
('Rocky', 'PG', '1976-11-21'),
('Crazy Rich Asians', 'PG-13', '2018-08-15');

-- Modify the SELECT statement:

SELECT *
FROM Movie
WHERE ReleaseDate < '2000-01-01';

LIKE
 % matches any number of characters. Ex: LIKE 'L%t' matches "Lt", "Lot", "Lift", and
"Lol cat".
 _ matches exactly one character. Ex: LIKE 'L_t' matches "Lot" and "Lit" but not "Lt"
and "Loot".

The given SQL creates a Movie table and inserts some movies. The SELECT statement
selects all movies.
Modify the SELECT statement to select movies with the word "star" somewhere in the
title.
Run your solution and verify the result table shows just the movies Rogue One: A Star
Wars Story, Star Trek and Stargate.
CREATE TABLE Movie (
ID INT AUTO_INCREMENT,
Title VARCHAR(100),
Rating CHAR(5) CHECK (Rating IN ('G', 'PG', 'PG-13', 'R')),
ReleaseDate DATE,
PRIMARY KEY (ID)
);

INSERT INTO Movie (Title, Rating, ReleaseDate) VALUES

('Rogue One: A Star Wars Story', 'PG-13', '2016-12-16'),
('Star Trek', 'PG-13', '2009-05-08'),
('The Dark Knight', 'PG-13', '2008-07-18'),
('Stargate', 'PG-13', '1994-10-28'),
('Avengers: Endgame', 'PG-13', '2019-04-26');

-- Modify the SELECT statement:

SELECT *
FROM Movie;

Simple functions.
Type Function Description Example Result

Numer SELECT ABS(-5);

ABS(n) Absolute value of n 5
ic

SELECT LOG(10);
LOG(n) Natural logarithm of n 2.302585

SELECT POW(2, 3);

POW(x, y) x to the power of y 8

Random number between 0 (inclusive) SELECT RAND();

RAND() 0.118318
and 1 (exclusive)

ROUND(n, d) n rounded to d decimal places SELECT ROUND(16.25, 16.3

1);
SELECT SQRT(25);
SQRT(n) Square root of n 5

SELECT CONCAT('Dis',
CONCAT(s1, 'Disenga
Concatenation of the strings s1, s2, ... 'en', 'gage');
s2, ...) ge'

SELECT
LOWER(s) s converted to lower case LOWER('MySQL'); 'mysql'

SELECT UPPER('mysql');
UPPER(s) s converted to upper case 'MYSQL'
String

SELECT
REPLACE(s, s with all occurrences of from replaced REPLACE(‘Orange', 'O',
'Strange'
from, to) by to 'St');

SELECT SUBSTRING
SUBSTRING(s, Substring of s that starts at
('Boomerang', 1, 4); 'Boom'
pos, len) position pos with length len

CURDATE() Current date, time, or date and time

SELECT CURDATE(); '2019-
CURTIME() in 'YYYY-MM-DD', 'HH:MM:SS', or 'YYYY-
01-25'
NOW() MM-DD HH:MM:SS' format

DAY(d) SELECT MONTH('2016-

MONTH(d) Day, month, or year of d 10-25'); 10
YEAR(d)
Date
Time HOUR(t) SELECT
MINUTE(t) Hour, minute, or second of t MINUTE('22:11:45'); 11
SECOND(t)

DATEDIFF(dt1, SELECT
dt2) Difference of dt1 − dt2, in number of DATEDIFF('2013-03-10',
6
TIMEDIFF(dt1, days or amount of time '2013-03-04');
dt2)

 COUNT() counts the number of selected values.

 MIN() finds the minimum of selected values.
 MAX() finds the maximum of selected values.
 SUM() sums selected values.
 AVG() computes the arithmetic mean of selected values.
 VARIANCE() computes the standard variance of selected values.
GROUP BY clause
 One or more columns are listed after GROUP BY, separated by commas.
 GROUP BY clause returns one row for each group.
 Each group may be ordered with the ORDER BY clause.
 GROUP BY clause must appear before the ORDER BY clause and after the WHERE
clause (if present).

import mysql.connector
from mysql.connector import errorcode

try:
reservationConnection = mysql.connector.connect(
user='samsnead',
password='*jksi72$',
host='127.0.0.1',
database='Reservation')

except mysql.connector.Error as err:

if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
print('Invalid credentials')
elif err.errno == errorcode.ER_BAD_DB_ERROR:
print('Database not found')
else:
print('Cannot connect to database:', err)

else:
# Execute database operations...
reservationConnection.close()

 The cursor.rowcount property is the number of rows returned or altered by a

query.
 The cursor.column_names property is a list of column names in a query result.
 The cursor.fetchwarnings() method returns a list of warnings generated by a
query.
 The connection.commit() method saves all changes
 The connection.rollback() method discards all changes
 cursor.fetchone() returns a tuple containing a single result row or the value None if no rows are
selected. If a query returns multiple rows, cursor.fetchone() may be executed repeatedly until it
returns None.
 cursor.fetchall() returns a list of tuples containing all result rows. The tuple list can be processed
in a loop. Ex: for rowTuple in cursor.fetchall() assigns each row to rowTuple and terminates
when all rows are processed.

flightCursor = reservationConnection.cursor()
flightQuery = ('SELECT FlightNumber, DepartureTime FROM Flight '
'WHERE AirportCode = %s AND AirlineName = %s')
flightData = ('PEK', 'China Airlines')
flightCursor.execute(flightQuery, flightData)

for row in flightCursor.fetchall():

print('Flight', row[0], 'departs at', row[1])

flightCursor.close()

CHAPTER 5

Data wrangling with Python and pandas.

Method Parameters Description

Returns a dataframe constructed from a CSV

file. filepath_or_buffer is a string containing the
filepath_or_buffer
full path for the CSV file. When the file is in the
read_csv() sep=NoDefault.no_d
same directory as the code, only the file name is
efault
needed. sep specifies the character that
separates values in the CSV file.

Returns a dataframe constructed from an Excel

spreadsheet. io is a string containing the full
io path for the Excel file. When the file is in the
read_excel()
sheet_name=0 same directory as the code, only the file name is
needed. sheet_name is a string or integer that
specifies which Excel sheet to read.

Returns a dataframe constructed from an SQL

table_name table. table_name specifies the table
read_sql_tab con name. con specifies a database server
le() schema=None connection string. schema specifies the schema
columns=none in the database server. columns specifies which
table columns to include in the dataframe.

Returns a new dataframe. data specifies

data=None dataframe values as an array, dictionary, or
index=None another dataframe. index and columns specify
DataFrame()
columns=None row and column labels. The
defaults index=None and columns=None genera
te integer labels.

dataframe.a index Returns the dataframe value stored

t[] column at index and column.

Returns information about dataframe, such as

number of rows and columns, data types, and
dataframe.i
verbose=None memory usage. If verbose=False, shows only
nfo()
summary dataframe information and hides
column details.

dataframe.lo indexRange Returns a slice

of dataframe. indexRange specifies rows in the
slice,
c[] columnRange
as startIndex:endIndex. columnRange specifies
columns in the slice as startLabel:endLabel.

Sorts dataframe columns or rows. by specifies

indexes or labels on which to sort. axis specifies
by
whether to sort rows (0) or columns
dataframe.s axis=0
(1). ascending specifies whether to sort
ort_values() ascending=True
ascending or descending. inplace specifies
inplace=False
whether to sort dataframe or return a new
dataframe.

Python data structuring methods.

Paramet
Method Description
ers

string[start:e Returns the substring of string that begins at the index start and ends at the
none
nd] index end - 1.

string.capital
ize()
Returns a copy of string with the initial character uppercase, all characters
string.upper(
none uppercase, all characters lowercase, or the initial character of all words
)
uppercase.
string.lower()
string.title()

Converts arg to datetime data type and returns the converted object. Data
to_datetime(
arg type of arg may be int, float, str, datetime, list, tuple, one-dimensional array,
)
Series, or DataFrame.

Converts arg to numeric data type and returns the converted object. Data
to_numeric() arg
type of arg may be scalar, list, tuple, one-dimensional array, or Series.

pandas data structuring methods.

Paramete
Method Description
rs

dtype
df.astyp Converts the data type of all dataframe df columns to dtype. To alter individual
copy=Tr
e() columns, specify dtype as {col: dtype, col:dtype, . . .}.
ue

loc
df.insert Inserts a new column with label column at location loc in dataframe df. value is a
column
() Scalar, Series, or Array of values for the new column.
value

scikit-learn data structuring methods.

Method Parameters Description
X Standardizes data in input X of data type Array
axis=0 or DataFrame. axis indicates whether to
with_mean=Tru standarize along columns (0) or rows
preprocessing.scale()
e (1). with_mean=True centers the data at the
with_std=True mean value. with_std=True scales the data so
copy=True that one represents a standard deviation.

Normalizes data in input X,

feature_range a fit_transform() parameter of data type Array or
preprocessing.MinMaxScaler().fit_tr =(0,1) DataFrame. feature_range specifies the range of
ansform() copy=True scaled
X data. feature_range and copy are MinMaxScaler(
) parameters.

pandas data cleaning methods.

Method Parameters Description

labels=None Removes rows (axis=0) or columns (axis=1) from

df.drop() axis=0 dataframe df. labels specifies the labels of rows or columns
inplace=False to drop.

Removes duplicate rows from df. subset specifies the labels

df.drop_duplica subset=None
of columns used to identify duplicates. If subset=None, all
tes() inplace=False
columns are used.

Removes rows (axis=0) or columns (axis=1) containing

axis=0
missing values from df. subset specifies labels on the
how='any'
df.dropna() opposite axis to consider for missing values. how indicates
subset=None
whether to drop the row or column if any or if all values are
inplace=False
missing.

Returns a Boolean series that identifies duplicate rows

in df. true indicates a duplicate row. subset specifies the
df.duplicated() subset=None
labels of columns used to identify duplicates.
If subset=None, all columns are used.

value=None Replaces NA and NaN values in df with value, which may be

df.fillna()
inplace=False a scalar, dict, Series, or DataFrame.

Returns a dataframe of Boolean values. True in the returned

df.isnull()
none dataframe indicates the corresponding value of the
df.isna()
input df is None, NaT or NaN.

Returns the mean values of rows (axis=0) or columns

axis=0
(axis=1) of df. skipna indicates whether to exclude unknown
df.mean() skip_na=True
values in the calculation. numeric_only indicates whether to
numeric_only=None
exclude non-numeric rows or columns.

df.replace() to_replace=None Replaces to_replace values

value=NoDefault.no_d in df with value. to_replace and value may be str, dict, list,
efault regex, or other data types.
inplace=False

Python data enriching methods.

Method Parameters Description

objs
Appends dataframes specified in objs parameter. Appends rows
axis=0
pd.conca if axis=0 or columns if axis=1. join specifies whether to perform
join='outer'
t() an 'outer' or 'inner' join. Resulting index values are unchanged
ignore_index=F
if ignore_index=False or renumbered if ignore_index=True.
alse

Applies the function specified in func parameter to a dataframe df. Applies

df.apply( func
function to each column if axis=0 or to each row if axis=1. Returns a
) axis=0
Series or DataFrame.

loc Inserts a column to df. loc specifies the integer position of the new
df.insert(
column column. column specifies a string or numeric column label. value specifies
)
value column values as a Scalar or Series.

right Joins df with the right dataframe. how specifies whether to perform
df.merge how='inner' a 'left', 'right', 'outer', or 'inner' join. on specifies join column labels, which
() on=None must appear in both dataframes. If on=None, all matching labels become
sort=False join columns. sort=True sorts rows on the join columns.

LAB: Cleaning data using dropna() and fillna()

fullscreen
Full screen1 / 1
The hmeq_small dataset contains information on 5960 home equity loans, including 7
features on the characteristics of the loan.
 Load the data set hmeq_small.csv as a data frame.
 Create a new data frame with all the rows with missing data deleted.
 Create a second data frame with all missing data filled in with the mean value of
the column.
 Find the means of the columns for both new data frames.
import pandas as pd

# Read in hmeq_small.csv
hmeq = pd.read_csv('hmeq_small.csv')# Your code here

# Create a new data frame with the rows with missing values dropped
hmeqDelete = hmeq.dropna() # Your code here

# Create a new data frame with the missing values filled in by the mean of the column
hmeqReplace = hmeq.fillna(hmeq.mean(numeric_only=True)) # Your code here

# Print the means of the columns for each new data frame
print("Means for hmeqDelete are ",hmeqDelete.mean(numeric_only=True)) # Your code
here)

print("Means for hmeqReplace are ", hmeqReplace.mean(numeric_only=True)) # Your

code here)

LAB: Structuring data using scale() and MinMaxScaler()

fullscreen
Full screen1 / 1
The hmeq_small dataset contains information on 5960 home equity loans, including 7
features on the characteristics of the loan.
 Load the hmeq_small.csv data set as a data frame.
 Standardize the data set as a new data frame.
 Normalize the data set as a new data frame.
 Print the means and standard deviations of both the standardized and normalized
data.

import pandas as pd
from sklearn import preprocessing

# Read in the file hmeq_small.csv

hmeq = pd.read_csv('hmeq_small.csv')

# Standardize the data

standardized = preprocessing.scale(hmeq)

# Output the standardized data as a data frame with column names

hmeqStand = pd.DataFrame(standardized, columns=hmeq.columns)

# Normalize the data (min-max scaling)

normalized = preprocessing.minmax_scale(hmeq)

# Output the normalized data as a data frame with column names

hmeqNorm = pd.DataFrame(normalized, columns=hmeq.columns)

# Print the means and standard deviations of hmeqStand and hmeqNorm

print("The means of hmeqStand are ", hmeqStand.mean())
print("The standard deviations of hmeqStand are ", hmeqStand.std())
print("The means of hmeqNorm are ", hmeqNorm.mean())
print("The standard deviations of hmeqNorm are ", hmeqNorm.std())
The forestfires dataset contains meteorological information and the area burned for 517
forest fires that occurred in Montesinho Natural Park in Portugal. The columns of interest
are FFMC, DMC, DC, ISI, temp, RH, wind, and rain.
 Read in the file forestfires.csv.
 Create a new data frame X from the columns FFMC, DMC, DC, ISI, temp, RH, wind,
and rain, in that order.
 Calculate the correlation matrix for the data in X.
 Scale the data.
 Use sklearn's PCA function to perform four-component factor analysis on the scaled
data.
 Print the factors and the explained variance.

# Import the necessary modules

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Read in forestfires.csv
fires = pd.read_csv('forestfires.csv')# Your code here

# Create a new data frame with the columns FFMC, DMC, DC, ISI, temp, RH, wind, and
rain, in that order
X = fires[['FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain']]# Your code here

# Calculate the correlation matrix for the data in the data frame X
XCorr = X.corr()# Your code here
print(XCorr)

# Scale the data.

scaler = StandardScaler()# Your code here
firesScaled = scaler.fit_transform(X) # Your code here

# Perform four-component factor analysis on the scaled data.

# Your code here
pca = PCA(n_components=4)
firesPCA = pca.fit_transform(firesScaled)

# Print the factors and the explained variance.

print("Factors: ", pca.components_) # Your code here)
print("Explained variance: ", pca.explained_variance_) # Your code here)

CHAPTER 6

Seaborn single feature plots.

Command Description

sns.histplot(df, Creates a histogram of the named numerical feature from

x='Feature') the dataframe.

sns.kdeplot(df, Creates a density plot of the named numerical feature from

x='Feature') the dataframe.

sns.countplot(df, Creates a bar chart of the named categorical feature from

x='Feature') the dataframe.

sns.boxplot(df, Creates a box plot of the named numerical feature from

x='Feature') the dataframe.

sns.violinplot(df, Creates a violin plot of the named numerical feature from

x='Feature') the dataframe.

Two feature plots in seaborn.

Command Description

sns.scatterplot(df, x='Horizontal feature',

Creates a scatter plot of the features provided.
y='Vertical feature')

sns.swarmplot(df, x='Numerical feature', Creates a swarm plot displaying the distribution of x

y='Categorical feature') for each group in y.

sns.stripplot(df, x='Numerical feature', Creates a strip plot displaying the distribution of x for
y='Categorical feature') each group in y.

Dataset summary functions.

Function Behavior

.shape returns the dataframe's dimensions and displays as (number of instances,

df.shape
number of features). df.shape is useful when code needs one of these dimensions.

.info() displays the name, number of non-null values, and type of each feature in the
df.info()
dataframe.

.describe() displays summary statistics (count, mean, standard deviation, min/max,

df.describe(inclu and quartiles) for each numerical feature. Including include = "all" displays the
de = "all") count, number of categories, and mode's name and frequency for categorical
features.

Many relationship visualization in pandas.

Function Behavior

df.hist() df.hist() plots a histogram for every column in the dataframe.

df.boxplot() df.boxplot() plots a box plot for every column in the dataframe.
pd.plotting.scatter_matrix(df) plots every pair of numerical features as an
pd.plotting.scatter_mat
individual scatter plot. For more control, seaborn provides the
rix(df)
function sns.pairplot(df).

LAB: Visualizing mpg data using matplotlib

fullscreen
Full screen1 / 1
The dataset mpg contains information on miles per gallon (mpg) and engine size for cars
sold from 1970 through 1982. The dataset has the
features mpg, cylinders, displacement, horsepower, weight, acceleration, model_year, ori
gin, and name.
 Load the dataset mpg.csv.
 Create a new dataframe using the columns weight and mpg.
 Use matplotlib to make a scatter plot of weight vs mpg labelling the x-
axis Weight and the y-axis MPG.

import matplotlib.pyplot as plt

import pandas as pd
import seaborn as sns

# Load the mpg data set

mpg = sns.load_dataset('mpg')# Your code here

# Create a new data frame with the columns "weight" and "mpg"
mpgSmall = mpg[['weight', 'mpg']]# Your code here

print(mpgSmall)

# Create a scatter plot of weight vs mpg with x label "Weight" and y label "MPG"
# Your code here
plt.scatter(mpgSmall['weight'], mpgSmall['mpg'])
plt.xlabel('Weight')
plt.ylabel('MPG')
plt.title('Weight vs MPG')

plt.savefig('mpg_scatter.png')

LAB: Visualizing Titanic passenger statistics using bar charts

fullscreen
Full screen1 / 1
The titanic dataset contains data on 887 Titanic passengers, including each passenger's
survival status, embarkation location, cabin class, and sex. Write a program that
performs the following tasks:
 Load the dataset in titanic.csv as titanic.
 Create a new data frame, firstSouth, by subsetting titanic to include instances
where a passenger is in the first class cabin (pclass feature is 1) and boarded from
Southampton (embarked feature is S).
 Create a new data frame, secondThird, by subsetting titanic to include instances
where a passenger is either in the second (pclass feature is 2) or third class
(pclass feature is 3) cabin.
 Create bar charts for the following:
o Passengers in first class who embarked in Southampton grouped by sex.
o Passengers in second and third class grouped by survival status.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load titanic.csv
titanic = sns.load_dataset('titanic') # Your code here

# Subset the titanic dataset to include first class passengers who embarked in
Southampton
firstSouth = titanic[(titanic['pclass'] == 1) & (titanic['embarked'] == 'S')]#
Your code here

# Subset the titanic dataset to include either second or third class passenger
secondThird = titanic[(titanic['pclass'] == 2) | (titanic['pclass'] == 3)]# Your
code here

print(firstSouth.head())
print(secondThird.head())

# Create a bar chart for the first class passengers who embarked in Southampton
grouped by sex
sns.countplot(data=firstSouth, x='sex')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.title('First-Class Passengers from Southampton by Sex')

# Your code here

plt.savefig('titanic_bar_1.png')

# Create a bar chart for the second and third class passengers grouped by survival
status
sns.countplot(data=secondThird, x='survived')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.title('Survival Count of 2nd and 3rd Class Passengers')
# Your code here
plt.legend(labels=["0","1"], title = "survived")
plt.savefig('titanic_bar_2.png')

CHAPTER 7

Simple linear regression

# Import packages

import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import r_regression

# Import data
crabs = pd.read_csv('crab-groups.csv')

# Store relevant columns as variables

X = crabs[['latitude']].values.reshape(-1, 1)
y = crabs[['mean_mm']].values.reshape(-1, 1)

# Fit a least squares regression model

linModel = LinearRegression()
linModel.fit(X, y)
yPredicted = linModel.predict(X)

# Graph the model

plt.scatter(X, y, color='black')
plt.plot(X, yPredicted, color='blue', linewidth=2)
plt.xlabel('Latitude', fontsize=14)
plt.ylabel('Mean length (mm)', fontsize=14)

# Graph the residuals

plt.scatter(X, y, color='black')
plt.plot(X, yPredicted, color='blue', linewidth=2)
for i in range(len(X)):
plt.plot([X[i], X[i]], [y[i], yPredicted[i]], color='grey', linewidth=1)
plt.xlabel('Latitude', fontsize=14)
plt.ylabel('Mean length (mm)', fontsize=14)

# Output the intercept of the least squares regression

intercept = linModel.intercept_
print(intercept[0])

# Output the slope of the least squares regression

slope = linModel.coef_
print(slope[0][0])

# Write the least squares model as an equation

print("Predicted mean length = ", intercept[0], " + ", slope[0][0], "* (latitude)")

# Compute the sum of squared errors for the least squares model
SSEreg = sum((y - yPredicted) ** 2)[0]
SSEreg

# Compute the sum of squared errors for the horizontal line model
SSEyBar = sum((y - np.mean(y)) ** 2)[0]
SSEyBar

# Compute the proportion of variation explained by the linear regression

# using the sum of squared errors
(SSEyBar - SSEreg) / (SSEyBar)

# Compute the correlation coefficient r

r = r_regression(X, np.ravel(y))[0]
r

# Compute the proportion of variation explained by the linear regression

# using correlation coefficient
r**2

# Compute the proportion of variation explained by the linear regression

# using the LinearModel object's score method
linModel.score(X, y)

John F. Kennedy International Airport (JFK) is a major airport serving New York City. JFK
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival and departure delays (in minutes)
were recorded.
 Initialize a linear regression model for predicting arrival delay based on departure
delay.
The code contains all imports, loads the dataset, fits the regression model, and prints the
model's intercept.

# Import packages and functions

import pandas as pd
from sklearn.linear_model import LinearRegression

# Import flights and remove missing values

flights = pd.read_csv('flightsJFK.csv').dropna()

# Define X and y and convert to proper format

X = flights[['dep_delay']].values.reshape(-1, 1)
y = flights[['arr_delay']].values.reshape(-1, 1)

# Initialize a linear regression model

linearModel = LinearRegression() # Your code goes here

# Fit the linear model

linearModel = linearModel.fit(X, y)

print('Intercept:', linearModel.intercept_[0])

# Import packages and functions

import pandas as pd
from sklearn.linear_model import LinearRegression

# Import flights and remove missing values

flights = pd.read_csv('flightsEWR.csv').dropna()

# Define X and y and convert to proper format

X = flights[['dep_delay']].values.reshape(-1, 1)
y = flights[['arr_delay']].values.reshape(-1, 1)

# Initialize and fit a linear regression model

linearModel= LinearRegression() # Your code goes here
linearModel = linearModel.fit(X,y)

print('Intercept:', linearModel.intercept_[0])
John F. Kennedy International Airport (JFK) is a major airport serving New York City. JFK
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival and departure delays (in minutes)
were recorded.
 Predict the arrival delay for a flight that departed 8 minutes late, and assign
variable yHat with the prediction.
 Assign variable slope with the slope coefficient of the model.
The code contains all imports, loads the dataset, initializes and fits the model, and
prints yHat and slope once calculated.

# Import packages and functions

import pandas as pd
from sklearn.linear_model import LinearRegression

# Import flights and remove missing values

flights = pd.read_csv('flightsJFK.csv').dropna()

# Define X and y and convert to proper format

X = flights[['dep_delay']].values.reshape(-1, 1)
y = flights[['arr_delay']].values.reshape(-1, 1)

# Initialize a linear regression model

linearModel = LinearRegression()

# Fit the linear model

linearModel = linearModel.fit(X, y)

# Predict the arrival delay and assign the slope

# Your code goes here
yHat = linearModel.predict([[8]])
slope =linearModel.coef_

print('Predicted arrival delay:', yHat[0][0])

print('Slope coefficient:', slope[0][0])

Residual plots with Python.

# Import packages
import pandas as pd
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

# Import data
crabs = pd.read_csv('crab-groups.csv')

# Store relevant columns as variables

X = crabs[['latitude']].values.reshape(-1, 1)
y = crabs[['mean_mm']].values.reshape(-1, 1)

# Fit a least squares regression model

linModel = LinearRegression();
linModel.fit(X, y);

# regplot() creates a scatter plot with the regression line overlaid

p = sns.regplot(data=crabs, x='latitude', y='mean_mm', ci=False,
scatter_kws={'color':'black'})
p.set_xlabel('Latitude', fontsize=14);
p.set_ylabel('Mean length (mm)', fontsize=14);

# Calculate predicted values and residuals

yPredicted = linModel.predict(X)
yResid = yPredicted – y

# Scatter plot with predicted values vs. residuals

# Points should be scattered around a horizontal line at y=0 with no obvious pattern
p = sns.regplot(x=yPredicted, y=yResid, ci=False, scatter_kws={'color':'black'})
p.set_xlabel('Fitted values', fontsize=14);
p.set_ylabel('Residuals', fontsize=14);
p.set_title('Fitted value vs. residual plot', fontsize=16);

# Residuals must be stored as a flattened array

resid = np.ravel(yResid)

# Use qqplot() from statsmodels to make a QQ plot

p = sm.qqplot(resid, line='45')

plt.title('Normal Q-Q plot', fontsize=16);

plt.xlabel('Theoretical quantiles', fontsize=14);
plt.ylabel('Sample quantiles', fontsize=14);
Multiple linear regression in Python
# Import packages
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from mpl_toolkits import mplot3d

# Load the dataset

mpg = pd.read_csv('mpg.csv')

# Remove rows that have missing fields

mpg = mpg.dropna()

# Store relevant columns as variables

X = mpg[['acceleration', 'weight']].values.reshape(-1, 2)
y = mpg[['mpg']].values.reshape(-1, 1)

# Graph acceleration vs MPG

plt.scatter(X[:, 0], y, color='black')
plt.xlabel('Acceleration', fontsize=14);
plt.ylabel('MPG', fontsize=14);

# Graph weight vs MPG

plt.scatter(X[:, 1], y, color='black')
plt.xlabel('Weight', fontsize=14);
plt.ylabel('MPG', fontsize=14);

# Fit a least squares multiple linear regression model

linModel = LinearRegression()
linModel.fit(X, y)

# Write the least squares model as an equation

print(
"Predicted MPG = ",
linModel.intercept_[0],
" + ",
linModel.coef_[0][0],
"* (Acceleration)",
" + ",
linModel.coef_[0][1],
"* (Weight)",
)

# Set up the figure

fig = plt.figure()
ax = plt.axes(projection='3d')
# Plot the points
ax.scatter3D(X[:, 0], X[:, 1], y, color="Black")
# Plot the regression as a plane
xDeltaAccel, xDeltaWeight = np.meshgrid(
np.linspace(X[:, 0].min(), X[:, 0].max(), 2),
np.linspace(X[:, 1].min(), X[:, 1].max(), 2),
)
yDeltaMPG = (
linModel.intercept_[0]
+ linModel.coef_[0][0] * xDeltaAccel
+ linModel.coef_[0][1] * xDeltaWeight
)
ax.plot_surface(xDeltaAccel, xDeltaWeight, yDeltaMPG, alpha=0.5)
# Axes labels
ax.set_xlabel('Acceleration');
ax.set_ylabel('Weight');
ax.set_zlabel('MPG');
# Set the view angle
ax.view_init(30, 50);
ax.set_xlim(28, 9);

# Make a prediction
yMultyPredicted = linModel.predict([[20, 3000]])
print(
"Predicted MPG for a car with acceleration = 20 seconds and Weight = 3000 pounds \
n",
"using the multiple linear regression is ",
yMultyPredicted[0][0],
"miles per gallon",
)

# Store weight as an array

X2 = X[:, 1].reshape(-1, 1)

# Fit a quadratic regression model using just Weight

polyFeatures = PolynomialFeatures(degree=2, include_bias=False)
xPoly = polyFeatures.fit_transform(X2)
polyModel = LinearRegression()
polyModel.fit(xPoly, y)

# Graph the quadratic regression

plt.scatter(X2, y, color='black')
xDelta = np.linspace(X2.min(), X2.max(), 1000)
yDelta = polyModel.predict(polyFeatures.fit_transform(xDelta.reshape(-1, 1)))
plt.plot(xDelta, yDelta, color='blue', linewidth=2)
plt.xlabel('Weight', fontsize=14)
plt.ylabel('MPG', fontsize=14)

# Write the quadratic model as an equation

print(
"Predicted MPG = ",
polyModel.intercept_[0],
" + ",
polyModel.coef_[0][0],
"* (Weight)",
" + ",
polyModel.coef_[0][1],
"* (Weight)^2",
)

# Make a prediction
polyInputs = polyFeatures.fit_transform([[3000]])
yPolyPredicted = polyModel.predict(polyInputs)
print(
"Predicted MPG for a car with Weight = 3000 pounds \n",
"using the simple polynomial regression is ", yPolyPredicted[0][0], "miles per gallon",
)

# Fit a quadratic regression model using acceleration and weight

polyFeatures2 = PolynomialFeatures(degree=2, include_bias=False)
xPoly2 = polyFeatures.fit_transform(X)
polyModel2 = LinearRegression()
polyModel2.fit(xPoly2, y)

# Write the quadratic regression as an equation

print(
"Predicted MPG =", polyModel2.intercept_[0], "\n",
" + ", polyModel2.coef_[0][0], "* (Acceleration)\n",
" + ", polyModel2.coef_[0][1], "* (Weight)", "\n",
" + ", polyModel2.coef_[0][2], "* (Acceleration)^2 \n",
" + ", polyModel2.coef_[0][3], "* (Acceleration)*(Weight) \n",
" + ", polyModel2.coef_[0][4], "* (Weight)^2 \n",
)

# Make a prediction
polyInputs2 = polyFeatures2.fit_transform([[20, 3000]])
yPolyPredicted2 = polyModel2.predict(polyInputs2)
print(
"Predicted MPG for a car with acceleration = 20 seconds and Weight = 3000 pounds \
n",
"using the polynomial regression is ", yPolyPredicted2[0][0], "miles per gallon",
)

LaGuardia Airport (LGA) is a major airport serving New York City. LGA wanted to predict
the arrival delay of an incoming flight based on the departure delay. 50 recent flights
were randomly selected, and the arrival delays (in minutes) were recorded.
 Initialize a multiple regression model for predicting arrival delay based on
departure delay and flight distance.
The code contains all imports, loads the dataset, fits the regression model, and prints the
model's intercept.
# Import packages and functions
import pandas as pd
from sklearn.linear_model import LinearRegression

# Import flights and remove missing values

flights = pd.read_csv('flightsLGA.csv').dropna()

# Define X and y and convert to proper format

X = flights[['dep_delay', 'distance']].values.reshape(-1, 2)
y = flights[['arr_delay']].values.reshape(-1, 1)

# Initialize a linear regression model

multipleModel = LinearRegression()# Your code goes here

# Fit the linear model

multipleModel = multipleModel.fit(X, y)

print('Intercept:', multipleModel.intercept_)

John F. Kennedy International Airport (JFK) is a major airport serving New York City. JFK
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival delays (in minutes) were recorded.
 Create a dataframe containing month (month) and distance (distance) in that
order. Use the reshape() function to ensure the input features are in the proper
format.
The code contains all imports, loads the dataset, fits the regression model, and prints the
model's intercept.

# Import packages and functions

import pandas as pd
from sklearn.linear_model import LinearRegression

# Import flights and remove missing values

flights = pd.read_csv('flightsJFK.csv').dropna()

# Define X and y and convert to proper format

X = flights[['month','distance']].values.reshape(-1,2)# Your code goes here
y = flights[['arr_delay']].values.reshape(-1, 1)

# Initialize a linear regression model

multipleModel = LinearRegression()

# Fit the linear model

multipleModel = multipleModel.fit(X, y)

print('Intercept:', multipleModel.intercept_)

Newark Liberty International Airport (EWR) is a major airport serving New York City. EWR
wanted to predict the arrival delay of an incoming flight based on the departure delay. 50
recent flights were randomly selected, and the arrival delays (in minutes) were recorded.
 Predict the arrival delay for a flight with departure time of 1868 and distance of
1752, and assign variable yHat with the prediction.
 Calculate the slope coefficients for multipleModel and assign slope with the result.
The code contains all imports, loads the dataset, fits the multiple regression model, and
prints yHat and slope once calculated.
# Import packages and functions
import pandas as pd
from sklearn.linear_model import LinearRegression

# Import flights and remove missing values

flights = pd.read_csv('flightsEWR.csv').dropna()

# Define X and y and convert to proper format

X = flights[['dep_time', 'distance']].values.reshape(-1, 2)
y = flights[['arr_delay']].values.reshape(-1, 1)
# Initialize a linear regression model
multipleModel = LinearRegression()

# Fit the linear model

multipleModel = multipleModel.fit(X, y)

# Predict the arrival delay and save the slope coefficient

# Your code goes here
yHat = multipleModel.predict([[1868, 1752]])
slope = multipleModel.coef_
print('Predicted arrival delay:', yHat)
print('Slope coefficients:', slope)

Logistic regression in Python.

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

# Load the Wisconsin Breast Cancer dataset

WBCD = pd.read_csv("WisconsinBreastCancerDatabase.csv")
# Convert Diagnosis to 0 and 1.
WBCD.loc[WBCD['Diagnosis'] == 'B', 'Diagnosis'] = 0
WBCD.loc[WBCD['Diagnosis'] == 'M', 'Diagnosis'] = 1
WBCD

# Store relevant columns as variables

X = WBCD[['Radius mean']].values.reshape(-1, 1)
y = WBCD[['Diagnosis']].values.reshape(-1, 1).astype(int)

# Logistic regression predicting diagnosis from tumor radius

logisticModel = LogisticRegression()
logisticModel.fit(X, np.ravel(y.astype(int)))

# Graph logistic regression probabilities

plt.scatter(X, y)
xDelta = np.linspace(X.min(), X.max(), 10000)
yPredicted = logisticModel.predict(X).reshape(-1, 1).astype(int)
yDeltaProb = logisticModel.predict_proba(xDelta.reshape(-1, 1))[:, 1]
plt.plot(xDelta, yDeltaProb, color='red')
plt.xlabel('Radius', fontsize=14);
plt.ylabel('Probability of malignant tumor', fontsize=14);

# Display the slope parameter estimate

logisticModel.coef_

# Display the intercept parameter estimate

logisticModel.intercept_

# Predict the probability a tumor with radius mean 13 is benign / malignant

pHatProb = logisticModel.predict_proba([[13]])
pHatProb[0]

# Classify whether tumor with radius mean 13 is benign (0) or malignant (1)
pHat = logisticModel.predict([[13]])
pHat[0]

print(
"A tumor with radius mean 13 has predicted probability: \n",
pHatProb[0][0],
"of being benign\n",
pHatProb[0][1],
"of being malignant\n",
"and overall is classified to be benign",
)

The US Forest Service regularly monitors weather conditions to predict which areas are
at risk of wildfires. Data scientists working with the US Forest Service would like to
predict whether a wildfire will occur based on wind speed.
 Fit the logistic regression model, logisticModel, to predict whether a wildfire will
occur.
The code contains all imports, loads the dataset, and prints the model coefficients.

# Import packages and functions

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Load the dataset

fires = pd.read_csv('fires.csv')

# Create input matrix X and output matrix y

X = fires['wind'].values.reshape(-1, 1)
y = np.ravel(fires['fire'])

# Define and fit the logistic regression model

logisticModel = LogisticRegression()
logisticModel.fit(X,y)# Your code goes here

# Print the estimated coefficients

print('Slope:', logisticModel.coef_[0][0])
print('Intercept:', logisticModel.intercept_[0])

The US Forest Service regularly monitors weather conditions to predict which areas are
at risk of wildfires. Data scientists working with the US Forest Service would like to
predict whether a wildfire will occur based on temperature.
 Use the fitted logistic regression model, logisticModel, to predict whether a wildfire
will occur on a day with temperature = 25. Assign the prediction to pred.
The code contains all imports, loads the dataset, and prints the prediction.

# Import packages and functions

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Load the dataset

fires = pd.read_csv('fires.csv')

# Create input matrix X and output matrix y

X = fires['temp'].values.reshape(-1, 1)
y = np.ravel(fires['fire'])

# Define and fit the logistic regression model

logisticModel = LogisticRegression()
logisticModel = logisticModel.fit(X, y)

# Calculate the predicted value and assign to pred

pred = logisticModel.predict([[25]]) # Your code goes here

# Print the predicted value

print('Is a wildfire predicted? (0 = no, 1 = yes):', pred[0])

The US Forest Service regularly monitors weather conditions to predict which areas are
at risk of wildfires. Data scientists working with the US Forest Service would like to
predict whether a wildfire will occur based on daily rainfall.
 Use the fitted logistic regression model, logisticModel, to calculate the probabilities
of each outcome on a day with daily rainfall = 2. Assign the probabilities to prob.
The code contains all imports, loads the dataset, and prints the probabilities.

# Import packages and functions

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Load the dataset

fires = pd.read_csv('fires.csv')

# Create input matrix X and output matrix y

X = fires['rain'].values.reshape(-1, 1)
y = np.ravel(fires['fire'])

# Define and fit the logistic regression model

logisticModel = LogisticRegression()
logisticModel = logisticModel.fit(X, y)

# Calculate the probabilities and assign to prob

prob = logisticModel.predict_proba([[2]])# Your code goes here

# Print the predicted value

print('Probability of no wildfire:', prob[0][0])
print('Probability of a wildfire:', prob[0][1])

LAB: Creating simple linear regression models

The nbaallelo_slr dataset contains information on 126315 NBA games between 1947 and
2015. The columns report the points made by one team, the Elo rating of that team
coming into the game, the Elo rating of the team after the game, and the points made by
the opposing team. The Elo score measures the relative skill of teams in a league.
 Load the dataset into a data frame.
 Create a new column y in the data frame that is the difference between the points
made by the two teams.
 Use sklearn's LinearRegression() function to perform a simple linear regression on
the y and elo_i columns.
 Compute the proportion of variation explained by the linear regression using
the LinearRegression object's score method.

# Import the necessary modules

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
# Read in nbaallelo_slr.csv
nba = pd.read_csv('nbaallelo_slr.csv')
# Your code here

# Create a new column in the data frame that is the difference between pts and opp_pts
nba['y'] = nba['pts'] - nba['opp_pts']
# Your code here

# Store relevant columns as variables

X = nba[['elo_i']].values.reshape (-1,1)
# Your code here
y = nba[['y']].values.reshape (-1,1)
# Your code here

# Initialize the linear regression model

SLRModel = LinearRegression()
# Your code here
# Fit the model on X and y
SLRModel.fit(X,y)

# Your code here

# Print the intercept

intercept = SLRModel.intercept_
# Your code here
print('The intercept of the linear regression line is ', end="")
print('%.3f' % intercept[0] + ". ")

# Print the slope

slope = SLRModel.coef_
# Your code here
print('The slope of the linear regression line is ', end="")
print('%.3f' % slope[0][0] + ". ")

# Compute the proportion of variation explained by the linear regression using the
LinearRegression object's score method
score = SLRModel.score(X,y)
# Your code here
print('The proportion of variation explained by the linear regression model is ', end="")
print('%.3f' % score + ". ")

LAB: Performing logistic regression using LogisticRegression()

The nbaallelo_log file contains data on 126314 NBA games from 1947 to 2015. The
dataset includes the features pts, elo_i, win_equiv, and game_result. Using the csv
file nbaallelo_log.csv and scikit-learn's LogisticRegression function, construct a logistic
regression model to classify whether a team will win or lose a game based on the team's
elo_i score.
 Hot encode the game_result variable as a numeric variable with 0 for L and 1 for W
 Use the LogisticRegression function to construct a logistic regression model
with game_result as the target and elo_i as the predictor.
 Predict the probability of a win from an elo_i score of 1310.
 Predict whether a team with an elo_i score of 1310 will win.

# Import the necessary libraries

import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np

# Load nbaallelo_log.csv into a dataframe

NBA = pd.read_csv("nbaallelo_log.csv")

# Hot encode the game_result variable as a numeric variable with 0 for L and 1 for W
NBA.loc[NBA['game_result']=='L','game_result']=0
NBA.loc[NBA['game_result']=='W','game_result']=1
# Your code here

# Store relevant columns as variables

X = NBA[['elo_i']].values.reshape(-1, 1)
y = NBA[['game_result']]. values.ravel().astype(int)

# Initialize and fit the logistic model using the LogisticRegression function
NBAmodel = LogisticRegression()
NBAmodel.fit(X,y)
# Your code here

# Predict the probability that an elo_i score of 1310 is a win / loss

outcomeProb = NBAmodel.predict_proba([[1310]])
# Your code here

# Predict whether an elo_i score of 1310 is a win (1) or loss (0)

outcomePred = NBAmodel.predict([[1310]])

# Your code here

print("A team with the given elo_i score has predicted probability: \n", end="")
print('%.3f' % outcomeProb[0][0] + " losing\n", end="")
print('%.3f' % outcomeProb[0][1] + " winning")
print("and the overall prediction is",
outcomePred[0])
Chapter 8
Binary classification metrics in Python.
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

# Load breast cancer data and hot encodes categorical variable

WBCD = pd.read_csv("WisconsinBreastCancerDatabase.csv")
WBCD.loc[WBCD['Diagnosis'] == 'B', 'Diagnosis'] = 0
WBCD.loc[WBCD['Diagnosis'] == 'M', 'Diagnosis'] = 1

# Store relevant columns as variables

X = WBCD[['Radius mean']].values.reshape(-1, 1)
y = WBCD[['Diagnosis']].values.reshape(-1, 1).astype(int)

# Logistic regression predicting diagnosis from tumor radius

logisticModel = LogisticRegression()
logisticModel.fit(X, np.ravel(y.astype(int)))
cutoff = 0.5
yPredictedProb = logisticModel.predict_proba(X)[:, 1]
yPredLowCutoff = []
for i in range(0, yPredictedProb.size):
if yPredictedProb[i] < cutoff:
yPredLowCutoff.append(0)
else:
yPredLowCutoff.append(1)

# Display confusion matrix

metrics.confusion_matrix(y, yPredLowCutoff)

# Display accuracy
metrics. accuracy_score(y, yPredLowCutoff)

# Display precision
metrics.precision_score(y, yPredLowCutoff)

# Display recall
metrics.recall_score(y, yPredLowCutoff)

# Plot the ROC curve

pred = logisticModel.predict_proba(X)[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(y, pred)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(
fpr=fpr, tpr=tpr, roc_auc=roc_auc, pos_label='Malignant, 1'
)
display.plot()
plt.show()

Loss functions for regression in Python.

# Import packages
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

# Load tortoise data

tortoise = pd.read_csv("Tortoises.csv")

# Store relevant columns as variables

X = tortoise["Length"]
y = tortoise["Clutch"]

# Split data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123
)

# Create a linear model using the training set and predictions using the test set
X_test = np.asarray(X_test)
y_test = np.asarray(y_test)
linModel = LinearRegression()
linModel.fit(X_train.values.reshape(-1, 1), y_train.values.reshape(-1, 1))
y_pred = np.ravel(linModel.predict(X_test.reshape(-1, 1)))

# Display linear model and scatter plot of the test set

plt.scatter(X_test, y_test)
plt.xlabel("Length (mm)", fontsize=14)
plt.ylabel("Clutch size", fontsize=14)
plt.plot(X_test, y_pred, color='red')
plt.ylim([0, 14])
for i in range(5):
plt.plot([X_test[i], X_test[i]], [y_test[i], y_pred[i]], color='grey', linewidth=2)

# Display MSE
metrics.mean_squared_error(y_test, y_pred)

# Display RMSE
metrics.mean_squared_error(y_test, y_pred, squared=False)

# Display MAE
metrics.mean_absolute_error(y_test, y_pred)

# Create a quadratic model using the training set and predictions using the test set
X_train = np.asarray(X_train)
y_train = np.asarray(y_train)
poly = PolynomialFeatures().fit_transform(X_train.reshape(-1, 1))
poly_reg_model = LinearRegression().fit(poly, y_train)
poly_test = PolynomialFeatures().fit_transform(X_test.reshape(-1, 1))
y_pred = poly_reg_model.predict(poly_test)

# Display quadratic model and scatter plot of the test set

plt.scatter(X_test, y_test)
plt.xlabel("Length (mm)", fontsize=14)
plt.ylabel("Clutch size", fontsize=14)
x = np.linspace(X_test.min(), X_test.max(), 100)
y=(
poly_reg_model.coef_[2] * x**2
+ poly_reg_model.coef_[1] * x
+ poly_reg_model.intercept_
)
plt.plot(x, y, color='red', linewidth=2)
plt.ylim([0, 14])
for i in range(5):
plt.plot([X_test[i], X_test[i]], [y_test[i], y_pred[i]], color='grey', linewidth=2)

# Display MSE
metrics.mean_squared_error(y_test, y_pred)

# Display RMSE
metrics.mean_squared_error(y_test, y_pred, squared=False)

# Display MAE
metrics.mean_absolute_error(y_test, y_pred)

Loss functions for classification in Python.

# Import packages and functions
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load the Wisconsin Breast Cancer dataset

WBCD = pd.read_csv('WisconsinBreastCancerDatabase.csv')

# Convert Diagnosis to 0 and 1

WBCD.loc[WBCD['Diagnosis'] == 'B', 'Diagnosis'] = 0
WBCD.loc[WBCD['Diagnosis'] == 'M', 'Diagnosis'] = 1

# Store relevant columns as variables

X = WBCD[['Radius mean']].values.reshape(-1, 1)
y = WBCD[['Diagnosis']].values.reshape(-1, 1).astype(int)

# Split data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123
)

# Logistic regression predicting diagnosis from tumor radius

logisticModel = LogisticRegression();
logisticModel.fit(X_train, np.ravel(y_train.astype(int)));

# Graph logistic regression probabilities

plt.scatter(X_test, y_test)
x_prob = np.linspace(X_test.min(), X_test.max(), 1000)
y_prob = logisticModel.predict_proba(x_prob.reshape(-1, 1))[:, 1]
plt.plot(x_prob, y_prob, color='red')
plt.xlabel('Radius mean', fontsize=14);
plt.ylabel('Probability of malignant tumor', fontsize=14);

# Predict the probabilities for the test set

p_hat = logisticModel.predict_proba(X_test)

# Display the log-loss

metrics.log_loss(y_test, p_hat)

Training-validation-test split in Python.

# Import packages
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

# Load bad drivers data

badDrivers = pd.read_csv('bad-drivers.csv')

# Set the proportions of the training-validation-test split

trainingProportion = 0.70
validationProportion = 0.10
testProportion = 0.20

# Split off the test data

trainingAndValidationData, testData = train_test_split(
badDrivers, test_size=testProportion
)

# Split the remaining into training and validation data

trainingData, validationData = train_test_split(
trainingAndValidationData,
train_size=trainingProportion / (trainingProportion + validationProportion),
)

# Display the scatter plot for the entire sample data

plt.scatter(
badDrivers[['Losses incurred by insurance companies for collisions per insured driver
($)']],
badDrivers[['Car Insurance Premiums ($)']],
)
plt.xlabel('Losses incurred by insurance companies', fontsize=14)
plt.ylabel('Car insurance premiums', fontsize=14)
plt.xlim(80, 200)
plt.ylim(600, 1400)
plt.title('Sample data')
plt.show()

# Display the scatter plot for the training data

plt.scatter(
trainingData[['Losses incurred by insurance companies for collisions per insured driver
($)']],
trainingData[['Car Insurance Premiums ($)']],
)
plt.xlabel('Losses incurred by insurance companies', fontsize=14)
plt.ylabel('Car insurance premiums', fontsize=14)
plt.xlim(80, 200)
plt.ylim(600, 1400)
plt.title('Training data')
plt.show()

# Display the scatter plot for the validation data

plt.scatter(
validationData[['Losses incurred by insurance companies for collisions per insured
driver ($)']],
validationData[['Car Insurance Premiums ($)']],
)
plt.xlabel('Losses incurred by insurance companies', fontsize=14)
plt.ylabel('Car insurance premiums', fontsize=14)
plt.xlim(80, 200)
plt.ylim(600, 1400)
plt.title('Validation data')
plt.show()

# Display the scatter plot for the test data

plt.scatter(
testData[['Losses incurred by insurance companies for collisions per insured driver
($)']],
testData[['Car Insurance Premiums ($)']],
)
plt.xlabel('Losses incurred by insurance companies', fontsize=14)
plt.ylabel('Car insurance premiums', fontsize=14)
plt.xlim(80, 200)
plt.ylim(600, 1400)
plt.title('Test data')
plt.show()

train_test_split(df,train_size=0.90)
A, B= train_test_split(df,test_size=0.05).

Researchers collected measurements from loblolly pines.

 Set the proportions for the training dataset to 70%, validation dataset to 10%, and
testing dataset to 20%.
The code provided contains all imports, loads the dataset, splits the dataset into training,
validation, and test datasets, prints the sizes of these samples, and prints the test
dataset.

# Import packages and functions

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

rng = np.random.RandomState(2)

# Load the dataset

pines = pd.read_csv('pinesSample.csv')

# Set proportions of train-validate-test split

# Your code goes here

trainingPropPercent = 0.70
validatingPropPercent = 0.10
testingPropPercent = 0.20

# Split dataset into training/validation data and testing data

trainAndValidate, testingDataPercent = train_test_split(pines,
test_size=testingPropPercent, random_state=rng)

# Split training/validation data into training data and validation data

trainingDataPercent, validatingDataPercent = train_test_split(trainAndValidate,
train_size=trainingPropPercent/(trainingPropPercent+validatingPropPercent),
random_state=rng)

# Print split sizes and test dataset

print('original dataset:', len(pines),
'\ntrain_data:', len(trainingDataPercent),
'\nvalidation_data:', len(validatingDataPercent),
'\n\ntest_data:', len(testingDataPercent),
'\n', testingDataPercent)
# Import packages and functions
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

rng = np.random.RandomState(42)

# Load the dataset

pines = pd.read_csv('pinesSample.csv')

# Set proportions of train-validate-test split

trainingPropPercent = 0.6
validatingPropPercent = 0.2
testingPropPercent = 0.2

# Split dataset into training/validation data and testing data

trainAndValidate, testingDataPercent =
train_test_split(pines,test_size=testingPropPercent, random_state=rng) # Your code goes
here

# Split training/validation data into training data and validation data

trainingDataPercent, validatingDataPercent = train_test_split(
trainAndValidate,
train_size=trainingPropPercent/(trainingPropPercent+validatingPropPercent),
random_state=rng
)

# Print split sizes and test dataset

print('original dataset:', len(pines),
'\ntrain_data:', len(trainingDataPercent),
'\nvalidation_data:', len(validatingDataPercent),
'\n\ntest_data:', len(testingDataPercent),
'\n', testingDataPercent

# Import packages and functions

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

rng = np.random.RandomState(33)

# Load the dataset

loblolly = pd.read_csv('loblollySample.csv')
# Set proportions of train-validate-test split
trainPropPercent = 0.6
validatePropPercent = 0.2
testPropPercent = 0.2

# Split dataset into training/validation data and testing data

trainAndValidate, testDataPercent = train_test_split(
loblolly,
test_size=testPropPercent,
random_state=rng
)

# Split training/validation data into training data and validation data

trainDataPercent, validateDataPercent = train_test_split(

trainAndValidate,
train_size=trainPropPercent / (trainPropPercent + validatePropPercent),
random_state=rng
) # Your code goes here

# Print split sizes and test dataset

print('original dataset:', len(loblolly),
'\ntrain_data:', len(trainDataPercent),
'\nvalidation_data:', len(validateDataPercent),
'\n\ntest_data:', len(testDataPercent),
'\n', testDataPercent
)

k-fold cross-validation in Python.

# Import packages and functions

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

# Import dataset
badDrivers = pd.read_csv('bad-drivers.csv')

# Split off 20% of the data to be left out as test data

badDriversTrainingdata, testData = train_test_split(badDrivers, test_size=0.20)

# Store relevant columns as variables

X = badDriversTrainingdata[
['Losses incurred by insurance companies for collisions per insured driver ($)']
].values.reshape(-1, 1)
y = badDriversTrainingdata[['Car Insurance Premiums ($)']].values.reshape(-1, 1)

# Fit a linear model to the data

linModel = LinearRegression()
linModel.fit(X, y)
yPredicted = linModel.predict(X)

# Plot the fitted model

plt.scatter(X, y, color='black')
plt.plot(X, yPredicted, color='blue', linewidth=1)
plt.xlabel('Losses incurred by insurance companies', fontsize=14);
plt.ylabel('Car insurance premiums', fontsize=14);

# neg_mean_square_error is the negative MSE, so add a - so the scores are positive.

ten_fold_scores = -cross_val_score(
linModel, X, y, scoring='neg_mean_squared_error', cv=10
)

# neg_mean_square_error is the negative MSE, so add a - so the scores are positive.

LOOCV_scores = -cross_val_score(linModel, X, y, scoring='neg_mean_squared_error',
cv=40)

# Plot the errors for both scores

plt.plot(np.zeros_like(ten_fold_scores), ten_fold_scores, '.')
plt.plot(np.zeros_like(LOOCV_scores) + 1, LOOCV_scores, '.')
plt.ylabel('Mean squared errors', fontsize=14);
plt.xticks([0, 1], ['10-fold', 'LOOCV']);

cross_val_score(M, X, y, cv=5)

Bootstrap method in Python.

# Import packages and functions

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.utils import resample
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load data set

badDrivers = pd.read_csv('bad-drivers.csv')

# Create bootstrap samples and collect errors

bootstrapErrors = []
for i in range(0, 30):
# Create the bootstrap sample and the out-of-bag sample
boot = resample(badDrivers, replace=True, n_samples=51)
oob = badDrivers[~badDrivers.index.isin(boot.index)]

# Fit a linear model to the bootstrap sample

XBoot = boot[
['Losses incurred by insurance companies for collisions per insured driver ($)']
].values.reshape(-1, 1)
yBoot = boot[['Car Insurance Premiums ($)']].values.reshape(-1, 1)
linModel = LinearRegression()
linModel.fit(XBoot, yBoot)

# Predict y values for the out-of-bag sample

XOob = oob[
['Losses incurred by insurance companies for collisions per insured driver ($)']
].values.reshape(-1, 1)
YOob = oob[['Car Insurance Premiums ($)']].values.reshape(-1, 1)
YOobPredicted = linModel.predict(XOob)

# Calculate the error

bootError = mean_squared_error(YOob, YOobPredicted)
bootstrapErrors.append(bootError)

# Calculate the mean of the errors

np.mean(bootstrapErrors)

# Calculate the standard deviation of the errors

np.std(bootstrapErrors)

# Plot the errors

plt.plot(bootstrapErrors, np.zeros_like(bootstrapErrors), '.')
plt.xlabel('Bootstrap errors (MSE)', fontsize=14)
plt.gca().axes.yaxis.set_ticks([]);
Model selection in Python.

# Import packages
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import PolynomialFeatures

# Import dataset
thurber = pd.read_csv('Thurber.csv')

# Split off 20% of the data to be left out as test data

thurberTrainingData, test_data = train_test_split(thurber, test_size=0.20)

# Store relevant columns as variables

X = thurberTrainingData[['log(Density)']].values.reshape(-1, 1)
y = thurberTrainingData[['Electron mobility']].values.reshape(-1, 1)

# Fit a cubic regression model

polyFeatures = PolynomialFeatures(degree=3, include_bias=False)
XPoly = polyFeatures.fit_transform(X)
polyModel = LinearRegression()
polyModel.fit(XPoly, y)

# Graph the scatterplot and the polynomial regression

plt.scatter(X, y, color='black')
xDelta = np.linspace(X.min(), X.max(), 1000)
yDelta = polyModel.predict(polyFeatures.fit_transform(xDelta.reshape(-1, 1)))
plt.plot(xDelta, yDelta, color='blue', linewidth=2)
plt.xlabel('log(Density)', fontsize=14);
plt.ylabel('Electron mobility', fontsize=14);

# Collect cross-validation metrics

cvMeans = []
cvStdDev = []

for i in range(1, 7):

# Fit a degree i polynomial regression model
polyFeatures = PolynomialFeatures(degree=i, include_bias=False)
XPoly = polyFeatures.fit_transform(X)
polyModel = LinearRegression()
polyModel.fit(XPoly, y)

# Carry out 10-fold cross-validation for the a degree i polynomial regression model
polyscore = -cross_val_score(
polyModel, XPoly, y, scoring='neg_mean_squared_error', cv=10
)

# Store the mean and standard deviation of the 10-fold cross-validation for the degree
i polynomial regression model
cvMeans.append(np.mean(polyscore))
cvStdDev.append(np.std(polyscore))

# Graph the errorbar chart using the cross-validation means and std deviations
plt.errorbar(x=range(1, 7), y=cvMeans, yerr=cvStdDev, marker='o', color='black')
plt.xlabel('Degree of regression polynomial', fontsize=14)
plt.ylabel('Mean squared error', fontsize=14)

Linear model for predicting house prices.

# Import packages and functions

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression

# Import and view data

homes = pd.read_csv('homes.csv').dropna()
homes

# Set seed
seed = 123

# Set proportion of data for the test set

test_p = 0.20

# Define input and output features

X = homes[['Floor']]
y = homes[['Price']]
# Create training and testing data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_p, random_state=seed
)

# Plot training dataset and regression line

p = sns.regplot(x=X_train, y=y_train, ci=False, line_kws={'color': 'black'})
p.set_xlabel('Square feet (1000s)', fontsize=14);
p.set_ylabel('Price ($1000s)', fontsize=14);
p.set_title('Training model', fontsize=16);

# Initialize and fit the linear model

linearModel = LinearRegression()
linearModel = linearModel.fit(X_train, y_train)

# Print model coefficients

print('beta1 =', linearModel.coef_)
print('beta0 =', linearModel.intercept_)

# Regression metrics on training dataset

y_pred = linearModel.predict(X_train)
print('MSE =', mean_squared_error(y_train, y_pred))
print('MAE =', mean_absolute_error(y_train, y_pred))
print('R-squared =', r2_score(y_train, y_pred))

# Regression metrics on testing dataset

y_pred = linearModel.predict(X_test)
print('MSE =', mean_squared_error(y_test, y_pred))
print('MAE =', mean_absolute_error(y_test, y_pred))
print('R-squared =', r2_score(y_test, y_pred))

# Plot the model for the training and testing sets

plt.rcParams["figure.figsize"] = (12, 5)

x = pd.array([1, 2, 3])
yhat = 213.13396131 + 37.92605345 * x

plt.subplot(1, 2, 1)

# Training set subplot

p = sns.scatterplot(x=X_train['Floor'], y=y_train['Price'])
plt.plot(x, yhat, color='black')
p.set_xlabel('Square feet (1000s)', fontsize=14)
p.set_ylabel('Price ($1000s)', fontsize=14)
p.set_title('Training dataset', fontsize=16)
p.set_ylim(140, 460)

plt.subplot(1, 2, 2)
# Testing set subplot
p = sns.scatterplot(x=X_test['Floor'], y=y_test['Price'])
plt.plot(x, yhat, color='black')
p.set_xlabel('Square feet (1000s)', fontsize=14);
p.set_ylabel('Price ($1000s)', fontsize=14);
p.set_title('Testing dataset', fontsize=16);
p.set_ylim(140, 460);

8.10 LAB: Evaluating linear regression using cross-validation

The nbaallelo_slr dataset contains information on 126315 NBA games between 1947 and
2015. The columns report the points made by one team, the Elo rating of that team
coming into the game, the Elo rating of the team after the game, and the points made by
the opposing team. The Elo rating measures the relative skill of teams in a league.
 The code creates a new column y in the data frame that is the difference
between pts and opp_pts.
 Split the data into 70 percent training set and 30 percent testing set
using sklearn's train_test_split function. Set random_state=0.
 Store elo_i and y from the training data as the variables X and y.
 The code performs a simple linear regression on X and y.
 Perform 10-fold cross-validation with the default scorer using scikit-
learn's cross_val_score function.

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

nba = pd.read_csv("nbaallelo_slr.csv")

# Create a new column in the data frame that is the difference between pts and opp_pts
nba['y'] = nba['pts'] - nba['opp_pts']

# Split the data into training and test sets

train, test = # Your code here

# Store relevant columns as variables

X = # Your code here
y = # Your code here

# Initialize the linear regression model

SLRModel = LinearRegression()
# Fit the model on X and y
SLRModel.fit(X,y)

# Perform 10-fold cross-validation with the default scorer

tenFoldScores = # Your code here
print('The cross-validation scores are', tenFoldScores)

Solution:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

nba = pd.read_csv("nbaallelo_slr.csv")

# Create a new column in the data frame that is the difference between pts and
opp_pts
nba['y'] = nba['pts'] - nba['opp_pts']

# Split the data into training and test sets

# Your code here

# Store relevant columns as variables

X = nba[['elo_i']].values # Your code here
y = nba[['y']].values# Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=0)

# Initialize the linear regression model

SLRModel = LinearRegression()
# Fit the model on X and y
SLRModel.fit(X,y)

# Perform 10-fold cross-validation with the default scorer

tenFoldScores = cross_val_score(SLRModel, X_train, y_train, scoring='r2', cv=10)
# Your code here
print('The cross-validation scores are', tenFoldScores)

Chapter 9

k-nearest neighbors classification in Python

# Import needed packages for classification

from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Import packages for visualization of results

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from mlxtend.plotting import plot_decision_regions

# Iport packages for evaluation

from sklearn.model_selection import train_test_split
from sklearn import metrics

# Read data, clean up names

beans = pd.read_csv('Dry_Bean_Dataset.csv')
beans['Class'] = beans['Class'].str.capitalize()
print(beans.shape)
beans.describe()

# Initialize model
beanKnnClassifier = KNeighborsClassifier(n_neighbors=5)
# Split data
X = beans[['MajorAxisLength', 'MinorAxisLength']]
y = beans[['Class']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Train model and make predictions for the test set.

beanKnnClassifier.fit(X_train_scaled, np.ravel(y_train))
y_pred = beanKnnClassifier.predict(scaler.transform(X_test))

# Predict one bean

bean = pd.DataFrame(data={'MajorAxisLength': [400], 'MinorAxisLength': [200]})
beanKnnClassifier.predict(scaler.transform(bean))

# Take a sample to keep runtime low while seeing what areas are classified as each bean
beanSample = beans.sample(200, random_state=123)
beanSample.describe()

# Create integer-valued labels for plot_decision_regions()

beanSample['Int'] = beanSample['Class'].replace(
to_replace = ['Barbunya', 'Bombay', 'Cali', 'Dermason', 'Horoz', 'Seker', 'Sira'],
value = [int(0), int(1), int(2), int(3), int(4), int(5), int(6)])

# Define input and output features

X = beanSample[['MajorAxisLength', 'MinorAxisLength']]
y = beanSample[['Int']]

# Fit model
beanKnnClassifier.fit(X, np.ravel(y))

# Set background opacity to 20%

contourf_kwargs = {'alpha': 0.2}

# Plot decision boundary regions

p = plot_decision_regions(X.to_numpy(), np.ravel(y), clf=beanKnnClassifier,
contourf_kwargs=contourf_kwargs)

# Add title and axis labels

p.set_xlabel('MajorAxisLength', fontsize=14)
p.set_ylabel('MinorAxisLength', fontsize=14)

# Add legend
L = plt.legend()

L.get_texts()[0].set_text('Barbunya')
L.get_texts()[1].set_text('Bombay')
L.get_texts()[2].set_text('Cali')
L.get_texts()[3].set_text('Dermason')
L.get_texts()[4].set_text('Horoz')
L.get_texts()[5].set_text('Seker')
L.get_texts()[6].set_text('Sira')

This dataset contains data on sleep habits for 30 randomly selected mammals. Each
mammal is categorized as an omnivore, herbivore, carnivore, or insectivore.
 Initialize a k-nearest neighbors classification model with k=4.
The code contains all imports, loads the dataset, fits the model, and applies the model

# Import packages and functions

import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Import dataset
sleep = pd.read_csv('sleep.csv')

# Create input matrix X and output matrix y

X = sleep[['awake', 'sleep_rem']]
y = sleep[['vore']]

knnModel= KNeighborsClassifier(n_neighbors=4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Your code goes here

knnModel = knnModel.fit(X, np.ravel(y))

# Print predictions
print(knnModel.predict(X))

This dataset contains data on sleep habits for 25 randomly selected mammals. Each
mammal is categorized as an omnivore, herbivore, carnivore, or insectivore.
REM sleep cycles of guinea pigs average 0.8 hours. Guinea pigs are awake on average
14.6 hours per day.
 Use the kneighbors() method to find the instances in the training data that are
closest to guinea pigs. Assign the instances, but not the distances, to neighbors.
The code contains all imports, loads the dataset, initializes the model, and applies the
model to a test dataset.

# Import packages and functions

import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

# Import dataset
sleep = pd.read_csv('sleep.csv')

# Create input matrix X and output matrix y

X = sleep[['sleep_rem', 'awake']]
y = sleep[['vore']]
knnModel = KNeighborsClassifier(n_neighbors=5)
knnModel = knnModel.fit(X.values, np.ravel(y.values))
guinea_pig = np.array([[0.8, 14.6]])
neighbors = knnModel.kneighbors(guinea_pig, return_distance=False)# Your code goes
here

# Print neighbors
print(neighbors)

Naive Bayes classification in Python.

# Import packages and functions
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Read in the data and view the first five instances.

# File does not include column headers so they are provided via names.
messages = pd.read_table('SMSSpamCollection.csv', names=['Class', 'Message'])
messages.head()

# Split into testing and training sets

X_train, X_test, Y_train, Y_test = train_test_split(
messages['Message'], messages['Class'], random_state=123
)

# Count the words that appear in the messages

vectorizer = CountVectorizer(ngram_range=(1, 1))
vectorizer.fit(X_train)
# Uncomment the line below to see the words.
#vectorizer.vocabulary_

# Count the words in the training set and store in a matrix

X_train_vectorized = vectorizer.transform(X_train)
X_train_vectorized

# Initialize the model and fit with the training data

NBmodel = MultinomialNB()
NBmodel.fit(X_train_vectorized, Y_train)
# Make predictions onto the training and testing sets.
trainPredictions = NBmodel.predict(vectorizer.transform(X_train))
testPredictions = NBmodel.predict(vectorizer.transform(X_test))

# How does the model work on the training set?

confusion_matrix(Y_train, trainPredictions)

# Display that in terms of correct porportions

confusion_matrix(Y_train, trainPredictions, normalize='true')

# How does the model work on the test set?

confusion_matrix(Y_test, testPredictions, normalize='true')

# Predict some phrases. Add your own.

NBmodel.predict(
vectorizer.transform(
["Big sale today! Free cash.",
"I'll be there in 5"]))

Re1/200*(200-25)/200*(200-21)/200*200/400 = 0.0019578125

Not Re 19/20087/200(200-5)/200 = 0.040291875

Support vector machine classification in Python.

# Load packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler

# Load and view data

penguins = sns.load_dataset('penguins')
penguins

# Remove the penguins with missing data

penguinsClean = penguins[~penguins['body_mass_g'].isna()]

# Only use numeric values. Categorical values could be encoded as dummy variables.

X = penguinsClean[
['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
]
Y = penguinsClean['species']

# Split the data into training and testing sets.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=20220621)

# Scale the input variable because SVM is dependent on differences in scale for
distances
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define and fit the model.

# Adjust C from 0.01 to 100 by changing the number of decimal places or zeros.
# C controls the slope of the hinge function. Larger values make misclassification less
frequent.

penguinsSVMlinear = svm.SVC(kernel='linear', C=0.01)

penguinsSVMlinear.fit(X_train_scaled, Y_train)

# Predict for the test set

Y_pred = penguinsSVMlinear.predict(X_test_scaled)

# Display the confusion matrix

confusion_matrix(Y_test, Y_pred)
# Adjust the number of decimal places in
# gamma (affects distance a point has influence, smaller value of gamma allow
influence to spread more )
# and C

penguinsSVMrbf = svm.SVC(kernel='rbf', C=10, gamma=0.01)

penguinsSVMrbf.fit(X_train_scaled, Y_train)

Predict for the test set

Y_pred = penguinsSVMrbf.predict(X_test_scaled)

# Display the confusion matrix

confusion_matrix(Y_test, Y_pred)

# Adjust the number of decimal places in C and change degree by steps of 1.

# Degree impacts the degree of the polynomial for the kernel.

penguinsSVMpoly = svm.SVC(kernel='poly', C=0.1, degree=5)

penguinsSVMpoly.fit(X_train_scaled, Y_train)

# Predict for the test set

Y_pred = penguinsSVMpoly.predict(X_test_scaled)

# Display the confusion matrix

confusion_matrix(Y_test, Y_pred)

# The number of support vectors for each class

penguinsSVMrbf.n_support_

# Which instances in the training set are support vectors

penguinsSVMrbf.support_

# The coefficients of the hyperplanes for each pair of classes in the form intercept =
coefficient1*variable1 + coefficient2*variable2 + ...
penguinsSVMlinear.coef_

# The intercept of the hyperplanes for each pair of classes.

penguinsSVMlinear.intercept_
K-nearest neighbors classification

The dataset SDSS contains 17 observational features and one class feature for 10000
deep sky objects observed by the Sloan Digital Sky Survey.
Use sklearn's KNeighborsClassifier function to perform kNN classification to classify each
object by the object's redshift and u-g color.
 Import the necessary modules for kNN classification.
 Create a dataframe X with features redshift and u_g.
 Create dataframe y with feature class.
 Initialize a kNN model with k=3.
 Fit the model using the training data.
 Find the predicted classes for the test data.
 Calculate the accuracy score and confusion matrix.

# Import needed packages for classification

from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Import packages for visualization of results
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from mlxtend.plotting import plot_decision_regions

# Iport packages for evaluation

from sklearn.model_selection import train_test_split
from sklearn import metrics
skySurvey = pd.read_csv('SDSS.csv')
skySurvey['u_g'] = skySurvey['u']-skySurvey['g']

# Initialize model with k=3

skySurveyKnn =KNeighborsClassifier(n_neighbors=3) # Your code here
X = skySurvey[['redshift', 'u_g']] # Features
y = skySurvey['class']

# Fit model using X_train and y_train

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
skySurveyKnn.fit(X_train, y_train)# Your code here

# Find the predicted classes for X_test

y_pred = skySurveyKnn.predict(X_test) # Your code here

# Calculate accuracy score

score = metrics.accuracy_score(y_test, y_pred)# Your code here

# Print accuracy score

print('Accuracy score is ', end="")
print('%.3f' % score)

# Print confusion matrix

print(metrics.confusion_matrix(y_test, y_pred))# Your code here)

Chapter 10
K-means clustering in Python.

# Import packages and functions

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

from sklearn.cluster import KMeans

# Load dataset
geyser = pd.read_csv('oldfaithful.csv')
geyser

# Visual exploration
p = sns.scatterplot(data=geyser, x='Eruption', y='Waiting')
p.set_xlabel('Eruption time (min)', fontsize=14);
p.set_ylabel('Waiting time (min)', fontsize=14);

# Initialize a k-means model with k=2

kmModel = KMeans(n_clusters=2)

# Fit the model

kmModel = kmModel.fit(geyser)
# Save the cluster centroids
centroids = kmModel.cluster_centers_
centroids[1]

# Save the cluster assignments

clusters = kmModel.fit_predict(geyser[['Eruption', 'Waiting']])

# View the clusters for the first five instances

clusters[0:5]

# Plot clusters
p = sns.scatterplot(
data=geyser, x='Eruption', y='Waiting', hue=clusters, style=clusters
)
p.set_xlabel('Eruption time (min)', fontsize=14);
p.set_ylabel('Waiting time (min)', fontsize=14);

# Add centroid for cluster 0

plt.scatter(x=centroids[0, 0], y=centroids[0, 1], c='black')

# Add centroid for cluster 1

plt.scatter(x=centroids[1, 0], y=centroids[1, 1], c='black', marker='X')

# Fit k-means clustering with k=1,...,5 and save WCSS for each
WCSS = []
k = [1, 2, 3, 4, 5]
for j in k:
kmModel = KMeans(n_clusters=j)
kmModel = kmModel.fit(geyser)
WCSS.append(kmModel.inertia_)

# Plot the WCSS for each cluster

ax = plt.figure().gca()
plt.plot(k, WCSS, '*-')
plt.xlabel('Number of clusters (k)', fontsize=14);
plt.ylabel('Within-cluster sum of squares (WCSS)', fontsize=14);

K-means clustering using scikit-learn.

Researchers studying chemical properties of wines collected data on a sample of white

wines in northern Portugal. One of the research goals was to cluster wines based on
similar chemical properties.
 Fit the k-means clustering model to cluster wines based on alcohol concentration
(alcohol) and total sulfur dioxide (total_sulfur_dioxide).
The code provided initializes the model with n_clusters=5 and random_state=rng and
prints the cluster centers

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

# Read in the data

wines = pd.read_csv('whitewine.csv')

# Seed random number generator

rng = np.random.RandomState(43)

# Initialize k-means clustering model

kmeansModel = KMeans(n_clusters=5, random_state=rng)

# Your code goes here Solution:

kmeansModel=kmeansModel.fit(wines[['alcohol','total_sulfur_dioxide']])

print(kmeansModel.cluster_centers_)

 Initialize a k-means clustering model with n_clusters=3 and random_state=rng.

 Fit the model to cluster wines based on free sulfur dioxide (free_sulfur_dioxide) and
density (density).
kmeansModel = KMeans(n_clusters=3, random_state=rng)
clusters = kmeansModel.fit_predict(wines[['free_sulfur_dioxide', 'density']])

Agglomerative clustering in Python.

# Import packages and functions

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from scipy.cluster.hierarchy import dendrogram, linkage

from scipy.spatial.distance import squareform

# Load the dataset

cytochrome = pd.read_csv('cytochrome.csv', header=None, usecols=range(1, 14))
cytochrome

# Add labels for each species and save as a data frame

species = [
"Human",
"Monkey",
"Horse",
"Cow",
"Dog",
"Whale",
"Rabbit",
"Kangaroo",
"Chicken",
"Penguin",
"Duck",
"Turtle",
"Frog",
]

pd.DataFrame(data=cytochrome.to_numpy(), index=species, columns=species)

# Format the data as a distance matrix

# In this case, the data already represents distance between points (species)
differences = squareform(cytochrome)

# Define a clustering model with single linkage

clusterModel = linkage(differences, method='single')

# Create the dendrogram

dendrogram(clusterModel, labels=species, leaf_rotation=90)

# Plot the dendrogram

plt.ylabel('Amino acid differences', fontsize=14)
plt.yticks(np.arange(0, 11, step=1))
plt.xlabel('Species', fontsize=14)
plt.title('Single linkage clustering', fontsize=16)
plt.show()

Hierarchical clustering using scipy and scikit-learn.

import pandas as pd
from scipy.cluster.hierarchy import linkage
from scipy.spatial.distance import pdist
from sklearn.preprocessing import StandardScaler

wine = pd.read_csv('wine1.csv')

# Calculate a distance matrix with selected variables

X = wine[['alcohol', 'fixed_acidity']]

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X))

# pdist() calculates pairs of distances between each instance in the dataset

dist = pdist(X)

# Your code goes here

print(clusterModel)

 Cluster wines with complete linkage.

clusterModel = linkage(dist, method='complete')

 Using pdist(), calculate a distance matrix for wines. The matrix of input features, X,
has already been created.
 Use the distance matrix to cluster the wines with centroid linkage.

dist = pdist(X)
clusterModel = linkage(dist, method='centroid')

DBSCAN in Python.

# Import packages and functions

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import DBSCAN

from numpy import where
from sklearn.preprocessing import StandardScaler

# Load the dataset

homes = pd.read_csv('homes.csv')
homes

# Create a smaller data frame with two variables: Price and Floor
homes_pf = homes[['Price', 'Floor']]
homes_pf.describe()

# Define a scaler to transform values

scaler = StandardScaler()

# Apply scaler and view result

homes_scaled = pd.DataFrame(scaler.fit_transform(homes_pf), columns=['Price', 'Floor'])
homes_scaled.describe()

# Initialize DBSCAN model

# Setting a large epsilon will cluster all "middle" values and detect outliers
dbscanModel = DBSCAN(eps=1, min_samples=12)

# Fit the model

dbscanModel = dbscanModel.fit(homes_scaled)

# Predict clusters
clusters = dbscanModel.fit_predict(homes_scaled)
clusters = pd.Categorical(clusters)
clusters

# Visualize scaled outliers

p = sns.scatterplot(data=homes_scaled, x='Floor', y='Price', hue=clusters)
p.set_xlabel('Scaled floor', fontsize=14);
p.set_ylabel('Scaled price', fontsize=14);

# Points where the prediction is -1 are considered outliers

outliers_scaled = homes_scaled[clusters == -1]
outliers_scaled

# Outliers on original scale (price and square footage in thousands)

outliers_unscaled = homes[clusters == -1]
outliers_unscaled

# Visualize outliers on original scale

p = sns.scatterplot(data=homes, x='Floor', y='Price', hue=clusters)
p.set_xlabel('Floors', fontsize=14);
p.set_ylabel('Price', fontsize=14);
Researchers studying chemical properties of wines collected data on a sample of white
wines in Northern Portugal. A research goal was to cluster wines based on similar
chemical properties.
 Fit the DBSCAN model to cluster wines.
The code provided creates a dataframe with two features (citric_acid and fixed_acidity),
normalizes the dataframe, initializes the DBSCAN model, and prints the cluster labels for
each point in the dataset.

import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

wine = pd.read_csv('wine1.csv')

# Create an input matrix with selected features

X = wine[['citric_acid', 'fixed_acidity']]

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X))

# Cluster using DBSCAN with default options

dbscanModel = DBSCAN()

# Your code goes here

print(dbscanModel.labels_)

Fit the DBSCAN model to cluster wines

dbscanModel = dbscanModel.fit(wine)

 Use the DBSCAN clustering function to cluster wines. Keep eps and min_samples at
default values.
 Fit the DBSCAN model to cluster wines.
dbscanModel = DBSCAN()
dbscanModel = dbscanModel.fit(wine)

 Use the DBSCAN clustering function to cluster wines.

Set eps=0.75 and min_samples=3.
 Fit the DBSCAN model to cluster wines.

import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
wine = pd.read_csv('wine1.csv')

# Create an input matrix with selected features

X = wine[['chlorides', 'density']]

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X))

dbscanModel = DBSCAN(eps=0.75, min_samples=3)

dbscanModel = dbscanModel.fit(X)
# Your code goes here

print(dbscanModel.labels_)

dbscanModel = DBSCAN(eps=0.75, min_samples=3)

dbscanModel = dbscanModel.fit(X)

Factor analysis in Python.

# Load the pandas package

import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib_inline.backend_inline

matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

# Load the rock.csv dataset

rock = pd.read_csv('rock.csv')

# Display the correlation matrix using a heatmap

plt.figure(figsize=(4, 4))
sns.heatmap(rock.corr(), cmap="YlGnBu", annot=True)

# Create a scatter plot using perimeter and area

plt.figure(figsize=(4, 4))
plt.scatter(rock['Perimeter'], rock['Area'])
plt.xlabel('Perimeter', fontsize=14);
plt.ylabel('Area', fontsize=14);

# Create a scatter plot with a linear regression line

model = st.linregress(rock['Perimeter'], rock['Area'])
plt.figure(figsize=(4, 4))
plt.scatter(rock['Perimeter'], rock['Area'])
x = np.linspace(0, 5000, 10000)
y = model[0] * x + model[1]
plt.plot(x, y, '-r', linewidth=2.5)
plt.xlabel('Perimeter', fontsize=14);
plt.ylabel('Area', fontsize=14);

# Scale the data

scaler = StandardScaler()
rock = pd.DataFrame(
scaler.fit_transform(rock), columns=['Area', 'Perimeter', 'Shape', 'Permeability']
)

# Initialize and fit a PCA model on the rock data

pcaModel = PCA(n_components=4);
pcaModel.fit(rock);

# Display the explained variance (eigenvalues)

pcaModel.explained_variance_

# Show the factor loadings

pcaModel.components_.T * np.sqrt(pcaModel.explained_variance_)

# Create a scree plot

xint = range(0, 5)
plt.xticks(xint)
plt.plot([1, 2, 3, 4], pcaModel.explained_variance_, 'b*-')
plt.xlabel('Factors', fontsize='14');
plt.ylabel('Eigenvalues', fontsize='14');

Researchers studying chemical properties of wines collected data on a sample of white

wines in northern Portugal. Several chemical components in the wines were highly
correlated.
 Create a dataframe, X, that contains five features in the following
order: fixed_acidity, quality, total_sulfur_dioxide, volatile_acidity, and pH.
The code provided prints the correlation matrix for the features in X.
import pandas as pd
white_wine = pd.read_csv('white_wine.csv')

# Your code goes here

print(X.corr())

 Fit the principal components model to the dataframe X.

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

wines = pd.read_csv('wines.csv')

X = wines[['citric_acid', 'fixed_acidity', 'free_sulfur_dioxide', 'density']]

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X))

pcaModel = PCA(n_components=2)

# Your code goes here

print(pcaModel.explained_variance_ratio_)

 Fit the principal components model to the dataframe X.

 Use print() to calculate and display the factor loading matrix.
model = PCA(n_components=2)
model.fit(X)
print(model.components_.T * np.sqrt(model.explained_variance_))

Principal components with the travel ratings dataset.

# Import packages and data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

reviews = pd.read_csv('tripadvisor_review.csv').dropna()
# Drop user ratings
X = reviews.drop(axis=1, labels='User ID')

# Standardize input features to mean=0 and sd=1

scaler = StandardScaler()
X = pd.DataFrame(
scaler.fit_transform(X),
columns=[
'Art',
'Clubs',
'Juice bars',
'Restaurants',
'Museums',
'Resorts',
'Parks',
'Beaches',
'Theaters',
'Religious',
],
)
X.describe().round(2)

# Plot correlation matrix for input features

plt.figure(figsize=(15, 10))
plt.rcParams.update({'font.size': 14})
sns.heatmap(X.corr().round(2), cmap="RdBu", annot=True, vmin=-1, vmax=1)
plt.show()

# Initialize and fit a PCA model on the travel ratings data

pcaModel = PCA(n_components=10);
pcaModel.fit(X);

# Display eigenvalues
pcaModel.explained_variance_.round(3)

# Calculate PC1 and PC2

pca = PCA(n_components=2)
pca_result = pca.fit_transform(X.values)
pca_result

# Add PC1 and PC2 to X and display updated correlations

X['PC1'] = pca_result[:, 0]
X['PC2'] = pca_result[:, 1]
plt.figure(figsize=(15, 10))
plt.rcParams.update({'font.size': 14})
sns.heatmap(X.corr().round(2), cmap="RdBu", annot=True, vmin=-1, vmax=1)
plt.show()

Clustering with the travel ratings dataset.

# Import packages and data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist

reviews = pd.read_csv('tripadvisor_review.csv').dropna()

# seed for reproducibility

seed = 123

# Drop user ID from dataset

X = reviews.drop(axis=1, labels=['User ID'])
X

# Initialize a k-means model with k=4

kmModel = KMeans(n_clusters=4, random_state=seed, n_init=10)
kmModel = kmModel.fit(X)
clusters = kmModel.fit_predict(X)
centroids = kmModel.cluster_centers_

# Show cluster ratings for juice bars

p = sns.kdeplot(data=X, x='Juice bars', hue=clusters, palette='viridis', linewidth=2.5)
p.set_xlabel('Juice bars', fontsize=14)
p.set_ylabel('Density', fontsize=14)
plt.show()

# Describe cluster ratings for juice bars

X[['Juice bars']].groupby(by=clusters).describe().round(2)

# Show cluster ratings for resorts

p = sns.kdeplot(data=X, x='Resorts', hue=clusters, palette='viridis', linewidth=2.5)
p.set_xlabel('Resorts', fontsize=14)
p.set_ylabel('Density', fontsize=14)
plt.show()

# Describe cluster ratings for juice bars

X[['Resorts']].groupby(by=clusters).describe().round(2)

# Show cluster ratings for religious sites

p = sns.kdeplot(data=X, x='Religious', hue=clusters, palette='viridis', linewidth=2.5)
p.set_xlabel('Religious sites', fontsize=14)
p.set_ylabel('Density', fontsize=14)
plt.show()

# Describe cluster ratings for religious sites

X[['Religious']].groupby(by=clusters).describe().round(2)

Outlier detection with the travel ratings dataset.

# Import packages and data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import DBSCAN

reviews = pd.read_csv('tripadvisor_review.csv').dropna()

# Drop user ID
X = reviews.drop(axis=1, labels='User ID')

# Define DBSCAN model

dbscanModel = DBSCAN(eps=1, min_samples=20)

# Fit the model

dbscanModel = dbscanModel.fit(X)
clusters = dbscanModel.fit_predict(X)

# Subset of outliers
outliers = X[clusters == -1]
outliers.describe()

# Subset of non-outliers
nonoutliers = X[clusters == 0]
nonoutliers.describe()
# Plot art gallery and club ratings
p = sns.scatterplot(
data=X, x='Art', y='Clubs', hue=clusters, style=clusters, palette='Paired_r'
)
p.set_xlabel('Art galleries', fontsize=14)
p.set_ylabel('Clubs', fontsize=14)
plt.legend(labels=['Non-outlier', 'Outlier'])
plt.show()

# Plot restaurant and beach ratings

p = sns.scatterplot(
data=X,
x='Restaurants',
y='Beaches',
hue=clusters,
style=clusters,
palette='Paired_r',
)
p.set_xlabel('Restaurants', fontsize=14)
p.set_ylabel('Beaches', fontsize=14)
plt.legend(labels=['Non-outlier', 'Outlier'])
plt.show()

LAB: Grouping mammal sleep habits using k-means clustering

The msleep dataset contains information on sleep habits for 83 mammals. Features
include total sleep, length of the sleep cycle, time spent awake, brain weight, and body
weight. Animals are also labeled with their name, genus, and conservation status.
 Load the dataset msleep.csv into a data frame.
 Create a new data frame X with sleep_total and sleep_cycle.
 Initialize a k-means clustering model with 4 clusters and random_state = 0.
 Fit the model to the data subset X.
 Find the centroids of the clusters in the model.
 Graph the clusters using the cluster numbers to specify colors.
 Find the within-cluster sum of squares for 1, 2, 3, 4, and 5 clusters.

from sklearn.cluster import KMeans

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
# Load the dataset
mammalSleep = pd.read_csv('msleep.csv') # Your code here

# Clean the data

mammalSleep = mammalSleep.dropna()

# Create a dataframe with the columns sleep_total and sleep_cycle

X = mammalSleep[['sleep_total', 'sleep_cycle']] # Your code here

# Initialize a k-means clustering model with 4 clusters and random_state = 0

km = KMeans(n_clusters=4, random_state = 0)# Your code here

# Fit the model

km.fit(X)# Your code here

# Find the centroids of the clusters

mammalSleepCentroids = km.cluster_centers_# Your code here
print(mammalSleepCentroids)

# Predict the cluster for each data point in mammal_sleep

mammalSleep['cluster'] = km.predict(X) # Your code here

plt.figure(figsize=(6, 6))

# Graph the clusters

# Your code here
sns.scatterplot(data=mammalSleep, x='sleep_total', y='sleep_cycle', hue='cluster',
palette='Set2')
plt.xlabel('Total sleep', fontsize=14)
plt.ylabel('Length of sleep cycle',fontsize=14)
plt.savefig('msleep_clusters.png')

WCSS = []
k = [1,2,3,4,5]
for j in k:
km = KMeans(n_clusters = j)
mammalSleepKmWCSS = km.fit(X)
intermediateWCSS =(km.inertia_)# find the within-cluster sum of squares
WCSS.append(round(intermediateWCSS,1))

print(WCSS)

Analyzing factors in forest fire data using PCA

The forestfires dataset contains meteorological information and the area burned for 517
forest fires that occurred in Montesinho Natural Park in Portugal. The columns of interest
are FFMC, DMC, DC, ISI, temp, RH, wind, and rain.
 Read in the file forestfires.csv.
 Create a new data frame X from the columns FFMC, DMC, DC, ISI, temp, RH, wind,
and rain, in that order.
 Calculate the correlation matrix for the data in X.
 Scale the data.
 Use sklearn's PCA function to perform four-component factor analysis on the scaled
data.
 Print the factors and the explained variance.
 # Import the necessary modules
 import pandas as pd
 from sklearn.preprocessing import StandardScaler
 from sklearn.decomposition import PCA
 import seaborn as sns

 # Read in forestfires.csv
 fires = pd.read_csv('forestfires.csv') # Your code here

 # Create a new data frame with the columns FFMC, DMC, DC, ISI, temp, RH, wind,
and rain, in that order
 X =fires[['FFMC','DMC','DC','ISI','temp','RH','wind','rain']]# Your code here

 # Calculate the correlation matrix for the data in the data frame X
 XCorr = X.corr()# Your code here
 print(XCorr)

 # Scale the data.
 scaler = StandardScaler()# Your code here
 firesScaled = pd.DataFrame(
 scaler.fit_transform(X),
columns=['FFMC','DMC','DC','ISI','temp','RH','wind','rain']
 )
 # Your code here

 # Perform four-component factor analysis on the scaled data.
 pca = PCA(n_components=4)
 firesPCA = pca.fit_transform(firesScaled)
 # Your code here

 # Print the factors and the explained variance.
 print("Factors: ", pca.components_) # Your code here

 print("Explained variance: ",pca.explained_variance_)# Your code here)
Chapter 11
Building a regression tree using scikit-learn.

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text

# Seed the random number generator

rng = np.random.RandomState(39)

# Read in the data

raptorExample = pd.read_csv('raptorExample.csv')

# Encode sex as a dummy variable

raptorExampleWithDummy = pd.get_dummies(raptorExample, drop_first=True)

# Assign outcome to y and features to X

y = raptorExampleWithDummy['Wing']
X = raptorExampleWithDummy.drop('Wing', axis=1)

# Define model
raptorRT = DecisionTreeRegressor(max_depth=2, min_samples_leaf=3,
random_state=rng)

# Fit the model

# Your code goes here

# Print regression tree

print(export_text(raptorRT, feature_names=X.columns.to_list()))

Classification trees in Python.

# Import packages and functions

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn import metrics, tree

# Load the penguins data

penguins = pd.read_csv('palmer_penguins.csv')
# Drop penguins with missing values
penguins = penguins.dropna()

# Calculate summary statistics using .describe()

penguins.describe(include='all')

# Save output features as y

y = penguins[['species']]

# Save input features as x

X = penguins[['flipper_length_mm', 'bill_length_mm']]

# Initialize the model

classtreeModel = DecisionTreeClassifier(max_depth=2)

# Fit the model

classtreeModel = classtreeModel.fit(X, y)

# Print tree as text

print(export_text(classtreeModel, feature_names=X.columns.to_list()))

# Resize the plotting window

plt.figure(figsize=[12, 8])

# Values in brackets represent classes in alphabetical order

# [Adelie, Chinstrap, Gentoo]
p = tree.plot_tree(classtreeModel, feature_names=X.columns, filled=False, fontsize=10)

# Calculate cross-entroy and error rate

print("Cross-entropy: ", metrics.log_loss(y, classtreeModel.predict_proba(X)))

print("Error rate: ", 1 - metrics.accuracy_score(y, classtreeModel.predict(X)))

# Calculate the confusion matrix

metrics.confusion_matrix(y, classtreeModel.predict(X))

# Plot the confusion matrix

metrics.ConfusionMatrixDisplay.from_predictions(y, classtreeModel.predict(X))

# Calculate the Gini index

probs = pd.DataFrame(data=classtreeModel.predict_proba(X))

print("Gini index: ", (probs * (1 - probs)).mean().sum())

11.3.2: Building a classification tree using scikit-learn.

The dataset contains age and body measurements for a sample of hawks observed near
Iowa City, Iowa.
 Initialize the model using the DecisionTreeClassifier() type of classification tree
with min_samples_split of 3 and the random number generator random_state set
to rng.
The code contains all imports, loads the dataset, fits the model, and prints the tree.

# Import packages and functions

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text

# Seed random number generator

rng = np.random.RandomState(35)

# Load the dataset

raptor = pd.read_csv('raptor_Example.csv')

# Assign outcome to y and features to X

y = raptor['Age']
X = raptor.drop('Age', axis=1)

# Initialize the model -- decision tree classifier

# Your code goes here

raptorCT =

raptorCT = DecisionTreeClassifier(min_samples_split=3, random_state=rng)

# Fit the model

raptorCT.fit(X,y)

# Print classification tree

print(export_text(raptorCT, feature_names=X.columns.to_list()))

# Import packages and functions

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
# Seed random number generator
rng = np.random.RandomState(49)

# Load the dataset

birdOfPrey = pd.read_csv('birdOfPrey_Example.csv')

# Assign outcome to y and features to X

y = birdOfPrey['Age']
X = birdOfPrey.drop('Age', axis=1)

# Initialize the model -- decision tree classifier

birdOfPreyCT = DecisionTreeClassifier(max_depth=3, min_samples_split=5,
min_samples_leaf=1, random_state=rng)

# Fit the model

# Your code goes here

# Print classification tree

Classification random forests in Python.

# Import packages and functions
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import metrics, tree

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load the penguins data

penguins = pd.read_csv('palmer_penguins.csv')

# Drop penguins with missing values

penguins = penguins.dropna()

# Calculate summary statistics using .describe()

penguins.describe(include='all')

# y = output features
y = penguins['species']

# X = input features
X = penguins.drop('species', axis=1)

# Convert categorical inputs like species and island into dummy variables
X = pd.get_dummies(X, drop_first=True)

# Create a training/testing split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=8675309
)

# Initialize the random forest model

rfModel = RandomForestClassifier(max_depth=2, max_features='sqrt',
random_state=99);

# Fit the random forest model on the training data

rfModel.fit(X_train, y_train);

pd.DataFrame(
data={
'feature': rfModel.feature_names_in_,
'importance': rfModel.feature_importances_,
}
).sort_values('importance', ascending=False)

# Predict species on the testing data

y_pred = rfModel.predict(X_test)

# Calculate a confusion matrix

metrics.confusion_matrix(y_test, y_pred)

# Plot the confusion matrix

metrics.ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

# Calculate the Gini index

probs = pd.DataFrame(data=rfModel.predict_proba(X_test))
print("Gini index ", (probs * (1 - probs)).mean().sum())

# Save the first random forest tree as singleTree

singleTree = rfModel.estimators_[0]

# Set image size

plt.figure(figsize=[15, 8])
# Plot a single regression tree
tree.plot_tree(singleTree, feature_names=X.columns, filled=False, fontsize=10);

Building random forest classification trees using scikit-learn.

# Import packages and functions

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_text

# Seed random number generator

rng = np.random.RandomState(29)

# Load the dataset

birdOfPrey = pd.read_csv('birdOfPrey_Example.csv')

# Assign outcome to y and features to X

y = birdOfPrey['Species']
X = birdOfPrey.drop('Species', axis=1)

# Split dataset into training data and testing data

XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size=.3, random_state=rng)

# Initialize the model -- random forest classification trees

birdOfPreyRFC = # Your code goes here

birdOfPreyRFC = RandomForestClassifier(n_estimators=74, criterion='gini',

max_features='sqrt', bootstrap= True, random_state=rng) # Your code goes
here

# Fit the model with training data

birdOfPreyRFC = birdOfPreyRFC.fit(XTrain, yTrain)

# Print first and last random trees generated in the forest

print('First tree:')
print(export_text(birdOfPreyRFC[0], feature_names=X.columns.to_list()))
print('Last tree:')
print(export_text(birdOfPreyRFC[74-1], feature_names=X.columns.to_list()))

LAB: Creating a regression tree using mpg data

The dataset mpg contains information on miles per gallon (mpg) and engine size for cars
sold from 1970 through 1982. The dataset has the
features mpg, cylinders, displacement, horsepower, weight, acceleration, model_year, ori
gin, and name.
 Load the mpg.csv dataset.
 Create a dataframe, X, using weight and model_year as features.
 Create a dataframe, y, using mpg.
 Initialize a regression tree with random_state = 100 that has depth 3 and a
minimum number of samples in each leaf of 5.
 Fit the regression tree on X.

# Load the necessary libraries

import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text

# Load the mpg dataset

mpg = # Your code here

# Subset the data containing weight and model_year

X = # Your code here

# Subset the data containing mpg

y = # Your code here

# Initialize a regression tree with random_state = 100

# that has depth 3 and a minimum number of samples in each leaf of 5
mpgRT = # Your code here

# Fit the X and y data

# Your code here

# Print regression tree

print("max_depth = %s, %s"% (mpgRT.max_depth, mpgRT.random_state))
# Print tree
print(export_text(mpgRT, feature_names=X.columns.to_list()))

Chapter 12

Perceptron models in Python.

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Loads haberman.csv
haberman = pd.read_csv('haberman.csv')

# Slices the features of the dataset

X = haberman[['Age', 'Year', 'Nodes']]
y = haberman[["Survived"]]

# Scales the features

scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=['Age', 'Year', 'Nodes'])

# Splits the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=123
)

# Initializes and fits a perceptron model

clf = Perceptron(tol=0.00001, eta0=0.1, max_iter=20000);
clf.fit(X_train, np.ravel(y_train));

# Creates a list of predictions from the test features

y_pred = clf.predict(X_test)

# Finds the accuracy score

accuracy_score(y_pred, y_test)

# Displays a heatmap for the confusion matrix

sns.heatmap(confusion_matrix(y_pred, y_test), annot=True)

Single-layer perceptron using scikit-learn.

# Import packages and functions
import pandas as pd
import numpy as np
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load the dataset
heart = pd.read_csv('heart.csv')

# Slices the features of the dataset

X = heart[['trestbps', 'age', 'thalach']]
y = heart[['target']]

# Scales the features

scaler = StandardScaler()
XScaled = pd.DataFrame(scaler.fit_transform(X), columns=['trestbps','age','thalach'])

# Splits the data into train and test sets

XTrain, XTest, yTrain, yTest = train_test_split(XScaled, y, test_size=0.2,
random_state=123)

# Initializes and fits a perceptron model

pModel = # Your code goes here
pModel.fit(XTrain, np.ravel(yTrain))

print(pModel.coef_)
print(pModel.intercept_)

Multilayer perceptron models in Python.

# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
import matplotlib_inline.backend_inline

matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

homes = pd.read_csv('homes.csv')

# Loads input and output features

X = homes[['Bed', 'Floor']]
y = homes[['Price']]

# Splits the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(y), random_state=123)
# Initializes and trains a multilayer perceptron regressor model on the training set
# This cell takes a long time to run.
mlpReg_train = MLPRegressor(
random_state=1, max_iter=500000, hidden_layer_sizes=[1]
).fit(X_train, np.ravel(y_train))

# Predicts the price of a 5 bedroom house with 2,896 sq ft

mlpReg_train.predict([[5, 2.896]])

# Plots the loss curves for the training sets

f, ax = plt.subplots(1, 1)
sns.lineplot(
x=range(len(mlpReg_train.loss_curve_)), y=mlpReg_train.loss_curve_, label='Training'
)
ax.set_xlabel('Epochs', fontsize=14);
ax.set_ylabel('Loss', fontsize=14);

# Compare the final loss between train and test sets

print(mlpReg_train.loss_)
print(
mean_squared_error(y_test, mlpReg_train.predict(X_test)) / 2
) # division by 2 to get squared error to match squared error.

# Obtains the final weights and biases

print(mlpReg_train.coefs_)
print(mlpReg_train.intercepts_)

Multilayer perceptron using scikit-learn.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor

# Seed random number generator

rng = np.random.RandomState(26)

# Loads the cabsNY.csv dataset

cabsNY = pd.read_csv('cabsNY.csv')

# Loads predictor and target variables

X = cabsNY[['fare','toll']].to_numpy() # converted to numpy type array
y = cabsNY[['distance']]
# Splits the data into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, np.ravel(y),random_state=rng)

# Initializes and trains a multilayer perceptron regressor model on the training and
validation sets

multLayerPercModelTrain = # Your code goes here

multLayerPercModelValidation = # Your code goes here

# Predicts the distance of a taxi ride with a specific fare and toll cost
print(multLayerPercModelTrain.predict([[4, 7]]))

# Prints the final weights, biases, and losses

weights = multLayerPercModelTrain.coefs_
biases = multLayerPercModelTrain.intercepts_
loss = multLayerPercModelTrain.loss_
print('{}\n{}\n{}'.format(weights, biases, loss))

LAB: Single-layer perceptron

The nbaallelo_log file contains data on 126314 NBA games from 1947 to 2015. The
dataset includes the features pts, elo_i, win_equiv, and game_result. Using the csv
file nbaallelo_log.csv and sklearn's Perceptron function, construct a perceptron model to
classify whether a team will win or lose a game based on the
features pts, elo_i, win_equiv. Complete the program with the following tasks:
 Scale the features in X and y.
 Use the Perceptron function to initialize and fit a perceptron model with a learning
rate of 0.05 and 20000 epochs.
 Print the weights for the input variables and bias term.
 Find the accuracy score.
Note: The program reads in a csv file's name from user.

import pandas as pd
import numpy as np
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

# Load input into a dataframe

NBA = pd.read_csv(input())
# Hot encode the game_result variable as a numeric variable with 0 for L and 1 for W
NBA.loc[NBA['game_result']=='L','game_result']=0
NBA.loc[NBA['game_result']=='W','game_result']=1

# Store relevant columns as variables

X = NBA[['pts','elo_i','win_equiv']]
y = NBA[['game_result']].astype(int)

# Scale the input features

scaler = StandardScaler()
XScaled = # Your code here

np.random.seed(42)

# Split the data into train and test sets

XTrain, XTest, yTrain, yTest = train_test_split(XScaled, y, test_size=0.3,
random_state=123)

# Initialize a perceptron model with a learning rate of 0.05 and 20000 epochs
classifyNBA = # Your code here
# Fit the perceptron model
# Your code here

# Create a list of predictions from the test features

yPred = # Your code here

# Find the weights for the input variables

weightVar = # Your code here
print(weightVar)

# Find the weights for the bias term

weightBias = # Your code here
print(weightBias)

# Find the accuracy score

score = # Your code here
print('%.3f' % score)
s

# Load input into a dataframe

NBA = pd.read_csv(input())

# Hot encode the game_result variable as a numeric variable with 0 for L and 1 for W
NBA.loc[NBA['game_result']=='L','game_result']=0
NBA.loc[NBA['game_result']=='W','game_result']=1

# Store relevant columns as variables

X = NBA[['pts','elo_i','win_equiv']]
y = NBA[['game_result']].astype(int)

# Scale the input features

scaler = StandardScaler()
XScaled = pd.DataFrame(scaler.fit_transform(X), columns=['pts', 'elo_i', 'win_equiv'])
yScaled= pd.DataFrame(scaler.fit_transform(y), columns=['game_result'])# Your code
here

np.random.seed(42)

# Split the data into train and test sets

XTrain, XTest, yTrain, yTest = train_test_split(XScaled, y, test_size=0.3,
random_state=123)

# Initialize a perceptron model with a learning rate of 0.05 and 20000 epochs
classifyNBA = Perceptron(eta0=0.05, max_iter=20000); # Your code here
classifyNBA.fit(XTrain,np.ravel(yTrain))# Fit the perceptron model
# Your code here

# Create a list of predictions from the test features

yPred =classifyNBA.predict(XTest) # Your code here

# Find the weights for the input variables

weightVar = classifyNBA.coef_# Your code here
print(weightVar)

# Find the weights for the bias term

weightBias = classifyNBA.intercept_ # Your code here
print(weightBias)
# Find the accuracy score
score =accuracy_score(yPred, yTest)# Your code here
print('%.3f' % score)

G10 Python 2
No ratings yet
G10 Python 2
64 pages
Week 3 Numpy,Pandas, Data Visulisation
No ratings yet
Week 3 Numpy,Pandas, Data Visulisation
24 pages
DV Lab Manual Modified
No ratings yet
DV Lab Manual Modified
31 pages
Unit - V
No ratings yet
Unit - V
29 pages
FOD Record Sem 1
No ratings yet
FOD Record Sem 1
25 pages
Experiment 2
No ratings yet
Experiment 2
17 pages
PMI - Modules and Data Structures
No ratings yet
PMI - Modules and Data Structures
23 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
36 pages
Python Crash Course
100% (1)
Python Crash Course
9 pages
Ilovepdf Merged (2) Merged
No ratings yet
Ilovepdf Merged (2) Merged
65 pages
Python Numpy Programming: Eliot Feibush
No ratings yet
Python Numpy Programming: Eliot Feibush
66 pages
Python Libraries Explained
No ratings yet
Python Libraries Explained
10 pages
Python Unit IV
No ratings yet
Python Unit IV
12 pages
Python Unit 4
No ratings yet
Python Unit 4
43 pages
Unit III - Functions in Python
No ratings yet
Unit III - Functions in Python
54 pages
Python For Data Science
100% (1)
Python For Data Science
4 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
Python Basics: Subset Slice
No ratings yet
Python Basics: Subset Slice
1 page
Array Function 230321 202235
No ratings yet
Array Function 230321 202235
13 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
Python For DataScience
No ratings yet
Python For DataScience
47 pages
oG1M8adGXOGe DHBiQVrXgXHO6GrHU01tHWZgd tpRqUW65xGX9ufzrZMtM6hjBWlvlYViPn6r2Cgghq2M8oiXNNdf0HeL-DQvJKWM
No ratings yet
oG1M8adGXOGe DHBiQVrXgXHO6GrHU01tHWZgd tpRqUW65xGX9ufzrZMtM6hjBWlvlYViPn6r2Cgghq2M8oiXNNdf0HeL-DQvJKWM
42 pages
Unit-3 PSC
No ratings yet
Unit-3 PSC
62 pages
UNIT II - Data Handling Part I
No ratings yet
UNIT II - Data Handling Part I
8 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
Numpy Data Analysis and Visualisation With Python
No ratings yet
Numpy Data Analysis and Visualisation With Python
75 pages
Value Added Course: Programming in Python and Machine Learning UNIT-2
No ratings yet
Value Added Course: Programming in Python and Machine Learning UNIT-2
41 pages
HKU - 7001 - 3.2 Managing Data II
No ratings yet
HKU - 7001 - 3.2 Managing Data II
67 pages
CS3361 - Data Science
No ratings yet
CS3361 - Data Science
56 pages
0653 Scheme of Work (For Examination From 2019)
67% (3)
0653 Scheme of Work (For Examination From 2019)
115 pages
Python (3) Leaflet: Roland Becker December 16, 2020
No ratings yet
Python (3) Leaflet: Roland Becker December 16, 2020
15 pages
Introduction To Python Programming
No ratings yet
Introduction To Python Programming
9 pages
ML Lab File Vijay Kumar
No ratings yet
ML Lab File Vijay Kumar
27 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
Module 6 NumPY and Pandas
No ratings yet
Module 6 NumPY and Pandas
12 pages
Tutorial 2
No ratings yet
Tutorial 2
9 pages
M3-Introduction To Numpy and Pandas
No ratings yet
M3-Introduction To Numpy and Pandas
55 pages
05 NumPy - Arrays and Vectorized Computation
No ratings yet
05 NumPy - Arrays and Vectorized Computation
47 pages
Python Sheet
No ratings yet
Python Sheet
1 page
Numpy Semi 1
No ratings yet
Numpy Semi 1
15 pages
Python Notes
No ratings yet
Python Notes
16 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
DAO Cheatsheet
No ratings yet
DAO Cheatsheet
3 pages
Python DataScience Cheat-Sheet
100% (1)
Python DataScience Cheat-Sheet
7 pages
Assesment - Basic Python - MCQ - 40 Questions
100% (1)
Assesment - Basic Python - MCQ - 40 Questions
9 pages
Int254 Unit 2
No ratings yet
Int254 Unit 2
33 pages
Python Question Paper
No ratings yet
Python Question Paper
12 pages
Combined Science 0653 Objectives Core
No ratings yet
Combined Science 0653 Objectives Core
19 pages
Python Cheat Sheet Dataquest PDF
No ratings yet
Python Cheat Sheet Dataquest PDF
5 pages
Chap02 PDF
No ratings yet
Chap02 PDF
32 pages
Python Cheat Sheet For Beginners
No ratings yet
Python Cheat Sheet For Beginners
1 page
12.1 - 12.9 Introduction To Modules - Libraries For DataScience
No ratings yet
12.1 - 12.9 Introduction To Modules - Libraries For DataScience
54 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
Activity #1 Checkpoint
100% (1)
Activity #1 Checkpoint
3 pages
Getting Started With Python Cheat Sheet
No ratings yet
Getting Started With Python Cheat Sheet
1 page
PythonForDataScience PDF
No ratings yet
PythonForDataScience PDF
1 page
Python Programming List1
No ratings yet
Python Programming List1
31 pages
Internship Presentation
No ratings yet
Internship Presentation
18 pages
Please Save A Copy of This Document To Your Device To Enable All Features
No ratings yet
Please Save A Copy of This Document To Your Device To Enable All Features
205 pages
Secure Crypto-Biometric System For Cloud Computing
No ratings yet
Secure Crypto-Biometric System For Cloud Computing
54 pages
DICOM Processing and Segmentation in Python
No ratings yet
DICOM Processing and Segmentation in Python
18 pages
COURSE TEMPLATE For WEBSITE 1
No ratings yet
COURSE TEMPLATE For WEBSITE 1
85 pages
Mcqs
No ratings yet
Mcqs
30 pages
Vector Blox PG
No ratings yet
Vector Blox PG
28 pages
Fods Lab
No ratings yet
Fods Lab
36 pages
Pediatric Scale Replacement Kit (002-1799-00) : Caution
No ratings yet
Pediatric Scale Replacement Kit (002-1799-00) : Caution
8 pages
Diploma in Python For Water Resources, Nov 24 - Feb 25, V1a
No ratings yet
Diploma in Python For Water Resources, Nov 24 - Feb 25, V1a
16 pages
Guide To Jupyter Notebooks: Universidad Complutense de Madrid
No ratings yet
Guide To Jupyter Notebooks: Universidad Complutense de Madrid
38 pages
MEITRACK GPS Tracking System MS03 User Guide V1.0 - 20150825
No ratings yet
MEITRACK GPS Tracking System MS03 User Guide V1.0 - 20150825
46 pages
Datascience Roadmap
100% (2)
Datascience Roadmap
4 pages
Disease Prediction Using Machine Learning: V. Sharon Rose (Urk18Cs178)
No ratings yet
Disease Prediction Using Machine Learning: V. Sharon Rose (Urk18Cs178)
31 pages
DL Programs
No ratings yet
DL Programs
13 pages
Acitivties First Block 9
No ratings yet
Acitivties First Block 9
40 pages
Python Cheat Sheet
No ratings yet
Python Cheat Sheet
2 pages
Cheat Sheet Template
No ratings yet
Cheat Sheet Template
3 pages
Lab Manual Python
No ratings yet
Lab Manual Python
20 pages
Analytixpro - Data Science - Brochure PDF
No ratings yet
Analytixpro - Data Science - Brochure PDF
13 pages
Toxic Comments Classification
No ratings yet
Toxic Comments Classification
10 pages
Read Through The Text. Design A Food Web and Answer Some Questions From The Following Information
No ratings yet
Read Through The Text. Design A Food Web and Answer Some Questions From The Following Information
6 pages
PP Unit IV Modules, Package and Frameworks Notes For Mc4103
No ratings yet
PP Unit IV Modules, Package and Frameworks Notes For Mc4103
37 pages
Unit 5
No ratings yet
Unit 5
11 pages
AICTE Activity Report Aditya Dixit
No ratings yet
AICTE Activity Report Aditya Dixit
12 pages
BUS 1101 Written Assignment Unit 2
100% (1)
BUS 1101 Written Assignment Unit 2
4 pages
CNT 0001532 02 PDF
No ratings yet
CNT 0001532 02 PDF
3 pages
Fakecalligraphypractice PDF
No ratings yet
Fakecalligraphypractice PDF
7 pages
CodeCompose - A Large-Scale Industrial Deployment of AI-assisted Code Authoring
No ratings yet
CodeCompose - A Large-Scale Industrial Deployment of AI-assisted Code Authoring
11 pages
Mrs. Delgado 3 Media: Teacher(s) Subject Group and Discipline Unit Title MYP Year Unit Duration (HRS) 11
No ratings yet
Mrs. Delgado 3 Media: Teacher(s) Subject Group and Discipline Unit Title MYP Year Unit Duration (HRS) 11
8 pages
EVS Worksheets
No ratings yet
EVS Worksheets
2 pages
10.0 Effective Instruction Observation Form
100% (1)
10.0 Effective Instruction Observation Form
2 pages
Ecosystem Structure and Processes: Environmental Management
No ratings yet
Ecosystem Structure and Processes: Environmental Management
4 pages
Inquiry: Establishing The Purpose of The Unit: Mrs. Delgado 3 Media
No ratings yet
Inquiry: Establishing The Purpose of The Unit: Mrs. Delgado 3 Media
6 pages
Bracelet Bead Colors
No ratings yet
Bracelet Bead Colors
1 page
Auditing The Data Using Python
No ratings yet
Auditing The Data Using Python
4 pages
Peer Assessment 3
No ratings yet
Peer Assessment 3
3 pages
Possessive Adjectives
100% (1)
Possessive Adjectives
2 pages
Written Assignment Unit 2
No ratings yet
Written Assignment Unit 2
3 pages
PWP 13-16
No ratings yet
PWP 13-16
4 pages
Peer Assessment
No ratings yet
Peer Assessment
2 pages
Shield & Composite Volcano HO
No ratings yet
Shield & Composite Volcano HO
2 pages
A. Match The Description With The Most Suitable Character. (2,5 PTS)
No ratings yet
A. Match The Description With The Most Suitable Character. (2,5 PTS)
2 pages
CS450 HW 1
No ratings yet
CS450 HW 1
2 pages
5W HOW Activity
No ratings yet
5W HOW Activity
2 pages
PyTecplot Datasheet
No ratings yet
PyTecplot Datasheet
2 pages
IIQ Exam
No ratings yet
IIQ Exam
2 pages
WH10SE-Study Guide
No ratings yet
WH10SE-Study Guide
1 page
Experiment Steps - How Does Sunlight Affect Plant Growth
No ratings yet
Experiment Steps - How Does Sunlight Affect Plant Growth
1 page
Sigurd Scavenger Hunt
No ratings yet
Sigurd Scavenger Hunt
1 page
Cornell Notes 6Rs
No ratings yet
Cornell Notes 6Rs
1 page
Placement CV Final - Compressed
No ratings yet
Placement CV Final - Compressed
1 page