0% found this document useful (0 votes)

98 views103 pages

DS-05 Introduction To Machine Learning

Uploaded by

Bojana Jovanceva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views103 pages

DS-05 Introduction To Machine Learning

Uploaded by

Bojana Jovanceva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 103

Introduction to Machine Learning

Introduction to Data Science

1
References

Parts of this lecture are based on:

1.Slides from the Harvard Course CS109A Introduction to Data Science
by Pavlos Protopapas, Kevin Rader and Chris Tanner, available at
https://github.com/Harvard-IACS/2019-CS109A
2.James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
An introduction to statistical learning. Vol. 112. New York: springer,
2013.
3.Cielen, Davy, Arno Meysman, and Mohamed Ali. Introducing data
science: big data, machine learning, and more, using Python tools.
Manning Publications Co., 2016.

2
Agenda

• Introduction to ML
– Types of ML
– Using ML to make predictions
• Regression
– Defining the problem
– Linear Regressing Models
– Regressing model evaluation
• Classification
– Logistic regression
– Classification metrics
3
Paradigm shift

Traditional Programming

Data
Computer Output
Program

Machine Learning

Data
Computer Program
Output
Understanding
how machines
learn
What is machine learning?

• A branch of artificial intelligence, concerned with the

design and development of algorithms that allow
computers to evolve behaviors based on empirical data.
• As intelligence requires knowledge, it is necessary for the
computers to acquire knowledge
• Automating automation
• Getting computers to program themselves
– Let the data do the work instead!
Machine Learning Definition
• The Formal one:
– “A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured
by P, improves with experience E.”
• A Practical Example:
– Look at data. Try something. Get the right answer? No?
Look at the data. Do something different. Better? Yes?
Do that again. (Repeat)

• Human learning is still far more sophisticated than even

the most advanced machine-learning algorithms, but
computers have the advantage of greater capacity to
memorize, recall, and process data.
Types of Learning
• Supervised (inductive) learning
– Labels provided
– Training data includes desired outputs
– Predicting the future
– Learn from known past examples to predict future

• Unsupervised learning
– Labels not provided
– Training data does not include desired outputs
– Making sense of data
– Understanding the past
– Learning the structure of data

• Reinforcement learning
– Rewards and/or punishments from sequence of actions
Algorithms

Supervised learning Unsupervised learning

Semi-supervised learning
Machine Learning Capabilities

How much/many Which group

Which category (Regression) (Clustering,
(Classification) Recommender)

Is it odd Which action

(Anomaly) (Reinforcement
Learning)
ML in a Nutshell

• Tens of thousands of machine learning algorithms

• Hundreds new every year
• Every machine learning algorithm has three components:
– Representation
– Evaluation
– Optimization
Representation

• Decision trees
• Sets of rules / Logic programs
• Instances
• Graphical models (Bayes/Markov nets)
• Neural networks
• Support vector machines
• Model ensembles
• Etc.
Evaluation

• Accuracy
• Precision and recall
• Squared error
• Likelihood
• Posterior probability
• Cost / Utility
• Margin
• Entropy
• K-L divergence
• Etc.
Optimization

• Combinatorial optimization
– E.g.: Greedy search
• Convex optimization
– E.g.: Gradient descent
• Constrained optimization
– E.g.: Linear programming
ML in Practice

• Understanding domain, prior knowledge, and goals

• Data integration, selection, cleaning, pre-processing, etc.
• Learning models
• Interpreting results
• Consolidating and deploying discovered knowledge
Using data to make decisions

• Traditional approaches for the loan-

approval process for the microlending
• Human Analyst look at the documents
Using data to
make decisions
• You need to manually
analyze incoming
applications
• Apply filtering rules and
intuition to approve or
disapprove the
application
The machine-learning approach
• ML learns the optimal decisions
directly from the data without having
to arbitrarily hardcode decision rules.
• This graduation from rules-based to
ML-based decision-making means that
your decisions will be more accurate
and will improve over time as more
loans are made.
• Data provides the foundation for
deriving insights about the problem at
hand
• The input data, consists of a set of
features, numerical or categorical
metrics that capture the relevant
aspects of each application, such as the
applicant’s credit score, gender, and
occupation.
EXAMPLE: logistic regression to model the loan approval process
• Machine learning comes in many
flavors, ranging from simple
statistical models to more-
sophisticated Deep Learning
approaches
• In logistic regression, the logarithm
of the odds that each loan is repaid
is modeled as a linear function of
the input features:
– the applicant’s credit line,
– education level,
– age
• The optimal values of each
coefficient of the equation are
learned from the training data
examples.
Linear vs Non-Linear ML Algorithms

• When the relationship between the inputs and the response are
complicated, models such as logistic regression can be limited. We
need to use more complicated models.
Predicting a Variable

• Let's imagine a scenario where we'd like to predict one variable using another
(or a set of other) variables.
• Thus, we'd like to define two categories of variables:
• variables whose value we want to predict
• variables whose values we use to make our prediction

• Examples:

• Predicting the amount of views a YouTube video will get next week based
on video length, the date it was posted, previous number of views, etc.
• Predicting which movies a Netflix user will rate highly, based on their
previous movie ratings, demographic data etc.
21
Translating Between Statistics and Machine Learning

other terms:
residual, loss (statistics) = error (machine learning)

22
Data
The Advertising data set consists of the sales of that product in 200
different markets, along with advertising budgets for the product
, in
each of those markets for three different media: TV, radio, and
newspaper. Everything is given in units of $1000.

TV radio newspaper sales

230.1 37.8 69.2 22.1
44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9

Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with
applications in R" (Springer, 2013)
23
Response vs. Predictor Variables

X Y
predictors outcome
features response variable
covariates dependent variable

TV radio newspaper sales

n observations

230.1 37.8 69.2 22.1

44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9

p predictors
24
Response vs. Predictor Variables
𝑋 = 𝑋! , … , 𝑋"
𝑋# = 𝑥!# , … , 𝑥$# , … , 𝑥%# 𝑌 = 𝑦! , … , 𝑦%
predictors outcome
features response variable
covariates dependent variable

TV radio newspaper sales

n observations

230.1 37.8 69.2 22.1

44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9

p predictors
25
True vs. Statistical Model
• We will assume that the response variable, 𝑌, relates to the
predictors, 𝑋, through some unknown function expressed
generally as:

• 𝑌 =𝑓 𝑋 +𝜀

• Here, 𝑓 is the unknown function expressing an underlying rule for

relating 𝑌 to 𝑋, 𝜀 is the random amount (unrelated to 𝑋) that 𝑌
differs from the rule 𝑓 𝑋 .

• A statistical model is any algorithm that estimates 𝑓. We denote

the estimated function as 𝑓.(
26
Statistical Model

x
27
Statistical Model
How do we -ind 𝑓0 𝑥 ?

What is the
value of y at this
𝑥?

28
Statistical Model
How do we -ind 𝑓0 𝑥 ?

or this one?

29
Statistical Model
(
Simple idea is to take the mean of all y’s, 𝑓0 𝑥 = ∑)( 𝑦*
)

30
Prediction vs. Estimation

• 0 our estimate of 𝑓. These are called

For some problems, what's important is obtaining 𝑓,
inference problems.

• When we use a set of measurements, (𝑥*,( , … , 𝑥*,+ ) to predict a value for the response
variable, we denote the predicted value by:

• 0 *,( , … , 𝑥*,+ ).
𝑦9* = 𝑓(𝑥

• 0 we just want to make

For some problems, we don't care about the specific form of 𝑓,
our predictions 𝑦’s
9 as close to the observed values 𝑦’s as possible. These are called
prediction problems.

31
Simple Prediction Model

What is 𝑦,& at some 𝑥& ?

Find distances to all

other points
𝐷(𝑥& , 𝑥$ )
𝑦,&

Predict 𝑦,& = 𝑦"

𝑥-

32
Simple Prediction Model
Do the same for “all” 𝑥′𝑠

33
Extend the Prediction Model

What is 𝑦,& at some 𝑥& ?

Find distances to all

other points
𝑦,& 𝐷(𝑥& , 𝑥$ )

Find the k-nearest

neighbors, 𝑥&" , … , 𝑥&#

!
'
& = ∑$ 𝑦&!
Predict 𝑦1
'

𝑥-

34
Simple Prediction Models

35
Simple Prediction Models
We can try different k-models on more data

36
Error Evaluation
Start with some data.

37
Error Evaluation
Hide some of the data from the model. This is called train-test split.

We use the
train set to
estimate 𝑦,9 and
the test set to
evaluate the
model.

38
Error Evaluation
Estimate 𝑦9 for k=1 .

39
Error Evaluation
Now, we look at the data we have not used, the test data (red crosses).

40
Error Evaluation
Calculate the residuals (𝑦* − 𝑦9* ).

41
Error Evaluation
Do the same for k=3.

42
Error Evaluation

• In order to quantify how well a model performs, we define a loss or error function.

• A common loss function for quantitative outcomes is the Mean Squared Error
(MSE): Xn
1
M SE = (yi ybi )2
n i=1
• The quantity 𝑦* − 𝑦9* is called a residual and measures the error at the i-th
prediction.

• Note: The square Root of the Mean of the Squared Errors (RMSE) is also
commonly used. v
u n
p u1 X
RM SE = M SE = t (yi ybi )2
n i=1
43
Model Comparison
Do the same for all k’s and compare the RMSEs. k=3 seems to be the best model.

44
Model fitness
For a subset of the data, calculate the RMSE for k=3. Is RMSE=5.0 good enough?

45
Model fitness
What if we measure the Sales in cents instead of dollars?

RMSE is now
5004.93.
Is that good?

46
Model fitness
It is better if we compare it to something.

We will use the simplest

model:
n
1X
ŷ = yi
n i
i=1

47
R-squared - The Coefficient of Determination

P
(ŷi yi ) 2
R2 = 1 P i
i (ȳ yi ) 2

• = then 𝑅 3 = 0
If our model is as good as the mean value, 𝑦,

• If our model is perfect then 𝑅 3 = 1

• 𝑅 3 can be negative if the model is worst than the average. This can happen
when we evaluate the model in the test set.

48
Linear Models
• Note that in building our k-NN model for prediction, we did not
compute a closed form for 𝑓.0

• What if we ask the question:

– “how much more sales do we expect if we double the TV advertising budget?”

• Alternatively, we can build a model by first assuming a simple form of 𝑓:

Y = f (X) + ✏ = 1X + 0 + ✏.
• … then it follows that our estimate is:
Yb = fb(X) = c1 X + c0

• where 𝛽0( and 𝛽04 are estimates of 𝛽( and 𝛽4 respectively, that we compute using
49
observations.
Estimate of the regression coefficients (cont)
Is this line good?

50
Estimate of the regression coefficients (cont)
Maybe this one?

51
Estimate of the regression coefficients (cont)
Or this one?

52
Estimate of the regression coefficients (cont)
Question: Which line is the best?
First calculate the residuals

53
Estimate of the regression coefficients (cont)

• Again we use MSE as our loss function,

n n
1X 2 1X 2
L( 0 , 1 ) = (yi ybi ) = [yi ( 1X + 0 )] .
n i=1 n i=1

• We choose 𝛽0( and 𝛽04 in order to minimize the predictive errors made by our
model, i.e. minimize our loss function.

• Then the optimal values for 𝛽04 and 𝛽0( should be:

b0 , b1 = argminL( 0, 1 ).
0, 1

54
Estimate of the regression coefficients: brute force

• A way to estimate argmin5$ ,5" 𝐿 is to calculate the loss function for every
possible 𝛽4 and 𝛽( . Then select the 𝛽4 and 𝛽( where the loss function is
minimum.

• E.g. the loss function for different 𝛽( when 𝛽4 is fixed to be 6:

55
Estimate of the regression coefficients: exact method
Take the partial derivatives of 𝐿 with respect to 𝛽4 and 𝛽( , set to zero, and find the
solution to that equation. This procedure will give us explicit formulae for 𝛽04 and 𝛽0( :

P
ˆ1 = i (x
Pi
x)(yi y)
(x x) 2
i i

ˆ0 = ȳ ˆ1 x̄

where 𝑦= and 𝑥̅ are sample means.

The line: b Y = b1 X + b0
is called the regression line.

56
Proof: 1X 2
L( 0 , 1 ) = [yi ( 0 1 xi )]
n i

dL( 0 , 1 ) dL( 0 , 1 )
=0 =0
d 0 d 1
2X 2X
) (yi ) (yi 0 1 xi )( xi ) = 0
0 1 xi ) =0 n i
n i X X X
1X 1X ) x i yi + 0 xi + 1 x2i = 0
) yi 0 1 xi = 0 i i i
n i n i X X X
) xi yi + (ȳ 1 x̄) xi + 1 x2i = 0
) 0 = ȳ 1 x̄ i
! i i
X X
) 1 x2i nx̄ 2
= x i yi nx̄ȳ
i i
P
xi yi nx̄ȳ
) 1 = Pi 2
i xi nx̄2
P
(xi x̄)(yi ȳ)
) 1 = iP
i (xi x̄)2
57
Estimate of the regression coefficients: gradient descent

A more flexible method is

• Start from a random point
1. Determine which direction to go to reduce the loss (left or right)
2. Compute the slope of the function at this point and step to the
right if slope is negative or step to the left if slope is positive
3. Goto to #1
58
Estimate of the regression coefficients: Gradient Descent
• We know that we want to go in the opposite direction of the derivative and we
know we want to be making a step proportionally to the derivative.
Notation: 𝑤 = [𝛽4 , 𝛽( ]
• Making a step means:
𝑤 )67 = 𝑤 89: + 𝑠𝑡𝑒𝑝
Opposite direction of the derivative and proportional to the derivative means:
)67 89:
𝑑ℒ
𝑤 =𝑤 −𝜆
𝑑𝑤

Change to more conventional notation:

𝑑ℒ Learning
𝑤 (*;() = 𝑤 (*) −𝜆 Rate
𝑑𝑤
59
Estimate of the regression coefficients: gradient descent

Summary of Gradient Descent 𝑑ℒ

• 𝑤 (*;() = 𝑤 (*) −𝜆
• Algorithm for optimization of first order 𝑑𝑤
to finding a minimum of a function.
• It is an iterative method. L - +
• L is decreasing in the direction of the
negative derivative.
• The learning rate is controlled by the
magnitude of 𝜆.

60
Interpretation of Predictors

• Question: What do you think a predictor coefficient means?

• 𝑆𝑎𝑙𝑒𝑠 = 7.5 + 0.04 𝑇𝑉

• What does 7.5 mean and what does 0.04 mean?

• If we increase the TV by $1000, what would you expect the increase in sales to be?

• What if? 𝑆𝑎𝑙𝑒𝑠 = 7.5 + 1.01 𝑇𝑉

•
The interpretation of the predictors depends on the values, but decisions depend on
how much we trust these values.

61
Confidence intervals for the predictors estimates

• We interpret the 𝜀 term in our observation

y = f (x) + ✏
• to be noise introduced by random variations in natural systems or imprecisions
of our scientific instruments.
• If we knew the exact form of 𝑓 𝑥 , for example, 𝑓 𝑥 = 𝛽4 + 𝛽( 𝑥, and there
was no 𝜀 , then estimating the 𝛽0 < 𝑠 would have been exact.
• However, three things happen, which result in mistrust of the values of 𝛽0 < 𝑠 :
• 𝜺 is always there
• we do not know the exact form of 𝑓 𝑥
• limited sample size

62
Confidence intervals for the predictors estimates (cont)
• But due to error, every time we measure the response Y for a fixed value of X,
we will obtain a different observation.
• We have 3 measurements for several different values of X, “realization” on the
picture

63
Confidence intervals for the predictors estimates (cont)

• For each one of those “realizations”, we could fit a model and estimate 𝛽04 and
𝛽0( .

64
Confidence intervals for the predictors estimates (cont)

• For each one of those “realizations”, we could fit a model and estimate, 𝛽04
and 𝛽0( .

65
Confidence intervals for the predictors estimates (cont)

• For each one of those “realizations”, we could fit a model and estimate, 𝛽04
and 𝛽0( .

66
Bootstrapping for Estimating Sampling Error

Definition
• Bootstrapping is the practice of estimating properties of an
estimator by measuring those properties by, for example, sampling
from the observed data.

• For example, we can compute 𝛽0[ and 𝛽0\ multiple times by

randomly sampling from our data set. We then use the variance of
our multiple estimates to approximate the true variance of 𝛽0[ and
𝛽0\.
Idealized Sampling

idealized original population

(through an oracle)

take samples

apply test statistic (e.g. mean)

histogram of statistic values

compare test statistic on

the given data, compute p
Bootstrap Sampling
Original pop.
Given data (sample)

bootstrap samples,
drawn with replacement

apply test statistic (e.g. mean)

histogram of statistic values

The region containing 95% of the samples is a 95% confidence interval (CI)
Confidence intervals for the predictors estimates (cont)
We sample multiple times and calculate 𝛽.04 and 𝛽0(

70
Confidence intervals for the predictors estimates (cont)
Another sample

71
Confidence intervals for the predictors estimates (cont)
Another sample

72
Confidence intervals for the predictors estimates (cont)
And another sample

73
Confidence intervals for the predictors estimates (cont)
Repeat this for 100 times

74
Confidence intervals for the predictors estimates (cont)
We can now estimate the mean and standard deviation of all the estimates 𝛽0( .
The variance of 𝛽04 and 𝛽0( are also called their standard errors, 𝑆𝐸 𝛽04 , 𝑆𝐸 𝛽0( .

75
Confidence intervals for the predictors estimates (cont)
Finally we can calculate the confidence intervals, which are the ranges of values
such that the true value of 𝛽( is contained in this interval with n percent
probability.

95%
68%

76
Confidence intervals for the predictors estimates: Standard Errors

• We can empirically estimate the standard errors, 𝑆𝐸 𝛽0[ , 𝑆𝐸 𝛽0\ of 𝛽[

and 𝛽\ through bootstrapping.

• If for each bootstrapped sample the estimated betas are: 𝛽0[,^ , 𝛽0\,^ , then

• 𝑆𝐸 𝛽0[ = var(𝛽
`[)

• 𝑆𝐸 𝛽0\ = var(𝛽
`\)
Confidence intervals for the predictors estimates: Standard Errors

• Alternatively:

• If we know the variance 𝜎"# of the noise 𝜖, we can compute 𝑆𝐸 𝛽') , 𝑆𝐸 𝛽'+
analytically using the formulae below (no need to bootstrap):

•
•
Standard Errors

• However, if we make the following assumptions,

• the errors 𝜖* = 𝑦* − 𝑦9* and 𝜖= = 𝑦= − 𝑦9= are uncorrelated, for 𝑖 ≠ 𝑗 ,

• each 𝜖* has a mean 0 and variance 𝜎>3 ,

• then, we can empirically estimate 𝜎 3 , from the data and our regression line:

Remember:
𝑦* = 𝑓 𝑥* + 𝜖* ⟹ 𝜖* = 𝑦* − 𝑓(𝑥* )
Standard Errors
s
⇣ ⌘ 1 x2
More data: 𝑛 ↑ and ∑/(𝑥/ − 𝑥)̅ # ↑⟹ 𝑆𝐸 ↓ SE b0 =
n
+P
(x x)
2
i i
⇣ ⌘
Larger coverage: 𝑣𝑎𝑟(𝑥) or ∑/(𝑥/ − 𝑥)̅ # ↑ ⟹ 𝑆𝐸 ↓ SE b1 = qP
2
# (xi x)
Better data: 𝜎 ↓ ⇒ 𝑆𝐸 ↓ i

s
X (fˆ(x) yi ) 2
Better model: (𝑓' − 𝑦/ ) ↓ ⟹ 𝜎 ↓ ⟹ 𝑆𝐸 ↓ ⇡
n 2

80
Classification

• Up to this point, the methods we have seen have centered around modeling and
the prediction of a quantitative response variable (ex, number of taxi pickups,
number of bike rentals, etc). Linear regression performs well under these
situations

• When the response variable is categorical, then the problem is no longer called a
regression problem, but instead is called a classification problem.

• The goal is to attempt to classify each observation into a category, also known as
(aka) class or cluster, defined by Y, based on a set of predictor variables X.
Example

• We are interested in predicting whether an individual will default on his or her

credit card payment, on the basis of annual income and monthly credit card
balance.
• The individuals who defaulted on their credit card payments are shown in
orange, and those who did not are shown in blue.

82
Why not Linear Regression?

• We model default =1 and not default =0

• Applying Linear Regression we will get:

83
Logistic Regression
• Rather than modelling this response Y directly, logistic regression
models the probability that Y belongs to a particular category
• For the Default data, logistic regression models the probability of
default.
– The probability of default given balance can be written as
Pr(default = Yes|balance), which we abbreviate as p(balance)
– The values p(balance), will range between 0 and 1.
– Then for any given value of balance, a prediction can be made for default.
– For example, one might predict default = Yes for any individual for whom
p(balance) > 0.5.
– Alternatively, if a company wishes to be conservative in predicting
individuals who are at risk for default, then they may choose to use a lower
threshold, such as p(balance) > 0.1.
84
The Logistic Model

• We need to model the relationship between

p(X) = Pr(Y = 1|X) and X
• We can try to apply linear regression model p(X) = β0 + β1X, but we
will get the model shown two slides before
• To avoid this problem, we must model p(X) using a function that
gives outputs between 0 and 1 for all values of X.
• Many functions meet this description. In logistic regression, we use
the logistic function,

85
The Logistic Model

• After a bit of manipulation of the original formula, we find that

• The quantity p(X)/[1−p(X)] is called the odds, and can take on any
value odds between 0 and ∞.
• Values of the odds close to 0 and ∞ indicate very low and very high
probabilities of default, respectively
• For example,
– on average 1 in 5 people with an odds of 1/4 will default, since p(X)=0.2
implies an odds of 0.2/(1−0.2) = 1/4.
– Likewise on average nine out of every ten people with an odds of 9 will
default, since p(X)=0.9 implies an odds of 0.9/(1−0.9) = 9 86
The Logistic Model
• By taking the logarithm of both sides of we arrive at

• The left-hand side is called the log-odds or logit

• The coefficients β0 and β1 need to be estimated based on the available training data
• The maximum likelihood method is used, to fit a logistic regression model and can
be interpreted as follows:
– we seek estimates for β0 and β1 such that the predicted probability ˆp(xi) of
default for each individual, corresponds as closely as possible to the individual’s
observed default status.
• Mathematical equation called a likelihood function is used to estimate β0 and β1 :

87
Making Predictions

• Solving the model a we get β0 = −10.6513 and β1= 0.0055

• For example, the model predicts that the default probability for an
individual with a balance of $1, 000 is

which is below 1 %.
• In contrast, the predicted probability of
default for an individual with a balance
of $2, 000 is much higher, and
equals 0.586 or 58.6 %.
88
Multiple Logistic Regression

• We now consider the problem of predicting a binary response

using multiple predictors

89
Important measures for classification and diagnostic testing

• Accuracy
• Precision
• Recall
• Sensitivity
• Specificity

90
Basic quantities for performance evaluation

91
Two of the most important performance measures

• Precision By this we mean the percentage of true positives, NTP, among all
examples that the classifier has labeled as positive: NTP + NFP . The value is
obtained by the following formula:

• Recall By this we mean the probability that a positive example will be

correctly recognized by the classifier. The value is obtained by dividing the
number of true positives, NTP , by the number of positives in the given set:
NTP + NFN:

92
Why not to use “accuracy” directly

• The simplest measure of performance would be the fraction of

items that are correctly classified, or the “accuracy” which is:

NTP + NTN
NTP + NTN + NFP + NFN

• (NTP = true positive, NFN = false negative etc.).

• But this measure is dominated by the larger set (of positives or
negatives) and favors trivial classifiers.
• e.g. if 5% of items are truly positive, then a classifier that always
says “negative” is 95% accurate.
Other performance measures

• Sensitivity is recall measured on positive examples:

• Specificity is recall measured on negative examples:

94
Precision and Recall
Important measures for classification and diagnostic testing
Confusion matrix

• Is an error matrix, that allows

visualization of the
performance of an algorithm.
Each row of the matrix
represents the instances in a
predicted class while each
column represents the
instances in an actual class
(or vice versa).
ROC plots
• ROC is Receiver-Operating Characteristic. ROC plots
• Y-axis: true positive rate = NTP/(NTP + NFN), same as recall
• X-axis: false positive rate = NFP/(NFP + NTN) = 1 - specificity
ROC Curve

An ROC curve that rises at 45° is a poor model. It represents a random allocation of cases to the classes and is the
99
ROC curve for the baseline model.
ROC AUC

• ROC AUC is the “Area

Under the Curve” – a single
number that captures the
overall quality of the
classifier. It should be Random ordering
between 0.5 (random area = 0.5
classifier) and 1.0 (perfect).
Lift Plot

• A variation of the ROC plot

is the lift plot, which
compares the performance
of the actual
classifier/search engine Lift is the ratio
against random ordering, or of these lengths
sometimes against another
classifier.
Lift Plot

• Lift plots emphasize initial

precision (typically what you care
about), and performance in a
problem-independent way.
• Note: The lift plot points should
be computed at regular spacing,
e.g. 1/00 or 1/1000. Otherwise
the initial lift value can be
excessively high, and unstable.
1 - specificity
Linear Regression Example with sklearn

103

WORC Medical Examination Form 2024 Fillable Final
No ratings yet
WORC Medical Examination Form 2024 Fillable Final
4 pages
Learning - and - Development - Framework - in Health PEI
No ratings yet
Learning - and - Development - Framework - in Health PEI
23 pages
Session 5: Gender and Development (Gad) Pre-Test
No ratings yet
Session 5: Gender and Development (Gad) Pre-Test
3 pages
OC and Stiggs
100% (1)
OC and Stiggs
71 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
Linear Regression For ML Ass
No ratings yet
Linear Regression For ML Ass
99 pages
ML Introduction
No ratings yet
ML Introduction
76 pages
Mlfa Autumn 22 Lec 01
No ratings yet
Mlfa Autumn 22 Lec 01
43 pages
Intro To ML
No ratings yet
Intro To ML
26 pages
ML 1 PPT Unit 1
No ratings yet
ML 1 PPT Unit 1
93 pages
CE880 Lecture5 Slides
No ratings yet
CE880 Lecture5 Slides
32 pages
Ai - Foundations of Machine Learning I
No ratings yet
Ai - Foundations of Machine Learning I
39 pages
ML - 1 - Sovan - Introduction To ML
No ratings yet
ML - 1 - Sovan - Introduction To ML
83 pages
Intro DL 01
No ratings yet
Intro DL 01
64 pages
Ai - Foundations of Machine Learning I
No ratings yet
Ai - Foundations of Machine Learning I
40 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
45 pages
Machine Learning and Data Mining
No ratings yet
Machine Learning and Data Mining
88 pages
CS601 - Machine Learning - Unit 1 - Notes - 1672759748
No ratings yet
CS601 - Machine Learning - Unit 1 - Notes - 1672759748
13 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
Anintroductiontomachinelearning: Michaelclark Centerforsocialresearch Universityofnotredame
No ratings yet
Anintroductiontomachinelearning: Michaelclark Centerforsocialresearch Universityofnotredame
43 pages
21CSC305P ML - Unit 1-E
No ratings yet
21CSC305P ML - Unit 1-E
137 pages
Day 2. Lecture - Machinelearning
No ratings yet
Day 2. Lecture - Machinelearning
32 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
12 pages
Unit 1 - Machine Learning
No ratings yet
Unit 1 - Machine Learning
17 pages
Mechine Learning
No ratings yet
Mechine Learning
106 pages
Machine Learning
No ratings yet
Machine Learning
64 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
01ML Introduction
No ratings yet
01ML Introduction
80 pages
Machine Learning Notes
100% (3)
Machine Learning Notes
134 pages
Concept Learning
No ratings yet
Concept Learning
85 pages
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
No ratings yet
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
16 pages
2024 Machine Learning Intro
No ratings yet
2024 Machine Learning Intro
50 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
M2 AI Chap1 Neural-Network
No ratings yet
M2 AI Chap1 Neural-Network
60 pages
QSRI Lecture1
No ratings yet
QSRI Lecture1
45 pages
SDL Unit 1
No ratings yet
SDL Unit 1
7 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
Module 1 ML
No ratings yet
Module 1 ML
51 pages
Regression 0
No ratings yet
Regression 0
108 pages
Machine Learning
No ratings yet
Machine Learning
54 pages
Supervised Machine Learning - Linear Regression
No ratings yet
Supervised Machine Learning - Linear Regression
92 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Unit 1 Machine Learning - PDF Lands
No ratings yet
Unit 1 Machine Learning - PDF Lands
5 pages
Previous Lecture
No ratings yet
Previous Lecture
43 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
2021 Machine Learning Intro
No ratings yet
2021 Machine Learning Intro
43 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
Chapter 4 - Machine Learning
No ratings yet
Chapter 4 - Machine Learning
81 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
10 pages
Iu 3.6.4 ML 101
No ratings yet
Iu 3.6.4 ML 101
39 pages
Unit 01
No ratings yet
Unit 01
32 pages
ML Chapter 1
No ratings yet
ML Chapter 1
41 pages
ML - Week 1
No ratings yet
ML - Week 1
37 pages
Machine Learning - Unit - 1
100% (1)
Machine Learning - Unit - 1
58 pages
Module 03 - Learners Guide
No ratings yet
Module 03 - Learners Guide
13 pages
Lect3 Machine Learning
No ratings yet
Lect3 Machine Learning
27 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
9 pages
ML Intro Theory
No ratings yet
ML Intro Theory
10 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Crystallizers
0% (1)
Crystallizers
7 pages
Sineax V 604 Programmable Universal Transmitter
No ratings yet
Sineax V 604 Programmable Universal Transmitter
18 pages
Owner's Manual / Manual Del Propietario: Briggs & Stratton Power Products Group, LLC Jefferson, Wisconsin, U.S.A
No ratings yet
Owner's Manual / Manual Del Propietario: Briggs & Stratton Power Products Group, LLC Jefferson, Wisconsin, U.S.A
32 pages
Chiller System
No ratings yet
Chiller System
8 pages
L1-Introduction To CORROSION PDF
No ratings yet
L1-Introduction To CORROSION PDF
24 pages
STAT 2220 Practice TT1 UofM
No ratings yet
STAT 2220 Practice TT1 UofM
21 pages
Pair of Words
No ratings yet
Pair of Words
18 pages
Maladaptive Patterns of Behavior A. Anxiety
67% (3)
Maladaptive Patterns of Behavior A. Anxiety
19 pages
Journal Iso 20743
No ratings yet
Journal Iso 20743
15 pages
Modul Kursus Sembelihan 2020
No ratings yet
Modul Kursus Sembelihan 2020
66 pages
Chapter 9
No ratings yet
Chapter 9
4 pages
The World Health Organization Is A Specialized Agency of The United Nations Responsible For International Public Health
No ratings yet
The World Health Organization Is A Specialized Agency of The United Nations Responsible For International Public Health
3 pages
Structure of Bacteria
No ratings yet
Structure of Bacteria
2 pages
Essential Oils and Cosmetic Raw Materials
100% (1)
Essential Oils and Cosmetic Raw Materials
25 pages
Design Siemens
No ratings yet
Design Siemens
22 pages
Note Phytohormones
100% (1)
Note Phytohormones
47 pages
Autoclave Parts/ Components
No ratings yet
Autoclave Parts/ Components
8 pages
Infra Rad Radiation
No ratings yet
Infra Rad Radiation
16 pages
Phase Diag
No ratings yet
Phase Diag
31 pages
I-100 Introduction To Incident Command System: Student Reference Notes
No ratings yet
I-100 Introduction To Incident Command System: Student Reference Notes
56 pages
Pen Dry 1996
No ratings yet
Pen Dry 1996
4 pages
Practice Test/Thermochemistry/Ap Chemistry: Combustion F F F
No ratings yet
Practice Test/Thermochemistry/Ap Chemistry: Combustion F F F
3 pages
Exogenic Processes
No ratings yet
Exogenic Processes
28 pages
Risk Profile Analysis PDF
No ratings yet
Risk Profile Analysis PDF
2 pages
Nouveau Document Texte
No ratings yet
Nouveau Document Texte
1,014 pages
Nurs 350 Pico Paper, Spring 2014
No ratings yet
Nurs 350 Pico Paper, Spring 2014
12 pages

DS-05 Introduction To Machine Learning

Uploaded by

DS-05 Introduction To Machine Learning

Uploaded by

Introduction to Machine Learning

Introduction to Data Science

Parts of this lecture are based on:

• A branch of artificial intelligence, concerned with the

• Human learning is still far more sophisticated than even

Supervised learning Unsupervised learning

How much/many Which group

Is it odd Which action

• Tens of thousands of machine learning algorithms

• Understanding domain, prior knowledge, and goals

• Traditional approaches for the loan-

TV radio newspaper sales

TV radio newspaper sales

230.1 37.8 69.2 22.1

TV radio newspaper sales

230.1 37.8 69.2 22.1

• Here, 𝑓 is the unknown function expressing an underlying rule for

• A statistical model is any algorithm that estimates 𝑓. We denote

• 0 our estimate of 𝑓. These are called

• 0 we just want to make

What is 𝑦,& at some 𝑥& ?

Find distances to all

Predict 𝑦,& = 𝑦"

What is 𝑦,& at some 𝑥& ?

Find distances to all

Find the k-nearest

We will use the simplest

• If our model is perfect then 𝑅 3 = 1

• What if we ask the question:

– “how much more sales do we expect if we double the TV advertising budget?”

• Alternatively, we can build a model by first assuming a simple form of 𝑓:

• Again we use MSE as our loss function,

• E.g. the loss function for different 𝛽( when 𝛽4 is fixed to be 6:

where 𝑦= and 𝑥̅ are sample means.

A more flexible method is

Change to more conventional notation:

Summary of Gradient Descent 𝑑ℒ

• Question: What do you think a predictor coefficient means?

• 𝑆𝑎𝑙𝑒𝑠 = 7.5 + 0.04 𝑇𝑉

• What if? 𝑆𝑎𝑙𝑒𝑠 = 7.5 + 1.01 𝑇𝑉

• We interpret the 𝜀 term in our observation

• For example, we can compute 𝛽0[ and 𝛽0\ multiple times by

idealized original population

apply test statistic (e.g. mean)

histogram of statistic values

compare test statistic on

apply test statistic (e.g. mean)

histogram of statistic values

• We can empirically estimate the standard errors, 𝑆𝐸 𝛽0[ , 𝑆𝐸 𝛽0\ of 𝛽[

• However, if we make the following assumptions,

• the errors 𝜖* = 𝑦* − 𝑦9* and 𝜖= = 𝑦= − 𝑦9= are uncorrelated, for 𝑖 ≠ 𝑗 ,

• each 𝜖* has a mean 0 and variance 𝜎>3 ,

• We are interested in predicting whether an individual will default on his or her

• We model default =1 and not default =0

• We need to model the relationship between

• After a bit of manipulation of the original formula, we find that

• The left-hand side is called the log-odds or logit

• Solving the model a we get β0 = −10.6513 and β1= 0.0055

• We now consider the problem of predicting a binary response

• Recall By this we mean the probability that a positive example will be

• The simplest measure of performance would be the fraction of

• (NTP = true positive, NFN = false negative etc.).

• Sensitivity is recall measured on positive examples:

• Specificity is recall measured on negative examples:

• Is an error matrix, that allows

• ROC AUC is the “Area

• A variation of the ROC plot

• Lift plots emphasize initial

You might also like