0% found this document useful (0 votes)
39 views206 pages

ML PPT 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views206 pages

ML PPT 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 206

K Nearest Neighbor Algorithm

3/1/2024 1
What is Machine Learning?
What is KNN Classifier?
K-Nearest Neighbor Decision Rule
Classes and Labeling
How do we find the Label?
Example of a 2-D Space
Practical Example of Cats and Dogs
2-D Euclidean Space Feature
Representation
Unknown Test Sample
K-NN where K = 5
What we did?
What is ‘k’ and its significance?

3/1/2024 13
Linear Models
• While most real valued features are not intrinsically
geometric – think of a person’s age or an object’s
temperature – we can still imagine them being plotted
in a d-dimensional Cartesian coordinate system.
• We can then use geometric concepts such as lines and
planes to impose structure on this space, for instance
in order to build a classification model.
• Linearity plays a fundamental role in mathematics and
related disciplines, and the mathematics of linear
models is well-understood.
• In machine learning, linear models are of particular
interest because of their simplicity.

3/1/2024 14
Linear Models

3/1/2024 15
Linear Models

3/1/2024 16
Some Pointers on Linear Model
 Linear models are parametric, meaning that they have a fixed
form with a small number of numeric parameters that need to
be learned from data.
 Linear models are stable, which is to say that small variations in
the training data have only limited impact on the learned model.
 Linear models are less likely to overfit the training data than
some other models, largely because they have relatively few
parameters.
 The flipside of this is that they sometimes lead to underfitting:
e.g., imagine you are learning where the border runs between
two countries from labeled samples, then a linear model is
unlikely to give a good approximation.

3/1/2024 17
Linear Regression
• Linear models exist for all predictive tasks,
including classification, probability estimation
and regression.
• In statistics, linear regression is a linear
approach to modeling the relationship
between a dependent variable and one or
more independent variables.
• Let X be the independent variable and Y be
the dependent variable.

3/1/2024 18
Linear Regression

m is the slope of the line and c is the y intercept.


We will use this equation to train our model with a given dataset and predict the
value of Y for any given value of X.
Our challenge is to determine the value of m and c, such that the line corresponding
to those values is the best fitting line or gives the minimum error.
3/1/2024 19
Linear Regression

3/1/2024 20
Least Squares Method
• We start by introducing a method that can be
used to learn linear models for classification
and regression.
• The regression problem is to learn a function
estimator

• The differences between the actual and


estimated function values on the training
examples
3/1/2024 21
Least Squares Method

3/1/2024 22
Difference b/n Univariate, Bivariate,
and Multivariate
• Univariate Data: This type of data consists
of only one variable.
• It does not deal with causes or relationships
and the main purpose of the analysis is to
describe the data and find patterns that exist
within it.

3/1/2024 23
Difference b/n Univariate, Bivariate,
and Multivariate
• Bivariate Data: This type of data involves two
different variables.
• The analysis of this type of data deals with
causes and relationships and the analysis is
done to find out the relationship among the
two variables.

3/1/2024 24
Linear Regression

3/1/2024 25
Linear Regression

3/1/2024 26
Covariance (Industrial Average
Monthly Returns)

3/1/2024 27
Covariance (Industrial Average &
Monthly Returns)

3/1/2024 28
Covariance (Industrial Average &
Monthly Returns)

3/1/2024 29
One Rises other also Rises (Linear
Relationship)

3/1/2024 30
Co-Vary
• This is called Covariance i.e., how they Co-
Vary.
• How they change together?
• Covariance is one of a family of statistical
measures used to analyze the linear
relationships between two variables.
• How do two variables behave as a pair?

3/1/2024 31
How do we measure them?
• Covariance
• Correlation
• Linear Regression

3/1/2024 32
Covariance
• A descriptive measure of the linear association
between two variables.
• A positive value indicates a direct or increasing
linear relationship.
• A negative value indicates a decreasing
relationship.
- Direction +
Does not indicate the Strength
3/1/2024 33
Variables move in the Same or
Opposite Direction

3/1/2024 34
A Negative Slope

3/1/2024 35
Variables have no Relationship

3/1/2024 36
Calculation of Covariance
• Sample Covariance is the estimation for the
Population Covariance.
• As all estimators, it uses sample data and it is
experimental.
• On the other hand, the population statistics is
theoretical and can be calculated when you
know the joint distribution.

3/1/2024 37
Sample and Population Covariance

3/1/2024 38
Move Table from a Classroom (x 
Students, y  Tables Moved)

3/1/2024 39
Covariance v/s Correlation
• Covariance provides the Direction (positive,
negative, near zero) of the linear relationship
between variables.
• Correlation provides Direction and Strength.
• Covariance result has no upper or lower bound
and its size is dependent on the scale of the
variables.
• While correlation is always between -1 and +1
and its scale is independent of the scale of the
variables themselves.
3/1/2024 40
Correlation

3/1/2024 41
Correlation is only applicable to Linear
Relationships

3/1/2024 42
How do we compute Correlation?

3/1/2024 43
Covariance Matrix

3/1/2024 44
Covariance Matrix

3/1/2024 45
Covariance Matrix

3/1/2024 46
Correlation Matrix

3/1/2024 47
Gradient Descent for Linear Regression

3/1/2024 48
Mean Square Error (MSE) Function

3/1/2024 49
Mean Square Error (MSE) Function

3/1/2024 50
Gradient Descent Minimization

3/1/2024 51
Gradient Descent Minimization

3/1/2024 52
Gradient Descent Minimization

3/1/2024 53
Gradient Descent Minimization

3/1/2024 54
Gradient Descent to Solve a Linear
Regression Problem using Matlab
• The best way of learning how linear regression
works is using an example:
• First let's visualize our data set:

3/1/2024 55
Linear Regression
• Now what we want to do is to find a straight
line that is the best fit to this data.
• The line will be our hypothesis, let's define it's
function.
• The line doesn't have to be straight.
• But because a straight line is the simplest
case, we're going to go with it.

3/1/2024 56
Linear Regression

3/1/2024 57
Cost Function

3/1/2024 58
Cost Function

3/1/2024 59
Gradient Descent
• To explain to you what gradient descent is and
how it works, it helps to plot our cost
function, so our cost function will probably
look like something like that blue line:

3/1/2024 60
Gradient Descent
• Now that red circle is where you could end up
depending on your data set and initial theta
(2D vector containing intercept and slope).
• Green circle is where you want to end up,
because if you remember our cost function's
job is to calculate how off our prediction is.
• So the minimum value for this function means
getting the most accurate predictions.

3/1/2024 61
Gradient Descent Convergence Criteria

• First of all, one should know that when we say


parameters we mean our vector of theta which is
a 2D vector that contains the intercept and slope
of our hypothesis line.

3/1/2024 62
Gradient Descent Convergence Criteria
• Now let's start with the partial derivative of
our cost function of theta (with respect to θj),
• The partial derivative is actually the slope of
the line tangent to our cost function of theta.
• This looks like something like this:

3/1/2024 63
Gradient Descent Convergence Criteria
• In the picture above, that red circle is the value of
cost function for a specific vector of theta.
• The bold blue line is the tangent line of that circle
and the slope of that line is the partial derivative
of our cost function for a specific vector of theta.
• Now this is the smart thing about this algorithm
that makes it useful for minimizing functions.
• Using the sign of the slope of this tangent line, it
can decide whether to increase or decrease the
value of our parameters (intercept and slope).
3/1/2024 64
Implementing Gradient Descent
• main.m is the file that prepares all the data that is
required for our algorithm.
• Feeds this data to another function which
actually has the implementation of the algorithm
in it, and then shows us the results.
• gradient.m is the file that has the gradient
function and the implementation of gradient
descent in it.
• cost.m is a short and simple file that has a
function that calculates the value of cost function
with respect to its arguments.
3/1/2024 65
main.m
% Loading the dataset

dataSet = load('DataSet.txt');

% Storing the values in separate matrices

x = dataSet(:, 1);

y = dataSet(:, 2);

• This block of code is basically loading the comma separated list of


data stored in DataSet.txt to a variable called dataSet.
• DataSet is a matrix with 2 columns, the first column is sizes and the
second column is prices.
• We're populating the variable x with the first column and the
variable y with the second column.

3/1/2024 66
Feature Normalization
• In this part, we're doing something called feature
normalization.
• This is a trick that is used in machine learning a lot and it
turns out it really helps gradient descent to reach
convergence by normalizing the data, now what does that
mean?
• You see, often in machine learning, our data sets contain
very large numbers, and this can cause many problems in
many cases.
• Even worse than that is when you have a set of features in
our data set in the range of 1 and 5, and you have another
set in the range of say 1000 and 200000, now that could
cause many problems, so what do we do?

3/1/2024 67
Feature Normalization
• One of the ways of dealing with such
problems is to use something called mean
normalization.
• To apply Min-Max normalization, you simply
apply the formula below to every sample in
your data set:

3/1/2024 68
Feature Normalization

% Do you want feature normalization?

normalization = true;

% Applying mean normalization to our dataset

if (normalization)

maxX = max(x);
minX = min(x);
x = (x - minX) / (maxX - minX);

end

3/1/2024 69
Feature Normalization
• Now what we want to do is to apply this
formula to every single one of values in our
data set.
• So the code is:

x = (x - minX) / (maxX - minX);

3/1/2024 70
Feature Normalization
• The variable x in the code above is a nx1 matrix that
contains all of our house sizes.
• The min() function simply finds the smallest value in
that matrix.
• We then subtract the number from a matrix.
• The result is another matrix and the values within
that matrix look like this:

3/1/2024 71
Feature Normalization
• Then we are dividing this matrix by another
number which is the biggest value in our
original vector max(x) minus the smallest
value in our original vector min(x).
• No need to normalize the independent
variable y (except may be in few cases).
• So now we have a variable called x that
contain the normalized values of our data set
and y that contain the prices of our houses.

3/1/2024 72
Scatter Plot

% Adding a column of ones to the beginning of the 'x' matrix

x = [ones(length(x), 1) x];

% Plotting the dataset

figure;

plot(x(:, 2), y, 'rx', 'MarkerSize', 10);

xlabel('Size ( squared meters )');

ylabel('Price');

title('Housing Prices');

3/1/2024 73
Calling Gradient Descent Algorithm
% Running gradient descent on the data

% 'x' is our input matrix

% 'y' is our output matrix

% 'parameters' is a matrix containing our initial theta and slope

parameters = [0; 0];

learningRate = 0.1;

repetition = 1500;

[parameters, costHistory] = gradient(x, y, parameters, learningRate, repetition);

3/1/2024 74
Return Parameters
• Now this is where it all happens, we are calling a
function called gradient that runs gradient
descent on our data based on the arguments we
send it.
• It is returning two things first, parameters which
is a matrix that contains the intercept and slope
of the line that fits our data set best.
• The second one is another matrix containing the
value of our cost function on each iteration of
gradient descent to plot the cost function later
3/1/2024 75
Learning Rate
• There is no definitive way to come up with the
best learning rate and repetition values.
• You just have to tinker with it to find the best
values for your data set.
• Although plotting the cost function can help
you with this because it lets you see how good
(or bad) the algorithm did its job.

3/1/2024 76
Plotting Hypothesis
% Plotting our final hypothesis

figure;

plot(min(x(:, 2)):max(x(:, 2)), parameters(1) + parameters(2) * (min(x(:, 2)):max(x(:, 2))));

hold on;

• **At this point, the intercept and slope of the


hypothesis line have been modified by the algorithm,
and the resulting line is the best fit to our data. **
• parameters(1) is the intercept of our hypothesis line
or ϴ1
• parameters(2) is the slope of our hypothesis line or ϴ2

3/1/2024 77
gradient.m
% Getting the length of our dataset

m = length(y);

% Creating a matrix of zeros for storing our cost function history

costHistory = zeros(repetition, 1);

• In the above code, we are storing the number of


our training examples in a variable called m.
• We are also pre-allocating a nx1
matrix called costHistory.

3/1/2024 78
gradient.m
• Notice that we are using our repetition variable
to determine how many values are going to be
stored in this vector.
• Now the reason for that is because on every
iteration of gradient descent we are going to have
different values for intercept and slope of our
hypothesis line.
• Therefore on every iteration we are going to have
a different value for our cost function.
3/1/2024 79
Convergence
• Inside the for loop is where it all happens, first
let me explain what formulas we're using, so
we said that the formula for gradient descent
is this:

3/1/2024 80
Convergence
• If we plug in the definition of our cost function
and take into consideration that we have two
parameters (intercept and slope), we end up
with an algorithm like this:

3/1/2024 81
h Calculation
h = (x * parameters - y)';

• The whole point of this line of code is to


calculate that summation that is in both of our
formulas, the best way to explain this is using
equation so x * parameters will look like:

3/1/2024 82
Why Column Added?
• So after multiplying these matrices, we are left with another
matrix that contains our predicted price or h(x) for each one
of the sizes in our training set.
• BTW this is why we added that column of ones to this matrix
in main.m. Now we are going to have to subtract the real
prices or y from these:

3/1/2024 83
Transpose the Column Vector
• Now we almost have that entire summation part of
our algorithm in a matrix.
• But in order to proceed and do the rest of our
calculation, we have to transpose this matrix which
in this case basically means converting it into a
vector.

3/1/2024 84
Running Gradient Descent
% Running gradient descent

for i = 1:repetition

% Calculating the transpose of our hypothesis

h = (x * parameters - y)';

% Updating the parameters

parameters(1) = parameters(1) - learningRate * (1/m) * h * x(:, 1);

parameters(2) = parameters(2) - learningRate * (1/m) * h * x(:, 2);

% Keeping track of the cost function

costHistory(i) = cost(x, y, parameters);

3/1/2024
end 85
Complete Code
• Now with this vector saved in a variable
called h, we are ready to move on to the next
part of the code:

parameters(1) = parameters(1) - learningRate * (1/m) * h * x(:, 1);


parameters(2) = parameters(2) - learningRate * (1/m) * h * x(:, 2);

3/1/2024 86
Logistic Function
• To model population growth using
a differential equation, we first need to
introduce some variables and relevant terms.
• The variable t will represent time.
• The units of time can be hours, days, weeks,
months, or even years.
• Any given problem must specify the units used
in that particular problem.

3/1/2024 87
Logistic Function
• The variable n will represent population.
• Since the population varies over time, it is
understood to be a function of time.
• Therefore we use the notation n(t) for the
population as a function of time.
• If n(t) is a differentiable function, then the first
derivative dn/dt represents the instantaneous
rate of change of the population as a function
of time.
3/1/2024 88
Logistic Function
• In Exponential Growth and Decay, we studied the exponential growth and
decay of populations and radioactive substances.
• You have likely studied exponential growth and even modeled populations
using exponential functions.
• In this section we'll look at a special kind of exponential function called
the logistic function.
• The logistic function models the exponential growth of a population, but also
considers factors like the carrying capacity of land.
• The carrying capacity of an environment is the maximum population size of a
biological species that can be sustained by that specific environment, given
the food, habitat, water, and other resources available.
• A certain region simply won't support unlimited growth because as one
population grows, its resources diminish.
• So a logistic function puts a limit on growth.

3/1/2024 89
Exponential v/s Logistic

The logistic function is exponential for early times, but the growth slows as it
reaches some limit. In this hypothetical case, the limit seems to be about 85
individuals; the function approaches a horizontal asymptote at P(t) = 85.

3/1/2024 90
Exponential v/s Logistic
• Exponential growth is unchecked growth —
growth without limits.
• Exponential functions aren’t realistic models
of population growth.
• Other phenomena, except for the early stages
of growth where space, nutrients and other
necessities are effectively unlimited.

3/1/2024 91
The Logistic Function
• The logistic function can be written in a number of ways that are all
only subtly different.
• In this version, n(t) is the population ("number") as a function of
time, t. to is the initial time, and the term (t - to) is just a flexible
horizontal translation of the logistic function.
• k is a parameter that affects the rate of exponential growth.
• L is the horizontal asymptote or the limit on the size of a
population.

3/1/2024 92
Limits of Growth
• One important feature of the logistic function is it's
behavior at large values of the independent variable.
• Here we'll define a population function n(t) as a logistic
function.

• There are two adjustable parameters in this function, L


and k.
• These are a vertical scaling parameter (L) and a
horizontal scaling parameter (k) that allow us to stretch
or compress such a function to fit our data.

3/1/2024 93
Varying L and k

3/1/2024 94
Varying L and k

3/1/2024 95
Varying L and k

3/1/2024 96
Limit Analysis

• where (t−t0) has been replaced with t.


• It's the same thing; t−t0 just measures the change
in time from some initial time, to.
• Here, just so simplify a bit, we'll assume that t0=0.

3/1/2024 97
Limit Analysis
• Now lets look at how this function behaves in
the limit as t → ∞:

3/1/2024 98
Limit Analysis

3/1/2024 99
Inflection Point
• One important point on the logistic curve is
the inflection point, the point where the
curvature of the graph changes from concave-
upward to concave-downward.

3/1/2024 100
Inflection Point

3/1/2024 101
Inflection Point

3/1/2024 102
Logistic Regression Model

3/1/2024 103
Logistic Regression Model
• Logistic Regression was used in the biological
sciences in early twentieth century.
• It was then used in many social science
applications.
• Logistic Regression is used when the
dependent variable (target) is categorical.

3/1/2024 104
Examples
• To predict whether an email is spam (1) or (0)
• If we use linear regression for this problem, there
is a need for setting up a threshold based on
which classification can be done.
• Whether the tumor is malignant (1) or not (0)
• Say if the actual class is malignant, predicted
continuous value 0.4 and the threshold value is
0.5, the data point will be classified as not
malignant which can lead to serious consequence
in real time.
3/1/2024 105
Linear Regression v/s Logistic
Regression
• From this example, it can be inferred that
linear regression is not suitable for
classification problem.
• Linear regression is unbounded, and this
brings logistic regression into picture.
• Their value strictly ranges from 0 to 1.

3/1/2024 106
Types of Logistic Regression
• Binary Logistic Regression
• Multinomial Logistic Regression
• Ordinal Logistic Regression

3/1/2024 107
Types of Logistic Regression
• Binary Logistic Regression: The categorical
response has only two 2 possible outcomes.
Example: Spam or Not
• Multinomial Logistic Regression: Three or more
categories without ordering. Example: Predicting
which food is preferred more (Veg, Non-Veg,
Vegan)
• Ordinal Logistic Regression: Three or more
categories with ordering. Example: Movie rating
from 1 to 5
3/1/2024 108
Model
• Output = 0 or 1
• Hypothesis => Z = WX + B
• hΘ(x) = sigmoid (Z)

• If ‘Z’ goes to infinity, Y(predicted) will become 1


and if ‘Z’ goes to negative infinity, Y(predicted)
will become 0.
3/1/2024 109
Analysis of the Hypothesis
• The output from the hypothesis is the
estimated probability.
• This is used to infer how confident can
predicted value be actual value when given an
input X.
• Based on the x1 value, let’s say we obtained
the estimated probability to be 0.8.
• This tells that there is 80% chance that an
email will be spam.
3/1/2024 110
Analysis of the Hypothesis

• This justifies the name ‘logistic regression’.


• Data is fit into linear regression model first.
• Then be acted upon by a logistic function predicting
the target categorical dependent variable.
3/1/2024 111
Vectorized Representation
• The decision boundary can be described by an
equation.
• As in linear regression, the logistic regression
algorithm will be able to find the best θs
parameters.
• θs to make the decision boundary actually
separate the data points correctly.

3/1/2024 112
m Training Vectors
• Suppose we have a generic training set

• made of m training examples, where (x(1),y(1)) is


the 1st example and so on.
• More specifically, x(m) is the input variable of
the mth example, while y(m) is its output variable.
• Being this a classification problem, each example
has of course the output y bound
between 0 and 1.

3/1/2024 113
Input Vector and Hypothesis Function
• In other words, y 0,1. Each example is
represented as usual by its feature vector

• Finally we have the hypothesis function for


logistic regression, as seen earlier

3/1/2024 114
Minimize Cost Function
• Our task now is to choose the best
parameters θs in the equation above, given
the current training set, in order to minimize
errors.
• Remember that θ is not a single parameter: it
expands to the equation of the decision
boundary which can be a line or a more
complex formula (with more θs to guess).

3/1/2024 115
Minimize Cost Function
• The procedure is similar to what we did for
linear regression.
• Define a cost function and try to find the best
possible values of each θ by minimizing the
cost function output.
• The minimization will be performed by a
gradient descent algorithm, whose task is to
parse the cost function output until it finds
the lowest minimum point.

3/1/2024 116
Logistic Loss Function
• We can make it more compact into a one-line
expression:

• With the optimization in place, the logistic


regression cost function can be rewritten as:

3/1/2024 117
How to minimize Cost Function?
• We have the hypothesis function and the cost
function
• It's now time to find the best values for θs
parameters in the cost function.
• Minimize the cost function by running the
gradient descent algorithm.
• The procedure is identical to what we did for
linear regression.

3/1/2024 118
How to minimize Cost Function?
• More formally, we want to minimize the cost function:

• Which will output a set of parameters θ, the best ones


(i.e. with less error).
• Once done, we will be ready to make predictions
on new input examples with their features x, by using
the new θs in the hypothesis function:

• Where hθ(x) is the output, the prediction, or yet the


probability that y=1.

3/1/2024 119
Gradient Descent
• The way we are going to minimize the cost
function is by using the gradient descent.
• To minimize the cost function we have to run
the gradient descent function on each
parameter:

3/1/2024 120
Gradient Descent
• Remember to simultaneously update all θj as
we did in the linear regression counterpart.
• if you have n features, that is a feature
vector θ =[θ0,θ1, θn], all those parameters
have to be updated simultaneously on each
iteration:

3/1/2024 121
Gradient Descent
• Back to the algorithm, the computation of the
daunting derivative ∂J(θ)/∂θj, which becomes:

• So the loop also can be rewritten as

3/1/2024 122
Gradient Descent
• What's changed however is the definition of
the hypothesis hθ(x): for linear regression we
had hθ(x)=θ x, whereas for logistic regression
we have hθ(x)=1/(1+eθ x).

3/1/2024 123
Cost Function

• Linear regression uses mean squared error as its


cost function.
• If this is used for logistic regression, then it will be
a non-convex function of parameters (theta).
• Gradient descent will converge into global
minimum only if the function is convex.

3/1/2024 124
Cost Function

3/1/2024 125
3/1/2024 126
3/1/2024 127
Simplified Cost Function

3/1/2024 128
End to End Machine Learning Project
• Look at the big picture.
• Get the data.
• Discover and visualize the data to gain insights.
• Prepare the data for Machine Learning
algorithms.
• Select a model and train it.
• Fine-tune your model.
• Present your solution.
• Launch, monitor, and maintain your system.
3/1/2024 129
Popular Real Open Datasets
• Popular open data repositories:
—UC Irvine Machine Learning Repository
—Kaggle datasets
—Amazon’s AWS datasets
• Meta portals (they list open data repositories):
—http://dataportals.org/
—http://opendatamonitor.eu/
—http://quandl.com/
• Other pages listing many popular open data repositories:
—Wikipedia’s list of Machine Learning datasets
—Quora.com question
—Datasets subreddit

3/1/2024 130
California Housing Prices

3/1/2024 131
California Housing Prices
• The first task is to build a model of housing prices in California
using the California census data.
• This data has metrics such as the population, median income,
median housing price, and so on for each block group in
California.
• Block groups are the smallest geographical unit for which the
US Census Bureau publishes sample data (a block group
typically has a population of 600 to 3,000 people).
• We will just call them “districts” for short.
• Your model should learn from this data and be able to predict
the median housing price in any district, given all the other
metrics.
3/1/2024 132
California Housing Prices
• The first question is what exactly is the
business objective; building a model is
probably not the end goal.
• How does the company expect to use and
benefit from this model?
• This is important because it will determine
how you frame the problem, what algorithms
you will select, what performance measure
you will use to evaluate your model.

3/1/2024 133
Task at Hand – Get a Bigger Picture
• A prediction of a district’s median housing
price.
• This will be fed as input to another Machine
Learning system, along with many other
signals.
• This downstream system will determine
whether it is worth investing in a given area or
not.

3/1/2024 134
Machine Learning Pipeline

3/1/2024 135
Pipelines
• A sequence of data processing components is
called a data pipeline.
• Pipelines are very common in Machine
Learning systems, since there is a lot of data to
manipulate and many data transformations to
apply.
• Components typically run asynchronously.

3/1/2024 136
Pipelines
• Each component pulls in a large amount of
data, processes it, and spits out the result in
another data store.
• Then some time later the next component in
the pipeline pulls this data and spits out its
own output, and so on.
• Each component is fairly self-contained: the
interface between components is simply the
data store.

3/1/2024 137
Get the Nature of the Model
• First, one need to frame the problem: is it
supervised, unsupervised, or Reinforcement
Learning?
• Is it a classification task, a regression task, or
something else?
• Should you use batch learning or online
learning techniques?

3/1/2024 138
Get the Nature of the Data
• Let’s see: it is clearly a typical supervised learning
task since you are given labeled training examples
• Each instance comes with the expected output,
i.e., the district’s median housing price.
• Moreover, it is also a typical regression task, since
you are asked to predict a value.
• More specifically, this is a multiple regression
problem since the system will use multiple
features to make a prediction (it will use the
district’s population, the median income, etc.).
3/1/2024 139
Get the Nature of the Data
• It is also a univariate regression problem since we are only
trying to predict a single value for each district.
• If we were trying to predict multiple values per district, it
would be a multivariate regression problem.
• Finally, there is no continuous flow of data coming in the
system.
• There is no particular need to adjust to changing data
rapidly, and the data is small enough to fit in memory, so
plain batch learning should do just fine.
• If the data was huge, you could either split your batch
learning work across multiple servers

3/1/2024 140
Select a Performance Measure
• Next step is to select a performance measure.
• A typical performance measure for regression
problems is the Root Mean Square Error
(RMSE).
• It gives an idea of how much error the system
typically makes in its predictions, with a higher
weight for large errors.

3/1/2024 141
Notations

3/1/2024 142
Notations

3/1/2024 143
Notations

3/1/2024 144
Mean Absolute Error
• The RMSE is generally the preferred performance
measure for regression tasks.
• In some contexts one may prefer to use another
function.
• For example, suppose that there are many outlier
districts. In that case, you may consider using the Mean
Absolute Error (also called the Average Absolute
Deviation.

3/1/2024 145
Notations

3/1/2024 146
Download the Data
• You can just download a single compressed
file, housing.tgz, which contains a comma-
separated value (CSV) file called housing.csv
with all the data.

3/1/2024 147
Download the Data

3/1/2024 148
Take a look at the Data Structure

3/1/2024 149
Load Pandas and Read the Data File
• Now let’s load the data using Pandas. Once
again you should write a small function to
load the data:

3/1/2024 150
Take a look at the Data Structure
• Each row represents one district. There are 10
attributes (you can see the first 6 in the
screenshot):
• longitude, latitude, housing_median_age,
total_rooms, total_bed rooms, population,
households, median_income,
median_house_value, and ocean_proximity.

3/1/2024 151
Housing Info
• The info() method is useful to get a quick
description of the data, in particular the total
number of rows, and each attribute’s type and
number of non-null values.

3/1/2024 152
Housing Info
• There are 20,640 instances in the dataset,
which means that it is fairly small by Machine
Learning standards.
• But it’s perfect to get started.
• Notice that the total_bed rooms attribute has
only 20,433 non-null values, meaning that 207
districts are missing this feature.
• We will need to take care of this later.

3/1/2024 153
Attributes and its Types
• All attributes are numerical, except the
ocean_proximity field.
• Its type is object, so it could hold any kind of
Python object.
• The values in the ocean_proximity column
were repetitive, which means that it is
probably a categorical attribute.

3/1/2024 154
Attributes and its Types
• We can find out what categories exist and how
many districts belong to each category by
using the value_counts() method:

3/1/2024 155
describe()
• The describe() method shows a summary of
the numerical attributes

3/1/2024 156
describe()
• The count, mean, min, and max rows are self-
explanatory.
• Note that the null values are ignored (so, for example,
count of total_bedrooms is 20,433, not 20,640).
• The std row shows the standard deviation, which
measures how dispersed the values are.
• When a feature has a bell-shaped normal distribution
(also called a Gaussian distribution), which is very
common.
• The “68-95-99.7” rule applies: about 68% of the values
fall within 1σ of the mean, 95% within 2σ, and 99.7%
within 3σ.

3/1/2024 157
describe()
• The 25%, 50%, and 75% rows show the
corresponding percentiles.
• A percentile indicates the value below which a
given percentage of observations in a group of
observations falls.
• For example, 25% of the districts have a
housing_median_age lower than 18, while 50%
are lower than 29 and 75% are lower than 37.
• These are often called the 25th percentile (or 1st
quartile), the median, and the 75th percentile (or
3rd quartile).
3/1/2024 158
Histogram
• Another quick way to get a feel of the type of
data you are dealing with is to plot a histogram
for each numerical attribute.
• A histogram shows the number of instances (on
the vertical axis) that have a given value range (on
the horizontal axis).
• You can either plot this one attribute at a time, or
you can call the hist() method on the whole
dataset, and it will plot a histogram for each
numerical attribute
3/1/2024 159
Histogram
• For example, you can see that slightly over
800 districts have a median_house_value
equal to about $100,000.

3/1/2024 160
Overview of Probability Theory
• Any realistic model of a real-world phenomenon must
take into account the possibility of randomness.
• That is, more often than not, the quantities we are
interested in will not be predictable in advance but,
rather, will exhibit an inherent variation that should be
taken into account by the model.
• This is usually accomplished by allowing the model to
be probabilistic in nature.
• Such a model is, naturally enough, referred to as a
probability model.

3/1/2024 161
Sample Space
• Suppose that we are about to perform an
experiment whose outcome is not predictable in
advance.
• However, while the outcome of the experiment
will not be known in advance, let us suppose that
the set of all possible outcomes is known.
• This set of all possible outcomes of an
experiment is known as the sample space of the
experiment and is denoted by S.
3/1/2024 162
Sample Space

3/1/2024 163
Sample Space

3/1/2024 164
Event

3/1/2024 165
Probabilities Defined on Events

3/1/2024 166
Probabilities Defined on Events

3/1/2024 167
Probabilities Defined on Events

3/1/2024 168
Probabilities Defined on Events

3/1/2024 169
Conditional Probability
• Independent Events
• Events can be "Independent", meaning each
event is not affected by any other events.

3/1/2024 170
Conditional Probability
• Dependent Events
• But events can also be "dependent" ... which
means they can be affected by previous
events ...

3/1/2024 171
Tree Diagram

3/1/2024 172
Tree Diagram

3/1/2024 173
Tree Diagram

3/1/2024 174
Conditional Probability

3/1/2024 175
Example

3/1/2024 176
Example

3/1/2024 177
Baye’s Theorem

3/1/2024 178
Baye’s Theorem

3/1/2024 179
Baye’s Theorem

3/1/2024 180
Random Variable
• It frequently occurs that in performing an experiment we
are mainly interested in some functions of the outcome as
opposed to the outcome itself.
• For instance, in tossing dice we are often interested in the
sum of the two dice and are not really concerned about the
actual outcome.
• That is, we may be interested in knowing that the sum is
seven and not be concerned over whether the actual
outcome was (1, 6) or (2, 5) or (3, 4) or (4, 3) or (5, 2) or (6,
1).
• These quantities of interest, or more formally, these real-
valued functions defined on the sample space, are known
as random variables.

3/1/2024 181
Discrete and Continuous RV

A random variable is called discrete if it has either a finite or a countable number of


possible values. A random variable is called continuous if its possible values contain a
whole interval of numbers.

3/1/2024 182
Probability Distributions of Discrete
Random Variable
• Associated to each possible value x of a
discrete random variable X is the
probability P(x) that X will take the value x in
one trial of the experiment.
• The probability distribution of a
discrete random variable X is a list of each
possible value of X together with the
probability that X takes that value in one trial
of the experiment.

3/1/2024 183
Probability Distributions of Discrete
Random Variable

3/1/2024 184
Probability Distributions of Discrete
Random Variable

3/1/2024 185
Probability Distributions of Discrete
Random Variable

3/1/2024 186
Mean of a Discrete Random Variable

3/1/2024 187
Variance and Standard Deviation of a
Discrete Random Variable

3/1/2024 188
Example

3/1/2024 189
Solution

3/1/2024 190
Probability Mass Function

3/1/2024 191
Cumulative Distribution Function

3/1/2024 192
Example of PMF and CDF

3/1/2024 193
The Bernoulli Random Variable

3/1/2024 194
Continuous Random Variable –
Probability Density Function

3/1/2024 195
Continuous Random Variable –
Probability Density Function

3/1/2024 196
Continuous Random Variable –
Probability Density Function

3/1/2024 197
Exponential Random Variable

3/1/2024 198
Normal Random Variable

3/1/2024 199
3/1/2024 200
Continuous Random Variable –
Probability Density Function

3/1/2024 201
Maximum Likelihood Estimator
• Notes

3/1/2024 202
Bivariate Gaussian Distribution

3/1/2024 203
Multivariate Gaussian Distribution

3/1/2024 204
Maximum Likelihood Estimation –
Random Experiment
• In a random experiment, the outcome cannot
be predicted with certainty, before the
experiment is run.
• We assume that we can identify a fixed
set S that includes all possible outcomes of a
random experiment.
• This set plays the role of the universal set
when modeling the experiment.

3/1/2024 205
• For simple experiments, S may be precisely the
set of possible outcomes.
• More often, for complex experiments, S is a
mathematically convenient set that includes the
possible outcomes and perhaps other elements
as well.
• For example, if the experiment is to throw a
standard die and record the score that occurs, we
would let S={1,2,3,4,5,6}, the set of possible
outcomes.
3/1/2024 206

You might also like