0% found this document useful (0 votes)
47 views19 pages

Ai ML

Uploaded by

31062csiot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views19 pages

Ai ML

Uploaded by

31062csiot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 19

Regression

Hello and welcome! In this video we'll be giving a brief introduction to


regression. So let's get started. Look at this data set. It's related to co2
emissions from different cars. It includes engine size, number of cylinders, fuel
consumption, and co2 emission from various automobile models. The question is:
given this data set can we predict the co2 emission of a car using other fields
such as engine size or cylinders? Let's assume we have some historical data from
different cars and assume that a car such as in row 9 has not been manufactured
yet, but we're interested in estimating its approximate co2 emission after
production. Is it possible? We can use regression methods to predict a continuous
value such as co2 emission using some other variables. Indeed regression is the
process of predicting a continuous value. In regression there are two types of
variables: a dependent variable and one or more independent variables. The
dependent variable can be seen as the state, target, or final goal we study and try
to predict. And the independent variables, also known as explanatory variables, can
be seen as the causes of those states. The independent variables are shown
conventionally by X and the dependent variable is notated by Y. A regression model
relates Y or the dependent variable to a function of X i.e. the independent
variables. The key point in the regression is that our dependent value should be
continuous and cannot be a discrete value. However, the independent variable, or
variables, can be measured on either a categorical or continuous measurement scale.
So, what we want to do here is to use the historical data of some cars using one or
more of their features and from that data make a model. We use regression to build
such a regression estimation model; then the model is used to predict the expected
co2 emission for a new or unknown car. Basically there are two types of regression
models simple regression and multiple regression. Simple regression is when one
independent variable is used to estimate a dependent variable. It can be either
linear or non-linear. For example, predicting co2 emission using the variable of
engine size. Linearity of regression is based on the nature of relationship between
independent and dependent variables. When more than one independent variable is
present the process is called multiple linear regression. For example, predicting
co2 emission using engine size and the number of cylinders in any given car. Again,
depending on the relation between dependent and independent variables it can be
either linear or non-linear regression. Let's examine some sample applications of
regression. Essentially we use regression when we want to estimate a continuous
value. For instance, one of the applications of regression analysis could be in the
area of sales forecasting. You can try to predict a sales person's total yearly
sales from independent variables such as age, education, and years of experience.
It can also be used in the field of psychology, for example, to determine
individual satisfaction, based on demographic and psychological factors. We can use
regression analysis to predict the price of a house in an area, based on its size
number of bedrooms, and so on. We can even use it to predict employment income for
independent variables such as hours of work, education, occupation, sex, age, years
of experience, and so on. Indeed, you can find many examples of the usefulness of
regression analysis in these and many other fields or domains, such as finance,
healthcare, retail, and more. We have many regression algorithms, each of them has
its own importance and a specific condition to which their application is best
suited. And while we've covered just a few of them in this course, it gives you
enough base knowledge for you to explore different regression techniques.

Simple Linear Regression


Hello and welcome. In this video, we'll be covering linear regression. You don't
need to know any linear algebra to understand topics in linear regression. This
high-level introduction will give you enough background information on linear
regression to be able to use it effectively on your own problems. So let's get
started.
Play video starting at ::23 and follow transcript0:23
Let's take a look at this data set. It's related to the Co2 emission of different
cars. It includes engine size, cylinders, fuel consumption and Co2 emissions for
various car models. The question is, given this data set, can we predict the Co2
emission of a car using another field such as engine size? Quite simply, yes. We
can use linear regression to predict a continuous value such as Co2 emission by
using other variables. Linear regression is the approximation of a linear model
used to describe the relationship between two or more variables. In simple linear
regression, there are two variables, a dependent variable and an independent
variable. The key point in the linear regression is that our dependent value should
be continuous and cannot be a discrete value. However, the independent variables
can be measured on either a categorical or continuous measurement scale. There are
two types of linear regression models. They are simple regression and multiple
regression. Simple linear regression is when one independent variable is used to
estimate a dependent variable. For example, predicting Co2 emission using the
engine size variable. When more than one independent variable is present the
process is called multiple linear regression, for example, predicting Co2 emission
using engine size and cylinders of cars. Our focus in this video is on simple
linear regression. Now let's see how linear regression works. Okay, so let's look
at our data set again. To understand linear regression, we can plot our variables
here. We show engine size as an independent variable and emission as the target
value that we would like to predict. A scatter plot clearly shows the relation
between variables where changes in one variable explain or possibly cause changes
in the other variable. Also, it indicates that these variables are linearly
related. With linear regression you can fit a line through the data. For instance,
as the engine size increases, so do the emissions. With linear regression you can
model the relationship of these variables. A good model can be used to predict what
the approximate emission of each car is.
Play video starting at :3:1 and follow transcript3:01
How do we use this line for prediction now?
Play video starting at :3:4 and follow transcript3:04
Let us assume for a moment that the line is a good fit of the data. We can use it
to predict the emission of an unknown car. For example, for a sample car with
engine size 2.4, you can find the emission is 214.
Play video starting at :3:22 and follow transcript3:22
Now, let's talk about what the fitting line actually is.
Play video starting at :3:27 and follow transcript3:27
We're going to predict the target value y. In our case using the independent
variable engine size represented by x1. The fit line is shown traditionally as a
polynomial. In a simple regression problem, a single x, the form of the model would
be theta 0 plus theta 1 x1. In this equation, y hat is the dependent variable of
the predicted value. And x1 is the independent variable.
Play video starting at :4:2 and follow transcript4:02
Theta 0 and theta 1 are the parameters of the line that we must adjust. Theta 1 is
known as the slope or gradient of the fitting line and theta 0 is known as the
intercept.
Play video starting at :4:16 and follow transcript4:16
Theta 0 and theta 1 are also called the coefficients of the linear equation.
Play video starting at :4:23 and follow transcript4:23
You can interpret this equation as y hat being a function of x1, or y hat being
dependent of x1. How would you draw a line through the points? And how do you
determine which line fits best?
Play video starting at :4:42 and follow transcript4:42
Linear regression estimates the coefficients of the line. This means we must
calculate theta 0 and theta 1 to find the best line to fit the data. This line
would best estimate the emission of the unknown data points. Let's see how we can
find this line or, to be more precise, how we can adjust the parameters to make the
line the best fit for the data. For a moment, let's assume we've already found the
best fit line for our data. Now, let's go through all the points and check how well
they align with this line. Best fit here means that if we have, for instance, a car
with engine size x1 = 5.4 and actual Co2 = 250, its Co2 should be predicted very
close to the actual value, which is y = 250 based on historical data. But if we use
the fit line, or better to say using our polynomial with known parameters to
predict the Co2 emission, it will return y hat = 340. Now if you compare the actual
value of the emission of the car with what we've predicted using our model, you
will find out that we have a 90 unit error. This means our prediction line is not
accurate. This error is also called the residual error. So we can say the error is
the distance from the data point to the fitted regression line.
Play video starting at :6:24 and follow transcript6:24
The mean of all residual errors shows how poorly the line fits with the whole data
set. Mathematically it can be shown by the equation Mean Squared Error, shown as
MSE. Our objective is to find a line where the mean of all these errors is
minimized. In other words, the mean error of the prediction using the fit line
should be minimized. Let's reword it more technically. The objective of linear
regression, is to minimize this MSE equation and to minimize it, we should find the
best parameters theta 0 and theta 1. Now the question is how to find theta 0 and
theta 1 in such a way that it minimizes this error?
Play video starting at :7:13 and follow transcript7:13
How can we find such a perfect line? Or said another way, how should we find the
best parameters for our line? Should we move the line a lot randomly and calculate
the MSE value every time and choose the minimum one? Not really. Actually, we have
two options here. Option one, we can use a mathematic approach, or option two, we
can use an optimization approach. Let's see how we could easily use a mathematic
formula to find the theta 0 and
Play video starting at :7:49 and follow transcript7:49
As mentioned before, theta 0 and theta 1 in the simple linear regression are the
coefficients of the fit line. We can use a simple equation to estimate these
coefficients. That is, given that it's a simple linear regression with only two
parameters, and knowing that theta 0 and theta 1 are the intercept and slope of the
line, we can estimate them directly from our data. It requires that we calculate
the mean of the independent and dependent or target columns from the data set.
Notice that all of the data must be available to traverse and calculate the
parameters. It can be shown that the intercept and slope can be calculated using
these equations.
Play video starting at :8:38 and follow transcript8:38
We can start off by estimating the value for theta 1. This is how you can find the
slope of a line based on the data. X bar is the average value for the engine size
in our data set. Please consider that we have nine rows here, rows 0 to 8. First we
calculate the average of x1 and of y, then we plug it into the slope equation to
find theta 1.
Play video starting at :9:8 and follow transcript9:08
The xi and yi in the equation refer to the fact that we need to repeat these
calculations across all values in our data set. And i refers to the ith value of x
or y. Applying all values, we find theta 1 equals 43.98. (Known error in audio,
will be updated in next version) It is our second parameter. It is used to
calculate the first parameter which is the intercept of the line.
Play video starting at :9:37 and follow transcript9:37
Now we can plug theta 1 into the line equation to find theta 0. It is easily
calculated that theta 0 equals 92.94. (Known error in audio, will be updated in
next version) So these are the two parameters for the line, where theta 0 is also
called the bias coefficient, and theta 1 is the coefficient for the Co2 emission
column.
Play video starting at :10:4 and follow transcript10:04
As a side note, you really don't need to remember the formula for calculating these
parameters, as most of the libraries used for machine learning in Python, R and
Scala can easily find these parameters for you. But it's always good to understand
how it works. Now, we can write down the polynomial of the line.
Play video starting at :10:27 and follow transcript10:27
So, we know how to find the best fit for our data and its equation. Now the
question is how can we use it to predict the emission of a new car based on its
engine size?
Play video starting at :10:40 and follow transcript10:40
After we found the parameters of the linear equation, making predictions is as
simple as solving the equation for a specific set of inputs.
Play video starting at :10:49 and follow transcript10:49
Imagine we are predicting Co2 emission, or y, from engine size, or x for the
automobile in record number 9. Our linear regression model representation for this
problem would be y hat= theta 0 + theta 1 x1. Or if we map it to our data set, it
would be Co2Emission =theta 0 + theta 1 EngineSize.
Play video starting at :11:19 and follow transcript11:19
As we saw, we can find theta 0, theta 1 using the equations that we just talked
about. Once found, we can plug in the equation of the linear model. For example,
let's use theta 0 = 92.94 and theta 1 = 43.98. (Known error in audio, will be
updated in next version) So, we can rewrite the linear model as Co2Emission equals
92.94 plus 43.98 EngineSize. (Known error in audio, will be updated in next
version) Now let's plug in the 9th row of our data set and calculate the Co2
emission for a car with an engine size of 2.4. So Co2Emission = 92.94 + 43.98 x
2.4. (Known error in audio, will be updated in next version) Therefore, we can
predict that the Co2Emission for this specific car would be 198.492. (Known error
in audio, will be updated in next version) Let's talk a bit about why linear
regression is so useful. Quite simply, it is the most basic regression to use and
understand. In fact, one reason why linear regression is so useful is that it's
fast. It also doesn't require tuning of parameters. So, something like tuning the K
parameter and K nearest neighbors, or the learning rate in neural networks isn't
something to worry about. Linear regression is also easy to understand, and highly
interpretable.
Play video starting at :12:48 and follow transcript12:48
Thanks for watching this video. (Music)

Train datasets
Hello and welcome. In this video, we'll be covering model evaluation. So let's get
started. The goal of regression is to build a model to accurately predict an
unknown case. To this end, we have to perform regression evaluation after building
the model. In this video, we'll introduce and discuss two types of evaluation
approaches that can be used to achieve this goal. These approaches are train and
test on the same dataset and train/test split. We'll talk about what each of these
are, as well as the pros and cons of using each of these models. Also, we'll
introduce some metrics for accuracy of regression models. Let's look at the first
approach. When considering evaluation models, we clearly want to choose the one
that will give us the most accurate results. So, the question is, how can we
calculate the accuracy of our model? In other words, how much can we trust this
model for prediction of an unknown sample using a given dataset and having built a
model such as linear regression? One of the solutions is to select a portion of our
dataset for testing. For instance, assume that we have 10 records in our dataset.
We use the entire dataset for training, and we build a model using this training
set. Now, we select a small portion of the dataset, such as row number six to nine,
but without the labels. This set is called a test set, which has the labels, but
the labels are not used for prediction and is used only as ground truth. The labels
are called actual values of the test set. Now we pass the feature set of the
testing portion to our built model and predict the target values. Finally, we
compare the predicted values by our model with the actual values in the test set.
This indicates how accurate our model actually is. There are different metrics to
report the accuracy of the model, but most of them work generally based on the
similarity of the predicted and actual values. Let's look at one of the simplest
metrics to calculate the accuracy of our regression model. As mentioned, we just
compare the actual values y with the predicted values, which is noted as y hat for
the testing set. The error of the model is calculated as the average difference
between the predicted and actual values for all the rows. We can write this error
as an equation. So, the first evaluation approach we just talked about is the
simplest one, train and test on the same dataset. Essentially, the name of this
approach says it all. You train the model on the entire dataset, then you test it
using a portion of the same dataset. In a general sense, when you test with a
dataset in which you know the target value for each data point, you're able to
obtain a percentage of accurate predictions for the model. This evaluation approach
would most likely have a high training accuracy and the low out-of-sample accuracy
since the model knows all of the testing data points from the training. What is
training accuracy and out-of-sample accuracy? We said that training and testing on
the same dataset produces a high training accuracy, but what exactly is training
accuracy? Training accuracy is the percentage of correct predictions that the model
makes when using the test dataset. However, a high training accuracy isn't
necessarily a good thing. For instance, having a high training accuracy may result
in an over-fit the data. This means that the model is overly trained to the
dataset, which may capture noise and produce a non-generalized model. Out-of-sample
accuracy is the percentage of correct predictions that the model makes on data that
the model has not been trained on. Doing a train and test on the same dataset will
most likely have low out-of-sample accuracy due to the likelihood of being over-
fit. It's important that our models have high out-of-sample accuracy because the
purpose of our model is, of course, to make correct predictions on unknown data.
So, how can we improve out-of-sample accuracy? One way is to use another evaluation
approach called train/test split. In this approach, we select a portion of our
dataset for training, for example, row zero to five, and the rest is used for
testing, for example, row six to nine. The model is built on the training set.
Then, the test feature set is passed to the model for prediction. Finally, the
predicted values for the test set are compared with the actual values of the
testing set. The second evaluation approach is called train/test split. Train/test
split involves splitting the dataset into training and testing sets respectively,
which are mutually exclusive. After which, you train with the training set and test
with the testing set. This will provide a more accurate evaluation on out-of-sample
accuracy because the testing dataset is not part of the dataset that has been used
to train the data. It is more realistic for real-world problems. This means that we
know the outcome of each data point in the dataset, making it great to test with.
Since this data has not been used to train the model, the model has no knowledge of
the outcome of these data points. So, in essence, it's truly out-of-sample testing.
However, please ensure that you train your model with the testing set afterwards,
as you don't want to lose potentially valuable data. The issue with train/test
split is that it's highly dependent on the datasets on which the data was trained
and tested. The variation of this causes train/test split to have a better out-of-
sample prediction than training and testing on the same dataset, but it still has
some problems due to this dependency. Another evaluation model, called K-fold
cross-validation, resolves most of these issues. How do you fix a high variation
that results from a dependency? Well, you average it. Let me explain the basic
concept of K-fold cross-validation to see how we can solve this problem. The entire
dataset is represented by the points in the image at the top left. If we have K
equals four folds, then we split up this dataset as shown here. In the first fold
for example, we use the first 25 percent of the dataset for testing and the rest
for training. The model is built using the training set and is evaluated using the
test set. Then, in the next round or in the second fold, the second 25 percent of
the dataset is used for testing and the rest for training the model. Again, the
accuracy of the model is calculated. We continue for all folds. Finally, the result
of all four evaluations are averaged. That is, the accuracy of each fold is then
averaged, keeping in mind that each fold is distinct, where no training data in one
fold is used in another. K-fold cross-validation in its simplest form performs
multiple train/test splits, using the same dataset where each split is different.
Then, the result is average to produce a more consistent out-of-sample accuracy. We
wanted to show you an evaluation model that addressed some of the issues we've
described in the previous approaches. However, going in-depth with K-fold cross-
validation model is out of the scope for this course. Thanks for watching. (Music)

Evaluation Metrics in Regression Models


Hello and welcome. In this video, we'll be covering accuracy metrics for model
evaluation. So let's get started. Evaluation metrics are used to explain the
performance of a model. Let's talk more about the model evaluation metrics that are
used for regression. As mentioned, basically, we can compare the actual values and
predicted values to calculate the accuracy of our regression model. Evaluation
metrics provide a key role in the development of a model as it provides insight to
areas that require improvement. We'll be reviewing a number of model evaluation
metrics, including Mean Absolute Error, Mean Squared Error, and Root Mean Squared
Error, but before we get into defining these, we need to define what an error
actually is. In the context of regression, the error of the model is the difference
between the data points and the trend line generated by the algorithm. Since there
are multiple data points, an error can be determined in multiple ways. Mean
Absolute Error is the mean of the absolute value of the errors. This is the easiest
of the metrics to understand, since it's just the average error. Mean Squared Error
is the mean of the squared error. It's more popular than Mean Absolute Error
because the focus is geared more towards large errors. This is due to the squared
term, exponentially increasing larger errors in comparison to smaller ones. Root
Mean Squared Error is the square root of the mean squared error. This is one of the
most popular of the evaluation metrics because Root Mean Squared Error is
interpretable in the same units as the response vector or Y units, making it easy
to relate its information. Relative absolute error, also known as residual sum of
square, where Y bar is a mean value of Y, takes the total absolute error and
normalizes it. By dividing by the total absolute error of the simple predictor.
Relative squared error is very similar to relative absolute error, but is widely
adopted by the data science community as it is used for calculating R-squared. R-
squared is not an error per say but is a popular metric for the accuracy of your
model. It represents how close the data values are to the fitted regression line.
The higher the R-squared, the better the model fits your data. Each of these
metrics can be used for quantifying of your prediction. The choice of metric
completely depends on the type of model your data type and domain of knowledge.
Unfortunately, further review is out of scope of this course. Thanks for watching.

At timestamps 2:01-2:05, an error was made regarding the terminology used. It was
stated that "relative absolute error is also known as the residual sum of square."

The correct information is as follows:

Relative Absolute Error (RAE) is a metric expressed as a ratio normalizing the


absolute error. It measures the average absolute difference between the actual and
predicted values relative to the average absolute difference between the actual
values and their mean.

The formula for RAE:

RAE = Σ|actual - predicted| / Σ|actual - mean|

It's important to note the distinction between Relative Absolute Error (RAE) and
Residual Sum of Squares (RSS):

Relative Absolute Error (RAE): Measures the average absolute difference between
actual and predicted values relative to the average absolute difference between
actual values and their mean.

Residual Sum of Squares (RSS): Calculates the sum of the squared differences
between actual and predicted values.

The formula for RSS:


RSS = Σ(actual - predicted)^2

Multiple Linear Regression


Hello, and welcome. In this video, we'll be covering multiple linear regression. As
you know, there are two types of linear regression models, simple regression and
multiple regression. Simple linear regression is when one independent variable is
used to estimate a dependent variable. For example, predicting CO_2 emission using
the variable of engine size. In reality, there are multiple variables that predict
the CO_2 emission. When multiple independent variables are present, the process is
called multiple linear regression. For example, predicting CO_2 emission using
engine size and the number of cylinders in the car's engine. Our focus in this
video is on multiple linear regression. The good thing is that multiple linear
regression is the extension of the simple linear regression model. So, I suggest
you go through the simple linear regression video first if you haven't watched it
already. Before we dive into a sample dataset and see how multiple linear
regression works, I want to tell you what kind of problems it can solve, when we
should use it, and specifically, what kind of questions we can answer using it.
Basically, there are two applications for multiple linear regression. First, it can
be used when we would like to identify the strength of the effect that the
independent variables have on the dependent variable. For example, does revision
time, test anxiety, lecture attendance and gender have any effect on exam
performance of students? Second, it can be used to predict the impact of changes,
that is, to understand how the dependent variable changes when we change the
independent variables. For example, if we were reviewing a person's health data, a
multiple linear regression can tell you how much that person's blood pressure goes
up or down for every unit increase or decrease in a patient's body mass index
holding other factors constant. As is the case with simple linear regression,
multiple linear regression is a method of predicting a continuous variable. It uses
multiple variables called independent variables or predictors that best predict the
value of the target variable which is also called the dependent variable. In
multiple linear regression, the target value, Y, is a linear combination of
independent variables, X. For example, you can predict how much CO_2 a car might
emit due to independent variables such as the car's engine size, number of
cylinders, and fuel consumption. Multiple linear regression is very useful because
you can examine which variables are significant predictors of the outcome variable.
Also, you can find out how each feature impacts the outcome variable. Again, as is
the case in simple linear regression, if you manage to build such a regression
model, you can use it to predict the emission amount of an unknown case such as
record number nine. Generally, the model is of the form y hat equals theta zero,
plus theta one x_1, plus theta two x_2 and so on, up to theta n x_n.
Mathematically, we can show it as a vector form as well. This means it can be shown
as a dot product of two vectors; the parameters vector and the feature set vector.
Generally, we can show the equation for a multidimensional space as theta transpose
x, where theta is an n by one vector of unknown parameters in a multi-dimensional
space, and x is the vector of the featured sets, as theta is a vector of
coefficients and is supposed to be multiplied by x. Conventionally, it is shown as
transpose theta. Theta is also called the parameters or weight vector of the
regression equation. Both these terms can be used interchangeably, and x is the
feature set which represents a car. For example, x_1 for engine size or x_2 for
cylinders, and so on. The first element of the feature set would be set to one,
because it turns that theta zero into the intercept or biased parameter when the
vector is multiplied by the parameter vector. Please notice that theta transpose x
in a one-dimensional space is the equation of a line, it is what we use in simple
linear regression. In higher dimensions when we have more than one input or x the
line is called a plane or a hyperplane, and this is what we use for multiple linear
regression. So, the whole idea is to find the best fit hyperplane for our data. To
this end and as is the case in linear regression, we should estimate the values for
theta vector that best predict the value of the target field in each row. To
achieve this goal, we have to minimize the error of the prediction. Now, the
question is, how do we find the optimized parameters? To find the optimized
parameters for our model, we should first understand what the optimized parameters
are, then we will find a way to optimize the parameters. In short, optimized
parameters are the ones which lead to a model with the fewest errors. Let's assume
for a moment that we have already found the parameter vector of our model, it means
we already know the values of theta vector. Now we can use the model and the
feature set of the first row of our dataset to predict the CO_2 emission for the
first car, correct? If we plug the feature set values into the model equation, we
find y hat. Let's say for example, it returns 140 as the predicted value for this
specific row, what is the actual value? Y equals 196. How different is the
predicted value from the actual value of 196? Well, we can calculate it quite
simply as 196 subtract 140, which of course equals 56. This is the error of our
model only for one row or one car in our case. As is the case in linear regression,
we can say the error here is the distance from the data point to the fitted
regression model. The mean of all residual errors shows how bad the model is
representing the data set, it is called the mean squared error, or MSE.
Mathematically, MSE can be shown by an equation. While this is not the only way to
expose the error of a multiple linear regression model, it is one of the most
popular ways to do so. The best model for our data set is the one with minimum
error for all prediction values. So, the objective of multiple linear regression is
to minimize the MSE equation. To minimize it, we should find the best parameters
theta, but how? Okay, how do we find the parameter or coefficients for multiple
linear regression? There are many ways to estimate the value of these coefficients.
However, the most common methods are the ordinary least squares and optimization
approach. Ordinary least squares tries to estimate the values of the coefficients
by minimizing the mean square error. This approach uses the data as a matrix and
uses linear algebra operations to estimate the optimal values for the theta. The
problem with this technique is the time complexity of calculating matrix operations
as it can take a very long time to finish. When the number of rows in your data set
is less than 10,000, you can think of this technique as an option. However, for
greater values, you should try other faster approaches. The second option is to use
an optimization algorithm to find the best parameters. That is, you can use a
process of optimizing the values of the coefficients by iteratively minimizing the
error of the model on your training data. For example, you can use gradient descent
which starts optimization with random values for each coefficient, then calculates
the errors and tries to minimize it through y's changing of the coefficients in
multiple iterations. Gradient descent is a proper approach if you have a large data
set. Please understand however, that there are other approaches to estimate the
parameters of the multiple linear regression that you can explore on your own.
After you find the best parameters for your model, you can go to the prediction
phase. After we found the parameters of the linear equation, making predictions is
as simple as solving the equation for a specific set of inputs. Imagine we are
predicting CO_2 emission or Y from other variables for the automobile in record
number nine. Our linear regression model representation for this problem would be y
hat equals theta transpose x. Once we find the parameters, we can plug them into
the equation of the linear model. For example, let's use theta zero equals 125,
theta one equals 6.2, theta two equals 14, and so on. If we map it to our data set,
we can rewrite the linear model as CO_2 emissions equals 125 plus 6.2 multiplied by
engine size, plus 14 multiplied by cylinder, and so on. As you can see, multiple
linear regression estimates the relative importance of predictors. For example, it
shows cylinder has higher impact on CO_2 emission amounts in comparison with engine
size. Now, let's plug in the ninth row of our data set and calculate the CO_2
emission for a car with the engine size of 2.4. So, CO_2 emission equals 125 plus
6.2 times 2.4, plus 14 times four, and so on. We can predict the CO_2 emission for
this specific car would be 214.1. Now, let me address some concerns that you might
already be having regarding multiple linear regression. As you saw, you can use
multiple independent variables to predict a target value in multiple linear
regression. It sometimes results in a better model compared to using a simple
linear regression which uses only one independent variable to predict the dependent
variable. Now the question is how, many independent variable should we use for the
prediction? Should we use all the fields in our data set? Does adding independent
variables to a multiple linear regression model always increase the accuracy of the
model? Basically, adding too many independent variables without any theoretical
justification may result in an overfit model. An overfit model is a real problem
because it is too complicated for your data set and not general enough to be used
for prediction. So, it is recommended to avoid using many variables for prediction.
There are different ways to avoid overfitting a model
in regression, however that is outside the scope of this video. The next question
is, should independent variables be continuous? Basically, categorical independent
variables can be incorporated into a regression model by converting them into
numerical variables. For example, given a binary variables such as car type, the
code dummy zero for manual and one for automatic cars. As a last point, remember
that multiple linear regression is a specific type of linear regression. So, there
needs to be a linear relationship between the dependent variable and each of your
independent variables. There are a number of ways to check for linear relationship.
For example, you can use scatter plots and then visually checked for linearity. If
the relationship displayed in your scatter plot is not linear, then you need to use
non-linear regression. This concludes our video. Thanks for watching. (Music)

Multiple Linear Regression Video


In the prior video "Multiple Linear Regression", an error occurred at the timestamp
10:00 in calculating Theta T, Predicting with Multiple Linear Regression. Please
see correction below.

The accurate value of Theta T should be as follows:

Theta T value: [62.43, 9.19, 8.70, 9.68]

The corresponding regression equation is:

y_bar = 62.43 + 9.19 * x1 + 8.70 * x2 + 9.68 * x3

Additionally, at the timestamp 10:20,

the correct predicted CO2 emissions (Co2Em) calculation is:

Co2Em = 62.43 + 9.19 * 2.4 + 8.70 * 4 + 9.68 * 9.2

Co2Em = 208.34
Introduction to Classification
Hello, in this video, we'll give you an introduction to classification. So let's
get started. In machine learning classification is a supervised learning approach
which can be thought of as a means of categorizing or classifying some unknown
items into a discrete set of classes. Classification attempts to learn the
relationship between a set of feature variables and a target variable of interest.
The target attribute in classification is a categorical variable with discrete
values. So, how does classification and classifiers work? Given a set of training
data points along with the target labels, classification determines the class label
for an unlabeled test case. Let's explain this with an example. A good sample of
classification is the loan default prediction. Suppose a bank is concerned about
the potential for loans not to be repaid? If previous loan default data can be used
to predict which customers are likely to have problems repaying loans, these bad
risk customers can either have their loan application declined or offered
alternative products. The goal of a loan default predictor is to use existing loan
default data which has information about the customers such as age, income,
education et cetera, to build a classifier, pass a new customer or potential future
default to the model, and then label it, i.e the data points as defaulter or not
defaulter. Or for example zero or one. This is how a classifier predicts an
unlabeled test case. Please notice that this specific example was about a binary
classifier with two values. We can also build classifier models for both binary
classification and multi-class classification. For example, imagine that you've
collected data about a set of patients, all of whom suffered from the same illness.
During their course of treatment, each patient responded to one of three
medications. You can use this labeled dataset with a classification algorithm to
build a classification model. Then you can use it to find out which drug might be
appropriate for a future patient with the same illness. As you can see, it is a
sample of multi-class classification. Classification has different business use
cases as well. For example, to predict the category to which a customer belongs,
for churn detection where we predict whether a customer switches to another
provider or brand, or to predict whether or not a customer responds to a particular
advertising campaign. Data classification has several applications in a wide
variety of industries. Essentially, many problems can be expressed as associations
between feature and target variables, especially when labelled data is available.
This provides a broad range of applicability for classification. For example,
classification can be used for email filtering, speech recognition, handwriting
recognition, biometric identification, document classification and much more. Here
we have the types of classification algorithms and machine learning. They include
decision trees, naive bayes, linear discriminant analysis, k-nearest neighbor,
logistic regression, neural networks, and support vector machines. There are many
types of classification algorithms. We will only cover a few in this course. Thanks
for watching. (Music)

K-Nearest Neighbours
Hello and welcome. In this video, we'll be covering the K-Nearest Neighbors
algorithm. So, let's get started. Imagine that a telecommunications provider has
segmented his customer base by service usage patterns, categorizing the customers
into four groups. If demographic data can be used to predict group membership, the
company can customize offers for individual perspective customers. This is a
classification problem. That is, given the dataset with predefined labels, we need
to build a model to be used to predict the class of a new or unknown case. The
example focuses on using demographic data, such as region, age, and marital status
to predict usage patterns. The target field called custcat has four possible values
that correspond to the four customer groups as follows: Basic Service, E Service,
Plus Service, and Total Service. Our objective is to build a classifier. For
example, using the row zero to seven to predict the class of row eight. We will use
a specific type of classification called K-Nearest Neighbor. Just for sake of
demonstration, let's use only two fields as predictors specifically, age and
income, and then plot the customers based on their group membership. Now, let's say
that we have a new customer. For example, record number eight, with a known age and
income. How can we find the class of this customer? Can we find one of the closest
cases and assign the same class label to our new customer? Can we also say that the
class of our new customer is most probably group four i.e Total Service, because
it's nearest neighbor is also of class four? Yes, we can. In fact, it is the first
nearest neighbor. Now, the question is, to what extent can we trust our judgment
which is based on the first nearest neighbor? It might be a poor judgment
especially if the first nearest neighbor is a very specific case or an outlier,
correct? Now, let's look at our scatter plot again. Rather than choose the first
nearest neighbor, what if we chose the five nearest neighbors and did a majority
vote among them to define the class of our new customer? In this case, we'd see
that three out of five nearest neighbors tell us to go for class three, which is
Plus Service. Doesn't this make more sense? Yes. In fact, it does. In this case,
the value of K in the K-Nearest Neighbors algorithm is five. This example
highlights the intuition behind the K-Nearest Neighbors algorithm. Now, let's
define the K Nearest Neighbors. The K-Nearest Neighbors algorithm is a
classification algorithm that takes a bunch of labeled points and uses them to
learn how to label other points. This algorithm classifies cases based on their
similarity to other cases. In K-Nearest Neighbors, data points that are near each
other are said to be neighbors. K-Nearest Neighbors is based on this paradigm.
Similar cases with the same class labels are near each other. Thus, the distance
between two cases is a measure of their dissimilarity. There are different ways to
calculate the similarity or conversely, the distance or dissimilarity of two data
points. For example, this can be done using Euclidean distance. Now, let's see how
the K-Nearest Neighbors algorithm actually works. In a classification problem, the
K-Nearest Neighbors algorithm works as follows. One, pick a value for K. Two,
calculate the distance from the new case hold out from each of the cases in the
dataset. Three, search for the K-observations in the training data that are nearest
to the measurements of the unknown data point. And four, predict the response of
the unknown data point using the most popular response value from the K-Nearest
Neighbors. There are two parts in this algorithm that might be a bit confusing.
First, how to select the correct K and second, how to compute the similarity
between cases, for example, among customers. Let's first start with the second
concern. That is, how can we calculate the similarity between two data points?
Assume that we have two customers, customer one and customer two, and for a moment,
assume that these two customers have only one feature, H. We can easily use a
specific type of Minkowski distance to calculate the distance of these two
customers, it is indeed the Euclidean distance. Distance of X_1 from X_2 is root of
34 minus 30 to power of two, which is four. What about if we have more than one
feature? For example, age and income. If we have income and age for each customer,
we can still use the same formula but this time, we're using it in a two
dimensional space. We can also use the same distance matrix for multidimensional
vectors. Of course, we have to normalize our feature set to get the accurate
dissimilarity measure. There are other dissimilarity measures as well that can be
used for this purpose but as mentioned, it is highly dependent on datatype and also
the domain that classification is done for it. As mentioned, K and K-Nearest
Neighbors is the number of nearest neighbors to examine. It is supposed to be
specified by the user. So, how do we choose the right K? Assume that we want to
find the class of the customer noted as question mark on the chart. What happens if
we choose a very low value of K? Let's say, K equals one. The first nearest point
would be blue, which is class one. This would be a bad prediction, since more of
the points around it are magenta or class four. In fact, since its nearest neighbor
is blue we can say that we capture the noise in the data or we chose one of the
points that was an anomaly in the data. A low value of K causes a highly complex
model as well, which might result in overfitting of the model. It means the
prediction process is not generalized enough to be used for out-of-sample cases.
Out-of-sample data is data that is outside of the data set used to train the model.
In other words, it cannot be trusted to be used for prediction of unknown samples.
It's important to remember that overfitting is bad, as we want a general model that
works for any data, not just the data used for training. Now, on the opposite side
of the spectrum, if we choose a very high value of K such as K equals 20, then the
model becomes overly generalized. So, how can we find the best value for K? The
general solution is to reserve a part of your data for testing the accuracy of the
model. Once you've done so, choose K equals one and then use the training part for
modeling and calculate the accuracy of prediction using all samples in your test
set. Repeat this process increasing the K and see which K is best for your model.
For example, in our case, K equals four will give us the best accuracy. Nearest
neighbors analysis can also be used to compute values for a continuous target. In
this situation, the average or median target value of the nearest neighbors is used
to obtain the predicted value for the new case. For example, assume that you are
predicting the price of a home based on its feature set, such as number of rooms,
square footage, the year it was built, and so on. You can easily find the three
nearest neighbor houses of course not only based on distance but also based on all
the attributes and then predict the price of the house as the medium of neighbors.
This concludes this video. Thanks for watching. (Music)
en
Evaluation Metrics in Classification
Hello and welcome. In this video, we'll be covering evaluation metrics for
classifiers. Let's get started. Evaluation metrics explain the performance of a
model. Let's talk more about the model evaluation metrics that are used for
classification. Imagine that we have an historical dataset which shows the customer
churn for a telecommunication company. We have trained the model, and now we want
to calculate its accuracy using the test set. We pass the test set to our model,
and we find the predicted labels. Now the question is, how accurate is this model?
Basically, we compare the actual values in the test set with the values predicted
by the model to calculate the accuracy of the model. Evaluation metrics provide a
key role in the development of a model as they provide insight to areas that might
require improvement. There are different model evaluation metrics, but we just talk
about three of them here, specifically, Jaccard index, F1 score, and log loss.
Let's first look at one of the simplest accuracy measurements, the Jaccard index,
also known as the Jaccard similarity coefficient. Let's say y shows the true labels
of the churn dataset, and y-hat shows the predicted values by our classifier. Then
we can define Jaccard as the size of the intersection divided by the size of the
union of two label sets. For example, for a test set of size 10 with eight correct
predictions or eight intersections, the accuracy by the Jaccard index would be
0.66. If the entire set of predicted labels for a sample strictly matches with the
true set of labels, then the subset accuracy is 1.0, otherwise, it is 0.0. Another
way of looking at accuracy of classifiers is to look at a confusion matrix. For
example, let's assume that our test set has only 40 rows. This matrix shows the
corrected and wrong predictions in comparison with the actual labels. Each
confusion matrix row shows the actual true labels in the test set, and the columns
show the predicted labels by classifier. Let's look at the first row. The first row
is for customers whose actual churn value in the test set is one. As you can
calculate, out of 40 customers, the churn value of 15 of them is one, and out of
these 15, the classifier correctly predicted six of them as one, and nine of them
as zero. This means that for six customers, the actual churn value was one in the
test set, and the classifier also correctly predicted those as one. However, while
the actual label of nine customers was one, the classifier predicted those as zero,
which is not very good. We can consider this as an error of the model for the first
row. What about the customers with a churn value 0? Let's look at the second row.
It looks like there were 25 customers whose churn value was zero. The classifier
correctly predicted 24 of them as zero and one of them wrongly predicted as one, so
it has done a good job in predicting the customers with a churn value of zero. A
good thing about the confusion matrix is that it shows the model's ability to
correctly predict or separate the classes. In the specific case of a binary
classifier such as this example, we can interpret these numbers as the count of
true positives, false negatives, true negatives, and false positives. Based on the
count of each section, we can calculate the precision and recall of each label.
Precision is a measure of the accuracy provided that a class label has been
predicted. It is defined by precision equals true positive divided by true positive
plus false positive. Recall is the true positive rate. It is defined as recall
equals true positive divided by true positive plus false negative. We can calculate
the precision and recall of each class. Now we're in the position to calculate the
F1 scores for each label based on the precision and recall of that label. The F1
score is the harmonic average of the precision and recall, where an F1 score
reaches its best value at one, which represents perfect precision and recall, and
its worst at zero. It is a good way to show that a classifier has a good value for
both recall and precision. It is defined using the F1 score equation. For example,
the F1 score for Class 0, ie churn equals zero, is 0.83, and the F1 score for Class
1, ie churn equals one, is 0.55. Finally, we can tell the average accuracy for this
classifier is the average of the F1 score for both labels, which is 0.69 in our
case. Please notice that both Jaccard and F1 score can be used for multiclass
classifiers as well, which is out of scope for this course. Now, let's look at
another accuracy metric for classifiers. Sometimes the output of a classifier is
the probability of a class label instead of the label. For example, in logistic
regression, the output can be the probability of customer churn, ie yes, or equals
to one. This probability is a value between zero and one. Logarithmic loss, also
known as log loss, measures the performance of a classifier where the predicted
output is a probability value between zero and one. For example, predicting a
probability of 0.13 when the actual label is one would be bad, and would result in
a high log loss. We can calculate the log loss for each row using the log loss
equation, which measures how far each prediction is from the actual label. Then we
calculate the average log loss across all rows of the test set. It is obvious that
ideal classifiers have progressively smaller values of log loss, so the classifier
with the lower log loss has better accuracy. Thanks for watching.

Introduction to Decision Trees


Hello and welcome. In this video, we're going to introduce an examine decision
trees. So let's get started. What exactly is a decision tree? How do we use them to
help us classify? How can I grow my own decision tree? These may be some of the
questions that you have in mind from hearing the term decision tree. Hopefully,
you'll soon be able to answer these questions and many more by watching this video.
Imagine that you're a medical researcher compiling data for a study. You've already
collected data about a set of patients all of whom suffered from the same illness.
During their course of treatment, each patient responded to one of two medications.
We call them drug A and drug B. Part of your job is to build a model to find out
which drug might be appropriate for a future patient with the same illness. The
feature sets of this dataset are age, gender, blood pressure, and cholesterol of
our group of patients and the target is the drug that each patient responded to. It
is a sample of binary classifiers, and you can use the training part of the data
set to build a decision tree and then use it to predict the class of an unknown
patient. In essence, to come up with a decision on which drug to prescribe to a new
patient. Let's see how a decision tree is built for this dataset. Decision trees
are built by splitting the training set into distinct nodes, where one node
contains all of or most of one category of the data. If we look at the diagram
here, we can see that it's a patient's classifier. So as mentioned, we want to
prescribe a drug to a new patient, but the decision to choose drug A or B will be
influenced by the patient's situation. We start with age, which can be young,
middle aged or senior. If the patient is middle aged, then we'll definitely go for
drug B. On the other hand, if he has a young or a senior patient, will need more
details to help us determine which drug to prescribe. The additional decision
variables can be things such as cholesterol levels, gender or blood pressure. For
example, if the patient is female, then we will recommend drug A, but if the
patient is male, then will go for drug B. As you can see, decision trees are about
testing an attribute and branching the cases based on the result of the test. Each
internal node corresponds to a test, and each branch corresponds to a result of the
test, and each leaf node assigns a patient to a class. Now the question is, how can
we build such a decision tree? Here is the way that a decision tree is built. A
decision tree can be constructed by considering the attributes one by one. First,
choose an attribute from our dataset. Calculate the significance of the attribute
in the splitting of the data. In the next video, we will explain how to calculate
the significance of an attribute to see if it's an effective attribute or not.
Next, split the data based on the value of the best attribute, then go to each
branch and repeat it for the rest of the attributes. After building this tree, you
can use it to predict the class of unknown cases; or in our case, the proper drug
for a new patient based on his or her characteristics. This concludes this video.
Thanks for watching. (Music)

Building Decision Trees


Hello and welcome. In this video, we'll be covering the process of building
decision trees. So, let's get started. Consider the drug data set again. The
question is, how do we build a decision tree based on that data set? Decision trees
are built using recursive partitioning to classify the data. Let's say we have 14
patients in our data set, the algorithm chooses the most predictive feature to
split the data on. What is important in making a decision tree, is to determine
which attribute is the best or more predictive to split data based on the feature.
Let's say we pick cholesterol as the first attribute to split data, it will split
our data into two branches. As you can see, if the patient has high cholesterol we
cannot say with high confidence that drug B might be suitable for him. Also, if the
patient's cholesterol is normal, we still don't have sufficient evidence or
information to determine if either drug A or drug B is in fact suitable. It is a
sample of bad attributes selection for splitting data. So, let's try another
attribute. Again, we have our 14 cases, this time we picked the sex attribute of
patients. It will split our data into two branches, male and female. As you can
see, if the patient is female, we can say drug B might be suitable for her with
high certainty. But if the patient is male, we don't have sufficient evidence or
information to determine if drug A or drug B is suitable. However, it is still a
better choice in comparison with the cholesterol attribute because the result in
the nodes are more pure. It means nodes that are either mostly drug A or drug B.
So, we can say the sex attribute is more significant than cholesterol, or in other
words it's more predictive than the other attributes. Indeed, predictiveness is
based on decrease in impurity of nodes. We're looking for the best feature to
decrease the impurity of patients in the leaves, after splitting them up based on
that feature. So, the sex feature is a good candidate in the following case because
it almost found the pure patients. Let's go one step further. For the male patient
branch, we again test other attributes to split the sub-tree. We test cholesterol
again here, as you can see it results in even more pure leaves. So we can easily
make a decision here. For example, if a patient is male and his cholesterol is
high, we can certainly prescribe drug A, but if it is normal, we can prescribe drug
B with high confidence. As you might notice, the choice of attribute to split data
is very important and it is all about purity of the leaves after the split. A node
in the tree is considered pure if in 100 percent of the cases, the nodes fall into
a specific category of the target field. In fact, the method uses recursive
partitioning to split the training records into segments by minimizing the impurity
at each step. Impurity of nodes is calculated by entropy of data in the node. So,
what is entropy? Entropy is the amount of information disorder or the amount of
randomness in the data. The entropy in the node depends on how much random data is
in that node and is calculated for each node. In decision trees, we're looking for
trees that have the smallest entropy in their nodes. The entropy is used to
calculate the homogeneity of the samples in that node. If the samples are
completely homogeneous, the entropy is zero and if the samples are equally divided
it has an entropy of one. This means if all the data in a node are either drug A or
drug B, then the entropy is zero, but if half of the data are drug A and other half
are B then the entropy is one. You can easily calculate the entropy of a node using
the frequency table of the attribute through the entropy formula where P is for the
proportion or ratio of a category, such as drug A or B. Please remember though that
you don't have to calculate these as it's easily calculated by the libraries or
packages that you use. As an example, let's calculate the entropy of the data set
before splitting it. We have nine occurrences of drug B and five of drug A. You can
embed these numbers into the entropy formula to calculate the impurity of the
target attribute before splitting it. In this case, it is 0.94. So, what is entropy
after splitting? Now, we can test different attributes to find the one with the
most predictiveness, which results in two more pure branches. Let's first select
the cholesterol of the patient and see how the data gets split based on its values.
For example, when it is normal we have six for drug B, and two for drug A. We can
calculate the entropy of this node based on the distribution of drug A and B which
is 0.8 in this case. But, when cholesterol is high, the data is split into three
for drug B and three for drug A. Calculating it's entropy, we can see it would be
1.0. We should go through all the attributes and calculate the entropy after the
split and then choose the best attribute. Okay. Let's try another field. Let's
choose the sex attribute for the next check. As you can see, when we use the sex
attribute to split the data, when its value is female, we have three patients that
responded to drug B and four patients that responded to drug A. The entropy for
this node is 0.98 which is not very promising. However, on the other side of the
branch, when the value of the sex attribute is male, the result is more pure with
sex for drug B and only one for drug A. The entropy for this group is 0.59. Now,
the question is between the cholesterol and sex attributes which one is a better
choice? Which one is better at the first attribute to divide the data-set into two
branches? Or in other words, which attribute results in more pure nodes for our
drugs? Or in which tree do we have less entropy after splitting rather than before
splitting? The sex attribute with entropy of 0.98 and 0.59 or the cholesterol
attribute with entropy of 0.81 and 1.0 in it's branches. The answer is the tree
with the higher information gain after splitting. So, what is information gain?
Information gain is the information that can increase the level of certainty after
splitting. It is the entropy of a tree before the split minus the weighted entropy
after the split by an attribute. We can think of information gain and entropy as
opposites. As entropy or the amount of randomness decreases, the information gain
or amount of certainty increases and vice versa. So, constructing a decision tree
is all about finding attributes that return the highest information gain. Let's see
how information gain is calculated for the sex attribute. As mentioned, the
information gained is the entropy of the tree before the split minus the weighted
entropy after the split. The entropy of the tree before the split is 0.94, the
portion of female patients is seven out of 14 and its entropy is 0.985. Also, the
portion of men is seven out of 14 and the entropy of the male node is 0.592. The
result of a square bracket here is the weighted entropy after the split. So, the
information gain of the tree if we use the sex attribute to split the data set is
0.151. As you could see, we will consider the entropy over the distribution of
samples falling under each leaf node and we'll take a weighted average of that
entropy weighted by the proportion of samples falling under that leave. We can
calculate the information gain of the tree if we use cholesterol as well. It is
0.48. Now, the question is, which attribute is more suitable? Well, as mentioned,
the tree with the higher information gained after splitting, this means the sex
attribute. So, we select the sex attribute as the first splitter. Now, what is the
next attribute after branching by the sex attribute? Well, as you can guess, we
should repeat the process for each branch and test each of the other attributes to
continue to reach the most pure leaves. This is the way you build a decision tree.
Thanks for watching. (Music)

Intro to Logistic Regression


Hello and welcome. In this video, we'll learn a machine learning method called
Logistic Regression which is used for classification. In examining this method,
we'll specifically answer these three questions. What is logistic regression? What
kind of problems can be solved by logistic regression? In which situations do we
use logistic regression? So let's get started. Logistic regression is a statistical
and machine learning technique for classifying records of a dataset based on the
values of the input fields. Let's say we have a telecommunication dataset that we'd
like to analyze in order to understand which customers might leave us next month.
This is historical customer data where each row represents one customer. Imagine
that you're an analyst at this company and you have to find out who is leaving and
why? You'll use the dataset to build a model based on historical records and use it
to predict the future churn within the customer group. The dataset includes
information about services that each customer has signed up for, customer account
information, demographic information about customers like gender and age range and
also customers who've left the company within the last month. The column is called
churn. We can use logistic regression to build a model for predicting customer
churn using the given features. In logistic regression, we use one or more
independent variables such as tenure, age, and income to predict an outcome, such
as churn, which we call the dependent variable representing whether or not
customers will stop using the service. Logistic regression is analogous to linear
regression but tries to predict a categorical or discrete target field instead of a
numeric one. In linear regression, we might try to predict a continuous value of
variables such as the price of a house, blood pressure of a patient, or fuel
consumption of a car. But in logistic regression, we predict a variable which is
binary such as yes/no, true/false, successful or not successful, pregnant/not
pregnant, and so on, all of which can be coded as zero or one. In logistic
regression independent variables should be continuous. If categorical, they should
be dummy or indicator coded. This means we have to transform them to some
continuous value. Please note that logistic regression can be used for both binary
classification and multi-class classification. But for simplicity in this video,
we'll focus on binary classification. Let's examine some applications of logistic
regression before we explain how they work. As mentioned, logistic regression is a
type of classification algorithm, so it can be used in different situations. For
example, to predict the probability of a person having a heart attack within a
specified time period, based on our knowledge of the person's age, sex, and body
mass index. Or to predict the chance of mortality in an injured patient or to
predict whether a patient has a given disease such as diabetes based on observed
characteristics of that patient such as weight, height, blood pressure, and results
of various blood tests and so on. In a marketing context, we can use it to predict
the likelihood of a customer purchasing a product or halting a subscription as
we've done in our churn example. We can also use logistic regression to predict the
probability of failure of a given process, system or product. We can even use it to
predict the likelihood of a homeowner defaulting on a mortgage. These are all good
examples of problems that can be solved using logistic regression. Notice that in
all these examples not only do we predict the class of each case, we also measure
the probability of a case belonging to a specific class. There are different
machine algorithms which can classify or estimate a variable. The question is, when
should we use logistic regression? Here are four situations in which logistic
regression is a good candidate. First, when the target field in your data is
categorical or specifically is binary. Such as zero/one, yes/no, churn or no churn,
positive/negative and so on. Second, you need the probability of your prediction.
For example, if you want to know what the probability is of a customer buying a
product. Logistic regression returns a probability score between zero and one for a
given sample of data. In fact, logistic regression predicts the probability of that
sample and we map the cases to a discrete class based on that probability. Third,
if your data is linearly separable. The decision boundary of logistic regression is
a line or a plane or a hyper plane. A classifier will classify all the points on
one side of the decision boundary as belonging to one class and all those on the
other side as belonging to the other class. For example, if we have just two
features and are not applying any polynomial processing we can obtain an inequality
like Theta zero plus Theta 1x1 plus theta 2x2 is greater than zero, which is a
half-plane easily plottable. Please note that in using logistic regression, we can
also achieve a complex decision boundary using polynomial processing as well, which
is out of scope here. You'll get more insight from decision boundaries when you
understand how logistic regression works. Fourth, you need to understand the impact
of a feature. You can select the best features based on the statistical
significance of the logistic regression model coefficients or parameters. That is,
after finding the optimum parameters, a feature X with the weight Theta one close
to zero has a smaller effect on the prediction than features with large absolute
values of Theta one. Indeed, it allows us to understand the impact an independent
variable has on the dependent variable while controlling other independent
variables. Let's look at our dataset again. We defined the independent variables as
X and dependent variable as Y. Notice, that for the sake of simplicity we can code
the target or dependent values to zero or one. The goal of logistic regression is
to build a model to predict the class of each sample which in this case is a
customer, as well as the probability of each sample belonging to a class. Given
that, let's start to formalize the problem. X is our dataset in the space of real
numbers of m by n. That is, of m dimensions or features and n records, and Y is the
class that we want to predict, which can be either zero or one. Ideally, a logistic
regression model, so-called Y hat, can predict that the class of the customer is
one, given its features X. It can also be shown quite easily that the probability
of a customer being in class zero can be calculated as one minus the probability
that the class of the customer is one. Thanks for watching this video. (Music)

Logistic regression vs Linear regression


Hello, and welcome. In this video, we will learn the difference between linear
regression and logistic regression. We go over linear regression and see why it
cannot be used properly for some binary classification problems. We also look at
the sigmoid function, which is the main part of logistic regression. Let's start.
Let's look at the telecommunication dataset again. The goal of logistic regression
is to build a model to predict the class of each customer and also the probability
of each sample belonging to a class. Ideally, we want to build a model y hat that
can estimate that the class of a customer is one, given its features x. I want to
emphasize that y is the labels vector, also called actual values that we would like
to predict, and y hat is the vector of the predicted values by our model. Mapping
the class labels to integer numbers, can we use linear regression to solve this
problem? First, let's recall how linear regression works to better understand
logistic regression. Forget about the churn prediction for a minute and assume our
goal is to predict the income of customers in the dataset. This means that instead
of predicting churn, which is a categorical value, let's predict income, which is a
continuous value. How can we do this? Let's select an independent variable, such as
customer age and predict a dependent variable, such as income. Of course, we can
have more features, but for the sake of simplicity, let's just take one feature
here. We can plot it and show age as an independent variable, and income as the
target value we would like to predict. With linear regression, you can fit a line
or polynomial through the data. We can find this line through training our model or
calculating it mathematically, based on the sample sets. We'll say this is a
straight line through the sample set. This line has an equation shown as a+bx_1.
Now, use this line to predict the continuous value y. That is, use this line to
predict the income of an unknown customer based on his or her age, and it is done.
What if we want to predict churn? Can we use the same technique to predict a
categorical field such as churn? Let's see. Say we're given data on customer churn,
and our goal this time is to predict the churn of customers based on their age. We
have a feature, age denoted as x_1, and a categorical feature, churn with two
classes, churn is yes, and churn is no. As mentioned, we can map yes and no to
integer values zero and one. How can we model it now? Well, graphically, we could
represent our data with a scatter plot, but this time, we have only two values for
the y-axis. In this plot, Class 0 is denoted in red, and Class 1 is denoted in
blue. Our goal here is to make a model based on existing data to predict if a new
customer is red or blue. Let's do the same technique that we used for linear
regression here to see if we can solve the problem for a categorical attribute such
as churn. With linear regression, you, again, can fit a polynomial through the
data, which is shown traditionally as a+bx. This polynomial can also be shown
traditionally as Theta_0+Theta_1x_1. This line has two parameters which are shown
with vector Theta, where the values of the vector are Theta_0 and Theta_1. We can
also show the equation of this line formally as Theta transpose X. Generally, we
can show the equation for a multidimensional space, as Theta transpose X, where
Theta is the parameters of the line in two-dimensional space or parameters of a
plane in three-dimensional space, and so on. As Theta is a vector of parameters and
is supposed to be multiplied by x, it is shown conventionally as transpose Theta.
Theta is also called the weights factor or confidences of the equation with both
these terms used interchangeably, and x is the feature set, which represents a
customer. Anyway, given a dataset, all the feature sets, x, Theta parameters can be
calculated through an optimization algorithm or mathematically, which results in
the equation of the fitting line. For example, the parameters of this line are -1
and 0.1, and the equation for the line is -1+0.1x_1.
Play video starting at :5:14 and follow transcript5:14
Now, we can use this regression line to predict the churn of the new customer. For
example, for our customer, or let's say a data point with x value of age equals 13,
we can plug the value into the line formula and the y value is calculated and
returns a number. For instance, for p_1 point, we have Theta transpose x equals -
1+0.1*x_1 = -1+0.1*13 = 0.3. We can show it on our graph. Now we can define a
threshold here. For example, at 0.5 to define the class. We write a rule here for
our model, y hat, which allows us to separate class 0 from class 1. If the value of
Theta transpose x is less than 0.5, then the class is zero. Otherwise, if the value
of Theta transpose x is more than 0.5, then the class is one. Because our
customer's y value is less than the threshold, we can say it belongs to class 0
based on our model. But there is one problem here. What is the probability that
this customer belongs to class 0? As you can see, it's not the best model to solve
this problem. Also, there are some other issues which verify that linear regression
is not the proper method for classification problems. As mentioned, if we use the
regression line to calculate the class of a point, it always returns a number such
as 3 or -2, and so on. Then we should use a threshold. For example, 0.5 to assign
that point to either class of zero or one. This threshold works as a step function
that outputs zero or one regardless of how big or small, positive or negative the
input is. Using the threshold, we can find the class of a record. Notice that in
the step function, no matter how big the value is as long as it's greater than 0.5,
it simply equals one and vice versa, regardless of how small the value y is, the
output would be zero if it is less than 0.5. In other words, there is no difference
between a customer who has a value of one or 1,000. The outcome would be one.
Instead of having this step function, wouldn't it be nice if we had a smother line.
One that would project these values between 0 and 1. Indeed, the existing method
does not really give us the probability of a customer belonging to a class, which
is very desirable. We need a method that can give us the probability of falling in
a class as well. What is the scientific solution here? Well, if instead of using
Theta transpose x, we use a specific function called sigmoid. Then sigmoid of Theta
transpose x gives us the probability of a point belonging to a class instead of the
value of y directly. I'll explain this sigmoid function in a second. But for now,
please accept that it will do the trick. Instead of calculating the value of Theta
transpose x directly, it returns the probability that a Theta transpose x is very
big or very small. It always returns a value between 0 and 1 depending on how large
the Theta transpose x actually is. Now, our model is sigmoid of Theta transpose x,
which represents the probability that the output is one given x. Now, the question
is, what is the sigmoid function? Let me explain in detail what sigmoid really is.
The sigmoid function, also called the logistic function resembles the step function
and is used by the following expression in the logistic regression. The sigmoid
function looks a bit complicated at first, but don't worry about remembering this
equation. It'll make sense to you after working with it. Notice that in the sigmoid
equation, when Theta transpose x gets very big, the e power minus theta transpose x
in the denominator of the fraction becomes almost zero and the value of the sigmoid
function gets closer to one. If Theta transpose x is very small, the sigmoid
function gets closer to zero. Depicting on the sigmoid plot, when Theta transpose x
gets bigger, the value of the sigmoid function gets closer to one. Also, if the
Theta transpose x is very small, the sigmoid function gets closer to zero. The
sigmoid functions output is always between 0 and 1, which makes it proper to
interpret the results as probabilities. It is obvious that when the outcome of the
sigmoid function gets closer to 1, the probability(y)=1 given x goes up. In
contrast, when the sigmoid value is closer to zero, the probability(y)=1, given x,
is very small. What is the output of our model when we use the sigmoid function? In
logistic regression, we model the probability that an input x belongs to the
default class, y=1, and we can write this formally as probability(y)=1 given x. We
can also write probability of y belongs to class 0. Given x is 1-probability(y) = 1
given x. For example, the probability of a customer staying with the company can be
shown as probability of churn equals 1, given a customer's income and age, which
can be, for instance, 0.8. The probability of churn is zero for the same customer,
given a customer's income and age can be calculated as 1-0.8 = 0.2.
Play video starting at :11:47 and follow transcript11:47
Now our job is to train the model to set its parameter values in such a way that
our model is a good estimate of probability(y)=1 given x. In fact, this is what a
good classifier model built by logistic regression is supposed to do for us. Also,
it should be a good estimate of probability(y) belongs to class 0 given x that can
be shown as 1-sigmoid of Theta^T x. Now, the question is, how can we achieve this?
We can find Theta through the training process. Let's see what the training process
is. Step 1, initialize Theta vector with random values as with most machine
learning algorithms. For example, -1 or 2. Step 2, calculate the model output,
which is Sigmoid of Theta^T x for a sample customer in your training set. X and
Theta^T x is the feature vector values. For example, the age and income of the
customer, for instance, two and five. Theta is the confidence or weight that you've
set in the previous step. The output of this equation is the prediction value. In
other words, the probability that the customer belongs to class 1. Step 3, compare
the output of our model, y hat, which could be a value of, let's say, 0.9997 with
the actual label of the customer, which is, for example, one for churn. Then record
the difference as our models error for this customer, which would be 1-0.9997,
which, of course, equals 0.0003. This is the error for only one customer out of all
the customers in the training set. Step 4. Calculate the error for all customers as
we did in the previous steps and add up these errors. The total error is the cost
of your model and is calculated by the models cost function. The cost function, by
the way, basically represents how to calculate the error of the model, which is the
difference between the actual and the models predicted values. The cost shows how
poorly the model is estimating the customer's labels. Therefore, the lower the
cost, the better the model is at estimating the customer's labels correctly. What
we want to do is to try to minimize this cost. Step 5. But because the initial
values for Theta were chosen randomly, it's very likely that the cost function is
very high. We change the theta in such a way to hopefully reduce the total cost.
Step 6. After changing the values of Theta, we go back to step 2. Then we start
another iteration and calculate the cost of the model again, and we keep doing
those steps over and over, changing the values of Theta each time until the cost is
low enough. This brings up two questions. First, how can we change the values of
Theta so that the cost is reduced across iterations? Second, when should we stop
the iterations? There are different ways to change the values of Theta, but one of
the most popular ways is gradient descent. Also, there are various ways to stop
iterations. But essentially, you stop training by calculating the accuracy of your
model and stop it when it's satisfactory. Thanks for watching this video.

You might also like