Ai ML
Ai ML
Train datasets
Hello and welcome. In this video, we'll be covering model evaluation. So let's get
started. The goal of regression is to build a model to accurately predict an
unknown case. To this end, we have to perform regression evaluation after building
the model. In this video, we'll introduce and discuss two types of evaluation
approaches that can be used to achieve this goal. These approaches are train and
test on the same dataset and train/test split. We'll talk about what each of these
are, as well as the pros and cons of using each of these models. Also, we'll
introduce some metrics for accuracy of regression models. Let's look at the first
approach. When considering evaluation models, we clearly want to choose the one
that will give us the most accurate results. So, the question is, how can we
calculate the accuracy of our model? In other words, how much can we trust this
model for prediction of an unknown sample using a given dataset and having built a
model such as linear regression? One of the solutions is to select a portion of our
dataset for testing. For instance, assume that we have 10 records in our dataset.
We use the entire dataset for training, and we build a model using this training
set. Now, we select a small portion of the dataset, such as row number six to nine,
but without the labels. This set is called a test set, which has the labels, but
the labels are not used for prediction and is used only as ground truth. The labels
are called actual values of the test set. Now we pass the feature set of the
testing portion to our built model and predict the target values. Finally, we
compare the predicted values by our model with the actual values in the test set.
This indicates how accurate our model actually is. There are different metrics to
report the accuracy of the model, but most of them work generally based on the
similarity of the predicted and actual values. Let's look at one of the simplest
metrics to calculate the accuracy of our regression model. As mentioned, we just
compare the actual values y with the predicted values, which is noted as y hat for
the testing set. The error of the model is calculated as the average difference
between the predicted and actual values for all the rows. We can write this error
as an equation. So, the first evaluation approach we just talked about is the
simplest one, train and test on the same dataset. Essentially, the name of this
approach says it all. You train the model on the entire dataset, then you test it
using a portion of the same dataset. In a general sense, when you test with a
dataset in which you know the target value for each data point, you're able to
obtain a percentage of accurate predictions for the model. This evaluation approach
would most likely have a high training accuracy and the low out-of-sample accuracy
since the model knows all of the testing data points from the training. What is
training accuracy and out-of-sample accuracy? We said that training and testing on
the same dataset produces a high training accuracy, but what exactly is training
accuracy? Training accuracy is the percentage of correct predictions that the model
makes when using the test dataset. However, a high training accuracy isn't
necessarily a good thing. For instance, having a high training accuracy may result
in an over-fit the data. This means that the model is overly trained to the
dataset, which may capture noise and produce a non-generalized model. Out-of-sample
accuracy is the percentage of correct predictions that the model makes on data that
the model has not been trained on. Doing a train and test on the same dataset will
most likely have low out-of-sample accuracy due to the likelihood of being over-
fit. It's important that our models have high out-of-sample accuracy because the
purpose of our model is, of course, to make correct predictions on unknown data.
So, how can we improve out-of-sample accuracy? One way is to use another evaluation
approach called train/test split. In this approach, we select a portion of our
dataset for training, for example, row zero to five, and the rest is used for
testing, for example, row six to nine. The model is built on the training set.
Then, the test feature set is passed to the model for prediction. Finally, the
predicted values for the test set are compared with the actual values of the
testing set. The second evaluation approach is called train/test split. Train/test
split involves splitting the dataset into training and testing sets respectively,
which are mutually exclusive. After which, you train with the training set and test
with the testing set. This will provide a more accurate evaluation on out-of-sample
accuracy because the testing dataset is not part of the dataset that has been used
to train the data. It is more realistic for real-world problems. This means that we
know the outcome of each data point in the dataset, making it great to test with.
Since this data has not been used to train the model, the model has no knowledge of
the outcome of these data points. So, in essence, it's truly out-of-sample testing.
However, please ensure that you train your model with the testing set afterwards,
as you don't want to lose potentially valuable data. The issue with train/test
split is that it's highly dependent on the datasets on which the data was trained
and tested. The variation of this causes train/test split to have a better out-of-
sample prediction than training and testing on the same dataset, but it still has
some problems due to this dependency. Another evaluation model, called K-fold
cross-validation, resolves most of these issues. How do you fix a high variation
that results from a dependency? Well, you average it. Let me explain the basic
concept of K-fold cross-validation to see how we can solve this problem. The entire
dataset is represented by the points in the image at the top left. If we have K
equals four folds, then we split up this dataset as shown here. In the first fold
for example, we use the first 25 percent of the dataset for testing and the rest
for training. The model is built using the training set and is evaluated using the
test set. Then, in the next round or in the second fold, the second 25 percent of
the dataset is used for testing and the rest for training the model. Again, the
accuracy of the model is calculated. We continue for all folds. Finally, the result
of all four evaluations are averaged. That is, the accuracy of each fold is then
averaged, keeping in mind that each fold is distinct, where no training data in one
fold is used in another. K-fold cross-validation in its simplest form performs
multiple train/test splits, using the same dataset where each split is different.
Then, the result is average to produce a more consistent out-of-sample accuracy. We
wanted to show you an evaluation model that addressed some of the issues we've
described in the previous approaches. However, going in-depth with K-fold cross-
validation model is out of the scope for this course. Thanks for watching. (Music)
At timestamps 2:01-2:05, an error was made regarding the terminology used. It was
stated that "relative absolute error is also known as the residual sum of square."
It's important to note the distinction between Relative Absolute Error (RAE) and
Residual Sum of Squares (RSS):
Relative Absolute Error (RAE): Measures the average absolute difference between
actual and predicted values relative to the average absolute difference between
actual values and their mean.
Residual Sum of Squares (RSS): Calculates the sum of the squared differences
between actual and predicted values.
Co2Em = 208.34
Introduction to Classification
Hello, in this video, we'll give you an introduction to classification. So let's
get started. In machine learning classification is a supervised learning approach
which can be thought of as a means of categorizing or classifying some unknown
items into a discrete set of classes. Classification attempts to learn the
relationship between a set of feature variables and a target variable of interest.
The target attribute in classification is a categorical variable with discrete
values. So, how does classification and classifiers work? Given a set of training
data points along with the target labels, classification determines the class label
for an unlabeled test case. Let's explain this with an example. A good sample of
classification is the loan default prediction. Suppose a bank is concerned about
the potential for loans not to be repaid? If previous loan default data can be used
to predict which customers are likely to have problems repaying loans, these bad
risk customers can either have their loan application declined or offered
alternative products. The goal of a loan default predictor is to use existing loan
default data which has information about the customers such as age, income,
education et cetera, to build a classifier, pass a new customer or potential future
default to the model, and then label it, i.e the data points as defaulter or not
defaulter. Or for example zero or one. This is how a classifier predicts an
unlabeled test case. Please notice that this specific example was about a binary
classifier with two values. We can also build classifier models for both binary
classification and multi-class classification. For example, imagine that you've
collected data about a set of patients, all of whom suffered from the same illness.
During their course of treatment, each patient responded to one of three
medications. You can use this labeled dataset with a classification algorithm to
build a classification model. Then you can use it to find out which drug might be
appropriate for a future patient with the same illness. As you can see, it is a
sample of multi-class classification. Classification has different business use
cases as well. For example, to predict the category to which a customer belongs,
for churn detection where we predict whether a customer switches to another
provider or brand, or to predict whether or not a customer responds to a particular
advertising campaign. Data classification has several applications in a wide
variety of industries. Essentially, many problems can be expressed as associations
between feature and target variables, especially when labelled data is available.
This provides a broad range of applicability for classification. For example,
classification can be used for email filtering, speech recognition, handwriting
recognition, biometric identification, document classification and much more. Here
we have the types of classification algorithms and machine learning. They include
decision trees, naive bayes, linear discriminant analysis, k-nearest neighbor,
logistic regression, neural networks, and support vector machines. There are many
types of classification algorithms. We will only cover a few in this course. Thanks
for watching. (Music)
K-Nearest Neighbours
Hello and welcome. In this video, we'll be covering the K-Nearest Neighbors
algorithm. So, let's get started. Imagine that a telecommunications provider has
segmented his customer base by service usage patterns, categorizing the customers
into four groups. If demographic data can be used to predict group membership, the
company can customize offers for individual perspective customers. This is a
classification problem. That is, given the dataset with predefined labels, we need
to build a model to be used to predict the class of a new or unknown case. The
example focuses on using demographic data, such as region, age, and marital status
to predict usage patterns. The target field called custcat has four possible values
that correspond to the four customer groups as follows: Basic Service, E Service,
Plus Service, and Total Service. Our objective is to build a classifier. For
example, using the row zero to seven to predict the class of row eight. We will use
a specific type of classification called K-Nearest Neighbor. Just for sake of
demonstration, let's use only two fields as predictors specifically, age and
income, and then plot the customers based on their group membership. Now, let's say
that we have a new customer. For example, record number eight, with a known age and
income. How can we find the class of this customer? Can we find one of the closest
cases and assign the same class label to our new customer? Can we also say that the
class of our new customer is most probably group four i.e Total Service, because
it's nearest neighbor is also of class four? Yes, we can. In fact, it is the first
nearest neighbor. Now, the question is, to what extent can we trust our judgment
which is based on the first nearest neighbor? It might be a poor judgment
especially if the first nearest neighbor is a very specific case or an outlier,
correct? Now, let's look at our scatter plot again. Rather than choose the first
nearest neighbor, what if we chose the five nearest neighbors and did a majority
vote among them to define the class of our new customer? In this case, we'd see
that three out of five nearest neighbors tell us to go for class three, which is
Plus Service. Doesn't this make more sense? Yes. In fact, it does. In this case,
the value of K in the K-Nearest Neighbors algorithm is five. This example
highlights the intuition behind the K-Nearest Neighbors algorithm. Now, let's
define the K Nearest Neighbors. The K-Nearest Neighbors algorithm is a
classification algorithm that takes a bunch of labeled points and uses them to
learn how to label other points. This algorithm classifies cases based on their
similarity to other cases. In K-Nearest Neighbors, data points that are near each
other are said to be neighbors. K-Nearest Neighbors is based on this paradigm.
Similar cases with the same class labels are near each other. Thus, the distance
between two cases is a measure of their dissimilarity. There are different ways to
calculate the similarity or conversely, the distance or dissimilarity of two data
points. For example, this can be done using Euclidean distance. Now, let's see how
the K-Nearest Neighbors algorithm actually works. In a classification problem, the
K-Nearest Neighbors algorithm works as follows. One, pick a value for K. Two,
calculate the distance from the new case hold out from each of the cases in the
dataset. Three, search for the K-observations in the training data that are nearest
to the measurements of the unknown data point. And four, predict the response of
the unknown data point using the most popular response value from the K-Nearest
Neighbors. There are two parts in this algorithm that might be a bit confusing.
First, how to select the correct K and second, how to compute the similarity
between cases, for example, among customers. Let's first start with the second
concern. That is, how can we calculate the similarity between two data points?
Assume that we have two customers, customer one and customer two, and for a moment,
assume that these two customers have only one feature, H. We can easily use a
specific type of Minkowski distance to calculate the distance of these two
customers, it is indeed the Euclidean distance. Distance of X_1 from X_2 is root of
34 minus 30 to power of two, which is four. What about if we have more than one
feature? For example, age and income. If we have income and age for each customer,
we can still use the same formula but this time, we're using it in a two
dimensional space. We can also use the same distance matrix for multidimensional
vectors. Of course, we have to normalize our feature set to get the accurate
dissimilarity measure. There are other dissimilarity measures as well that can be
used for this purpose but as mentioned, it is highly dependent on datatype and also
the domain that classification is done for it. As mentioned, K and K-Nearest
Neighbors is the number of nearest neighbors to examine. It is supposed to be
specified by the user. So, how do we choose the right K? Assume that we want to
find the class of the customer noted as question mark on the chart. What happens if
we choose a very low value of K? Let's say, K equals one. The first nearest point
would be blue, which is class one. This would be a bad prediction, since more of
the points around it are magenta or class four. In fact, since its nearest neighbor
is blue we can say that we capture the noise in the data or we chose one of the
points that was an anomaly in the data. A low value of K causes a highly complex
model as well, which might result in overfitting of the model. It means the
prediction process is not generalized enough to be used for out-of-sample cases.
Out-of-sample data is data that is outside of the data set used to train the model.
In other words, it cannot be trusted to be used for prediction of unknown samples.
It's important to remember that overfitting is bad, as we want a general model that
works for any data, not just the data used for training. Now, on the opposite side
of the spectrum, if we choose a very high value of K such as K equals 20, then the
model becomes overly generalized. So, how can we find the best value for K? The
general solution is to reserve a part of your data for testing the accuracy of the
model. Once you've done so, choose K equals one and then use the training part for
modeling and calculate the accuracy of prediction using all samples in your test
set. Repeat this process increasing the K and see which K is best for your model.
For example, in our case, K equals four will give us the best accuracy. Nearest
neighbors analysis can also be used to compute values for a continuous target. In
this situation, the average or median target value of the nearest neighbors is used
to obtain the predicted value for the new case. For example, assume that you are
predicting the price of a home based on its feature set, such as number of rooms,
square footage, the year it was built, and so on. You can easily find the three
nearest neighbor houses of course not only based on distance but also based on all
the attributes and then predict the price of the house as the medium of neighbors.
This concludes this video. Thanks for watching. (Music)
en
Evaluation Metrics in Classification
Hello and welcome. In this video, we'll be covering evaluation metrics for
classifiers. Let's get started. Evaluation metrics explain the performance of a
model. Let's talk more about the model evaluation metrics that are used for
classification. Imagine that we have an historical dataset which shows the customer
churn for a telecommunication company. We have trained the model, and now we want
to calculate its accuracy using the test set. We pass the test set to our model,
and we find the predicted labels. Now the question is, how accurate is this model?
Basically, we compare the actual values in the test set with the values predicted
by the model to calculate the accuracy of the model. Evaluation metrics provide a
key role in the development of a model as they provide insight to areas that might
require improvement. There are different model evaluation metrics, but we just talk
about three of them here, specifically, Jaccard index, F1 score, and log loss.
Let's first look at one of the simplest accuracy measurements, the Jaccard index,
also known as the Jaccard similarity coefficient. Let's say y shows the true labels
of the churn dataset, and y-hat shows the predicted values by our classifier. Then
we can define Jaccard as the size of the intersection divided by the size of the
union of two label sets. For example, for a test set of size 10 with eight correct
predictions or eight intersections, the accuracy by the Jaccard index would be
0.66. If the entire set of predicted labels for a sample strictly matches with the
true set of labels, then the subset accuracy is 1.0, otherwise, it is 0.0. Another
way of looking at accuracy of classifiers is to look at a confusion matrix. For
example, let's assume that our test set has only 40 rows. This matrix shows the
corrected and wrong predictions in comparison with the actual labels. Each
confusion matrix row shows the actual true labels in the test set, and the columns
show the predicted labels by classifier. Let's look at the first row. The first row
is for customers whose actual churn value in the test set is one. As you can
calculate, out of 40 customers, the churn value of 15 of them is one, and out of
these 15, the classifier correctly predicted six of them as one, and nine of them
as zero. This means that for six customers, the actual churn value was one in the
test set, and the classifier also correctly predicted those as one. However, while
the actual label of nine customers was one, the classifier predicted those as zero,
which is not very good. We can consider this as an error of the model for the first
row. What about the customers with a churn value 0? Let's look at the second row.
It looks like there were 25 customers whose churn value was zero. The classifier
correctly predicted 24 of them as zero and one of them wrongly predicted as one, so
it has done a good job in predicting the customers with a churn value of zero. A
good thing about the confusion matrix is that it shows the model's ability to
correctly predict or separate the classes. In the specific case of a binary
classifier such as this example, we can interpret these numbers as the count of
true positives, false negatives, true negatives, and false positives. Based on the
count of each section, we can calculate the precision and recall of each label.
Precision is a measure of the accuracy provided that a class label has been
predicted. It is defined by precision equals true positive divided by true positive
plus false positive. Recall is the true positive rate. It is defined as recall
equals true positive divided by true positive plus false negative. We can calculate
the precision and recall of each class. Now we're in the position to calculate the
F1 scores for each label based on the precision and recall of that label. The F1
score is the harmonic average of the precision and recall, where an F1 score
reaches its best value at one, which represents perfect precision and recall, and
its worst at zero. It is a good way to show that a classifier has a good value for
both recall and precision. It is defined using the F1 score equation. For example,
the F1 score for Class 0, ie churn equals zero, is 0.83, and the F1 score for Class
1, ie churn equals one, is 0.55. Finally, we can tell the average accuracy for this
classifier is the average of the F1 score for both labels, which is 0.69 in our
case. Please notice that both Jaccard and F1 score can be used for multiclass
classifiers as well, which is out of scope for this course. Now, let's look at
another accuracy metric for classifiers. Sometimes the output of a classifier is
the probability of a class label instead of the label. For example, in logistic
regression, the output can be the probability of customer churn, ie yes, or equals
to one. This probability is a value between zero and one. Logarithmic loss, also
known as log loss, measures the performance of a classifier where the predicted
output is a probability value between zero and one. For example, predicting a
probability of 0.13 when the actual label is one would be bad, and would result in
a high log loss. We can calculate the log loss for each row using the log loss
equation, which measures how far each prediction is from the actual label. Then we
calculate the average log loss across all rows of the test set. It is obvious that
ideal classifiers have progressively smaller values of log loss, so the classifier
with the lower log loss has better accuracy. Thanks for watching.