0% found this document useful (0 votes)
10 views63 pages

BSC ML CH1

Supervised learning is a machine learning technique where models are trained using labeled data to classify new observations into predefined categories, such as binary or multiclass classification. Key algorithms for classification include Logistic Regression, k-Nearest Neighbors, and Decision Trees, with evaluation metrics like accuracy, precision, recall, and F1 score used to assess model performance. The document also discusses the K-Nearest Neighbor algorithm, explaining its operation and application in classification tasks.

Uploaded by

rachitdhiliwal18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views63 pages

BSC ML CH1

Supervised learning is a machine learning technique where models are trained using labeled data to classify new observations into predefined categories, such as binary or multiclass classification. Key algorithms for classification include Logistic Regression, k-Nearest Neighbors, and Decision Trees, with evaluation metrics like accuracy, precision, recall, and F1 score used to assess model performance. The document also discusses the K-Nearest Neighbor algorithm, explaining its operation and application in classification tasks.

Uploaded by

rachitdhiliwal18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

• What is supervised learning?

Binary and
multiclass classification, Evaluation
Unit 1: measures for supervised learning, k-
Nearest Neighbor algorithm
2
It is the field of study that gives computers the capability to learn
without being explicitly programmed.

I/P
Data
Traditional Program Output
Algorithm

I/P
Data
Machine Learning Program
Output

3
Relationship Between
AI, ML, DL and DS

4
Types
Supervised Learning
• Supervised learning is when we train the machine using data that is well labeled.
• After that, the machine is provided with a new set of examples(data) so that the
supervised learning algorithm analyses the training data(set of training examples)
and produces a correct outcome from labeled data.
Classification
• The Classification algorithm is a Supervised Learning
technique that is used to identify the category of new
observations on the basis of training data.
• In Classification, a program learns from the given
dataset or observations and then classifies new
observation into a number of classes or groups.
• Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or
dog, etc. Classes can be called as targets/labels or
categories.
• Types:
➢ Binary Classifier: If the classification problem has only
two possible outcomes, then it is called as Binary
Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT
SPAM, CAT or DOG, etc.
➢ Multi-class Classifier: If a classification problem has
more than two outcomes, then it is called as Multi-class
Classifier.
Example: Classifications of types of crops, Classification
of types of music.
Binary Classification
• It is a process or task of classification, in which a given data is being classified into two
classes. It’s basically a kind of prediction about which of two groups the thing belongs to.
• categorizing data into two distinct classes. This method is essential for tasks like email spam
detection and medical diagnostics. It provides a clear decision boundary.
• The most popular algorithms used by the binary classification are-

• Logistic Regression
• k-Nearest Neighbors
• Decision Trees
• Support Vector Machine
• Naive Bayes
Multiclass Classification
Multi-class classification is the task of classifying elements into different classes. Unlike binary, it doesn’t restrict itself to any number of classes.

Examples of multi-class classification are


• classification of news in different categories,
• classifying books according to the subject,
• classifying students according to their streams etc.

Popular algorithms that can be used for multi-class classification include:


• k-Nearest Neighbors
• Decision Trees
• Naive Bayes
• Random Forest.
• Gradient Boosting
• There are several methods for training multiclass models:
• One-vs-rest strategy: Trains a separate classifier for each class against all
others
• One-vs-one approach: Creates binary classifiers for every pair of classes
• Softmax activation: Often used in neural networks to output probability
distributions across classes
Classification Type Output Structure Example

Spam (1) or Not Spam


Binary Single probability
(0)
Fruit: Apple (0.7),
Multiclass Probability distribution
Orange (0.2), Pear (0.1)
Emotions: Happy (1),
Multi-label Set of binary indicators
Sad (0), Excited (1)
• Linear Models
• Logistic Regression
Types of ML • Support Vector Machines
• Non-linear Models
Classificati • K-Nearest Neighbours
on • Kernel SVM
Algorithms: • Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
Supervised Learning
Examples
• Predicting House Prices:
• Given features like area, number of rooms, location, etc.,
predict the price of a house
• The labeled data would consist of houses with their
corresponding prices

• Email Spam Classification:


• Given the content and features of an email, classify it as
either spam or non-spam
• The labeled data would consist of emails marked as
spam or non-spam
Unsupervised Learning
• Unsupervised learning is the training of a machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without
guidance.
• Here the task of the machine is to group unsorted information according to similarities,
patterns, and differences without any prior training of data.

14
Comparison

16
Machine Learning Models
Task Driven Data
Driven

Supervised Learning Unsupervised Learning


(Pre-categorized data) (Unlabeled Data)
Predications + Predictive Models Pattern/Structure Recognition

Clustering
Divide by similarity

Association
Regression Classification Identify Sequences

Divide the ties by length Divide the socks by color

Dimensionality
Linear Regression Logistic Reduction
Compress data based on features
Regression

Decision Tree
Support Vector Machine
Random Forest

18
Neural Networks Naïve Bayes
Model Evaluation
⮚ Train/Test is a method to measure the accuracy of your model.
⮚ It is called Train/Test because you split the data set into two sets: a training set and a testing
set.
⮚ Example: 80% for training, and 20% for testing.
⮚ You train the model using the training set.
⮚ You test the model using the testing set.
⮚ Train the model means create the model.
⮚ Test the model means test the accuracy of the model.
⮚ We can measure model accuracy by two methods. Accuracy simply means the number of values correctly
predicted.
1. Confusion Matrix
2. Classification Measure
Confusion Matrix
• The confusion matrix is also known as Error matrix and is represented by a table which describes the

performance of a classification model on a set of test data in machine learning.

• It is a two-dimensional matrix where each row represents the instances in predictive class while each

column represents the instances in the actual class or you put the values in the other way.

Here, TP (True Positive) means the observation is positive and is predicted as positive,
FP (False Positive) means observation is negative but is predicted as positive,
TN (True Negative) means the observation is negative and is predicted as negative
and FN (False Negative) means the observation is positive but it is predicted as negative.

20
• Actual values =
[‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’,
‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’,
‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,
‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]

• Predicted values =
[‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,
‘dog’, ‘cat’, ‘cat’, ‘cat’, ‘cat’,
‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,
‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]
• A good model is one which has high TP and TN rates, while low FP and FN
rates.
• If you have an imbalanced dataset to work with, it’s always better to
use confusion matrix as your evaluation criteria for your machine learning
model.
2. Classification Measure
• Basically, it is an extended version of the confusion matrix.
• There are measures other than the confusion matrix which can
help achieve better understanding and analysis of our model and
its performance.
a. Accuracy
b. Precision
c. Recall (TPR, Sensitivity)
d. F1-Score
e. FPR (Type I Error)
f. FNR (Type II Error)
Accuracy

➢ Accuracy is the ratio of the total number of correct predictions and the total number of
predictions.

➢ Accuracy is, simply put, the total proportion of observations that have been correctly
predicted.
➢ We can use accuracy when we are interested in predicting both 0 and 1 correctly and our
dataset is balanced enough.
➢ The formula for calculating accuracy is as follows:

25
A common complaint about accuracy is that it fails when the classes are imbalanced.

For example if the data contains only 10% of positive instances, a majority baseline classifier which always assigns
the negative label would reach 90% accuracy since it would correctly predict 90% instances. But of course such a
classifier is useless, it doesn't classify anything.
Precision
• Precision is the ratio between the True Positives and all the Positives.
• Precision is a measure of how many of the positive predictions made are correct (true
positives)
• Precision is a good measure to determine, when the costs of False Positive is high
Recall
• The recall is the measure of our model correctly identifying True Positives.
• Thus, for all the patients who actually have heart disease, recall tells us how
many we correctly identified as having a heart disease

• Recall also gives a measure of how accurately our model is able to identify
the relevant data. We refer to it as Sensitivity or True Positive Rate.
• In most cases, we want both our precision and recall being high, but it is not
possible.
• When our precision will be high our recall will be low and vice versa.
• So to balance these we have another metric called F1 Score.
F1 Score
F1 score is a machine learning evaluation metric that measures a model’s accuracy which combines the precision and recall
scores of a model.
The F1 score is a popular performance measure for classification and often preferred over accuracy when data is
unbalanced, such as when the quantity of examples belonging to one class significantly outnumbers those found in the
other class.
F1 Score might be a better measure to use if we need to seek a balance between Precision and Recall
Advantages:
• Very small precision or recall will result in lower overall score. Thus it helps balance the two metrics.
• If you choose your positive class as the one with fewer samples, F1-score can help balance the metric across
positive/negative samples.

31
AUC-ROC

• ROC: Receiver Operating Characteristics


• AUC: Area Under Curve
• AUC-ROC curve helps us visualize how well our machine learning classifier
performs.
• ROC stands for Receiver Operating Characteristics, and the ROC curve is the
graphical representation of the effectiveness of the binary classification
model.
• It plots the true positive rate (TPR) vs the false positive rate (FPR) at
different classification thresholds.
• The curve plots two parameters, True Positive Rate (TPR) and False Positive Rate (FPR).

• Area Under ROC curve is basically used as a measure of the quality of a classification model.
Hence, the AUC-ROC curve is the performance measurement for the classification problem at
various threshold settings.

• The True Positive Rate (sensitivity ) or Recall is defined as


Benefit of using the model

• The False Positive Rate(1-Specificity ) is defined as


• Loss due to the model

33
• It measures the overall performance of the binary classification model.
• As both TPR and FPR range between 0 to 1, So, the area will always lie between
0 and 1, and A greater value of AUC denotes better model performance.
• Our main goal is to maximize this area in order to have the highest TPR and
lowest FPR at the given threshold.
• It represents the probability with our model to distinguish between the two
classes which are present in our target.
•Higher X-axis value indicates a higher number of false positive
than True Negative

•Higher Y-axis values indicates higher number of True positive


than False Negative.

• An excellent model has AUC near the 1, which means it has


a good measure of separability.

• A poor model has AUC near the 0, which means it has the
worst measure of separability.

• When AUC is 0.5, it means the model has no class


separation capacity whatsoever.
AUC value (x) Interpretation

x = 0.5
Implies that the ROC is random and the classifier was unable to differentiate the positive and negative
classes properly.

x > 0.5 && x <= 0.7


Implies that the classifier's performance is poor and limited but better than the random probability.

x > 0.7 && x <= 0.8


Implies that the classifier's performance is decently better, but there is still room for improvement.

x > 0.8 && x <= 0.9


Implies that the classifier is significantly good and can visibly differentiate between the positive and
negative classes to provide reliable results.

x = 1.0
Implies that the ROC is perfect and the classifier has the ability to provide highly accurate results with
reliable performance.
AUC-ROC
• ROC (Receiver Operating Characteristic) Curve tells us about how good the model can
distinguish between two things (e.g If a patient has a disease or no).
• Better models can accurately distinguish between the two classes , Whereas, a poor model
will have difficulties in distinguishing between the two.
• ROC Curves and AUC in Python

# calculate roc curve : The function returns the false positive rates
for each threshold, true positive rates for each threshold .
fpr, tpr, thresholds = roc_curve(y, probs)

# calculate AUC
auc = roc_auc_score(y, probs)
print('AUC: %.3f' % auc)
• from sklearn.datasets import make_classification
• from sklearn.linear_model import LogisticRegression
• from sklearn.metrics import roc_curve
• from sklearn.metrics import roc_auc_score
• import plotly.express as px
• import pandas as pd

• # Random Classification dataset
• X, y = make_classification(n_samples=1000, n_classes=2, random_state=1)

• model = LogisticRegression()
• model.fit(X, y)
• Now we want to evaluate how good our model is using ROC curves. To do this, we need to find FPR and TPR for various
threshold values

• fpr, tpr, thresh = roc_curve(y, preds)


• roc_df = pd.DataFrame(zip(fpr, tpr, thresh),columns =
["FPR","TPR","Threshold"])
K-Nearest Neighbor(KNN)
Algorithm
Introduction
• K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to
the new data.
• Example:
Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. KNN model will find the similar
features of the new data set to the cats and dogs images and based on the
most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

• Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories.
• To solve this type of problem, we need a K-NN algorithm.
• With the help of K-NN, we can easily identify the category or class of a particular dataset.
How does K-NN work?

• Step-1: Select the number K of the neighbors


• Step-2: Calculate the Euclidean distance of K number of
neighbors
• Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
• Step-4: Among these k neighbors, count the number of the data
points in each category.
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
•As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
How to select the value of K in the K-NN
Algorithm?
• There is no particular way to determine the best value for "K", so
we need to try some values to find the best out of them. The most
preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to
the effects of outliers in the model.
• Large values for K are good, but it may find some difficulties
• Optimal K:Usually determined using cross-validation to balance bias
and variance.
• For a very low value of k (suppose k=1), the
model is overfitting the training data, which
leads to a high error rate on the validation set.
On the other hand, for a high value of k, the
model performs poorly on both the train and
validation sets. If you observe closely, the
validation error curve reaches a minimum at a
value of k = 9. This value of k is the optimum
value of the model (it will vary for different
datasets). Researchers typically use the elbow
curve, named for its resemblance to an elbow,
to determine the k value.
when we take k=1, we get a very high RMSE
value. The RMSE value decreases as we
increase the k value. At k= 7, the RMSE is
approximately 1219.06 and shoots upon
further increasing the k value. We can safely
say that k=7 will give us the best result in
this case.
Advantages/ Disadvantages
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may be complex
some time.
• The computation cost is high because of calculating the distance
between the data points for all the training samples.
• Preprocessing Steps
• Scaling Features: Distance-based algorithms like KNN are sensitive to
varying ranges in feature values. Standardize or normalize your
features to ensure fair comparisons.
• Handling Missing Data: Impute or remove missing values, as KNN
relies heavily on complete data for distance calculations.
• Real-World Applications
• Recommendation Systems: Matching users with similar preferences.
• Image Recognition: Identifying objects by comparing pixel patterns.
• Medical Diagnostics: Classifying diseases based on patient records.
• Customer Segmentation: Grouping customers based on purchasing
behavior.
Example KNN
Find the class label for given instance using KNN with K=5
Step 1: Find distance
Step 2: Find Rank

Step 3: Find nearest neighbours


to assign class
• https://github.com/codebasics/py/blob/master/ML/17_knn_classi
fication/knn_classification_tutorial.ipynb

You might also like