0% found this document useful (0 votes)
18 views47 pages

FAM Unit5

The document outlines various types of machine learning, including supervised, unsupervised, and semi-supervised learning, detailing their algorithms, advantages, disadvantages, and applications. It explains the differences between classification and regression tasks, as well as the importance of model evaluation metrics such as accuracy, precision, recall, and F1 score. Additionally, it discusses overfitting and underfitting, dataset splitting, and performance evaluation techniques like confusion matrices and ROC curves.

Uploaded by

koolavarghese6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views47 pages

FAM Unit5

The document outlines various types of machine learning, including supervised, unsupervised, and semi-supervised learning, detailing their algorithms, advantages, disadvantages, and applications. It explains the differences between classification and regression tasks, as well as the importance of model evaluation metrics such as accuracy, precision, recall, and F1 score. Additionally, it discusses overfitting and underfitting, dataset splitting, and performance evaluation techniques like confusion matrices and ROC curves.

Uploaded by

koolavarghese6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

UNIT 5

TYPES OF
LEARNING
TYPES OF LEARNING

 Supervised Learning
 Unsupervised Learning
 Semi-supervised Learning
 Reinforcement Learning
SUPERVISED LEARNING
 Its use of labeled datasets to train algorithms
that to classify data or predict outcomes
accurately.
 It relies on guidance and supervision
 Ex. Exit poll
• Supervised learning involves training a machine
from labeled data.
• Labeled data consists of examples with the
correct answer or classification.
• The machine learns the relationship between
inputs (fruit images) and outputs (fruit labels).
• The trained machine can then make predictions
on new, unlabeled data.
SUPERVISED LEARNING
SUPERVISED LEARNING ALGORITHMS
1. Linear Regression: Used for regression task, it models the
relationship between dependent variable and one or more
independent variables.

2. Decision Tree: This algorithm partition the dataset into


smaller subsets based on features.

3. Random Forest: This method combines multiple trees to


improve accuracy.

4. Naïve Bayes : This algorithm works for text classification


and spam filtering

5. K-Nearest Neighbor :A simple classification algorithm that


classifies data points based on majority among their K-
nearest neighbors in the feature.
CATEGORIES/TYPES OF SUPERVISED
MACHINE LEARNING
REGRESSION
 Regression algorithms are used if there is a
relationship between the input variable and the
output variable.

 It is used for the prediction of continuous


variables, such as Weather forecasting, Market
Trends, etc.

 Example-
1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
CLASSIFICATION
 Classification algorithms are used when the
output variable is categorical.

 which means there are two classes such as


Yes-No, Male-Female, True-false, etc.

 Example
1. Spam Filtering,
2. Random Forest
3. Decision Trees
4. Logistic Regression
5. Support vector Machines
DIFFERENCE BETWEEN
CLASSIFICATION AND REGRESSION

Parameter Classification Regression

Basic Mapping function is used Mapping function is used


for mapping values to for mapping values to
predefined classes continues output

Involvers Prediction of Discrete values Continuous values

Nature of the predicted Unordered Ordered


data

Method of calculation By measuring accuracy By measuring root mean


square error

Example Algorithms Decision tree, Logistic Linear regression, Random


regression Forest
ADVANTAGES OF SUPERVISED
LEARNING:

 With the help of supervised learning, the model


can predict the output on the basis of prior
experiences.

 In supervised learning, we can have an exact


idea about the classes of objects.

 Supervised learning model helps us to solve


various real-world problems such as fraud
detection, spam filtering, etc.
DISADVANTAGES OF SUPERVISED LEARNING:

 Supervised learning models are not suitable for


handling the complex tasks.

 Supervised learning cannot predict the correct


output if the test data is different from the
training dataset.

 Training required lots of computation times.

 In supervised learning, we need enough


knowledge about the classes of object.
APPLICATIONS OF SUPERVISED
LEARNING:
• Fraud detection: Helps identify fraudulent
transactions in banking and finance
• Spam detection: Uses keywords and content
to identify spam emails
• Translation: Uses large amounts of digital
written material to create models that can
translate text from one language to another
• Image recognition: A computer identifies an
object in an image by looking for patterns that
match what it has seen before
• Product recommendations: A popular
feature on e-commerce websites
• Social media features: Facebook uses
machine learning to automatically suggest
friend tags by identifying faces in a user's photo
UNSUPERVISED LEARNING
 Unsupervised machine learning uses
machine learning algorithms to analyze and
cluster unlabeled datasets.

 These algorithms discover hidden patterns


or data groupings without the need for
human intervention.
TYPES OF UNSUPERVISED LEARNING
ALGORITHM:
CLUSTERING:
Clustering is a method of grouping the objects into
clusters such that objects with most similarities
remains into a group and has less or no similarities
with the objects of another group.

It does it by finding some similar patterns in the


unlabeled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence
and absence of those similar patterns.

It is an unsupervised learning method, hence no


supervision is provided to the algorithm, and it deals
with the unlabeled dataset.
Example: Mall or supermarket
ASSOCIATION
• An association rule is an unsupervised
learning method which is used for finding
the relationships between variables in the
large database.

• It determines the set of items that occurs


together in the dataset. Association rule
makes marketing strategy more effective.

• Such as people who buy X item (suppose a


bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of
Association rule is Market Basket Analysis.
APPLICATION OF ASSOCIATION
? Retail: For market basket analysis to
understand customer buying habits and to
drive sales through promotions and store
layout optimizations.
? Healthcare: For identifying combinations of
symptoms and diagnoses that frequently
occur together, which can help in the
diagnosis of new patients.
? Web Usage Mining: For analyzing patterns
in web usage data to improve website design
and personalized content delivery.
? Finance: For fraud detection by identifying
unusual patterns of transactions.
UNSUPERVISED LEARNING
ALGORITHMS
1. K-Means Clustering: The K-means clustering
algorithm is one of the most popular unsupervised
machine learning algorithms and it is used for data
segmentation.
It works by partitioning a data set into k clusters,
where each cluster has a mean that is computed
from the training data.
2. Principal Component Analysis (PCA): The
PCA algorithm is used for dimensionality
reduction of datasets
3.Convolutional Neural Networks (CNNs):They
work by taking an input image and splitting it into
small square tiles called “windows.” Each window is
then passed through a neuron in the first layer
ADVANTAGES OF SUPERVISED
MACHINE LEARNING
• Uncovering hidden patterns and structures
in data without needing labeled examples.
• Ability to explore and discover insights from
large and complex datasets.
• Flexibility in handling diverse data types
and domains.
• Useful for exploratory data analysis and
feature engineering.
DISADVANTAGES OF UNSUPERVISED
MACHINE LEARNING
• Results may be unpredictable or difficult to
understand.

• Difficult to measure accuracy or


effectiveness due to lack of predefined
answers during training.
APPLICATION OF UNSUPERVISED
LEARNING

• Recommendation systems: Unsupervised


learning can identify patterns and similarities in
user behavior and preferences to recommend
products, movies, or music that align with their
interests.
• Customer segmentation: Unsupervised
learning can identify groups of customers with
similar characteristics, allowing businesses to
target marketing campaigns and improve
customer service more effectively.
• Image analysis: Unsupervised learning can
group images based on their content, facilitating
tasks such as image classification, object
detection, and image retrieval.
DIFFERENCE BETWEEN SUPERVISED
AND UNSUPERVISED MACHINE
LEARNING

Supervised machine Unsupervised machine


Parameters
learning learning

Algorithms are used


Algorithms are trained
Input Data against data that is not
using labeled data.
labeled

Computational
Simpler method Computationally complex
Complexity

Accuracy Highly accurate Less accurate

No. of classes is not


No. of classes No. of classes is known
known

Uses real-time analysis of


Data Analysis Uses offline analysis
data

Desired output is not


Output Desired output is given.
given.
DIFFERENCE
Supervised machine Unsupervised machine
Parameters
learning learning

It is not possible to learn It is possible to learn larger


larger and more complex and more complex models
Complex model
models with supervised with unsupervised
learning. learning.

Model We can test our model. We can not test our model.

Supervised learning is also Unsupervised learning is


Called as
called classification. also called clustering.

Example: Optical character Example: Find a face in an


Example
recognition. image.

Use training data to infer


Training data No training data is used.
model.
SEMI SUPERVISED MACHINE
LEARNING
? With semi-supervised learning, you train an
initial model on a few labeled samples and
then iteratively apply the model to a larger
dataset.

? A semi-supervised learning approach uses


small amounts of labeled data and also
large amounts of unlabeled data.

? With semi-supervised learning, you train an


initial model on a few labeled samples and
then iteratively apply the model to a larger
dataset.
STEPS:
• We train a model with labelled data.

• We use the trained model to predict labels for the


unlabeled data, which creates pseudo-labeled data.

• We retrain the model with the pseudo-labeled and


labeled data together.

• This process happens iteratively as the model improves


and is able to perform with a greater degree of
accuracy.
MODEL EVALUATION
• Model evaluation is the process that uses some
metrics which help us to analyze the performance
of the model

• As we all know that model development is a multi-


step process and a check should be kept on how
well the model generalizes future predictions.

• Therefore evaluating a model plays a vital role so


that we can judge the performance of our model.

• The evaluation also helps to analyze a model’s key


weaknesses.
Accuracy
Accuracy is defined as the ratio of the number of correct
predictions to the total number of predictions. This is the
most fundamental metric used to evaluate the model.
Precision and Recall
Precision is the ratio of true positives to the summation of
true positives and false positives. It basically analyses the
positive predictions.
Recall is a metric that measures positive instances(True
Positive)from all the actual positive samples in dataset.
F1 score
The F1 score is the harmonic mean of precision and recall.
It is seen that during the precision-recall trade-off if we
increase the precision, recall decreases and vice versa. The
goal of the F1 score is to combine precision and recall.
TRAINING AND TESTING
The training data is used to train the machine
learning algorithm. Once you have trained your
machine learning model on a dataset, you must test
it on unseen data to evaluate its performance.

This unseen data is called the testing data.

This is similar to the test data used in software


testing, just the context is different here. In
software testing, we use test data to ensure the
software works well for given data.

In machine learning, we use testing data to ensure


the model works for the given testing data.
NEED OF DATA SET SPLITTING.
The train-test split is a technique for
evaluating the performance of a machine
learning algorithm.

The procedure involves taking a dataset and


dividing it into two subsets. The first subset is
used to fit the model and is referred to as the
training dataset.

The second subset is not used to train the model;


instead, the input element of the dataset is
provided to the model, then predictions are made
and compared to the expected values. This
second dataset is referred to as the test dataset.
Dataset Splitting:
scikit-learn alias sklearn is the most useful
and robust library for machine learning in
Python. The scikit-learn library provides
us with the model-selection module in which
we have the splitter function train-test-
split()
OVER FITTING AND UNDER FITTING IN ML
Over fitting occurs when our machine
learning model tries to cover all the data
points or more than the required data points
present in the given dataset.
Because of this, the model starts caching
noise and inaccurate values present in the
dataset, and all these factors reduce the
efficiency and accuracy of the model.
UNDERFITTING
Underfitting occurs when our machine learning model
is not able to capture the underlying trend of the data.
To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which
the model may not learn enough from the training
data. As a result, it may fail to find the best fit.

How to avoid underfitting:


By increasing the training time of the model.
By increasing the number of features.
PERFORMANCE MATRIX IN ML
Confusion Matrix:
A table with two rows and two columns that reports the number of
true positives, false negatives, false positives, and true negatives.

• True Positive (TP): The model correctly predicted a positive


outcome (the actual outcome was positive).
• True Negative (TN): The model correctly predicted a negative
outcome (the actual outcome was negative).
• False Positive (FP): The model incorrectly predicted a positive
outcome (the actual outcome was negative). Also known as a
Type I error.
• False Negative (FN): The model incorrectly predicted a
negative outcome (the actual outcome was positive). Also
known as a Type II error.

This matrix is especially helpful in evaluating a model’s


performance beyond basic accuracy metrics.
METRICS BASED ON CONFUSION MATRIX
DATA

1. Accuracy
Accuracy is used to measure the performance of the
model. It is the ratio of Total correct instances to the total
instances.
Accuracy= TP+TN
TP+TN+FP+FN

2. Precision
Precision is a measure of how accurate a model’s positive
predictions are. It is defined as the ratio of true positive
predictions to the total number of positive predictions
made by the model.
Precision= TP
TP+FP
3. Recall
It is the ratio of the number of true positive (TP)
instances to the sum of true positive and false
negative (FN) instances.
Out of all positive classes how our model
predicted correctly
Recall= TP
TP+FN
Recall must be high as possible.
4. F1-Score
F1-score is used to evaluate the overall
performance of a classification model. It is the
harmonic mean of precision and recall,
F1-Score= 2⋅Precision⋅Recall
Precision+Recall
AUC-ROC CURVE
The AUC-ROC curve, or Area Under the
Receiver Operating Characteristic curve, is a
graphical representation of the performance
of a binary classification model at various
classification thresholds.
It is commonly used in machine learning to
assess the ability of a model to distinguish
between two classes, typically the positive
class (e.g., presence of a disease) and the
negative class (e.g., absence of a disease).
RECEIVER OPERATING CHARACTERISTICS
(ROC) CURVE
ROC stands for Receiver Operating
Characteristics, and the ROC curve is the
graphical representation of the effectiveness of
the binary classification model. It plots the true
positive rate (TPR) vs the false positive rate
(FPR) at different classification thresholds.
Area Under Curve (AUC) Curve:
AUC stands for the Area Under the Curve, and
the AUC curve represents the area under the
ROC curve. It measures the overall performance
of the binary classification model. As both TPR
and FPR range between 0 to 1, So, the area will
always lie between 0 and 1, and
A greater value of AUC denotes better model
performance.
LOG LOSS
Logarithmic Loss, commonly known as Log
Loss or Cross-Entropy Loss, is a crucial metric
in machine learning, particularly in
classification problems. It quantifies the
performance of a classification model by
measuring the difference between predicted
probabilities and actual outcomes.
CROSS VALIDATION
Cross validation is a technique used in
machine learning to evaluate the performance
of a model on unseen data. It involves
dividing the available data into multiple folds
or subsets, using one of these folds as a
validation set, and training the model on the
remaining folds. This process is repeated
multiple times, each time using a different
fold as the validation set.
Finally, the results from each validation step
are averaged to produce a more robust
estimate of the model’s performance.
The main purpose of cross validation is to
prevent overfitting, which occurs when a
model is trained too well on the training data
and performs poorly on new, unseen data.

Methods of Cross Validation


1.Validation
In this method divide the input dataset into a training
set and test or validation set.

Both the subsets are given 50% of the dataset.

But it has one of the big disadvantages that we are just


using a 50% dataset to train our model, so the model
may miss out to capture important information of the
dataset. It also tends to give the under fitted model.
2. LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole dataset but
leaves only one data-point of the available dataset and then
iterates for each data-point.
In LOOCV, the model is trained on n−1 samples and tested
on the one omitted sample, repeating this process for each
data point in the dataset.

An advantage of using this method is that we make use of


all data points and hence it is low bias.

The major drawback of this method is that it leads


to higher variation in the testing model as we are testing
against one data point. If the data point is an outlier it can
lead to higher variation.

Another drawback is it takes a lot of execution time as it


iterates over ‘the number of data points’ times.
K FOLD
K-FOLD CROSS-VALIDATION APPROACH DIVIDES THE INPUT DATASET INTO
K GROUPS OF SAMPLES OF EQUAL SIZES. THESE SAMPLES ARE
CALLED FOLDS. FOR EACH LEARNING SET, THE PREDICTION FUNCTION
USES K-1 FOLDS, AND THE REST OF THE FOLDS ARE USED FOR THE TEST
SET.
LET'S TAKE AN EXAMPLE OF 5-FOLDS CROSS-VALIDATION. SO, THE
DATASET IS GROUPED INTO 5 FOLDS. ON 1ST ITERATION, THE FIRST FOLD IS
RESERVED FOR TEST THE MODEL, AND REST ARE USED TO TRAIN THE
MODEL. ON 2ND ITERATION, THE SECOND FOLD IS USED TO TEST THE
MODEL, AND REST ARE USED TO TRAIN THE MODEL. THIS PROCESS WILL
CONTINUE UNTIL EACH FOLD IS NOT USED FOR THE TEST FOLD.
ADVANTAGES AND DISADVANTAGES OF
CROSS VALIDATION
Advantages:
Overcoming Overfitting: Cross validation helps to prevent overfitting
by providing a more robust estimate of the model’s performance on
unseen data.
Model Selection: Cross validation can be used to compare different
models and select the one that performs the best on average.

Data Efficient: Cross validation allows the use of all the available data
for both training and validation, making it a more data-efficient
method compared to traditional validation techniques.
Disadvantages:
Computationally Expensive: Cross validation can be computationally
expensive, especially when the number of folds is large or when the
model is complex and requires a long time to train.
Time-Consuming: Cross validation can be time-consuming, especially
when there are many hyperparameters to tune or when multiple
models need to be compared.

You might also like