0% found this document useful (0 votes)
57 views

Machine Learning BE Merged Modules

Machine learning is a subset of artificial intelligence that allows computers to learn from data without being explicitly programmed. It builds models from sample data known as "training data" to make predictions or decisions without being explicitly programmed. There are three main types of machine learning: supervised learning which uses labeled data to build models and make predictions; unsupervised learning which looks for hidden patterns in unlabeled data; and reinforcement learning where an agent learns from trial and error using rewards and punishments. Some common applications of machine learning include image and speech recognition, traffic prediction, product recommendations, spam filtering, and virtual assistants. Key challenges for machine learning include issues with data quality, underfitting models to data, and overfitting models which perform well on training data but

Uploaded by

Harsh Shah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Machine Learning BE Merged Modules

Machine learning is a subset of artificial intelligence that allows computers to learn from data without being explicitly programmed. It builds models from sample data known as "training data" to make predictions or decisions without being explicitly programmed. There are three main types of machine learning: supervised learning which uses labeled data to build models and make predictions; unsupervised learning which looks for hidden patterns in unlabeled data; and reinforcement learning where an agent learns from trial and error using rewards and punishments. Some common applications of machine learning include image and speech recognition, traffic prediction, product recommendations, spam filtering, and virtual assistants. Key challenges for machine learning include issues with data quality, underfitting models to data, and overfitting models which perform well on training data but

Uploaded by

Harsh Shah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 561

CHAPTER 1

INTRODUCTION TO MACHINE LEARNING


What is Machine Learning
Machine Learning is said as a subset of Artificial Intelligence (AI)
It is mainly concerned with the development of algorithms which allow a
computer to learn from the data and past experiences on their own
It improve performance from experiences, and predict things without being
explicitly programmed
Using sample historical data (training data), ML algorithms build
a mathematical model that helps in making predictions
It brings computer science and statistics together for creating predictive
models
The more we will provide the information, the higher will be the performance.
Working of Machine Learning
A ML system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it
The accuracy of predicted output depends upon the amount of data.
Need of Machine Learning
It is capable of doing tasks that are too complex for a person
Using Machine Learning, we can save both time and money
Use Case: self-driving cars, cyber fraud detection, face recognition, and friend
suggestion by Facebook, etc.
Importance of Machine Learning:
Rapid increment in the production of data
Solving complex problems, which are difficult for a human
Decision making in various sector including finance
Finding hidden patterns and extracting useful information from data
Classification of Machine Learning
1)Supervised learning
It is a type of Machine Learning method
The system creates a model using labelled data and learn about each data.
once the training and processing are done then we test the model by providing
a sample data.
Then check whether it is predicting the exact output or not.
The example of supervised learning is: spam filtering.
Supervised learning can be grouped further in two categories of algorithms:
1.Classification 2. Regression
Supervised Learning
Supervised Learning
Supervised Learning
Possible Classifier
Supervised Learning
Possible Classifier
Supervised Learning
Possible Classifier
Supervised Learning
Possible Classifier
Supervised Learning Possible Classifiers
The Process
The Process
The Process
The Process
The Process
The Process
The Process
1.Binary 1.Multi-Class
Classification Classification

Classification
Types

1.Multi-Label 1.Imbalanced
Classification Classification
Binary Classification:

• Binary classification refers to those


classification tasks that have two class labels.
• Examples include:
• Email spam detection (spam or not).
• Conversion prediction (buy or not).

Popular algorithms :

• Logistic Regression
• k-Nearest Neighbors
Classification •

Decision Trees
Support Vector Machine
Types • Naive Bayes
Multi-Class Classification

• Multi-class classification refers to those


classification tasks that have more than two class
labels.
• Examples include:
• Face classification.
• Plant species classification.
• Optical character recognition.

Popular algorithms :

• k-Nearest Neighbors.
• Decision Trees.
Classification • Naive Bayes.
• Random Forest.
Types • Gradient Boosting.
Multi-Label Classification

• Multi-label classification refers to those


classification tasks that have two or more class
labels, where one or more class labels may be
predicted for each example.
• Examples include:
• photo classification, where a given photo may
have multiple objects in the scene and a model
may predict the presence of multiple known
objects in the photo, such as “bicycle,” “apple,”
“person,” etc

Popular algorithms :

Classification • Multi-label Decision Trees


• Multi-label Random Forests
Types • Multi-label Gradient Boosting
Imbalanced Classification

• Imbalanced classification refers to classification


tasks where the number of examples in each
class is unequally distributed.

• Examples include:
• Fraud detection.
• Outlier detection.
• Medical diagnostic tests.

Popular algorithms :

Classification • Cost-sensitive Logistic Regression.


• Cost-sensitive Decision Trees.
Types • Cost-sensitive Support Vector Machines.
Classification of Machine Learning

2) Unsupervised Learning
It is a learning method in which a machine learns without any supervision
The training is provided to the machine with the set of data that has not been
labelled, classified, or categorized
The algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.
It can be further classifieds into two categories of algorithms:
1.Clustering 2. Association
Labelled Training Data
Clustering Techniques

Density-Based • DBSCAN (Density-Based Spatial Clustering of Applications with


Methods Noise) , OPTICS (Ordering Points to Identify Clustering Structure) etc.

Hierarchical Based • Agglomerative (bottom up approach)


Methods • Divisive (top down approach)

• K-means, CLARANS (Clustering Large Applications based upon


Partitioning Methods Randomized Search) etc.

• STING (Statistical Information Grid), wave cluster, CLIQUE (CLustering


Grid-based Methods In Quest) etc
Classification of Machine Learning
3) Reinforcement Learning
It is a feedback-based learning method, in which a learning agent gets a reward
for each right action and gets a penalty for each wrong action.
In RL, the agent learns automatically using feedbacks without any labelled data
 Since there is no labelled data, so the agent is bound to learn by its experience
only
RL solves a specific type of problem where decision making is sequential, and
the goal is long-term, such as game-playing, robotics, etc.
The agent learns with the process of hit and trial, and based on the experience,
it learns to perform the task in a better way
Reinforcement Learning
"Reinforcement learning is a type of machine learning method where an
intelligent agent (computer program) interacts with the environment and
learns to act within that."
It is a core part of Artificial intelligence, and all AI agent works on the concept
of reinforcement learning.
The agent learns that what actions lead to positive feedback or
rewards and what actions lead to negative feedback penalty.
As a positive reward, the agent gets a positive point,
and as a penalty, it gets a negative point.
Applications of Machine learning

1.Image Recognition
It is used to identify objects, persons, places, digital images, etc.
Use case: Automatic friend tagging suggestion
2. Speech Recognition
Speech recognition is a process of converting voice instructions
into text, and it is also known as "Speech to text", or
"Computer speech recognition.“
Use Case: Google assistant, Siri, Cortana, and Alexa

3. Traffic prediction
Google Maps, which shows us the correct path with the
shortest route and predicts the traffic conditions such as
whether traffic is cleared, slow-moving, or heavily congested
Applications of Machine learning

4. Product recommendations
Google understands the user interest using various ML algorithms and suggests the product as per
customer interest.
Use Case: when we use Netflix, we find some recommendations for entertainment series, movies
5.Email Spam and Malware Filtering
We always receive an important mail in our inbox with the important symbol and spam emails in our spam
box.
Below are some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Applications of Machine learning
6.Virtual Personal Assistant
virtual personal assistants such as Google assistant, Alexa, Cortana, Siri.
These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.

7.Automatic Language Translation


Google's GNMT (Google Neural Machine Translation) provide this feature
It is a Neural Machine Learning that translates the text into our familiar language, and it called as automatic
translation.
How good is a model?

How do I choose the model?

Do I have enough data?

Is the data of sufficient quality?


• Errors in data
• Missing values
How confident can I be of the results?

Am I describing the data correctly?


• Are age and income enough? should I take gender also?

Challenges • How should I represent age? As a number , or as young,


middle age,old
Issues in Machine Learning
1. Poor Quality of Data
Noisy data, incomplete data, inaccurate data, and unclean data lead to less accuracy in classification
and low-quality results.
2. Underfitting of Training Data
Whenever a machine learning model is trained with fewer amounts of data, and as a result, it
provides incomplete and inaccurate data and destroys the accuracy of the machine learning model.
This generally happens when we have limited data into the data set, and we try to build a linear
model with non-linear data.
Methods to reduce Underfitting:
By increasing the training time of the model.
By increasing the number of features.
Issues in Machine Learning contd..
3. Overfitting of Training Data
Whenever a machine learning model is trained with a huge amount of data, it starts capturing noise and
inaccurate data into the training data set. It negatively affects the performance of the model.
The main reason behind overfitting is using non-linear methods used in machine learning algorithms as they
build non-realistic data models.
Methods to reduce overfitting:
Cross-Validation
Training with more data
Removing features
Early stopping the training
Regularization
Ensembling
Overfitting & Underfitting

Overfitting: Model with good training set. Underfitting: Not good for training
But may wrong for testing set as well as testing set
Real-life Example of overfitting and underfitting
Task: To identify whether the object is ball or not
Parameters:
Sphere-This feature is checking if the object is of a spherical shape.
Play-This feature is checking if one can play with it.
Eat-This feature is checking if one cannot eat it.
radius=5 cm-This feature is checking if an object's size is 5 cm or less than it.
Overfiiting Case:
If object (ball) with 10 cm radius is passed to classifier, it will classify it as not ball. Because
classifier is very much specific with features value.
Underfitting case:
If object (Orange) is passed to classifier , it will classify it as ball. Because it is very much
generalized with less number of parameter i.e. if object is sphere in shape it is ball.
Example:
Sphere play Eat Radius Class
yes yes No 5 Ball
yes yes No 3 Ball
yes no yes 5 Fruit
yes no yes 10 Fruit
yes yes No 10 Ball

Underfitting: model is buit on two parameters only (Sphere & radius). So when fruit is passed to
model it may classify it as ball
Overfitting: Model is specific with all parameters value (yes,yes,No,5), so when ball with 10 radius will
pass it may classify it as fruit
Bias/ Variance
Bias: Difference in predicted value and actual value
High Bias: difference is more
Variance: how the predicted values are scatter with
respect to each other
Low variance: values are not much scattered. They are
in groups.
Issues in Machine Learning contd..
4. Lack of Training Data
we need to ensure that Machine learning algorithms are trained with sufficient amounts of data.
5. Imperfections in the Algorithm When Data Grows
So you need regular monitoring and maintenance to keep the algorithm working. This is one of
the most exhausting issues faced by machine learning professionals.
How good is a model?

How do I choose the model?

Do I have enough data?

Is the data of sufficient quality?


• Errors in data
• Missing values
How confident can I be of the results?

Am I describing the data correctly?


• Are age and income enough? should I take gender also?

Challenges • How should I represent age? As a number , or as young,


middle age,old
Collect data

Prepare the input data

Analyze the input data

Train the algorithm.


Steps in Test the algorithm
developing Use it

ML Evaluate the algorithm[Accuracy Testing]

application Deployment of models into applications


How to choose the right algorithm
Size of the training data

Accuracy and/or Interpretability of the output

Speed or Training time

Linearity

Number of features
The main points to consider when trying to solve a
new problem are
Define the problem. What is the objective of the
problem?

Explore the data and familiarise yourself with the data.

Start with basic models to build a baseline model and


then try more complicated methods.
Chapter 2: Data Preprocessing

November 21, 2022 1


Why Data Preprocessing?
• Data in the real world is dirty

• incomplete: lacking attribute values, lacking certain attributes of interest, or


containing only aggregate data

• e.g., occupation=“ ”

• noisy: containing errors or outliers

• e.g., Salary=“-10”

• inconsistent: containing discrepancies in codes or names

• e.g., Age=“30” Birthday=“03/07/1980”


• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records

November 21, 2022 Data Mining: Concepts and Techniques 2


Why Is Data Dirty?

• Incomplete data may come from


• “Not applicable” data value when collected
• Different considerations between the time when the data was collected
and when it is analyzed.
• Human/hardware/software problems
• Noisy data (incorrect values) may come from
• Faulty data collection instruments
• Human or computer error at data entry
• Errors in data transmission
• Inconsistent data may come from
• Different data sources
• Functional dependency violation (e.g., modify some linked data)
• Duplicate records also need data cleaning

November 21, 2022 Data Mining: Concepts and Techniques 3


Why Is Data Preprocessing Important?

• No quality data, no quality mining results!


• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading
statistics.
• Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse

November 21, 2022 Data Mining: Concepts and Techniques 4


Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
numerical data

November 21, 2022 Data Mining: Concepts and Techniques 5


Forms of Data Preprocessing

November 21, 2022 Data Mining: Concepts and Techniques 6


Data Cleaning

• Importance
• “Data cleaning is one of the three biggest problems in data
warehousing”
• Data Cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration

November 21, 2022 Data Mining: Concepts and Techniques 7


How to Handle Missing Data?

• Ignore the tuple: usually done when class label is missing

• Fill in the missing value manually: tedious + infeasible?

• Fill in it automatically with


• a global constant : e.g., “unknown”
• the attribute mean
• the attribute mean for all samples belonging to the same
class: smarter
• the most probable value: inference-based such as Bayesian
formula or decision tree
November 21, 2022 Data Mining: Concepts and Techniques 8
Noisy Data

• Noise: random error or variance in a measured variable

• Incorrect attribute values may due to..

• faulty data collection instruments


• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention

November 21, 2022 Data Mining: Concepts and Techniques 9


How to Handle Noisy Data?

• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

• Regression
• smooth by fitting the data into regression functions

• Clustering
• detect and remove outliers

November 21, 2022 Data Mining: Concepts and Techniques 10


Binning method
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky

• Equal-width (distance) partitioning


• Divides the range into N intervals of equal size
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well

November 21, 2022 Data Mining: Concepts and Techniques 11


Binning Methods for Data Smoothing (Example)
Sorted data for price (in dollars for): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
November 21, 2022 Data Mining: Concepts and Techniques 12
Q) Suppose a group of 12 sales price records has been
sorted as follows:

5; 10; 11; 13; 15; 35; 50; 55; 72; 92; 204; 215;

Partition them into three bins by each of the following


methods.

(a) equal-frequency partitioning


(b) equal-width partitioning
(c) clustering

November 21, 2022 Data Mining: Concepts and Techniques 13


(a) equal-frequency partitioning

bin 1 5,10,11,13
bin 2 15,35,50,55
bin 3 72,92,204,215

(b) equal-width partitioning

The width of each interval is (215 - 5)/3 = 70.


bin 1 5,10,11,13,15,35,50,55,72 (5 to 75)
bin 2 92 (76 to 146)
bin 3 204,215 (147 to 217)
14
(c) clustering
We will use a simple clustering technique:
Partition the data along the 2 biggest gaps in the data.

bin 1 5,10,11,13,15
bin 2 35,50,55,72,92
bin 3 204,215

( ex. 5; 10; 11; 13; 15; 35; 50; 55; 72; 92; 204; 215;)

November 21, 2022 Data Mining: Concepts and Techniques 15


Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id  B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different
sources are different
• Possible reasons: different representations, different scales,
e.g., metric vs. British units

November 21, 2022 Data Mining: Concepts and Techniques 16


Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple


databases
• Object identification: The same attribute or object may have
different names in different databases
• Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality

November 21, 2022 Data Mining: Concepts and Techniques 17


Redundancy and Correlation Analysis

• Some redundancy can be detected by correlation analysis.

• Such analysis can measure how strongly one attributes implies the
other based on the available data

• For nominal data, x2 (chi-square) test


• For numeric data correlation coefficient and covariance

November 21, 2022 Data Mining: Concepts and Techniques 18


Correlation Analysis (Categorical Data)
• Χ2 (chi-square) test

(Observed  Expected) 2
2  
Expected

𝑐𝑜𝑢𝑛𝑡 𝐴 ∗(𝑐𝑜𝑢𝑛𝑡 𝐵)
• 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 =
𝑛

• The larger the Χ2 value, the more likely the variables are
related

November 21, 2022 Data Mining: Concepts and Techniques 19


Chi-Square Calculation: An Example

Play chess Not play chess Sum


(row): A
Like science fiction 250 (exp: 90) 200 (exp:360) 450

Not like science fiction 50 (exp:210) 1000(exp:840) 1050

Sum(col.): B 300 1200 1500 (n)

𝑐𝑜𝑢𝑛𝑡 𝐴 ∗(𝑐𝑜𝑢𝑛𝑡 𝐵) 300∗450


𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 = = =90
𝑛 1500

( 250  90 ) 2
(50  210 ) 2
( 200  360 ) 2
(1000  840 ) 2
2      507.93
90 210 360 840

• It shows that like_science_fiction and play_chess are correlated in


the group
November 21, 2022 Data Mining: Concepts and Techniques 20
November 21, 2022 Data Mining: Concepts and Techniques 21
507.93 >10.82

Calculated value is more than tabulated value of X2


So,

Null hypothesis: play chess and preferred reading are independent


(not related)

is rejected and conclude that the two attributes are strongly


correlated for the given group of people

November 21, 2022 Data Mining: Concepts and Techniques 22


Correlation Analysis (Numeric Data)

• Correlation coefficient (also called Pearson’s product moment


coefficient)

• If rA,B > 0,
A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger
correlation.

• rA,B = 0: independent;

• rA,B < 0: negatively correlated

November 21, 2022 Data Mining: Concepts and Techniques 23


Covariance (Numeric Data)

where A and B are the respective mean or expected values of A


and B

Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values

Negative covariance: If CovA,B < 0 then if A is larger than its expected


value, B is likely to be smaller than its expected value

Independence: CovA,B = 0 but the converse is not true:

Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence

November 21, 2022 Data Mining: Concepts and Techniques 24


Co-Variance: An Example
• It can be simplified in computation as

Example:

• Suppose two stocks A and B have the following values in one week: (2, 5), (3,
8), (5, 10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together? Days Stock Stock
A B
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4 Monday 2 5

• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6 Tuesday 3 8

Wednesday 5 10
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thursday 4 11
• Thus, A and B rise together since Cov(A, B) > 0. Friday 6 14
Data Transformation

• A function that maps the entire set of values of a given attribute to a new set
of replacement values so that each old value can be identified with one of
the new values
• Methods
• Smoothing: Remove noise from data
• Techniques include binning, regression, clustering

• Attribute/feature construction
• New attributes constructed from the given ones

• Aggregation: Summarization, data cube construction


• Eg. Daily sales data aggregated to monthly and annual income

26
Data Transformation

• Normalization: attribute data are scaled to fall within a smaller range


such as (-1.0 to 1.0) or (0.0 to 1.0)
• min-max normalization
• z-score normalization
• normalization by decimal scaling

• Discretization: raw values of numeric attributes (e.g. age) are replaced


by interval labels (0-10, 11-20) or conceptual labels( youth, adult, senior)
resulting in Concept hierarchy for the numeric attributes

• Concept hierarchy generation for nominal data:


• Attribute such as street can be generalized to higher level concept, like
city or country

27
Normalization

• Min-max normalization: it maps a value v of attribute A to new


value v’ in the range [new_minA, new_maxA] by computing,

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
• Ex. Let min and max value for the attribute income are
$12,000 and $98,000 resp. Now map income to the range
[0.0, 1.0]. Then the value $73,000 for income is transformed
as,
73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
28
Normalization

• Z-score normalization: The values for attribute A are normalized


based on mean and (μ) standard deviation(σ) of attribute A .
Formula is given by,
v  A
v' 
 A

• Ex. Let μ = 54,000, σ = 16,000 for attribute income. With z-


score normalization, a value $73600 for income is
transformed to , 73,600  54,000
 1.225
16,000
29
Normalization

• Normalization by decimal scaling: It transform the value by


moving the decimal point of value of attribute A. The number of
decimal points moved depends on the maximum absolute value
of A.
v Where j is the smallest integer such
• formula is given by, v ' 
10 j that Max(|ν’|) < 1

• Eg. Let for attribute A range is -986 to 927. To normalize by


decimal scaling, divide each value by 1000(i.e. j=3).
Therefore -986 is normalized to -0.986 and 917 is normalized to
0.917.

30
Example
Use the two methods below to normalize the following
group of data:

200; 300; 400; 600; 1000

(a) min-max normalization by setting min = 0 and max = 1

(b) z-score normalization

November 21, 2022 Data Mining: Concepts and Techniques 31


Using the data for age given in Exercise 2.4, answer the following:

(a) Use min-max normalization to transform the value 35 for age onto the range
[0:0; 1:0]

(b) Use z-score normalization to transform the value 35 for age, where the
standard deviation of age is 12.94 years.

(c) Use normalization by decimal scaling to transform the value 35 for age.

November 21, 2022 Data Mining: Concepts and Techniques 32


Concept hierarchy for the numeric attributes (Discretization)

$0…$1000

$0…$200 $200…$400 ……. $800…$1000

$0..$100 $100..$200 …………………………… $800..$900 $900..$1000

Concept hierarchy for attribute price


(interval labels) senior
---------------------------------------------
adult
Concept hierarchy for attribute Age
(conceptual labels)
------------------------------------------- youth
33
Concept Hierarchy Generation for Nominal Data
1. Specification of a partial/total ordering of attributes explicitly
at the schema level by users or experts.
• street < city < state < country

2. Specification of a hierarchy for a set of values by explicit data


grouping
• {Punjab, Haryana, Delhi} < North_India
• {Karnataka, Tamilnadu, kerala} < South_India

3. Automatic generation of hierarchies (or attribute levels) by the


analysis of the number of distinct values

34
Automatic Concept Hierarchy Generation

• Some hierarchies can be automatically generated based on


the analysis of the number of distinct values per attribute in
the data set

• The attribute with the most distinct values is placed at the


lowest level of the hierarchy

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


35
Data Reduction
• Why data reduction?
• A database/data warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time to run on
the complete data set

• Data reduction
• Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the same)
analytical results

36
Data reduction strategies

• Data cube aggregation:


• Eg. Total sales per year instead of per quarter

• Dimensionality reduction: original data is transform into


smaller space. Encoding methods are used.
• Eg. Wavelet transform and principal component analysis

• Attribute subset selection: remove unimportant(irrelevant,


redundant, weakly relevant) attributes
• Eg. For new CD purchase, customers phone number is
irrelevant

November 21, 2022 Data Mining: Concepts and Techniques 37


Numerosity Reduction

• Reduce data volume by choosing alternative, smaller forms of


data representation
• Parametric methods
• Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
(except possible outliers)
• Example: Regression and Log-linear models—obtain value at
a point in m-D space as the product on appropriate marginal
subspaces
• Non-parametric methods
Do not assume models
• Major families: histograms, clustering, sampling

November 21, 2022 Data Mining: Concepts and Techniques 38


Data Reduction Method (1): Regression and
Log-Linear Models
• Linear regression: Data are modeled to fit a straight line

y=wx + b

Where, y and x--numerical database attributes

w and b- regression coefficient

• Multiple regression: extension of linear regression method where y is


modeled as a linear function of two or more predictor variable

• Log-linear model: estimate the probability of each point in a


multidimentional space for a set of discretized attributes, based on the
smaller subset of dimensional combinations

November 21, 2022 Data Mining: Concepts and Techniques 39


Data Reduction Method (2): Histograms

• Histogram for an attribute A partition40


data distribution of A into disjoint
35
subsets or buckets.
• Partitioning rules: 30
• Equal-width: equal bucket range 25
• Equal-frequency (or equal-depth)
20
• V-optimal: with the least histogram
variance (weighted sum of the 15
original values that each bucket 10
represents)
• MaxDiff: set bucket boundary
5
between each pair for pairs have the0
β–1 largest
November 21, 2022
differences 10000
Data Mining: Concepts and Techniques
30000 50000 70000 90000
40
Data Reduction Method (3): Clustering

• Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only

• Can be very effective if data is clustered but not if data is “smeared”

• Can have hierarchical clustering and be stored in multi-dimensional index


tree structures

• There are many choices of clustering definitions and clustering algorithms


will be study later on.

November 21, 2022 Data Mining: Concepts and Techniques 41


Data Reduction Method (4): Sampling

• Sampling: obtaining a small sample s to represent the whole


data set N
• Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
• Choose a representative subset of the data
• Simple random sampling may have very poor performance in
the presence of skew
• Develop adaptive sampling methods
• Stratified sampling:
• Approximate the percentage of each class (or subpopulation of interest) in the overall
database
• Used in conjunction with skewed data

• Note: Sampling may not reduce database I/Os (page at a time)

November 21, 2022 Data Mining: Concepts and Techniques 42


Sampling: with or without Replacement

Raw Data
November 21, 2022 Data Mining: Concepts and Techniques 43
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

November 21, 2022 Data Mining: Concepts and Techniques 44


Outlier Analysis
What Are Outliers?
• Outlier: A data object that deviates significantly from the normal objects
• Ex.: Unusual credit card purchase

• Outliers are different from the noise data


• Noise is random error or variance in a measured variable
• Noise should be removed before outlier detection

• Outliers are interesting: It violates the mechanism that generates the normal data

• Applications:
• Credit card fraud detection
• Telecom fraud detection
• Medical analysis

46
Types of Outliers

Three kinds: global, contextual and collective outliers

• Global outlier (or point anomaly)


• Object is Og if it significantly deviates from the rest of the data set

• Contextual outlier (or conditional outlier)


• Object is Oc if it deviates significantly based on a selected context

• Collective Outliers
• A subset of data objects collectively deviate significantly from the
whole data set, even if the individual data objects may not be outliers
• Applications: E.g., intrusion detection:
• When a number of computers keep sending denial-of-service
packages to each other

47
Outlier Detection Methods

• Two ways to categorize outlier detection methods:


• Based on whether expert-labeled examples of
outliers can be obtained:
• Supervised
• semi-supervised
• unsupervised methods

• Based on assumptions about normal data and


outliers:
• Statistical
• proximity-based
• clustering-based methods
48
Outlier Detection Methods
Supervised Methods
• Expert label the normal objects, and any other objects not matching
the model of normal objects are reported as a outlier
• Classifier model is used

Unsupervised Methods
• Object labeled as normal or outlier are not available
• Assume the normal objects are somewhat ``clustered'‘ into multiple
groups, each having some distinct features
• An outlier is expected to be far away from any groups of normal
objects

49
Outlier Detection Methods

Semi-Supervised Methods
• Small set of data (normal objects/outliers)are labeled, but most of
the data are unlabeled

• Labeled normal data are used with unlabeled closed objects to build
a model for normal objects

• The model then can be used to detect the outlier ( objects not fitting
the model of normal objects are classified as outliers)
Outlier Detection Methods

• Statistical Methods
• Proximity-Based Methods
• Clustering-Based Methods

51
Outlier Detection : Statistical Methods
• Statistical methods (also known as model-based methods) assume that
the normal data follow some statistical model (a stochastic model)
• The data not following the model are outliers.

 Example : First use Gaussian distribution to model the


normal data
 For each object y in region R, estimate gD(y), the

probability of y fits the Gaussian distribution


 If gD(y) is very low, y is unlikely generated by the

Gaussian model, thus an outlier

52
Outlier Detection : Proximity-Based Methods
• An object is an outlier if the nearest neighbors of the object are far
away, i.e., the proximity of the object is significantly deviates from
the proximity of most of the other objects in the same data set

 Example : Model the proximity of an object


using its 3 nearest neighbors
 Objects in region R are substantially different
from other objects in the data set.
 Thus the objects in R are outliers

53
Proximity-Based Approaches: Distance-Based vs. Density-
Based Outlier Detection

• Two types of proximity-based outlier detection methods

• Distance-based outlier detection: An object O is an outlier


if its neighborhood does not have enough other points

• Density-based outlier detection: An object O is an outlier


if its density is relatively much lower than that of its
neighbors

54
Outlier Detection : Clustering-Based Methods

• Normal data belong to large and dense clusters, whereas outliers


belong to small or sparse clusters, or do not belong to any clusters

 Example : two clusters


 All points not in R form a large cluster
 The two points in R form a tiny cluster,
thus are outliers

55
Outlier detection and visualization: Scatter plot

56
Outlier detection and visualization: Box Plot

• Box Plots: Box plot is another very simple visualization tool to detect
outliers which use the concept of Interquartile range (IQR) technique.

November 21, 2022 Data Mining: Concepts and Techniques 57


Quartiles
• Example 1: Find the first and third quartiles of the data
set {3, 7, 8, 5, 12, 14, 21, 13, 18}.
• Total numbers in set=9
• First, write data in increasing order: 3, 5, 7, 8, 12, 13, 14,
18, 21.
• The median is 12.
• The first quartile, Q1, is the median of {3, 5, 7, 8}=6
• The third quartile, Q3, is the median of {13,14,18,21}=16
Quartiles
• Quartiles are the values (Q1, Q2,Q3,Q4) that divide a list
of numbers into quarters or four parts.
Quartiles

• Example 2: Find the first and third quartiles of the set {3, 7, 8, 5, 12,
14, 21, 15, 18, 14}.

• Median (Q2) is 13 (it is the mean of 12 and 14)


• Q1 = 8
• Q3 = 15.
Boxplot Example 1

Example: A sample of 10 boxes of raisins has these weights (in


grams):

25, 28, 29, 29, 30, 34, 35, 35, 37, 38


Make a box plot of the data
Solution:
Step 1: Order the data from smallest to largest.
Step 2: find the 5 number summary
(minimum, first quartile, median, third quartile, and
maximum)
( min, Q1,Q2,Q3,max)
(25,29,32,35,38)
Five number summary:
(25,29,32,35,38)
Ex 2
Outliers

• If a data value is very far away from the quartiles (either


much less than Q1 or much greater than Q3), it is sometimes
designated an outlier .
• The standard definition for an outlier is a number which is
less than Q1 or greater than Q3 by more than 1.5 times
the interquartile range
• IQR=Q3−Q1
• That is, an outlier is any number
less than Q1−(1.5×IQR) or
greater than Q3+(1.5×IQR)
Feature Selection Techniques in Machine
Learning
• While building a machine learning model for real-life dataset, we
come across a lot of features in the dataset and not all these features
are important every time.
• Adding unnecessary features while training the model leads us to
reduce the overall accuracy of the model, increase the complexity of
the model and decrease the generalization capability of the model
and makes the model biased.
• Even the saying “Sometimes less is better” goes as well for the
machine learning model. Hence, feature selection is one of the
important steps while building a machine learning model.
• Its goal is to find the best possible set of features for building a
machine learning model.

November 21, 2022 Data Mining: Concepts and Techniques 65


Feature Selection Techniques in Machine Learning

• Feature selection is the process of selecting the subset of the


relevant features and leaving out the irrelevant features present in a
dataset to build a model of high accuracy.

• In other words, it is a way of selecting the optimal features from the


input dataset.

• Three methods are used for the feature selection are_


1. Filter methods
2. Wrapper methods
3. Embedded methods

November 21, 2022 Data Mining: Concepts and Techniques 66


1. Filter Methods
In this method, the dataset is filtered, and a subset that contains only
the relevant features is taken. Some common techniques of filters
method are:

• Correlation:
Pearson’s Correlation Coefficient is a measure of quantifying the
association between the two continuous variables and the direction of
the relationship with its values ranging from -1 to 1.
• Chi-Square Test:
Chi-square method (X2) is generally used to test the relationship
between categorical variables. It compares the observed values from
different attributes of the dataset to its expected value.

November 21, 2022 Data Mining: Concepts and Techniques 67


1. Filter Methods
• Variance Threshold – It is an approach where all features are
removed whose variance doesn’t meet the specific threshold. By
default, this method removes features having zero variance. The
assumption made using this method is higher variance features are
likely to contain more information.

• Mean Absolute Difference (MAD) – This method is similar to


variance threshold method but the difference is there is no square
in MAD. This method calculates the mean absolute difference from
the mean value.
• Information Gain: It is defined as the amount of information
provided by the feature for identifying the target value and
measures reduction in the entropy values. Information gain of
each attribute is calculated considering the target values for
feature selection.

November 21, 2022 68


2. Wrappers Methods
• The wrapper method has the same goal as the filter method, but it
takes a machine learning model for its evaluation.
• In this method, some features are fed to the ML model, and evaluate
the performance. The performance decides whether to add those
features or remove to increase the accuracy of the model.
• This method is more accurate than the filtering method but complex to
work.
Some common techniques of wrapper methods are:
• Forward Selection
• Backward Selection
• Bi-directional Elimination

November 21, 2022


• Forward selection –This method is an iterative approach where we
initially start with an empty set of features and keep adding a feature
which best improves our model after each iteration. The stopping
criterion is till the addition of a new variable does not improve the
performance of the model.
• Backward elimination – This method is also an iterative approach
where we initially start with all features and after each iteration, we
remove the least significant feature. The stopping criterion is till no
improvement in the performance of the model is observed after the
feature is removed.
• Bi-directional elimination – This method uses both forward selection
and backward elimination technique simultaneously to reach to one
unique solution.

November 21, 2022 Data Mining: Concepts and Techniques 70


3. Embedded Methods:
• Embedded methods check the different training iterations of the
machine learning model and evaluate the importance of each
feature.
Some common techniques of Embedded methods are:
• LASSO
• Elastic Net
• Ridge Regression, etc.

November 21, 2022 Data Mining: Concepts and Techniques 71


• Regularization – This method adds a penalty to different parameters
of the machine learning model to avoid over-fitting of the model.
This approach of feature selection uses Lasso (L1 regularization) and
Elastic nets (L1 and L2 regularization). The penalty is applied over the
coefficients, thus bringing down some coefficients to zero. The
features having zero coefficient can be removed from the dataset.
• Tree-based methods – These methods such as Random Forest,
Gradient Boosting provides us feature importance as a way to select
features as well. Feature importance tells us which features are more
important in making an impact on the target feature.

November 21, 2022 Data Mining: Concepts and Techniques 72


Feature Extraction techniques

• Feature Extraction aims to reduce the number of features in a


dataset by creating new features from the existing ones (and then
discarding the original features).
• These new reduced set of features should then be able to summarize
most of the information contained in the original set of features.
• In this way, a summarised version of the original features can be
created from a combination of the original set.
Techniques_
• PCA (Principle Components Analysis)
• ICA (Independent Component Analysis)
• LDA (Linear Discriminant Analysis)
• Autoencoders

November 21, 2022 Data Mining: Concepts and Techniques 73


Autoencoder
• Autoencoders are a family of Machine Learning algorithms which can
be used as a dimensionality reduction technique.
• The main difference between Autoencoders and other dimensionality
reduction techniques is that Autoencoders use non-linear
transformations to project data from a high dimension to a lower
one.
• There exist different types of Autoencoders such as_
1. Denoising Autoencoder
2. Variational Autoencoder
3. Convolutional Autoencoder
4. Sparse Autoencoder

November 21, 2022 Data Mining: Concepts and Techniques 74


Autoencoder

1.Encoder: takes the input data and


compress it, so that to remove all the
possible noise and unhelpful
information. The output of the Encoder
stage is usually called bottleneck or
latent-space.

2.Decoder: takes as input the encoded


latent space and tries to reproduce the
original Autoencoder input using just it’s
compressed form (the encoded latent
space).

November 21, 2022 Data Mining: Concepts and Techniques 75


Principle Components Analysis (PCA)

• PCA is one of the mostely used linear dimensionality reduction


technique.

• When using PCA, we take as input our original data and try to find a
combination of the input features which can best summarize the
original data distribution so that to reduce its original dimensions.

• PCA is able to do this by maximizing variances and minimizing the


reconstruction error by looking at pair wised distances.
• In PCA, our original data is projected into a set of orthogonal axes
and each of the axes gets ranked in order of importance.

November 21, 2022 Data Mining: Concepts and Techniques 76


Independent Component Analysis (ICA)

• ICA is a linear dimensionality reduction method which takes as input data a


mixture of independent components and it aims to correctly identify each of
them (deleting all the unnecessary noise).
• Two input features can be considered independent if both their linear and
not linear dependance is equal to zero.
• Independent Component Analysis is commonly used in medical applications
such as EEG and fMRI analysis to separate useful signals from unhelpful
ones.
• As a simple example of an ICA application, let’s consider we are given an
audio registration in which there are two different people talking.
• Using ICA we could, for example, try to identify the two different
independent components in the registration (the two different people).
• In this way, we could make our unsupervised learning algorithm recognise
between the different speakers in the conversation.

November 21, 2022 Data Mining: Concepts and Techniques 77


Linear Discriminant Analysis (LDA)

• LDA aims to maximize the distance between the mean of each class
and minimize the spreading within the class itself.
• LDA uses therefore within classes and between classes as measures.
This is a good choice because maximizing the distance between the
means of each class when projecting the data in a lower-dimensional
space can lead to better classification results

November 21, 2022 Data Mining: Concepts and Techniques 78


Chapter 3
Supervised Learning with Regression
Regression

• Is Supervised or Unsupervised?

• What is the basic requirement of Supervised learning?

• What is Regression?

• Regression is a supervised machine learning technique


which is used to predict continuous values.
Which of the following is a regression task?

1 2 3 4
Predicting age of a Predicting Predicting whether Predicting whether
person nationality of a stock price of a a document is
person company will related to sighting
increase tomorrow of UFOs?
Linear Regression

• One of the most basic types


of regression in machine
learning.
• It consists of a
predictor/independent
variable and a dependent
variable related linearly to
each other.
Linear Regression
Linear Regression
Linear Regression

SUBJECT AGE X GLUCOSE LEVEL Y

1 43 99

2 21 65

3 25 79

4 42 75

5 57 87

6 59 81

7 55 ?
Linear Regression-Step I

GLUCOSE
SUBJECT AGE X LEVEL Y XY X2 Y2

1 43 99 4257 1849 9801

2 21 65 1365 441 4225

3 25 79 1975 625 6241

4 42 75 3150 1764 5625

5 57 87 4959 3249 7569

6 59 81 4779 3481 6561

Σ 247 486 20485 11409 40022


Linear Regression-Step II
Find b:
GLUCOSE
SUBJECT AGE X LEVEL Y XY X2 Y2

1 43 99 4257 1849 9801

2 21 65 1365 441 4225

3 25 79 1975 625 6241

4 42 75 3150 1764 5625

5 57 87 4959 3249 7569

6 59 81 4779 3481 6561

Σ 247 486 20485 11409 40022


Linear Regression-Step III
Find b:
GLUCOSE
SUBJECT AGE X LEVEL Y XY X2 Y2

1 43 99 4257 1849 9801

2 21 65 1365 441 4225

3 25 79 1975 625 6241

4 42 75 3150 1764 5625

5 57 87 4959 3249 7569

6 59 81 4779 3481 6561

Σ 247 486 20485 11409 40022


Linear Regression-Step IV
Insert the values into the equation.
GLUCOSE
y’ =bo +b1 * x SUBJECT AGE X LEVEL Y XY X2 Y2

1 43 99 4257 1849 9801

y’ = 65.14 + (0.385225 * x) 2 21 65 1365 441 4225

3 25 79 1975 625 6241

4 42 75 3150 1764 5625

5 57 87 4959 3249 7569

6 59 81 4779 3481 6561

Σ 247 486 20485 11409 40022


Linear Regression-Step V

Prediction – the value of y for the given value


SUBJECT AGE X GLUCOSE LEVEL Y
of x = 55
1 43 99
2 21 65
y’ = 65.14 + (0.385225 * x)
3 25 79
4 42 75

y’ = 65.14 +(.385225 ∗55) 5 57 87

y’ =86.327 6 59 81
7 55 86.327
Important points about LR
1. More susceptible to outliers hence;
2.It should not be used in the case of big-size data.
3.There should be a linear relationship between independent and
dependent variables.
4.There is only one independent and dependent variable.
5.The type of regression line: a best fit straight line.
Advantages And Disadvantages of LR
Advantages Disadvantages

Linear regression performs exceptionally well The assumption of linearity between


for linearly separable data dependent and independent variables
Advantages And Disadvantages of LR
Advantages Disadvantages

Linear regression performs exceptionally well The assumption of linearity between


for linearly separable data dependent and independent variables

Easier to implement, interpret and efficient to


It is often quite prone to noise and overfitting
train
Advantages And Disadvantages of LR
Advantages Disadvantages

Linear regression performs exceptionally well The assumption of linearity between


for linearly separable data dependent and independent variables

Easier to implement, interpret and efficient to


It is often quite prone to noise and overfitting
train

It handles overfitting pretty well using Linear regression is quite sensitive to outliers
dimensionally reduction techniques, Hence,it should not be used in the case of
regularization, and cross-validation big-size data
Use Case – Implementing Linear Regression
1.Loading the Data
2.Exploring the Data
3.Slicing The Data
4.Train and Split Data
5.Generate The Model
6.Evaluate The accuracy
Multiple linear regression
• is used to estimate the relationship between two or more
independent variables and one dependent variable
• Example:
• 1 The selling price of a house can depend on the desirability of the
location, the number of bedrooms, the number of bathrooms, the
year the house was built, the square footage of the lot and a number
of other factors
• 2 The height of a child can depend on the height of the mother, the
height of the father, nutrition, and environmental factors.
Multiple linear regression
Multiple linear regression
The simplest multiple regression model for two predictor variables is
y = β0 + β1x1 + β2x2 + €
Multiple linear regression
The simplest multiple regression model for two predictor variables is
y = β0 + β1x1 + β2x2 + €
Linear Regression

• Linear Regression is a supervised machine learning algorithm.

A) TRUE
B) FALSE
Linear Regression

• Linear Regression is mainly used for Regression.

• A) TRUE
• B) FALSE
Linear Regression

• It is possible to design a Linear regression algorithm using a neural


network?
• A) TRUE
• B) FALSE
Polynomial Regression

• It is also called the special case of Multiple Linear Regression.


• Because we add some polynomial terms to the Multiple Linear
regression equation to convert it into Polynomial Regression.
• It is a linear model with some modification in order to increase
the accuracy.
• The dataset used in Polynomial regression for training is of non-
linear nature.
• It makes use of a linear regression model to fit the complicated
and non-linear functions and datasets.
Need for Polynomial Regression

• If we apply a linear model on


a linear dataset, then it provides us
a good result as we have seen in
Simple Linear Regression,
• but if we apply the same model
without any modification on a non-
linear dataset, then it will produce a
drastic output.
• Due to which loss function will
increase, the error rate will be high,
and accuracy will be decreased.
• So for such cases, where data
points are arranged in a non-
linear fashion, we need the
Polynomial Regression model.
Equation of the Polynomial
Regression Model

• Simple Linear Regression equation:


• y = b0+b1x
• Multiple Linear Regression equation:
• y= b0+b1x+ b2x2+ b3x3+....+ bnxn
• Polynomial Regression equation:
• y= b0+b1x + b2x2+ b3x3+....+ bnxn

• The Simple and Multiple Linear equations are


also Polynomial equations with a single
degree, and the Polynomial regression
equation is Linear equation with the nth
degree.
• If your data points clearly will not fit a linear regression
(a straight line through all data points), it might be
ideal for polynomial regression.
Example
• we have registered 18 cars as they were passing a certain tollbooth.
• We have registered the car's speed, and the time of day (hour) the passing
occurred.
• The x-axis represents the hours of the day and the y-axis represents the
speed:

• import matplotlib.pyplot as plt

x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]

plt.scatter(x, y)
plt.show()
Example
import numpy
import matplotlib.pyplot as plt

x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))


myline = numpy.linspace(1, 22, 100)

plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
R-Squared
• It is important to know how well the relationship between the
values of the x- and y-axis is,
• if there are no relationship the polynomial regression can not be
used to predict anything.
• The relationship is measured with a value called the r-squared.
• The r-squared value ranges from 0 to 1, where 0 means no
relationship, and 1 means 100% related.
• Python and the Sklearn module will compute this value for you,
all you have to do is feed it with the x and y arrays:
R-Squared-Example
• How well does my data fit in a polynomial regression?

• import numpy
from sklearn.metrics import r2_score

x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))

print(r2_score(y, mymodel(x)))

• The result 0.94 shows that there is a very good relationship, and we can use polynomial
regression in future predictions.
Predict Future Values

• Now we can use the information we have gathered to predict future values.
• Example: Let us try to predict the speed of a car that passes the tollbooth at
around 17 P.M:

• import numpy
from sklearn.metrics import r2_score

x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))

speed = mymodel(17)
print(speed)
Bad Fit? Example
• These values for the x- and y-axis should result in a very bad fit for polynomial
regression:
• import numpy
import matplotlib.pyplot as plt

x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))

myline = numpy.linspace(2, 95, 100)

plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
Bad Fit? Example, the r-squared value?
• These values for the x- and y-axis should result in a very bad fit for polynomial
regression:
• import numpy
import matplotlib.pyplot as plt

x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))

myline = numpy.linspace(2, 95, 100)

print(r2_score(y, mymodel(x)))

• The result: 0.00995 indicates a very bad relationship, and tells us that this data set is not suitable
for polynomial regression.
Problem Description
• There is a Human Resource company, which is going to
hire a new candidate. The candidate has told his
previous salary 160K per annum, and the HR have to
check whether he is telling the truth or bluff.
• So to identify this, they only have a dataset of his
previous company in which the salaries of the top 10
positions are mentioned with their levels.
• By checking the dataset available, we have found that
there is a non-linear relationship between the
Position levels and the salaries.
• Our goal is to build a Bluffing detector
regression model, so HR can hire an honest candidate.
Below are the steps to build such a model.
• Problem
Multiple Regression
• Python Machine Learning Multiple Regression (w3schools.com)
Linear Regression Use Cases
• Sales of a product; pricing, performance, and risk parameters
• Generating insights on consumer behavior, profitability, and other business
factors
• Evaluation of trends; making estimates, and forecasts
• Determining marketing effectiveness, pricing, and promotions on sales of a
product
• Assessment of risk in financial services and insurance domain
• Studying engine performance from test data in automobiles
• Calculating causal relationships between parameters in biological systems
• Conducting market research studies and customer survey results analysis
• Astronomical data analysis
• Predicting house prices with the increase in sizes of houses
Regularization in Machine Learning
Regularization in Machine Learning
• Regularization is a technique used to reduce the errors by fitting
the function appropriately on the given training set and avoid
overfitting.
• It mainly regularizes or reduces the coefficient of features
toward zero.
• In simple words, "In regularization technique, we reduce the
magnitude of the features by keeping the same number of
features.“
• Hence, it maintains accuracy as well as a generalization of the
model.
How does Regularization Work?
• Regularization works by adding a penalty or complexity term or shrinkage term with
Residual Sum of Squares (RSS) to the complex model.
• Let’s consider the Simple linear regression equation:
• Here Y represents the dependent feature or response which is the learned relation.
Then,
• Y is approximated to β0 + β1X1 + β2X2 + …+ βpXp
• Here, X1, X2, …Xp are the independent features or predictors for Y, and
• β0, β1,…..βn represents the coefficients estimates for different variables or predictors(X),
which describes the weights or magnitude attached to the features, respectively.
• In simple linear regression, our optimization function or loss function is known as
the residual sum of squares (RSS).
Techniques of Regularization
• Mainly, there are three types of regularization techniques, which are given below:
1. Ridge Regression
2. Lasso Regression
3. Dropout
• Ridge Regression
• Ridge regression is one of the types of linear regression in which we introduce a small amount of bias,
known as Ridge regression penalty so that we can get better long-term predictions.
• In Statistics, it is known as the L-2 norm.
• In this technique, the cost function is altered by adding the penalty term (shrinkage term), which multiplies
the lambda with the squared weight of each individual feature. Therefore, the optimization function(cost
function) becomes:

• In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge
regression reduces the magnitudes of the coefficients that help to decrease the complexity of the
model.
Techniques of Regularization
• Lasso Regression
• Lasso regression is another variant of the regularization technique used to reduce the complexity
of the model. It stands for Least Absolute and Selection Operator.
• It is similar to the Ridge Regression except that the penalty term includes the absolute weights
instead of a square of weights. Therefore, the optimization function becomes:
• Cost Function for Lasso Regression is

• In statistics, it is known as the L-1 norm.


• In this technique, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly
equal to zero which means there is a complete removal of some of the features for model evaluation when
the tuning parameter λ is sufficiently large. Therefore, the lasso method also performs Feature selection and
is said to yield sparse models.
Key Differences between Ridge and Lasso
Regression
• Ridge regression helps us to reduce only the overfitting in the model while
keeping all the features present in the model.
• It reduces the complexity of the model by shrinking the coefficients
whereas Lasso regression helps in reducing the problem of overfitting in
the model as well as automatic feature selection.
• Lasso Regression tends to make coefficients to absolute zero whereas Ridge
regression never sets the value of coefficient to absolute zero.
Dropout
• Dropout is a regularization
technique used in neural
networks.
• It prevents complex co-
adaptations from other neurons.
• In neural nets, fully connected
layers are more prone to overfit
on training data.
• Using dropout, you can drop
connections with 1-p probability
for each of the specified layers.
• Where p is called keep
probability parameter and which
needs to be tuned.
Dropout
• With dropout, you are left with a reduced
network as dropped out neurons are left out
during that training iteration.
• Dropout decreases overfitting by avoiding
training all the neurons on the complete
training data in one go.
• It also improves training speed and learns
more robust internal functions that generalize
better on unseen data.
• However, it is important to note that Dropout
takes more epochs to train compared to
training without Dropout (If you have 10000
observations in your training data, then using
10000 examples for training is considered as 1
epoch).
• Along with Dropout, neural networks can be
regularized also using L1 and L2 norms.
Evaluation Metrics for Regression Problems

Linear Regression
Evaluation Metrics for Regression Problems
Linear Regression

How accurate our model is?

There are 6 evaluation techniques:

1. M.A.E (Mean Absolute Error)


2. M.S.E (Mean Squared Error)
3. R.M.S.E (Root Mean Squared Error)
4. R.M.S.L.E (Root Mean Squared Log
Error)
5. R-Squared
6. Adjusted R-Squared
Evaluation Metrics for Regression Problems
M.A.E (Mean Absolute Error)
It is the simplest & very widely used evaluation technique.
It is simply the mean of difference b/w actual & predicted values.
Below, is the mathematical formula of the Mean Absolute Error.

The Scikit-Learn is a great library, as it has almost all the inbuilt functions
Below is the code to implement Mean Absolute Error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_true, y_pred)

Here, ‘y_true’ is the true target values & ‘y_pred’ is the predicted target values.
Evaluation Metrics for Regression Problems
M.S.E (Mean Squared Error)
It takes the average of the square of the error. Here, the error is the difference
b/w actual & predicted values.
Below is the mathematical formula of the Mean Squared Error.

The Scikit-Learn is a great library, as it has almost all the inbuilt functions
Below is the code to implement Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y_true, y_pred)

Here, ‘y_true’ is the true target values & ‘y_pred’ is the predicted target values.
Root Mean Squared Error(RMSE)

Root Mean Squared Log Error(RMSLE)

Taking the log of the RMSE metric slows down the scale
of error.
we take the log of calculated RMSE error and resultant we get
as RMSLE.
R Squared (R2)
Now, how will you interpret the R2 score? suppose If the
R2 squared is also known as R2 score is zero then the above regression line by mean
Coefficient of Determination or line is equal means 1 so 1-1 is zero. So, in this case, both
sometimes also known as lines are overlapping means model performance is worst,
Goodness of fit It is not capable to take advantage of the output column.
Now the second case is when the R2 score is 1, it means
when the division term is zero and it will happen when the
regression line does not make any mistake, it is perfect. In
the real world, it is not possible.
So we can conclude that as our regression line moves
towards perfection, R2 score move towards one. And the
model performance improves.
The normal case is when the R2 score is between zero
and one like 0.8 which means your model is capable to
explain 80 per cent of the variance of data.
from sklearn.metrics import r2_score
r2 = r2_score(y_test,y_pred) print(r2)
Adjusted R Squared

• The disadvantage of the R2 score is while adding new features in data


the R2 score starts increasing or remains constant but it never
decreases because It assumes that while adding more data variance
of data increases.
• But the problem is when we add an irrelevant feature in the dataset
then at that time R2 sometimes starts increasing which is incorrect.
• Hence, To control this situation Adjusted R Squared came into
existence.
• Now as K increases by adding some features so the denominator will decrease, n-
1 will remain constant. R2 score will remain constant or will increase slightly so
the complete answer will increase and when we subtract this from one then the
resultant score will decrease. so this is the case when we add an irrelevant
feature in the dataset.
• And if we add a relevant feature then the R2 score will increase and 1-R2 will
decrease heavily and the denominator will also decrease so the complete term
decreases, and on subtracting from one the score increases.
Example:
n=40
k=2
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print(adj_r2_score)
Regularization
• Regularization: Make your Machine Learning Algorithms “Learn”, not
“Memorize” (einfochips.com)
• Regularization Techniques | Regularization In Deep Learning
(analyticsvidhya.com)
• Types of Regularization in Machine Learning | by Aqeel Anwar |
Towards Data Science
Gradient Descent in Linear Regression
Gradient Descent in Linear Regression contd..
• A linear regression model attempts to explain the relationship between a
dependent (output variables) variable and one or more independent
(predictor variable) variables using a straight line.
• This straight line is represented using the following formula-
y = mx +c
Where, y: dependent variable
x: independent variable
m: Slope of the line (For a unit increase in the quantity of X, Y
increases by m.1 = m units.)
c: y intercept (The value of Y is c when the value of X is 0)
Gradient Descent in Linear Regression contd..

From the scatter plot we can see there is a linear


relationship between Sales and marketing spent.

The next step is to find a straight line between


Sales and Marketing that explain the relationship
between them. But there can be multiple lines
that can pass through these points.
Gradient Descent in Linear Regression contd..
Cost Function:
The cost is the error in our predicted
value. We will use the Mean Squared
Error function to calculate the cost

So how do we know which of these lines is the best fit


line?
Gradient Descent in Linear Regression contd..
Gradient Descent in Linear Regression contd..
• Our goal is to minimize the cost as much as possible in order to find
the best fit line.
• We are not going to try all the permutation and combination of m
and c (inefficient way) to find the best-fit line.
• For that, we will use Gradient Descent Algorithm.
• Gradient Descent is an algorithm that finds the best-fit line for a given
training dataset in a smaller number of iterations.
Gradient Descent in Linear Regression contd..

If we plot m and c against MSE, it will acquire a bowl


shape.
For some combination of m and c, we will get the least Error
(MSE). That combination of m and c will give us our best fit
line.

The algorithm starts with some value of m and c (usually


starts with m=0, c=0). We calculate MSE (cost) at point
m=0, c=0. Let say the MSE (cost) at m=0, c=0 is 100.
Then we reduce the value of m and c by some amount
(Learning Step). We will notice a decrease in MSE (cost).
We will continue doing the same until our loss function is a
very small value or ideally 0 (which means 0 error or
100% accuracy).
Step by Step Algorithm:

1. Let m = 0 and c = 0. Let L be our learning rate. It could be a small


value like 0.01 for good accuracy.
• Learning rate gives the rate of speed where the gradient moves
during gradient descent. Setting it too high would make your path
instable, too low would make convergence slow. Put it to zero means
your model isn’t learning anything from the gradients
2. Calculate the partial derivative of the Cost function with respect to
m. Let partial derivative of the Cost function with respect to m be
Dm (With little change in m how much Cost function changes).
Gradient Descent in Linear Regression contd..
Gradient Descent in Linear Regression contd..

Similarly, let’s find the partial derivative with


respect to c. Let partial derivative of the Cost
function with respect to c be Dc (With little
change in c how much Cost function changes).
Gradient Descent in Linear Regression contd..
3. Now update the current values of m and c using the following
equation:

4. We will repeat this process until our Cost function is very small
(ideally 0)
Gradient Descent Algorithm gives optimum values of m and c of the
linear regression equation. With these values of m and c, we will get
the equation of the best-fit line and ready to make predictions.
Supervised Learning with
Classification
Decision Tree - Classification
• Decision tree builds classification models in the form of a tree
structure.
• It breaks down a dataset into smaller and smaller subsets
while at the same time an associated decision tree is
incrementally developed.
• The final result is a tree with decision nodes and leaf nodes.
• A decision node has two or more branches
• Leaf node represents a classification or decision.
• The topmost decision node in a tree which corresponds to
the best predictor called root node.
• Decision trees can handle both categorical and numerical
data.
Classification Model
What is node impurity/purity in decision trees?
• The decision tree is a greedy algorithm that performs a recursive binary partitioning of the
feature space.
• The tree predicts the same label for each bottommost (leaf) partition.
• Each partition is chosen greedily by selecting the best split from a set of possible splits.

Consider an example as the set of atoms in a metallic ball


• If all of the ball's atoms were gold - you would say that the ball is purely gold, and that its
purity level is highest (and its impurity level is lowest).
• Similarly, if all of the examples in the set were of the same class, then the set's purity
would be highest.
• If 1/3 of the atoms were gold, 1/3 silver, and 1/3 iron - you would say that for a ball made
of 3 kinds of atoms, its purity is lowest.
• Similarly, if the examples are split evenly between all of the classes, then the set's purity is
lowest.
• So the purity of a set of examples is the homogeneity of its examples - with regard to their
classes.
• Gini index is one of the popular measures of impurity
CART Algorithm
• CART Algorithm is an abbreviation of Classification And Regression Trees.
• Rather than general trees that could have multiple branches, CART makes
use binary tree, which has only two branches from each node.
• CART use Gini Impurity as the criterion to split node, not Information Gain.
• CART supports numerical target variables, which enables itself to become a
Regression Tree that predicts continuous values.
• Just like the ID3 and C4.5 algorithms that rely on Information Gain as the
criterion to split nodes, the CART algorithm makes use another criterion
called Gini to split the nodes.
CART Algorithm
• In CART algorithm it is intuitively using the Gini coefficient for a similar
purpose. That is, the larger Gini coefficient means the larger impurity of
the node.
• Similar to ID3 and C4.5 using Information Gain to select the node with
more uncertainty, the Gini coefficient will guide the CART algorithm to find
the node with larger uncertainty (i.e. impurity) and then split it.
• Gini Index is a metric to measure how often a randomly chosen element
would be incorrectly identified.
• It means an attribute with lower Gini index should be preferred.
• Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini”
value.
Example 1

Here I have given data of sports on which age and gender


are taking a part in a decision on ‘what kind of person
would play ground-game? ‘.

We will divide data in binary, like F or M & age =< 25 or


age > 25.
Solution
Gender Sportive- Sportive Total
yes -no
Female 5 2 7
Male 2 4 6

Age Sportive- Sportiv Total


yes e-no
< 25 5 1 6
> = 25 2 5 7
Find Gini index of attributes
Lets consider the dataset in the image below and draw a decision tree
using Gini index (Example 2)
• In the dataset above there are 5 attributes from which attribute E is
the predicting feature which contains 2(Positive & Negative) classes.
We have an equal proportion for both the classes.
In Gini Index, we have to choose some random values to categorize
each attribute. These values for this dataset are:
Using the same approach we can calculate the Gini index for C and D attributes.
Example 3
(Target attribute)

16
Naïve Bayes Classifier Algorithm
• Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional
training dataset.
• Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of
the probability of an object.
• Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles
Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two words Naïve and
Bayes, Which can be described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a
certain feature is independent of the occurrence of other features.
Such as if the fruit is identified on the bases of color, shape, and
taste, then red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that it is an
apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
Bayes' Theorem:
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
• The formula for Bayes' theorem is given as:
Where,
• P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
• P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
• P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
• P(B) is Marginal Probability: Probability of Evidence.
Advantages/ Disadvantages of Naïve Bayes Classifier
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other
Algorithms.
• It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


• Naive Bayes assumes that all features are independent or unrelated, so it
cannot learn the relationship between features.
Example 1: Naïve Bayes Classifier Example

Predict the class label for an unknown sample “X” using


Naïve Bayesian classification for a given dataset.

‘X’= (Outlook=Sunny, Temperature=Cool, Humidity=High,


Wind=Strong)
(Target attribute)

43
Learning phase:

P(Play=Yes) = 9/14 P(Play=No) = 5/14

Outlook Play=Yes Play=No Temperature Play=Yes Play=No

Sunny 2/9 3/5 Hot 2/9 2/5


Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=No Wind Play=Yes Play=No

Strong 3/9 3/5


High 3/9 4/5
Weak 6/9 2/5
Normal 6/9 1/5

44
• Test Phase
– Given a new instance,
x’ = (Outlook=Sunny, Temperature=Cool,
Humidity=High, Wind=Strong)
• MAP rule
P(x’|Yes) =P(Outlook=Sunny / Yes) *
P(Temperature=Cool / Yes)*
P(Humidity=High / Yes) *
P(Wind=Strong / Yes) *
P(Yes)
= 2/9 * 3/9 * 3/9 * 3/9 * 9/14
= = 0.0053
45
Map rule:
P(x’|No) = P(Outlook=Sunny/No) *
P(Temperature=Cool/No) *
P(Humidity=High/No) *
P(Wind=Strong/No) *
P(No)
= 3/5 * 1/5 * 4/5 * 3/5 * 5/14
= 0.0206

Given the fact P(X|Yes) < P(X|No),


we label X to be “Play tennis = No”.
- MAP rule
• P(x’|Yes) =
[P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes)
= 0.0053

• P(x’|No):
[P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No)
= 0.0206

Given the fact P(x’|Yes) < P(x’|No), we label x’ to be


“No”.
Example 2: Naïve Bayesian classification Example

• Predict a class label of an unknown sample using Naïve Bayesian


classification on the following training dataset from all electronics
customer database.
• The unknown sample is_
X’={age=“<=30, Income=“median”, Student=“yes”, credit
rating=“fair”}
Age Income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
• P(x’|Yes) =0.028
• P(x’|No)=0.007
• Since 0.028>0.007, therefor the naïve Bayesian
classifier predicts buyes computer=“yes” for sample X’
Surprise Test (20 marks)
ID Homeowner Status Income Defaulted
1 YES Employed High No
2 NO Business Average NO
3 NO Employed Low NO
4 YES Business High NO
5 NO Unemployed Average Yes
6 NO Business Low No
7 YES Unemployed High NO
8 NO Employed Average Yes
9 NO Business Low No
10 NO Employed Average Yes
Illustrate Decision tree and Naive Bayesian Classification techniques for the above
data set.
Show how we can classify a new tuple, with (Homeowner=yes; Status=Employed;
Income= Average)
K-Nearest Neighbor(KNN)
Algorithm for Machine Learning
Introduction
• K-Nearest Neighbor is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
Introduction
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
• Example:
Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. KNN model will find the similar features of
the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

• Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories.
• To solve this type of problem, we need a K-NN algorithm.
• With the help of K-NN, we can easily identify the category or class of a particular
dataset.
How does K-NN work?

• Step-1: Select the number K of the neighbors


• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data
points in each category.
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
•As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.
How to select the value of K in the K-NN
Algorithm?
• There is no particular way to determine the best value for "K", so we
need to try some values to find the best out of them. The most
preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to
the effects of outliers in the model.
• Large values for K are good, but it may find some difficulties
Advantages/ Disadvantages
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may be complex
some time.
• The computation cost is high because of calculating the distance
between the data points for all the training samples.
Example KNN
Find the class label for given instance using KNN with K=5
Step 1: Find distance
Step 2: Find Rank

Step 3: Find nearest neighbours to


assign class
What is the Support Vector Machine?
• “Support Vector Machine” (SVM) is a supervised machine
learning algorithm that can be used for both classification or
regression challenges.
• However, it is mostly used in classification problems.
• In the SVM algorithm, we plot each data item as a point in n-
dimensional space with the value of each feature being the
value of a particular coordinate.
• Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well.
Support Vector Machine Algorithm
• The goal of the SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional space into
classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a
hyperplane.
• SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors,
and hence algorithm is termed as Support Vector Machine.
Hyperplane and Support Vectors in the SVM
algorithm:
• Hyperplane: There can be multiple lines/decision boundaries to
segregate the classes in n-dimensional space, but we need to find out
the best decision boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.
• The dimensions of the hyperplane depend on the features present in
the dataset, which means if there are 2 features, then hyperplane will
be a straight line. And if there are 3 features, then hyperplane will be
a 2-dimension plane.
• We always create a hyperplane that has a maximum margin, which
means the maximum distance between the data points.
Hyperplane and Support Vectors in the SVM
algorithm:
• Support Vectors:
The data points or vectors that are the closest to the hyperplane
and which affect the position of the hyperplane are termed as
Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
How does SVM works?
• The working of the SVM algorithm can be understood by using an
example. Suppose we have a dataset that has two tags (green and
blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the new pair(x1, x2) of coordinates in either
green or blue
How does SVM works?

So as it is 2-d space so by just using a


straight line, we can easily separate these
two classes.
But there can be multiple lines that can
separate these classes.

SVM algorithm helps to find the best line or


decision boundary; this best boundary or region
is called as a hyperplane
How does SVM works?

• Hence, the SVM algorithm helps to find the best line or decision
boundary; this best boundary or region is called as a hyperplane.
• SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors.
• The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin.
• The hyperplane with maximum margin is called the optimal
hyperplane.
How does SVM works?
It unable to segregate the two classes using a straight line, as one of the stars
lies in the territory of other(circle) class as an outlier. The SVM algorithm has a
feature to ignore outliers and find the hyper-plane that has the maximum
margin. Hence, we can say, SVM classification is robust to outliers.
How does SVM works?

• In the scenario below, we can’t have linear hyper-plane between the two classes, so how does SVM
classify these two classes? SVM can solve this problem by introducing additional feature. Here, we will
add a new feature z=x^2+y^2. Now, let’s plot the data points on axis x and z:

In above plot, points to consider


are:

•All values for z would be positive


always because z is the squared
sum of both x and y
•In the original plot, red circles
appear close to the origin of x and
y axes, leading to lower value of z
and star relatively away from the
origin result to higher value of z.
SVM Kernel
• The SVM kernel is a function that takes low dimensional input space
and transforms it to a higher dimensional space i.e. it converts not
separable problem to separable problem.
• It is mostly useful in non-linear separation problem. Simply put, it
does some extremely complex data transformations, then finds out
the process to separate the data based on the labels or outputs
you’ve defined.
Support Vector Machines (Kernels)
• Kernel Function is a method used to take data as input and transform into the
required form of processing data.
• “Kernel” is used due to set of mathematical functions used in Support Vector
Machine provides the window to manipulate the data.
• So, Kernel Function generally transforms the training set of data so that a non-linear
decision surface is able to transformed to a linear equation in a higher number of
dimension spaces.
• Basically, It returns the inner product between two points in a standard feature
dimension.

Types of SVM kernels:


• Polynomial Kernel
• Sigmoid Kernel
• Gaussian Kernel Radial Basis Function (RBF)
Chapter 5
Optimization Technique
• https://neptune.ai/blog/the-ultimate-guide-to-evaluation-and-
selection-of-models-in-machine-learning

• Link for model selection and evalution


Introduction
• Machine Learning is the union of statistics and computation.

• However, any given model has several limitations depending on the


data distribution. None of them can be entirely accurate since they
are just estimations

• These limitations are popularly known by the name


of bias and variance.

• This is where model selection and model evaluation come into play.
What Is Model Selection
• Model selection is the process of selecting one final machine learning model from
among a collection of candidate machine learning models for a training dataset.
• Model selection is a process that can be applied both across different types of
models (e.g. logistic regression, SVM, KNN, etc.) and across models of the same type
configured with different model hyperparameters (e.g. different kernels in an SVM).
• For example, we may have a dataset for which we are interested in developing a
classification or regression predictive model. We do not know beforehand as to
which model will perform best on this problem, as it is unknowable. Therefore, we fit
and evaluate a suite of different models on the problem.
• Model selection is the process of choosing one of the models as the final model that
addresses the problem.
model • The process of evaluating a model’s
assessment performance

model • The process of selecting the proper


selection level of flexibility for a model
Model Selection Techniques
• If we are in a data-rich situation, the best approach is to
randomly divide the dataset into three parts: a training set, a
validation set, and a test set.
• The training set is used to fit the models;
• the validation set is used to estimate prediction error for model
selection;
• the test set is used for assessment of the generalization error of
the final chosen model.
What Is Model Selection contd..
There are two main classes of techniques to approximate the ideal case
of model selection
• Probabilistic Measures: Choose a model via in-sample error and
complexity.
• Resampling Methods: Choose a model via estimated out-of-sample
error.
Three common resampling model selection methods include:
• Random train/test splits.
• Cross-Validation (k-fold, LOOCV, etc)
• Bootstrap.
Random Split
• Random Splits are used to randomly sample a percentage of data into training,
testing, and preferably validation sets.
• The advantage of this method is that there is a good chance that the original
population is well represented in all the three sets. In more formal terms, random
splitting will prevent a biased sampling of data.
• It is very important to note the use of the validation set in model selection. The
validation set is the second test set and one might ask, why have two test sets?
• In the process of feature selection and model tuning, the test set is used for
model evaluation. This means that the model parameters and the feature set are
selected such that they give an optimal result on the test set. Thus, the validation
set which has completely unseen data points (not been used in the tuning and
feature selection modules) is used for the final evaluation.
About Train, Validation and Test Sets in Machine Learning

• Training Dataset:
• The sample of actual data used to fit the model.
• Validation Dataset:
• The sample of data used to provide an unbiased evaluation of a model fit on the
training dataset while tuning model hyperparameters.
• The evaluation becomes more biased as skill on the validation dataset is
incorporated into the model configuration.
• This dataset helps during the “development” stage of the model.
• Test Dataset:
• The sample of data used to provide an unbiased evaluation of a final model fit on
the training dataset. It is only used once a model is completely trained(using the
train and validation sets).
A visualization of the splits
Time-Based Split
• There are some types of data where random splits are not possible. For
example, if we have to train a model for weather forecasting, we cannot
randomly divide the data into training and testing sets. This will jumble up the
seasonal pattern! Such data is often referred to by the term – Time Series.

• In such cases, a time-wise split is used. The training set can have data for the last
three years and 10 months of the present year. The last two months can be
reserved for the testing or validation set.

• There is also a concept of window sets – where the model is trained till a
particular date and tested on the future dates iteratively such that the training
window keeps increasing shifting by one day (consequently, the test set also
reduces by a day). The advantage of this method is that it stabilizes the model
and prevents overfitting when the test set is very small (say, 3 to 7 days).
Time-Based Split
• However, the drawback of time-series data is that the events or data
points are mutually dependent. One event might affect every data
input that follows after.

• For instance, a change in the governing party might considerably


change the population statistics for the years to follow. Or the
infamous coronavirus pandemic is going to have a massive impact
on economic data for the next few years.

• No machine learning model can learn from past data in such a case
because the data points before and after the event have major
differences.
Bootstrap
• The first step is to select a sample size (which is usually equal to the size of the
original dataset). Thereafter, a sample data point must be randomly selected
from the original dataset and added to the bootstrap sample. After the addition,
the sample needs to be put back into the original sample. This process needs to
be repeated for N times, where N is the sample size.

• Therefore, it is a resampling technique that creates the bootstrap sample by


sampling data points from the original dataset with replacement. This means
that the bootstrap sample can contain multiple instances of the same data point.

• The model is trained on the bootstrap sample and then evaluated on all those
data points that did not make it to the bootstrapped sample. These are called
the out-of-bag samples.
Bootstrapping
The bootstrap method involves iteratively resampling a dataset with
replacement.
Holdout Method
• Hold-out is when you split up your dataset into a ‘train’ and ‘test’ set.
The training set is what the model is trained on, and the test set is
used to see how well that model performs on unseen data.
• A common split when using the hold-out method is using 80% of data
for training and the remaining 20% of the data for testing.
• Hold-out, is dependent on just one train-test split. That makes the
hold-out method score dependent on how the data is split into train
and test sets.
• Although this approach is simple to perform, it still faces the issue of
high variance, and it also produces misleading results sometimes.
Cross-Validation
• Cross-validation is a technique for validating the model efficiency by
training it on the subset of input data and testing on previously unseen
subset of the input data.
• We can also say that it is a technique to check how a statistical model
generalizes to an independent dataset.
• In machine learning, there is always the need to test the stability of the
model. It means based only on the training dataset; we can't fit our
model on the training dataset.
• For this purpose, we reserve a particular sample of the dataset, which
was not part of the training dataset.
• After that, we test our model on that sample before deployment, and this
complete process comes under cross-validation.
Cross-Validation
Hence the basic steps of cross-validations are:
1. Reserve a subset of the dataset as a validation set.
2. Provide the training to the model using the training dataset.
3. Now, evaluate model performance using the validation set. If the
model performs well with the validation set, perform the further
step, else check for the issues.
Methods used for Cross-Validation

• Validation Set Approach


• Leave-P-out cross-validation
• Leave one out cross-validation
• K-fold cross-validation
• Stratified k-fold cross-validation
Validation Set Approach
• We divide our input dataset into a training set and test or
validation set in the validation set approach. Both the
subsets are given 50% of the dataset.

• But it has one of the big disadvantages that we are just


using a 50% dataset to train our model, so the model may
miss out to capture important information of the dataset. It
also tends to give the underfitted model.
Leave-P-out cross-validation
• In this approach, the p datasets are left out of the training data.
• It means, if there are total n datapoints in the original input
dataset, then n-p data points will be used as the training dataset
and the p data points as the validation set.
• This complete process is repeated for all the samples, and the
average error is calculated to know the effectiveness of the model.

• There is a disadvantage of this technique; that is, it can be


computationally difficult for the large p.
Leave one out cross-validation(LOOCV)
• This method is similar to the leave-p-out cross-validation, but instead
of p, we need to take 1 dataset out of training.
• It means, in this approach, for each learning set, only one datapoint is
reserved, and the remaining dataset is used to train the model.
• This process repeats for each datapoint. Hence for n samples, we get
n different training set and n test set.
• It has the following features:
1. In this approach, the bias is minimum as all the data points are
used.
2. The process is executed for n times; hence execution time is high.
3. This approach leads to high variation in testing the effectiveness of
the model as we iteratively check against one data point.
K-Fold Cross-Validation
• K-fold cross-validation approach divides the input dataset into K groups of samples
of equal sizes. These samples are called folds. The steps for k-fold cross-validation
are:
1. Split the input dataset into K groups
2. For each group:
• Take one group as the reserve or test data set.
• Use remaining groups as the training dataset
• Fit the model on the training set and evaluate the performance of the model using the test
set.
• Example:
• Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5
folds. On 1st iteration, the first fold is reserved for test the model, and rest are used
to train the model. On 2nd iteration, the second fold is used to test the model, and
rest are used to train the model. This process will continue until each fold is not
used for the test fold.
5 fold cross validation
Stratified K-Fold
• This technique is similar to k-fold cross-validation with some little
changes. This approach works on stratification concept, it is a process of
rearranging the data to ensure that each fold or group is a good
representative of the complete dataset.
• Stratified k-fold cross-validation is same as just k-fold cross-validation,
but in Stratified k-fold cross-validation, it does stratified sampling
instead of random sampling.
• If for instance, the target variable is a categorical variable with 2 classes,
then stratified k-fold ensures that each test fold gets an equal ratio of
the two classes when compared to the training set.
• This makes the model evaluation more accurate and the model training
less biased.
What is random sampling and Stratified sampling ?
• Random sampling
• Suppose you want to take a survey and decided to call 1000 people from a
particular state, If you pick either 1000 male completely or 1000 female
completely or 900 female and 100 male (randomly) to ask their opinion on
a particular product.
• Then based on these 1000 opinion you can’t decide the opinion of that
entire state on your product. This is random sampling.
• Stratified sampling
In Stratified Sampling, Let the population for that state be 60% male and
40% female, Then for choosing 1000 people from that state if you pick 600
male ( 60% of 1000 ) and 400 female ( 40% for 1000 ) i.e 600 male + 400
female (Total=1000 people) to ask their opinion.
• Then these groups of people represent the entire state. This is called as
Stratified Sampling.
Comparison of Cross-validation to train/test split in Machine Learning

• Train/test split: The input data is divided into two parts, that are
training set and test set on a ratio of 70:30, 80:20, etc. It provides a high
variance, which is one of the biggest disadvantages.
• Training Data: The training data is used to train the model, and the dependent
variable is known.
• Test Data: The test data is used to make the predictions from the model that is
already trained on the training data. This has the same features as training data
but not the part of that.
• Cross-Validation dataset: It is used to overcome the disadvantage of
train/test split by splitting the dataset into groups of train/test splits,
and averaging the result. It can be used if we want to optimize our
model that has been trained on the training dataset for the best
performance. It is more efficient as compared to train/test split as every
observation is used for the training and testing both.
Limitations of Cross-Validation
• For the ideal conditions, it provides the optimum output. But for
the inconsistent data, it may produce a drastic result. So, it is one
of the big disadvantages of cross-validation, as there is no
certainty of the type of data in machine learning.
• In predictive modeling, the data evolves over a period, due to
which, it may face the differences between the training set and
validation sets. Such as if we create a model for the prediction of
stock market values, and the data is trained on the previous 5
years stock values, but the realistic future values for the next 5
years may drastically different, so it is difficult to expect the
correct output for such situations.
Applications of Cross-Validation

• This technique can be used to compare the performance of different


predictive modeling methods.
• It has great scope in the medical research field.
• It can also be used for the meta-analysis, as it is already being used by
the data scientists in the field of medical statistics.
Grid Searching
• Grid searching is a method to find the best possible combination of
hyper-parameters at which the model achieves the highest accuracy.
• Before applying Grid Searching on any algorithm, Data is used to divided
into training and validation set, a validation set is used to validate the
models. A model with all possible combinations of hyperparameters is
tested on the validation set to choose the best combination.
• Grid Searching can be applied to any hyperparameters algorithm whose
performance can be improved by tuning hyperparameter.
• For example, we can apply grid searching on K-Nearest Neighbors by
validating its performance on a set of values of K in it. Same thing we
can do with Logistic Regression by using a set of values of learning rate
to find the best learning rate at which Logistic Regression achieves the
best accuracy.
Hyperparameter tuning

• A Machine Learning model is defined as a mathematical model with a


number of parameters that need to be learned from the data. By training a
model with existing data, we are able to fit the model parameters.
• However, there is another kind of parameters, known as Hyperparameters,
that cannot be directly learned from the regular training process.
• These parameters express important properties of the model such as its
complexity or how fast it should learn
Some examples of model hyperparameters include:
• The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
• The learning rate for training a neural network.
• The C and sigma hyperparameters for support vector machines.
• The k in k-nearest neighbors.
• Models can have many hyperparameters and finding the best
combination of parameters can be treated as a search problem. Two
best strategies for Hyperparameter tuning are:
• GridSearchCV
• RandomizedSearchCV
GridSearchCV
• In GridSearchCV approach, machine learning model is evaluated
for a range of hyperparameter values.
• This approach is called GridSearchCV because it searches for best
set of hyperparameters from a grid of hyperparameters values
• For example, if we want to set two hyperparameters C and Alpha
of Logistic Regression Classifier model, with different set of values.
• The gridsearch technique will construct many versions of the
model with all possible combinations of hyerparameters, and will
return the best one.
• As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2,
0.3, 0.4].
For a combination C=0.3 and Alpha=0.2, performance score comes
out to be 0.726(Highest), therefore it is selected.
Drawback of GridSearchCV :
• GridSearchCV will go through all the intermediate combinations
of hyperparameters which makes grid search computationally
very expensive.
 RandomizedSearchCV
• RandomizedSearchCV solves the drawbacks of GridSearchCV, as
it goes through only a fixed number of hyperparameter settings.
• It moves within the grid in random fashion to find the best set
hyperparameters. This approach reduces unnecessary
computation.
Gradient Descent algorithm
• Gradient Descent is an optimization algorithm used for minimizing
the cost function in various machine learning algorithms.
• It is basically used for updating the parameters of the learning
model.
What is a Cost Function?
• It is a function that measures the performance of a model for any
given data.
• Cost Function quantifies the error between predicted values and
expected values and presents it in the form of a single real
number.
Gradient Descent algorithm contd..
• Gradient descent is an iterative optimization algorithm for finding the local
minimum of a function.
• To find the local minimum of a function using gradient descent, we must take
steps proportional to the negative of the gradient (move away from the
gradient) of the function at the current point.
• If we take steps proportional to the positive of the gradient (moving towards the
gradient), we will approach a local maximum of the function, and the procedure
is called Gradient Ascent.
• The gradient is the vector containing all partial derivatives of a function in a
point
• We can apply gradient descent on a convex function, and gradient ascent on a
concave function
• Gradient descent finds the nearest minimum of a function, gradient ascent the
nearest maximum
The goal of the gradient descent
algorithm is to minimize the given
function (say cost function). To
achieve this goal, it performs two
steps iteratively:
1.Compute the gradient (slope), the
first order derivative of the function at
that point
2.Make a step (move) in the
direction opposite to the gradient,
opposite direction of slope increase
from the current point by alpha times
the gradient at that point
Alpha is called Learning rate – a tuning parameter in the
optimization process. It decides the length of the steps
Alpha – The Learning Rate
We have the direction we want to move in, now we must decide the size of the step we must take.
*It must be chosen carefully to end up with local minima.
•If the learning rate is too high, we might OVERSHOOT the minima and keep bouncing, without
reaching the minima
•If the learning rate is too small, the training might turn out to be too long

a) Learning rate is optimal, model converges to the


minimum
b) Learning rate is too small, it takes more time but
converges to the minimum
c) Learning rate is higher than the optimal value, it
overshoots but converges ( 1/C < η <2/C)

d) Learning rate is very large, it overshoots and diverges,


moves away from the minima, performance decreases
on learning
What is the equation of the Gradient Descent Algorithm?

Here,
• θ is the parameter we wish to update,
• dJ/dθ is the partial derivative which tells us the rate of change of error on the cost function
with respect to the parameter θ,
• α here is the Learning Rate.
• So, this J here represents the cost function and there are multiple ways to calculate this
cost. Based on the way we are calculating this cost function there are different variants of
Gradient Descent.
Types of gradient Descent:
1. Batch Gradient Descent:
• This is a type of gradient descent which processes all the training
examples for each iteration of gradient descent.
• But if the number of training examples is large, then batch
gradient descent is computationally very expensive.
• Hence if the number of training examples is large, then batch
gradient descent is not preferred.
• Instead, we prefer to use stochastic gradient descent or mini-
batch gradient descent.
Batch Gradient Descent contd..
• If there are a total of ‘m’ observations in a data set then we use
all these observations to calculate the cost function J, then this
is known as Batch Gradient Descent.
• So for the entire training set, we calculate the cost function.
And then we update the parameters using the rate of change of
this cost function with respect to the parameters.
• An epoch is when the entire training set is passed through the
model.
• In batch Gradient Descent since we are using the entire training
set, the parameters will be updated only once per epoch.
2. Stochastic Gradient Descent
• If you use a single observation to calculate the cost function it is known
as Stochastic Gradient Descent, commonly abbreviated as SGD. We
pass a single observation at a time, calculate the cost and update the
parameters.
• This is a type of gradient descent which processes 1 training example
per iteration.
• Hence, the parameters are being updated even after one iteration in
which only a single example has been processed.
• Hence this is quite faster than batch gradient descent.
• But again, when the number of training examples is large, even then it
processes only one example which can be additional overhead for the
system as the number of iterations will be quite large.
• if we use the SGD, will take the first observation, then pass it through the neural network, calculate
the error and then update the parameters.
• Then will take the second observation and perform similar steps with it.
• This step will be repeated until all observations have been passed through the network and the parameters
have been updated.
• Each time the parameter is updated, it is known as an Iteration.
• Here since we have 5 observations, the parameters will be updated 5 times or we can say that there will be
5 iterations.

Stochastic Gradient Descent


3. Mini Batch gradient descent
• It takes a subset of the entire dataset to calculate the cost function.
So if there are ‘m’ observations then the number of observations in
each subset or mini-batches will be more than 1 and less than ‘m’.
• This is a type of gradient descent which works faster than both batch
gradient descent and stochastic gradient descent.
• Here b examples where b<m are processed per iteration.
• So even if the number of training examples is large, it is processed in
batches of b training examples in one go.
• Thus, it works for larger training examples and that too with lesser
number of iterations.
• Assume that the batch size is 2. So we’ll take the first two observations, pass
them linear regression model, calculate the error and then update the
parameters.
• Then we will take the next two observations and perform similar steps i.e will
pass through the network, calculate the error and update the parameters.

Mini Batch gradient descent


•In batch gradient Descent, as we have seen earlier as well, we take the entire
dataset > calculate the cost function > update parameter.
•In the case of Stochastic Gradient Descent, we update the parameters after every
single observation and we know that every time the weights are updated it is known
as an iteration.
•In the case of Mini-batch Gradient Descent, we take a subset of data and update the
parameters based on every subset.
•Now since we update the parameters using the entire data set in the case of the Batch GD,
the cost function, in this case, reduces smoothly.
•On the other hand, this updation in the case of SGD is not that smooth. Since we’re
updating the parameters based on a single observation, there are a lot of iterations. It might
also be possible that the model starts learning noise as well.
•The updation of the cost function in the case of Mini-batch Gradient Descent is smoother as
compared to that of the cost function in SGD. Since we’re not updating the parameters after
every single observation but after every subset of the data.
•Since we’ve to load the entire data set at a time, perform the forward propagation on that and
calculate the error and then update the parameters, the computation cost in the case of Batch
gradient descent is very high.
•Computation cost in the case of SGD is less as compared to the Batch Gradient Descent since
we’ve to load every single observation at a time but the Computation time here increases as there
will be more number of updates which will result in more number of iterations.
•In the case of Mini-batch Gradient Descent, taking a subset of the data there are a lesser number
of iterations or updations and hence the computation time in the case of mini-batch gradient descent
is less than SGD. Also, since we’re not loading the entire dataset at a time whereas loading a subset
of the data, the computation cost is also less as compared to the Batch gradient descent
Comparison between Batch GD, SGD, and Mini-batch GD:
What Is a Sample?
• A sample is a single row of data.
• It contains inputs that are fed into the algorithm and an output that is
used to compare to the prediction and calculate an error.
• A training dataset is comprised of many rows of data, e.g. many
samples. A sample may also be called an instance, an observation, an
input vector, or a feature vector.
What Is a Batch?
• The batch size is a hyperparameter that defines the number of samples
to work through before updating the internal model parameters.
• A training dataset can be divided into one or more batches.
• When all training samples are used to create one batch, the learning
algorithm is called batch gradient descent. When the batch is the size of
one sample, the learning algorithm is called stochastic gradient descent.
When the batch size is more than one sample and less than the size of
the training dataset, the learning algorithm is called mini-batch gradient
descent.
• Batch Gradient Descent. Batch Size = Size of Training Set
• Stochastic Gradient Descent. Batch Size = 1
• Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set
• In the case of mini-batch gradient descent, popular batch sizes include
32, 64, and 128 samples.
What Is an Epoch?
• The number of epochs is a hyperparameter that defines the number
times that the learning algorithm will work through the entire training
dataset.
• One epoch means that each sample in the training dataset has had an
opportunity to update the internal model parameters. An epoch is
comprised of one or more batches. For example, as above, an epoch
that has one batch is called the batch gradient descent learning
algorithm.
What Is the Difference Between Batch and Epoch?
• The batch size is a number of samples processed before the model is
updated.
• The number of epochs is the number of complete passes through the
training dataset.
• The size of a batch must be more than or equal to one and less than
or equal to the number of samples in the training dataset.
Example
• Assume you have a dataset with 200 samples (rows of data) and you
choose a batch size of 5 and 1,000 epochs.
• This means that the dataset will be divided into 40 batches, each with
five samples. The model weights will be updated after each batch of
five samples.
• This also means that one epoch will involve 40 batches or 40 updates
to the model.
• With 1,000 epochs, the model will be exposed to or pass through the
whole dataset 1,000 times. That is a total of 40,000 batches during
the entire training process.
• Gradient Descent algorithm

https://www.youtube.com/watch?v=vsWrXfO3wWw
Class Imbalance Problem in
Machine Learning
Class Imbalance Problem in Machine Learning

• Class imbalance is the problem when the number of examples available for one or more classes is
far less than other classes.
• In short, the distribution of examples across the known classes is biased.
• For Example: To detect fraud credit card transactions.
Class Imbalance Problem in Machine Learning
For Example: To detect fraud credit card
transactions.
Ex: fraud detection data set you have the count
following data: 1200

1000
Total Observations = 1000 1000
980

Fraud Observations = 20 800

Non Fraud Observations = 980


600
Event Rate= 2 %
400
The main question faced during data
analysis is – 200
200

How to get a balanced dataset by getting a


decent number of samples for these 0
No Fraud Fraud Total Observation
anomalies given the rare occurrence for
some them?
• Challenges with standard Machine learning techniques
• The conventional model evaluation methods do not accurately measure model performance when
faced with imbalanced datasets.
• Standard classifier algorithms like Decision Tree and Logistic Regression have a bias towards classes
which have imbalanced number of instances.
• They tend to only predict the majority class data.
• The features of the minority class are treated as noise and are often ignored.
• Thus, there is a high probability of misclassification of the minority class as compared to the
majority class.
• Evaluation of a classification algorithm performance is measured by the Confusion Matrix which
contains information about the actual and the predicted class.
• Challenges with standard Machine learning techniques

• Accuracy of a model = (TP+TN) / (TP+FN+FP+TN)


• while working in an imbalanced domain accuracy is not an appropriate measure to evaluate model
performance.
• For eg: A classifier which achieves an accuracy of 98 % with an event rate of 2 % is not accurate, if it
classifies all instances as the majority class. And eliminates the 2 % minority class observations as
noise.
Class Imbalance Problem in Machine Learning

• Examples:
• Datasets to identify customer churn where a vast majority of
customers will continue using the service. Specifically,
Telecommunication companies where Churn Rate is lower than 2 %.
• Data sets to identify rare diseases in medical diagnostics etc.
• Natural Disaster like Earthquakes
Class Imbalance Problem in Machine Learning
• The classes which have a large number of samples are called the majority
classes
• while the classes which have very few samples are called the minority
classes.
• Techniques for handling Class-Imbalance Problem:
• Data Level Methods
• Algorithm/Classifier Level Methods
Class Imbalance Problem in Machine Learning
Data Level Methods (Resampling Techniques):
• Data Level methods are those where we make changes to the distribution of the
training set while keeping the algorithm and its subparts such as loss function,
optimizer constant.
• The data level methods aim to vary the dataset in a way to make standard
algorithms work.
• There are two famous data-level methods readily applied in the machine learning
domain.
• 1. Oversampling:
• 2. Undersampling:
Class Imbalance Problem in Machine Learning
• Data Level Methods:1 Oversampling:
• Upsampling Minority Class
• It is a very simple and widely known technique used to solve the problem of Class Imbalance.
• In this technique, we try to make the distribution of all the classes equal in a mini-batch by
sampling an equal number of samples from all the classes thereby sampling more examples from
minority classes as compared to majority classes.

Class Imbalance Problem in Machine Learning
• Oversampling:
Over-Sampling increases the number of instances in the minority class by randomly
replicating them in order to present a higher representation of the minority class in
the sample.
Total Observations = 1000
Fraud Observations =20
Non Fraud Observations = 980
Event Rate= 2 %
In this case we are replicating 20 fraud observations 20 times.
Non Fraud Observations =980
Fraud Observations after replicating the minority class observations= 400
Total Observations in the new data set after oversampling=980+400=1380
Event Rate for the new data set after under sampling= 400/1380 = 29 %
Class Imbalance Problem in Machine Learning
• Data Level Methods: 1 Oversampling :
• Advantages
• This method leads to no information loss.
• Outperforms under sampling.
• Disadvantages
• It increases the likelihood of overfitting since it replicates the minority class events.
Class Imbalance Problem in Machine Learning
• Data Level Methods: 2 Undersampling:
• Downsampling Majority Class
• It is just the opposite of Oversampling.
• In this, we randomly remove samples from the majority class until all the classes have the same
number of samples.
• This technique has a significant disadvantage in that it discards data which might lead to a
reduction in the number of representative samples in the dataset.
• To fix this shortcoming various methods are used which carefully remove redundant samples
thereby preserving the variability of the dataset.
Data Level Methods: 2 Undersampling:
• Undersampling aims to balance class distribution by randomly eliminating
majority class examples.
• This is done until the majority and minority class instances are balanced out.
• Total Observations = 1000
• Fraudulent Observations =20
• Non Fraudulent Observations = 980
• Event Rate= 2 %
• In this case we are taking 10 % samples without replacement from Non Fraud
instances. And combining them with Fraud instances.
• Non Fraudulent Observations after random under sampling = 10 % of 980 =98
• Total Observations after combining them with Fraudulent observations =
20+98=118
• Event Rate for the new dataset after under sampling = 20/118 = 17%
Class Imbalance Problem in Machine Learning
• Data Level Methods: 2 Undersampling:
• Advantages
• It can help improve run time and storage problems by reducing the number of
training data samples when the training data set is huge.
• Disadvantages
• It can discard potentially useful information which could be important for building
rule classifiers.
• The sample chosen by random under-sampling may be a biased sample.
• And it will not be an accurate representation of the population.
• Thereby, resulting in inaccurate results with the actual test data set.
Class Imbalance Problem in Machine Learning
• Data Level Methods: 3 Cluster-Based Over Sampling:
• In this case, the K-means clustering algorithm is independently applied to minority
and majority class instances.
• This is to identify clusters in the dataset.
• Subsequently, each cluster is oversampled such that all clusters of the same class
have an equal number of instances and all classes have the same size.
Class Imbalance Problem in Machine Learning
• Data Level Methods: 3 Cluster-Based Over Sampling:
Total Observations = 1000
Fraudulent Observations =20
Non Fraudulent Observations = 980
Event Rate= 2 %
Majority Class Clusters
Cluster 1: 150 Observations
Cluster 2: 120 Observations
Cluster 3: 230 observations
Cluster 4: 200 observations
Cluster 5: 150 observations
Cluster 6: 130 observations
Minority Class Clusters
Cluster 1: 8 Observations
Cluster 2: 12 Observations
Class Imbalance Problem in Machine Learning
• Data Level Methods: 3 Cluster-Based Over Sampling:
After oversampling of each cluster, all clusters of the same class contain the same
number of observations.
Majority Class Clusters
Cluster 1: 170 Observations
Cluster 2: 170 Observations
Cluster 3: 170 observations
Cluster 4: 170 observations
Cluster 5: 170 observations
Cluster 6: 170 observations
Minority Class Clusters
Cluster 1: 250 Observations
Cluster 2: 250 Observations
Event Rate post cluster based oversampling sampling = 500/ (1020+500) = 33 %
Data Level Methods: 3 Cluster-Based Over Sampling:

• Advantages
• This clustering technique helps overcome the challenge between class
imbalance Where the number of examples representing positive class differs
from the number of examples representing a negative class.
• Also, overcome challenges within class imbalance, where a class is composed
of different sub clusters. And each sub cluster does not contain the same
number of examples.
• Disadvantages
• The main drawback of this algorithm, like most oversampling techniques is
the possibility of over-fitting the training data.
Data Level Methods: 4 Informed Over Sampling: Synthetic Minority Over-sampling
Technique for imbalanced data

• This technique is followed to avoid overfitting which occurs when


exact replicas of minority instances are added to the main dataset.
• It uses K-Nearest Neighbour Algorithm to select a subset of data is
taken from the minority class as an example and then new synthetic
similar instances are created.
• These synthetic instances are then added to the original dataset.
• The new dataset is used as a sample to train the classification
models.
Data Level Methods: 4 Informed Over Sampling: Synthetic Minority Over-sampling
Technique for imbalanced data

• Total Observations = 1000


• Fraudulent Observations = 20
• Non Fraudulent Observations = 980
• Event Rate = 2 %
• A sample of 15 instances is taken from the minority class and similar
synthetic instances are generated 20 times
• Post generation of synthetic instances, the following data set is created
• Minority Class (Fraudulent Observations) = 300
• Majority Class (Non-Fraudulent Observations) = 980
• Event rate= 300/1280 = 23.4 %
Data Level Methods: 4 Informed Over Sampling: Synthetic Minority Over-sampling
Technique for imbalanced data

• Advantages
• Mitigates the problem of overfitting caused by random oversampling as
synthetic examples are generated rather than replication of instances
• No loss of useful information
• Disadvantages
• While generating synthetic examples SMOTE does not take into consideration
neighboring examples from other classes. This can result in increase in
overlapping of classes and can introduce additional noise
• SMOTE is not very effective for high dimensional data
Synthetic Minority Oversampling Algorithm
Class Imbalance Problem in Machine Learning
Algorithm Level Methods:
• Here we keep the dataset constant but altering the training or inference algorithms.
1. Cost-Sensitive Learning(Penalize Algorithms):
• Here we assign different costs to classes according to their distribution.
• use higher learning rate for examples belonging to majority class as compared to
examples belonging to minority class or
• use class weighted loss functions which calculate loss by taking the class distribution into
account and hence penalize the classifier more for misclassifying examples from
minority class as compared to majority class.
• Mostly widely used class weighted loss functions are WeightedCrossEntropy and Focal
Loss.
Class Imbalance Problem in Machine Learning

Algorithm Level Methods:


1. Cost-Sensitive Learning(Penalize Algorithms):
• WeightedCrossEntropy.
• It uses the classical CrossEntropy loss and incorporates a weight term for giving more
weightage to a specific class which in class imbalance problems is the minority class.
• Hence, CrossEntropy loss penalizes more when the classifier misclassifies examples
belonging to the minority classes.
Class Imbalance Problem in Machine Learning

Algorithm Level Methods:


1. . Cost-Sensitive Learning(Penalize Algorithms):
. Focal loss
• It was first introduced by FAIR in their paper Focal Loss for Dense Object Detection is
designed in such a way that it performs two tasks for solving class imbalance.
• Firstly, it penalizes hard examples more as compared to easy examples and helps the
algorithm to perform better.
• Hard examples are those examples where the model is not confident and predicts the
ground truth with a low probability,
• whereas easy examples are those where the model is highly confident and predicts the
ground truth with high probability.
Class Imbalance Problem in Machine Learning

• Algorithm Level Methods:


• 2. One-Class Classification:
• One Class Classification is the technique of handling class imbalance by modelling the
distribution of only the minority class and treating all other classes as out-of-
distribution/anomaly classes.
• Using this technique, we aim to create a classifier that can detect examples belonging to
the minority class majority class.
• This is done in practice by training the model on only the instances belonging to the
minority class and during test time using examples belonging to all the classes to test the
ability of the classifier to correctly identify examples belonging to the minority class.
Class Imbalance Problem in Machine Learning

• Algorithm Level Methods:


• 2. One-Class Classification:
• One Class Classification technique is implemented in various ways.
• One widely used way in computer vision applications is using autoencoders where we
train the autoencoder on examples belonging to the minority class and make it
regenerate the input.
• Now at test time, we pass images belonging to all the classes and measure the
reconstruction error of the model using loss functions such as RMSE, MSE etc.
• If an image belongs to the minority class, the reconstruction error will be low as the
model is already familiar with its distribution and
• the reconstruction error would be high for examples belonging to major classes other
than the minority class.
Algorithmic Ensemble Techniques

• Modifying existing classification algorithms to make them appropriate


for imbalanced data sets.
• The main objective of ensemble methodology is to improve the
performance of single classifiers.
• The approach involves constructing several two stage classifiers from
the original data and then aggregate their predictions.
Approach to Algorithmic Ensemble based Methodologies
Approach to Algorithmic Ensemble based Methodologies
• 1 Bagging(Bootstrap Aggregating) Based techniques for imbalanced data
• Bootstrapping is the method of randomly creating samples of data out of a
population with replacement to estimate a population parameter.

•Consider there are n observations and m features in the training


set. You need to select a random sample from the training dataset
without replacement
•A subset of m features is chosen randomly to create a model
using sample observations
•The feature offering the best split out of the lot is used to split the
nodes
•The tree is grown, so you have the best root nodes
•The above steps are repeated n times. It aggregates the output
of individual decision trees to give the best prediction
Approach to Algorithmic Ensemble based Methodologies
• 1 Bagging(Bootstrap Aggregating)
Based techniques for imbalanced data
• The conventional bagging algorithm involves
generating ‘n’ different bootstrap training
samples with replacement.
• And training the algorithm on each
bootstrapped algorithm separately and then
aggregating the predictions at the end.
• Bagging is used for reducing Overfitting in
order to create strong learners for generating
accurate predictions.
1 Bagging(Bootstrap Aggregating) Based techniques
• Total Observations = 1000
• Fraud Observations =20
• Non Fraud Observations = 980
• Event Rate= 2 %
• There are 10 bootstrapped samples chosen from the population with
replacement.
• Each sample contains 200 observations.
• And each sample is different from the original dataset but resembles the dataset
in distribution & variability.
• The machine learning algorithms like logistic regression, neural networks,
decision tree are fitted to each bootstrapped sample of 200 observations.
• And the Classifiers c1, c2…c10 are aggregated to produce a compound classifier.
• This ensemble methodology produces a stronger compound classifier since it
combines the results of individual classifiers to come up with an improved one.
1 Bagging(Bootstrap Aggregating) Based techniques
• Advantages
• Improves stability & accuracy of machine learning algorithms
• Reduces variance
• Overcomes overfitting
• Improved misclassification rate of the bagged classifier
• In noisy data environments bagging outperforms boosting
• Disadvantages
• Bagging works only if the base classifiers are not bad to begin with. Bagging
bad classifiers can further degrade performance
2 Boosting-Based techniques for imbalanced data
• Boosting is an ensemble technique to combine weak learners to
create a strong learner that can make accurate predictions.
• Boosting starts out with a base classifier / weak classifier that is
prepared on the training data.
• What are base learners / weak classifiers?
• The base learners / Classifiers are weak learners i.e. the prediction
accuracy is only slightly better than average.
• A classifier learning algorithm is said to be weak when small changes
in data induce big changes in the classification model.
• In the next iteration, the new classifier focuses on or places more
weight to those cases which were incorrectly classified in the last
round.
2 Boosting-Based techniques for imbalanced data
Adaptive Boosting- Ada Boost techniques for imbalanced data

• Ada Boost is the first original boosting technique which creates a highly accurate
prediction rule by combining many weak and inaccurate rules.
• Each classifier is serially trained with the goal of correctly classifying examples in
every round that were incorrectly classified in the previous round.
• For a learned classifier to make strong predictions it should follow the following
three conditions:
• The rules should be simple
• Classifier should have been trained on sufficient number of training examples
• The Classifier should have low training error for the training instances
Adaptive Boosting- Ada Boost techniques for imbalanced data

• Each of the weak hypothesis has an accuracy slightly better than random guessing
i.e. Error Term € (t) should be slightly more than ½-β where β >0.
• This is the fundamental assumption of this boosting algorithm which can produce
a final hypothesis with a small error
• After each round, it gives more focus to examples that are harder to classify.
• The quantity of focus is measured by a weight, which initially is equal for all
instances.
• After each iteration, the weights of misclassified instances are increased and the
weights of correctly classified instances are decreased.
Adaptive Boosting- Ada Boost techniques for imbalanced data
Adaptive Boosting- Ada Boost techniques for imbalanced data

• For example in a data set containing 1000 observations out of which 20 are
labelled fraudulent.
• Equal weights W1 are assigned to all observations and the base classifier
accurately classifies 400 observations.
• Weight of each of the 600 misclassified observations is increased to w2 and
weight of each of the correctly classified observations is reduced to w3.
• In each iteration, these updated weighted observations are fed to the weak
classifier to improve its performance.
• This process continues till the misclassification rate significantly decreases
thereby resulting in a strong classifier.
Evaluation Metrics For Classification Model

Evaluation Metrics For Classification Model | Classification Model


Metrics (analyticsvidhya.com)
• The most important task in building any machine learning model is to
evaluate its performance.
• How would one measure the success of a machine learning model?
• How would we know that when to stop the training and evaluation
and when to call it good?
• There are different metrics for the tasks of classification and
regression.
• Some metrics, like precision-recall, are useful for multiple tasks.
• Using different metrics for performance evaluation, we should be
able to improve our model’s overall predictive power before we roll it
out for production on unseen data.
• Without doing a proper evaluation of the Machine Learning model
by using different evaluation metrics, and only depending on
accuracy,
• can lead to a problem when the respective model is deployed on
unseen data and may end in poor predictions.
Classification Metrics

• Classification is about predicting the class labels given input data.


• In binary classification, there are only two possible output classes.
• In multiclass classification, more than two possible classes can be
present.
Accuracy
• When any model gives an accuracy rate of 99%, you might think that model is
performing very good but this is not always true and can be misleading in some
situations.
Accuracy
• Accuracy simply measures how often the classifier correctly predicts.
• We can define accuracy as the ratio of the number of correct
predictions and the total number of predictions.
Accuracy
• Accuracy is useful when the target class is well balanced but is not a good choice
for the unbalanced classes.
• Imagine the scenario where we had 99 images of the dog and only 1 image of a
cat present in our training data.
• Then our model would always predict the dog, and therefore we got 99%
accuracy.
• In reality, Data is always imbalanced for example Spam email, credit card fraud,
and medical diagnosis.
• Hence, if we want to do a better model evaluation and have a full picture of the
model evaluation, other metrics such as recall and precision should also be
considered.
Confusion Matrix
• A confusion matrix is an N dimensional square matrix, where N represents total number of
target classes or categories.
• Confusion matrix can be used to evaluate a classifier whenever the data set is imbalanced.
Let us consider a binary classification problem i.e. the number of target classes are 2.
• A typical confusion matrix with two target classes (say “Yes” and “No”) looks like:
• The accuracy of the classifier can be calculated from the
confusion using the below formula:
• Accuracy = (TP + TN) / (TP + FP + TN + FN)
• The accuracy of our classifier is: (69+39) / (69+39+2+4) = 0.947 =
94.7%
Precision 
• Precision explains how many of the correctly predicted cases actually turned out
to be positive.
• Precision is useful in the cases where False Positive is a higher concern than False
Negatives.
• The importance of Precision is in music or video recommendation systems, e-
commerce websites, etc. where wrong results could lead to customer churn and
this could be harmful to the business.
Recall (Sensitivity)
• It is a useful metric in cases where False Negative is of higher concern than False
Positive.
• It is important in medical cases where it doesn’t matter whether we raise a false
alarm but the actual positive cases should not go undetected!
F1 Score 
• It gives a combined idea about Precision and Recall metrics. It is maximum when
Precision is equal to Recall.

• F1 Score is the harmonic mean of precision and recall.


• The F1 score punishes extreme values more.
• F1 Score could be an effective evaluation metric in the following cases:
• When FP and FN are equally costly.
• Adding more data doesn’t effectively change the outcome
• True Negative is high
• When to use the F1 Score?
• The F-score is often used in the field of information retrieval for measuring
search, document classification, and query classification performance.
• The F-score has been widely used in the natural language processing literature,
such as the evaluation of named entity recognition and word segmentation.
ROC curve (Receiver Operating Characteristic
curve)
• A ROC curve (Receiver Operating Characteristic curve) is a
graph showing the performance of a classification model.
• It is a way to visualize the tradeoff between the True Positive
Rate (TPR) and False Positive Rate(FPR) using different
decision thresholds (the threshold for deciding whether a
prediction is labeled “true” or “false”) for our predictive model.
• This threshold is used to control the tradeoff between TPR and
FPR.
• Increasing the threshold will generally increase the precision,
but a decrease in recall.
ROC curve (Receiver Operating Characteristic
curve)
• True Positive Rate (TPR / Sensitivity / Recall): True Positive Rate corresponds to
the proportion of positive data points that are correctly considered as positive,
for all positive data points.

• False Positive Rate (FPR): False Positive Rate corresponds to the proportion of
negative data points that are mistakenly considered as positive, for all negative
data points.
ROC curve (Receiver Operating Characteristic
curve)

They both have values in the range of [0,1] which are computed at varying
threshold values.

The perfect classifier will have high value of true positive rate and low value of
false positive rate.
ROC curve (Receiver Operating Characteristic curve)

•Any model with a ROC curve above the random


guessing classifier line can be considered as a
better model.
•Any model with a ROC curve below the random
guessing classifier line can outrightly be rejected.
•This curve plots TPR and FPR at different
classification thresholds but this is inefficient
because we have to evaluate our model at various
thresholds.
•There’s an efficient, sorting-based algorithm that
can provide us this information which is AUC.
When to use ROC?
• ROC curves are widely used to compare and evaluate different
classification algorithms.
• ROC curve is widely used when the dataset is imbalanced.
• ROC curves are also used in verification of forecasts in meteorology
Logarithmic Loss or Log Loss
• Log Loss can be used when the output of the classifier is a numeric probability instead of a
class label.
• Log loss measures the unpredictability of the extra noise that comes from using a predictor
as opposed to the true labels.
• Log loss for a binary classifier:
Logarithmic Loss or Log Loss

• Log loss for multi-class classification:

Consider, N samples belong to the M class. where,


y_ij indicates whether sample i belongs to class j or not
p_ij indicates the probability of sample i belonging to class j
Logarithmic Loss or Log Loss
• The image is of actual(Target) and Predicted
probabilities.
• The top image depicts a poor prediction
because of the large difference between the
actual and predicted probability which gives
us a large log loss.
• Here, the function penalizes the wrong
answer that the model is confident about.
• The bottom image depicts a good prediction
because the predicted probability is close to
the actual probability which gives us a small
log loss.
• Here, the function is rewarding a correct
answer that the model is confident about.
• Log loss doesn’t have an upper bound and it
exists on the range [0, ∞). Minimizing log loss
gives greater accuracy for the classifier.
• Performance Metrics for Classification Machine Learning Problems |
by Ramya Vidiyala | Towards Data Science

• More Performance Evaluation Metrics for Classification Problems You


Should Know – Kdnuggets
• Evaluation Metrics for Classification Problems with Implementation in
Python | by Venu Gopal Kadamba | Analytics Vidhya | Medium
• Performance metrics for classification and regression problems ·
Understanding Unexplained (shankarchavan.github.io)
Feature Selection Techniques in Machine Learning

1
Feature Selection Techniques in Machine Learning

• While building a machine learning model for real-life dataset, we


come across a lot of features in the dataset and not all these features
are important every time.
• Adding unnecessary features while training the model leads us
• to reduce the overall accuracy of the model,
• increase the complexity of the model and
• decrease the generalization capability of the model and makes the model
biased.
• Even the saying “Sometimes less is better” goes as well for the
machine learning model.
• Hence, feature selection is one of the important steps while building
a machine learning model.
• Its goal is to find the best possible set of features for building a
machine learning model.
2
Feature Selection Techniques in
Machine Learning
• Feature selection is the process of selecting the subset of the relevant features and leaving out
the irrelevant features present in a dataset to build a model of high accuracy.

• In other words, it is a way of selecting the optimal features from the input dataset.

• Three methods are used for the feature selection are_


1. Filter methods
2. Wrapper methods
3. Embedded methods

3
1. Filter Methods
In this method, the dataset is filtered, and a subset that contains only the
relevant features is taken.

Some common techniques of filters method are:


• Correlation
• Variance Threshold
• Chi-Square
• Anova test
• Information Gain

4
1. Filter Methods

• Correlation:
• Correlation explains how one or more variables are related to each other.
These variables can be input data features which have been used to forecast
our target variable.
• Pearson’s Correlation Coefficient is a measure of quantifying the association
between the two continuous variables and the direction of the relationship
with its values ranging from -1 to 1.

5
1. Filter Methods
Positive Correlation:
• Two features (variables) can be positively correlated with each other.
• It means that when the value of one variable increase then the value of the
other variable(s) also increases.

6
1. Filter Methods
Negative Correlation:
• Two features (variables) can be negatively correlated with each other.
• It means that when the value of one variable increase then the value of the other
variable(s) decreases.

7
1. Filter Methods
No Correlation:
• Two features (variables) are not correlated with each other.
• It means that when the value of one variable increase or decrease then the value of the
other variable(s) doesn’t increase or decreases.

8
Variance Threshold
• Variance Threshold is a feature selector that removes all the
low variance features from the dataset that are of no great use
in modeling.
• It looks only at the features (x), not the desired outputs (y), and
can thus be used for unsupervised learning.
• Default Value of Threshold is 0
• If Variance Threshold = 0 (Remove Constant Features )
• If Variance Threshold > 0 (Remove Quasi-Constant Features )
1. Filter Methods
• Chi-Square Test:
Chi-square method (X2) is generally used to test the relationship between
categorical variables.
It compares the observed values from different attributes of the dataset to
its expected value.

10
1. Filter Methods
• Variance Threshold – It is an approach where all features are
removed whose variance doesn’t meet the specific
threshold. By default, this method removes features having
zero variance. The assumption made using this method is
higher variance features are likely to contain more
information.
• Mean Absolute Difference (MAD) – This method is similar to
variance threshold method but the difference is there is no
square in MAD. This method calculates the mean absolute
difference from the mean value.
• Information Gain: It is defined as the amount of information
provided by the feature for identifying the target value and
measures reduction in the entropy values. Information gain
of each attribute is calculated considering the target values
for feature selection.

11
2. Wrappers Methods
• The wrapper method has the same goal as the filter method, but it
takes a machine learning model for its evaluation.
• In this method, some features are fed to the ML model, and
evaluate the performance. The performance decides whether to
add those features or remove to increase the accuracy of the
model.
• This method is more accurate than the filtering method but
complex to work.
Some common techniques of wrapper methods are:
• Forward Selection
• Backward Selection
• Bi-directional Elimination
• Forward selection –This method is an iterative approach where
we initially start with an empty set of features and keep adding
a feature which best improves our model after each iteration.
The stopping criterion is till the addition of a new variable does
not improve the performance of the model.
• Backward elimination – This method is also an iterative
approach where we initially start with all features and after
each iteration, we remove the least significant feature. The
stopping criterion is till no improvement in the performance of
the model is observed after the feature is removed.
• Bi-directional elimination – This method uses both forward
selection and backward elimination technique simultaneously
to reach to one unique solution.

July 20, 2021 Data Mining: Concepts and Techniques 13


3. Embedded Methods:
• Embedded methods check the different training
iterations of the machine learning model and
evaluate the importance of each feature.
Some common techniques of Embedded methods
are:
• LASSO
• Elastic Net
• Ridge Regression, etc.

14
• Regularization – This method adds a penalty to
different parameters of the machine learning model to
avoid over-fitting of the model. This approach of feature
selection uses Lasso (L1 regularization) and Elastic nets
(L1 and L2 regularization). The penalty is applied over
the coefficients, thus bringing down some coefficients
to zero. The features having zero coefficient can be
removed from the dataset.
• Tree-based methods – These methods such as Random
Forest, Gradient Boosting provides us feature
importance as a way to select features as well. Feature
importance tells us which features are more important
in making an impact on the target feature.

15
Issues in Decision Tree Learning
and How To solve them
Decision tree
• A decision tree is an algorithm for supervised learning.
• It uses a tree structure, in which there are two types of nodes:
decision node and leaf node.
• A decision node splits the data into two branches by asking a Boolean
question on a feature.
• A leaf node represents a class.
• The training process is about finding the “best” split at a certain
feature with a certain value.
• And the predicting process is to reach the leaf node from root by
answering the question at each decision node along the path.
Types of Decision Trees
• Types of decision trees are based on the type of target variable we
have.
• Categorical Variable Decision Tree:
• Decision Tree which has a categorical target variable then it called
a Categorical variable decision tree.
• Continuous Variable Decision Tree:
• Decision Tree has a continuous target variable then it is called Continuous
Variable Decision Tree.
Important Terminology related to Decision Trees
Decision Tree Example
Decision tree algorithms
• ID3
• C4.5
• CART
Issue 1 Overfitting the Data
• Over-fitting is nothing but the model runs accurately on the given
training data so much that it would be inaccurate in predicting the
outcomes of the untrained data.
• In decision trees, over-fitting occurs when the tree is designed so as to
perfectly fit all samples in the training data set.
• Thus it ends up with branches with strict rules of sparse data.
• Thus this effects the accuracy when predicting samples that are not
part of the training set.
Prevent overfitting/Determine how deeply to grow
the decision tree
• 1. we stop splitting the tree at some point;
• we need to introduce two hyperparameters for training like maximum depth of
the tree and minimum size of a leaf.
• 2. we generate a complete tree first, and then get rid of some
branches called as pruning.
• In pruning, you trim off the branches of the tree, i.e., remove the decision
nodes starting from the leaf node such that the overall accuracy is not
disturbed.
• This is done by segregating the actual training set into two sets: training data
set, D and validation data set, V.
• Prepare the decision tree using the segregated training data set, D.
• Then continue trimming the tree accordingly to optimize the accuracy of the
validation data set, V.
Overfitting the Data
• Unlike other regression models, decision tree doesn’t use
regularization to fight against overfitting.
• Instead, it employs tree pruning.
• Selecting the right hyperparameters (tree depth and leaf size) also
requires experimentation, e.g. doing cross-validation with a
hyperparameter matrix.
Issue 2 Continues Valued attributes
• Define new discrete valued attributes that partition the continuous attribute
value into a discrete set of intervals

• Find a set of thresholds midway Between different target values of the attribute :
Temperature>54 and Temperature>85

• Pick a threshold, c, that produces the greatest information gain : temperature>54


Issue 3 Unknown/Missing Attribute Values
• What if some examples missing values of A?
• Use training example anyway, sort through tree
• If node n tests A, assign most common value of A among other examples
sorted to node n
• Assign most common value of A among other examples with same target
value
• Assign probability pi to each possible value vi of A
• Assign fraction pi of example too each descendant in tree
• Classify new examples in same fashion
Issue 4 Attributes with Many Values
• Problem
• If attribute has many values, Gain will select it
• Imagine using Date = Oct_13_2004 as attribute

• One approach: use GainRatio instead


Gain( S , A)
GainRatio ( S , A) 
SplitInformation( S , A)
c
Si Si
SplitInformation( S , A)   log 2
i 1 S S
where Si is subset of S for which A has value vi
Issue 5 Attributes with Costs
• Use low-cost attributes where possible, relying on high-cost attributes only when
needed to produce reliable classficiations
• Tan and Schlimmer (1990)
Gain 2 ( S , A)
Cost ( A)

• Nunez (1988) 2Gain( S , A)  1


(Cost ( A)  1) w

where w ∈ [0, 1] determines importance of cost


Issue 5 Alternative Measures for selecting attributes
• There are a lot of alternatives to entropy and information gain.
• Two of them are Gain Ratio and Gini Index.
• Gain ratio is a modification to information gain.
• To compute Gain ratio we use two parameters:
• number of branches created and size of the branch.
• First calculate the information gain as shown in the lecture then
compute the intrinsic Information with the function below.

Now we can calculate the information Gain as follows.


Issue 5 Alternative Measures for selecting attributes
• The Gini Index is more concerned with the impurity of the
attribute.
• Gini Index is one of the most popular alternative to Entropy as
well.
• It is widely used in Classification and Regression Trees (CART).
• To calculate impurity:
From this one can calculate the average of Gini Index by

• Advantages of Gini Index; it doesn’t require computer to compute


logarithmic functions, which are computationally intensive.
• Gini Index is Minimized instead of Maximized.
Ensemble model
• Organizations use these supervised machine learning
techniques like Decision trees to make a better decision and to
generate more surplus and profit.
• Ensemble methods combine different decision trees to deliver
better predictive results, afterward utilizing a single decision
tree.
• The primary principle behind the ensemble model is that a
group of weak learners come together to form an active learner.
• There are two techniques given below that are used to perform
ensemble decision tree.
Ensemble model

• There are two techniques given below that are used to perform
ensemble decision tree.
1. Bagging
2. Boosting
Bagging

• Bagging is used when our objective is to reduce the variance of


a decision tree.
• Here the concept is to create a few subsets of data from the
training sample, which is chosen randomly with replacement.
• Now each collection of subset data is used to prepare their
decision trees thus, we end up with an ensemble of various
models.
• The average of all the assumptions from numerous trees is
used, which is more powerful than a single decision tree.
Random Forest Algorithm
• Random Forest is the supervised learning technique.
• It can be used for both Classification and Regression problems
• Random Forest is an expansion over bagging based on the concept
of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model
• It is based on the concept of ensemble learning, which is a process
of combining multiple classifiers to solve a complex problem and to improve
the performance of the model.
• It takes one additional step to predict a random subset of data. It also makes
the random selection of features rather than using all features to develop
trees.
• When we have numerous random trees, it is called the Random Forest.
Random Forest Algorithm
• As the name suggests, "Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset."
• Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the
final output.
• The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
Random Forest Algorithm
• The below diagram explains the working of the Random Forest algorithm:
Assumptions for Random Forest
• Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct
output, while others may not.
• But together, all the trees predict the correct output.
• Therefore, below are two assumptions for a better Random forest
classifier:
• There should be some actual values in the feature variable of the dataset so
that the classifier can predict accurate results rather than a guessed result.
• The predictions from each tree must have very low correlations.
Why use Random Forest?
• It takes less training time as compared to other algorithms.
• It predicts output with high accuracy, even for the large dataset it runs
efficiently.
• It can also maintain accuracy when a large proportion of data is missing.
How does Random Forest algorithm
work?
• Random Forest works in two-phases
• Phase I:Create the random forest by combining N decision tree,
• Phase II :Make predictions for each tree created in the first phase.
• The Working process can be explained in the below steps and
diagram:
• Step-1: Select random K data points from the training set.
• Step-2: Build the decision trees associated with the selected data
points (Subsets).
• Step-3: Choose the number N for decision trees that you want to
build.
• Step-4: Repeat Step 1 & 2.
• Step-5: For new data points, find the predictions of each decision
How does Random Forest algorithm
work?
Applications of Random Forest

1.Banking: Banking sector mostly uses this algorithm for the


identification of loan risk.
2.Medicine: With the help of this algorithm, disease trends and
risks of the disease can be identified.
3.Land Use: We can identify the areas of similar land use by this
algorithm.
4.Marketing: Marketing trends can be identified using this
algorithm.
• Advantages of Random Forest:
• Random Forest is capable of performing both Classification and
Regression tasks.
• It is capable of handling large datasets with high dimensionality.
• It enhances the accuracy of the model and prevents the
overfitting issue.
• Disadvantages of Random Forest:
• Although random forest can be used for both classification and
regression tasks, it is not more suitable for Regression tasks.
Boosting
• Boosting is an ensemble modelling technique that was first
presented by Freund and Schapire in the year 1997, since then,
Boosting has been a prevalent technique for tackling binary
classification problems.
• These algorithms improve the prediction power by converting a
number of weak learners to strong learners.
• The principle behind boosting algorithms is first we built a model
on the training dataset, then a second model is built to rectify
the errors present in the first model.
• This procedure is continued until and unless the errors are
minimized, and the dataset is predicted correctly.
Boosting Example
• suppose you built a decision tree algorithm on the Titanic dataset and from
there you get an accuracy of 80%.
• After this, you apply a different algorithm and check the accuracy and it
comes out to be 75% for KNN and 70% for Linear Regression.
• We see the accuracy differs when we built a different model on the same
dataset.
• But what if we use combinations of all these algorithms for making the
final prediction?
• We’ll get more accurate results by taking the average of results from these
models.
• We can increase the prediction power in this way.
• Boosting algorithms works in a similar way, it combines multiple models
(weak learners) to reach the final output (strong learners).
Types of boosting algorithms.
• AdaBoost algorithm
• Gradient descent algorithm
• Xtreme gradient descent algorithm
AdaBoost algorithm
• AdaBoost also called Adaptive Boosting is a technique in Machine
Learning used as an Ensemble Method.
• The most common algorithm used with AdaBoost is decision trees with
one level that means with Decision trees with only 1 split.
• These trees are also called Decision Stumps.
AdaBoost algorithm
• It builds a model and gives equal weights to all the data points.
• It then assigns higher weights to points that are wrongly classified.
• Now all the points which have higher weights are given more importance
in the next model.
• It will keep training models until and unless a lower error is received.
working of AdaBoost Algorithm
• Step 1 – The Image is shown below is the actual representation of our dataset. Since the
target column is binary it is a classification problem. First of all these data points will be
assigned some weights. Initially, all the weights will be equal.

• The formula to calculate the sample weights is:

Where N is the total number of datapoints


Here since we have 5 data points so the sample weights assigned will be 1/5.
working of AdaBoost Algorithm
• Step 2 – We start by seeing how well “Gender” classifies the samples and will see how the
variables (Age, Income) classifies the samples.
• We’ll create a decision stump for each of the features and then calculate the Gini Index of
each tree. The tree with the lowest Gini Index will be our first stump.
• Here in our dataset let’s say Gender has the lowest gini index so it will be our first stump.
working of AdaBoost Algorithm
• Step 3 – We’ll now calculate the “Amount of Say” or “Importance” or “Influence” for this
classifier in classifying the datapoints using this formula:

• The total error is nothing, but the summation of all the sample weights of misclassified
data points. Here in our dataset let’s assume there is 1 wrong output, so our total error will
be 1/5, and alpha(performance of the stump) will be:
working of AdaBoost Algorithm
• Note: Total error will always be between 0 and 1.
• 0 Indicates perfect stump and 1 indicates horrible stump.

• From the graph above we can see that when there


is no misclassification then we have no error
(Total Error = 0),
• so the “amount of say (alpha)” will be a large
number.
• When the classifier predicts half right and half
wrong then the Total Error = 0.5 and the
importance (amount of say) of the classifier will
be 0.
• If all the samples have been incorrectly classified
then the error will be very high (approx. to 1) and
hence our alpha value will be a negative integer.
working of AdaBoost Algorithm
• Step 4 – You must be wondering why is it necessary to calculate the Total Error and
performance of a stump?
• Well, the answer is very simple, we need to update the weights because if the same
weights are applied to the next model, then the output received will be the same as what
was received in the first model.
• The wrong predictions will be given more weight whereas the correct predictions weights
will be decreased.
• Now when we build our next model after updating the weights, more preference will be
given to the points with higher weights.
• After finding the importance of the classifier and total error we need to finally update the
weights and for this, we use the following formula:
working of AdaBoost Algorithm
• Step 4 After finding the importance of the classifier and total error we need to finally
update the weights and for this, we use the following formula:

• The amount of say (alpha) will be negative when the sample is correctly classified.
• The amount of say (alpha) will be positive when the sample is miss-classified.
• There are four correctly classified samples and 1 wrong, here the sample weight of that
datapoint is 1/5 and the amount of say/performance of the stump of Gender is 0.69.
New weights for correctly classified samples are: For wrongly classified samples the updated weights
working of AdaBoost Algorithm
• Note: See the sign of alpha when I am putting the values, the alpha is negative when the
data point is correctly classified, and this decreases the sample weight from 0.2 to 0.1004.
• It is positive when there is misclassification, and this will increase the sample weight from
0.2 to 0.3988
working of AdaBoost Algorithm
• We know that the total sum of the sample weights must be equal to 1 but here if we sum
up all the new sample weights, we will get 0.8004.
• To bring this sum equal to 1 we will normalize these weights by dividing all the weights
by the total sum of updated weights that is 0.8004. So, after normalizing the sample
weights we get this dataset and now the sum is equal to 1.
working of AdaBoost Algorithm
• Step 5 – Now we need to make a new dataset to see if the errors decreased or not. For this
we will remove the “sample weights” and “new sample weights” column and then based
on the “new sample weights” we will divide our data points into buckets.
working of AdaBoost Algorithm
• Step 6 – We are almost done, now what the algorithm does is selects random numbers
from 0-1. Since incorrectly classified records have higher sample weights, the probability
to select those records is very high.
• Suppose the 5 random numbers our algorithm take is 0.38,0.26,0.98,0.40,0.55.
• Now we will see where these random numbers fall in the bucket and according to it, we’ll
make our new dataset shown below.
working of AdaBoost Algorithm
• This comes out to be our new dataset and we see the datapoint which was wrongly classified
has been selected 3 times because it has a higher weight.
• Step 9 – Now this act as our new dataset and we need to repeat all the above steps i.e.
1. Assign equal weights to all the datapoints
2. Find the stump that does the best job classifying the new collection of samples by finding their
Gini Index and selecting the one with the lowest Gini index
3. Calculate the “Amount of Say” and “Total error” to update the previous sample weights.
4. Normalize the new sample weights.
5. Iterate through these steps until and unless a low training error is achieved.
6. Suppose with respect to our dataset we have constructed 3 decision trees (DT1, DT2, DT3) in a
sequential manner. If we send our test data now it will pass through all the decision trees and
finally, we will see which class has the majority, and based on that we will do predictions for
our test dataset.
• AdaBoost Algorithm - A Complete Guide for Beginners - Analytics
Vidhya
Difference between Bagging and Boosting
Bagging Boosting
Various training data subsets are randomly Each new subset contains the components that were
drawn with replacement from the whole misclassified by previous models.
training dataset.
Bagging attempts to tackle the over-fitting Boosting tries to reduce bias.
issue.
If the classifier is unstable (high variance), then If the classifier is steady and straightforward (high bias),
we need to apply bagging. then we need to apply boosting.
Every model receives an equal weight. Models are weighted by their performance.
Objective to decrease variance, not bias. Objective to decrease bias, not variance.
It is the easiest way of connecting predictions It is a way of connecting predictions that belong to the
that belong to the same type. different types.
Every model is constructed independently. New models are affected by the performance of the
previously developed model.
Classification of Class-Imbalanced Data Sets

• Class-imbalance problem: Rare positive example but numerous


negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes and
equal error costs: not suitable for class-imbalanced data
• Typical methods for imbalance data in 2-class classification:
• Oversampling: re-sampling of data from positive class
• Under-sampling: randomly eliminate tuples from negative
class
• Threshold-moving: moves the decision threshold, t, so that the
rare class tuples are easier to classify, and hence, less chance
of costly false negative errors
• Ensemble techniques: Ensemble multiple classifiers introduced
above
• Still difficult for class imbalance problem on multiclass tasks

31
Reinforcement Learning
Types of Machine Learning
Types of Machine Learning
What is Reinforcement Learning?

• Reinforcement Learning is a feedback-based


Machine learning technique in which an
agent learns to behave in an environment by
performing the actions and seeing the results
of actions.
• For each good action, the agent gets positive
feedback, and for each bad action, the agent
gets negative feedback or penalty.

• Reinforcement learning is a type of machine


learning method where an intelligent agent
(computer program) interacts with the
environment and learns to act within that
What is Reinforcement Learning?

• In Reinforcement Learning, the agent learns automatically


using feedbacks without any labeled data, unlike supervised
learning.
• Since there is no labeled data, so the agent is bound to learn
by its experience only.
• RL solves a specific type of problem where decision making
is sequential, and the goal is long-term, such as game-
playing, robotics, etc.
• The agent interacts with the environment and explores it by
itself. The primary goal of an agent in reinforcement learning
is to improve the performance by getting the maximum
positive rewards.
• The agent learns with the process of hit and trial, and based
on the experience, it learns to perform the task in a better
way.
Fundamentals of Reinforcement Learning

• Reinforcement Learning (RL) is one of the areas of Machine Learning (ML).


Unlike other ML paradigms, such as supervised and unsupervised learning, RL
works in a trial and error fashion by interacting with its environment.
• RL is one of the most active areas of research in artificial intelligence, and it is
believed that RL will take us a step closer towards achieving artificial general
intelligence.
• RL has evolved rapidly in the past few years with a wide variety of applications
ranging from building a recommendation system to self-driving cars.
• The major reason for this evolution is the advent of deep reinforcement learning,
which is a combination of deep learning and RL.
• With the emergence of new RL algorithms and libraries, RL is clearly one of the
most promising areas of ML.
The basic idea of RL

• You can see a dog and a master.


• Let’s imagine you are training your
dog to get the stick.
• Each time the dog gets a stick
successfully, you offered him a feast (a
bone let’s say).
• Eventually, the dog understands the
pattern, that whenever the master
throws a stick, it should get it as early
as it can to gain a reward (a bone)
from a master in a lesser time
The basic idea of RL
Applications of RL
• A large amount of data is required for reinforcement learning
models.
• That means it is not applied in the areas which have limited
data, but it may be ideal for robotics and industrial automation
and building computer games.
• Reinforcement learning algorithms have the ability to make
sequential decisions and learn from their experience.
• That is their distinguishing feature from traditional machine
learning models.
Applications of RL
• Computer Games: Pac-Man is a well-known and simple example. Pac-
Man’s (the agent of the model) goal is eating the food in the grid(the
environment of the model), but not getting killed by the ghost. Pac-Man is
rewarded when it eats food and loses the game when it is killed.
• Industrial Automation and Robotics: Reinforcement learning helps
industrial applications and robotics to gain the skills themselves for
performing their tasks.
• Traffic Control Systems: Reinforcement learning is used for real-time
decision-making and optimisation for traffic control activities. There are
existing projects such as the project to support air traffic control systems.
Applications of RL
• Resources Management Systems: Reinforcement learning is
used to distribute limited resources to the activities and to reach
the goal of resource usage.
• Advertising: Reinforcement learning supports businesses and
marketers to create personalized content and
recommendations.
• Other: Reinforcement learning models are also used for other
machine learning fields like text summarization, chatbots, self
driving cars, online stock trading, auctions and bidding.
Applications of RL
• 10 Real-Life Applications of Reinforcement Learning - neptune.ai
Reinforcement Learning Applications
1.Robotics:
RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.
2.Control:
RL can be used for adaptive control such as Factory processes, admission control in
telecommunication, and Helicopter pilot is an example of reinforcement learning.
3.Game Playing:
RL can be used in Game playing such as tic-tac-toe, chess, etc.
4.Chemistry:
RL can be used for optimizing the chemical reactions.
5.Business:
RL is now used for business strategy planning.
6.Manufacturing:
In various automobile manufacturing companies, the robots use deep reinforcement
learning to pick goods and put them in some containers.
7.Finance Sector:
The RL is currently used in the finance sector for evaluating trading strategies.
The basic idea of RL
• Consider an example of a child trying to take his/her first steps. What
will be the instructions he/she follows to start walking?
1. Observing others walking and trying to replicate the same
2. Standing still
3. Remaining still
1. Trying to balance the body weight, along with deciding on which foot
to advance first to start walking.It sounds like a difficult and challenging
task for a child to get up and walk, right? But for us, it is easy since we
have become used to it over time.
2. Now, putting it together, a child is an agent who is trying to manipulate
the environment (surface or floor) by trying to walk and going
from one state to another (taking a step). A child gets a reward when
he/she takes a few steps (appreciation) but will not receive any reward
or appreciation if he/she is unable to walk. This is a simplified
description of a reinforcement learning problem.
Key elements of RL
• Agent
• An agent is a software program that learns to make intelligent decisions.
• We can say that an agent is a learner in the RL setting.
• For instance, a chess player can be considered an agent since the player learns to make the best
moves (decisions) to win the game.
• Similarly, Mario in a Super Mario Bros video game can be considered an agent since Mario
explores the game and learns to make the best moves in the game.
• Environment
• The environment is the world of the agent.
• The agent stays within the environment.
• For instance, coming back to our chess game, a chessboard is called the environment since the
chess player (agent) learns to play the game of chess within the chessboard (environment).
Similarly, in Super Mario Bros, the world of Mario is called the environment.
Key elements of RL
• State and action
• A state is a position or a moment in the environment that the agent can be in.
• We learned that the agent stays within the environment, and there can be many positions in the environment
that the agent can stay in, and those positions are called states.
• For instance, in our chess game example, each position on the chessboard is called the state.
• The state is usually denoted by s.
• The agent interacts with the environment and moves from one state to another by performing an action. In the
chess game environment, the action is the move performed by the player (agent). The action is usually denoted
by a.
• Reward
• We learned that the agent interacts with an environment by performing an action and moves from one state to
another.
• Based on the action, the agent receives a reward.
• A reward is nothing but a numerical value, say, +1 for a good action and -1 for a bad action.
• How do we decide if an action is good or bad?
• In our chess game example, if the agent makes a move in which it takes one of the opponent's
chess pieces, then it is considered a good action and the agent receives a positive reward. Similarly,
if the agent makes a move that leads to the opponent taking the agent's chess piece, then it is
considered a bad action and the agent receives a negative reward. The reward is denoted by r.
Key elements of RL

• Policy – the agent prepares strategy(decision-making)


to map situations to actions.
• Value Function – The value of state shows up the
reward achieved starting from the state until the policy
is executed
• Model – Every RL agent doesn’t use a model of its
environment. The agent’s view maps state-action pairs
probability distributions over the states
Reinforcement Learning Workflow

– Create the Environment


– Define the reward
– Create the agent
– Train and validate the agent
– Deploy the policy
The RL algorithm
The steps involved in a typical RL algorithm are as follows:
1.First, the agent interacts with the environment by performing an action.
2.By performing an action, the agent moves from one state to another.
3.Then the agent will receive a reward based on the action it performed.
4.Based on the reward, the agent will understand whether the action is good or bad.
5.If the action was good, that is, if the agent received a positive reward, then the
agent will prefer performing that action, else the agent will try performing other
actions in search of a positive reward.
• RL is basically a trial and error learning process.
• Now, let's revisit our chess game example.
• The agent (software program) is the chess player.
• So, the agent interacts with the environment (chessboard) by performing an action
(moves).
• If the agent gets a positive reward for an action, then it will prefer performing that
action;
• else it will find a different action that gives a positive reward.
• Ultimately, the goal of the agent is to maximize the reward it gets.
• If the agent receives a good reward, then it means it has performed a good action.
If the agent performs a good action, then it implies that it can win the game.
• Thus, the agent learns to win the game by maximizing the reward.
RL agent in the grid world

• The positions A to I in the environment are called the states of


the environment.
• The goal of the agent is to reach state I by starting from state A
without visiting the shaded states (B, C, G, and H).
• Thus, in order to achieve the goal, whenever our agent visits a
shaded state, we will give a negative reward (say -1) and when
it visits an unshaded state, we will give a positive reward (say
+1).
• The actions in the environment are moving up, down, right and
left. The agent can perform any of these four actions to reach
state I from state A.
RL agent in the grid world

Iteration 1:
• the agent performs a random action in each state. For instance,
look at the following figure. In the first iteration, the agent
moves right from state A and reaches the new state B.
• But since B is the shaded state, the agent will receive a negative
reward and so the agent will understand that moving right is
not a good action in state A.
• When it visits state A next time, it will try out a different action
instead of moving right:
RL agent in the grid world

Iteration 1:
• from state B, the agent moves down and reaches the new
state E. Since E is an unshaded state, the agent will receive a
positive reward, so the agent will understand that
moving down from state B is a good action.
• From state E, the agent moves right and reaches state F.
Since F is an unshaded state, the agent receives a positive
reward, and it will understand that moving right from state E is
a good action.
• From state F, the agent moves down and reaches the goal
state I and receives a positive reward, so the agent will
understand that moving down from state F is a good action
RL agent in the grid world

Iteration 2:
In the second iteration, from state A, instead of
moving right, the agent tries out a different action as the
agent learned in the previous iteration that moving right is
not a good action in state A.
Thus, as Figure shows, in this iteration the agent
moves down from state A and reaches state D. Since D is
an unshaded state, the agent receives a positive reward
and now the agent will understand that moving down is a
good action in state A:
RL agent in the grid world

Iteration 2:
• from state D, the agent moves down and reaches state G. But
since G is a shaded state, the agent will receive a negative
reward and so the agent will understand that moving down is
not a good action in state D, and when it visits state D next
time, it will try out a different action instead of moving down.
• From G, the agent moves right and reaches state H. Since H is
a shaded state, it will receive a negative reward and understand
that moving right is not a good action in state G.
• From H it moves right and reaches the goal state I and receives
a positive reward, so the agent will understand that
moving right from state H is a good action.
RL agent in the grid world

Iteration 3:
• The agent moves down from state A since, in the second iteration, our agent
learned that moving down is a good action in state A.
So, the agent moves down from state A and reaches the next
state, D, as Figure shows.
• Now, from state D, the agent tries a different action instead of moving down
since in the second iteration our agent learned that moving down is not a
good action in state D. So, in this iteration, the agent moves right from state
D and reaches state E.
• From state E, the agent moves right as the agent already learned in the first
iteration that moving right from state E is a good action and reaches state F.
• Now, from state F, the agent moves down since the agent learned in the first
iteration that moving down is a good action in state F, and reaches the goal
state I.
Types of Reinforcement Learning
• 1 Positive Reinforcement Learning:
• In this type of RL, the algorithm receives a type of reward for a certain result. In other words, here we try to add a
reward for every good result in order to increase the likelihood of a good result.
• We can understand this easily with the help of a good example.
• In order to make a child do a certain task like cleaning their rooms or study hard to get marks, some parents often
promise them a reward at the end of the task.
• Like, the parents promise to give the child something that he or she loves like chocolate. This rather has a good
impact as it automatically makes the child work as they think of the reward. In this learning, we are adding a good
reward to increase the likelihood of task completion.
• This can have good impacts like improvement in performance, sustaining the change for a longer duration, etc., but its
negative side could be that too much of RL could cause overloading of states that could impact the results.
Types of Reinforcement Learning
• 2 Negative Reinforcement Learning:
• This RL Type is a bit different from positive RL. Here, we try to remove something negative in order to improve
performance.
• We can take the same child-parent example here as well. Some parents punish kids for not cleaning their rooms.
• The punishment can be no video games for one week or sometimes a month. To avoid the punishment the kids often
work harder or complete the job assigned to them.
• We can also take the example of getting late for the office. People often sleep late and get up late. To avoid being late
at the office, they try to change their sleep habits.
• From these examples, we understand that the algorithm in this case will receive negative feedback. Hence, it would
avoid the process that resulted in negative feedback. This also has it’s good impacts like, the behavior toward
performing the task would increase. It would force you to provide better results.
• The negative impact is that it would only force you to meet the minimum necessary requirement to complete the job.
Supervised vs Unsupervised vs Reinforcement
Learning
Criteria Supervised ML Unsupervised ML Reinforcement ML

Trained using unlabelled


Learns by using labelled Works on interacting with
Definition data without any
data the environment
guidance.
Supervised vs Unsupervised vs Reinforcement
Learning
Criteria Supervised ML Unsupervised ML Reinforcement ML

Trained using unlabelled


Learns by using labelled Works on interacting with
Definition data without any
data the environment
guidance.

Type of data Labelled data Unlabelled data No – predefined data


Supervised vs Unsupervised vs Reinforcement
Learning
Criteria Supervised ML Unsupervised ML Reinforcement ML

Trained using unlabelled


Learns by using labelled Works on interacting with
Definition data without any
data the environment
guidance.

Type of data Labelled data Unlabelled data No – predefined data

Regression and
Type of problems Association and Clustering Exploitation or Exploration
classification
Supervised vs Unsupervised vs Reinforcement Learning

Criteria Supervised ML Unsupervised ML Reinforcement ML

Trained using unlabelled


Learns by using labelled Works on interacting with
Definition data without any
data the environment
guidance.

Type of data Labelled data Unlabelled data No – predefined data

Regression and
Type of problems Association and Clustering Exploitation or Exploration
classification

Supervision Extra supervision No supervision No supervision


Supervised vs Unsupervised vs Reinforcement Learning
Criteria Supervised ML Unsupervised ML Reinforcement ML

Trained using unlabelled


Learns by using labelled Works on interacting
Definition data without any
data with the environment
guidance.

Type of data Labelled data Unlabelled data No – predefined data

Regression and Association and Exploitation or


Type of problems
classification Clustering Exploration

Supervision Extra supervision No supervision No supervision

Linear Regression,
K – Means, Q – Learning,
Algorithms Logistic Regression,
C – Means, Apriori SARSA
SVM, KNN etc.
Supervised vs Unsupervised vs Reinforcement Learning
Criteria Supervised ML Unsupervised ML Reinforcement ML

Learns by using labelled Trained using unlabelled Works on interacting with


Definition
data data without any guidance. the environment

Type of data Labelled data Unlabelled data No – predefined data

Regression and
Type of problems Association and Clustering Exploitation or Exploration
classification

Supervision Extra supervision No supervision No supervision

Linear Regression, Logistic K – Means, Q – Learning,


Algorithms
Regression, SVM, KNN etc. C – Means, Apriori SARSA

Discover underlying
Aim Calculate outcomes Learn a series of action
patterns
Supervised vs Unsupervised vs Reinforcement Learning
Criteria Supervised ML Unsupervised ML Reinforcement ML

Learns by using labelled Trained using unlabelled Works on interacting with


Definition
data data without any guidance. the environment

Type of data Labelled data Unlabelled data No – predefined data

Regression and
Type of problems Association and Clustering Exploitation or Exploration
classification

Supervision Extra supervision No supervision No supervision

Linear Regression, Logistic K – Means, Q – Learning,


Algorithms
Regression, SVM, KNN etc. C – Means, Apriori SARSA

Discover underlying
Aim Calculate outcomes Learn a series of action
patterns
Risk Evaluation, Forecast Recommendation System, Self Driving Cars, Gaming,
Application
Sales Anomaly Detection Healthcare
Fundamental concepts of RL
• Math essentials
• Before going ahead, let's quickly recap expectation from our high school
days, as we will be dealing with expectation throughout the book.
• Expectation
• Let's say we have a variable X and it has the values 1, 2, 3, 4, 5, 6.
• To compute the average value of X, we can just sum all the values
of X divided by the number of values of X. Thus, the average of X is
(1+2+3+4+5+6)/6 = 3.5.
• Now, let's suppose X is a random variable.
• The random variable takes values based on a random experiment, such
as throwing dice, tossing a coin, and so on. The random variable takes
different values with some probabilities. Let's suppose we are throwing
a fair dice, then the possible outcomes (X) are 1, 2, 3, 4, 5, and 6 and the
probability of occurrence of each of these outcomes is 1/6, as shown
in table
Fundamental concepts of RL

• How can we compute the average value of the random variable X? Since each value
has a probability of an occurrence, we can't just take the average.
• So, instead, we compute the weighted average, that is, the sum of values
of X multiplied by their respective probabilities, and this is called expectation.
• The expectation of a random variable X can be defined as:

• Thus, the expectation of the random variable X is E(X) = 1(1/6) + 2(1/6) +


3(1/6) + 4(1/6) + 5 (1/6) + 6(1/6) = 3.5.
Fundamental concepts of RL

• The expectation is also known as the expected value.


• Thus, the expected value of the random variable X is 3.5.
• Thus, when we say expectation or the expected value of a random variable,
it basically means the weighted average.
• Now, we will look into the expectation of a function of a random variable.
Let

• then we can write:

• Thus, the expected value of f(X) is given as E(f(X)) = 1(1/6) + 4(1/6) +


9(1/6) + 16(1/6) + 25(1/6) + 36(1/6) = 15.1.
Fundamental concepts of RL-Action
space

Action space:
• Consider the grid world environment shown in Figure
• In the preceding grid world environment, the goal of the agent is
to reach state I starting from state A without visiting the shaded
states. In each of the states, the agent can perform any of the four
actions—up, down, left, and right—to achieve the goal.
• The set of all possible actions in the environment is called the
action space. Thus, for this grid world environment, the action
space will be [up, down, left, right].
• We can categorize action spaces into two types:
1. Discrete action space
2. Continuous action space
Fundamental concepts of RL-Action
space

Discrete action space:


When our action space consists of actions that are discrete, then it
is called a discrete action space.
For instance, in the grid world environment, our action space
consists of four discrete actions, which are up, down, left, right, and so it is
called a discrete action space.
Continuous action space:
When our action space consists of actions that are continuous,
then it is called a continuous action space.
For instance, let's suppose we are training an agent to drive a car,
then our action space will consist of several actions that have continuous
values, such as the speed at which we need to drive the car, the number of
degrees we need to rotate the wheel, and so on.
In cases where our action space consists of actions that are continuous, it
is called a continuous action space.
Fundamental concepts of RL- Policy

• A policy defines the agent's behavior in an environment. The policy tells the
agent what action to perform in each state. For instance, in the grid world
environment, we have states A to I and four possible actions. The policy may
tell the agent to move down in state A, move right in state D, and so on.
• To interact with the environment for the first time, we initialize a random policy,
that is, the random policy tells the agent to perform a random action in each
state.
• Thus, in an initial iteration, the agent performs a random action in each state and
tries to learn whether the action is good or bad based on the reward it obtains.
• Over a series of iterations, an agent will learn to perform good actions in each
state, which gives a positive reward.
• Thus, we can say that over a series of iterations, the agent will learn a good
policy that gives a positive reward.
Fundamental concepts of RL-Policy

• The optimal policy is the policy that gets the agent a good reward
and helps the agent to achieve the goal. For instance, in our grid
world environment, the optimal policy tells the agent to perform an
action in each state such that the agent can reach state I from state
A without visiting the shaded states.

• The optimal policy is shown in Figure.


• As we can observe, the agent selects the action in each state based
on the optimal policy and reaches the terminal state I from the
starting state A without visiting the shaded states
• the optimal policy tells the agent to perform the correct action in
each state so that the agent can receive a good reward.
Fundamental concepts of RL-Policy

A policy can be classified as the following:


•A deterministic policy
•A stochastic policy
Fundamental concepts of RL- Episode

• The agent interacts with the environment by performing some actions, starting from
the initial state and reaches the final state.
• This agent-environment interaction starting from the initial state until the final state
is called an episode.
• For instance, in a car racing video game, the agent plays the game by starting from
the initial state (the starting point of the race) and reaches the final state (the
endpoint of the race). This is considered an episode.
• An episode is also often called a trajectory (the path taken by the agent) and it is
denoted by
Fundamental concepts of RL-Episode
• An agent can play the game for any number of episodes, and each episode is independent
of the others.
• What is the use of playing the game for multiple episodes? In order to learn the optimal
policy, that is, the policy that tells the agent to perform the correct action in each state, the
agent plays the game for many episodes.
• For example, let's say we are playing a car racing game; the first time, we may not win
the game, so we play the game several times to understand more about the game and
discover some good strategies for winning the game.
• Similarly, in the first episode, the agent may not win the game and it plays the game for
several episodes to understand more about the game environment and good strategies to
win the game.
• Say we begin the game from an initial state at a time step t = 0 and reach the final state at
a time step T, then the episode information consists of the agent-environment interaction,
such as state, action, and reward, starting from the initial state until the final state, that is,
(s0, a0, r0, s1, a1, r1,…,sT).
Fundamental concepts of RL-
Episode
• Figure shows an example of an
episode/trajectory:
Episode and the optimal policy with the
grid world environment
• In the grid world environment, the goal of our agent is to reach the final state I starting from the
initial state A without visiting the shaded states. An agent receives a +1 reward when it visits the
unshaded states and a -1 reward when it visits the shaded states.
• When we say generate an episode, it means going from the initial state to the final state. The agent
generates the first episode using a random policy and explores the environment and over several
episodes, it will learn the optimal policy.
• Episode 1
Episode and the optimal policy with the grid world environment
• Episode 2
• In the second episode, the agent tries a different policy to avoid the negative
rewards it received in the previous episode.
• For instance, as we can observe in the previous episode, the agent selected the
action right in state A and received a negative reward, so in this episode, instead of
selecting the action right in state A, it tries a different action, say down, as shown
in figure
Episode and the optimal policy with the grid world environment
• Episode n
• Thus, over a series of episodes, the agent learns the optimal policy, that is, the
policy that takes the agent to the final state I from state A without visiting the
shaded states, as Figure shows:
Fundamental concepts of RL- The value function
• The value function, also called the state value function, denotes the value of the
state. The value of a state is the return an agent would obtain starting from that
state following policy
• The value of a state or value function is usually denoted by V(s) and it can be
expressed as:

• where s0 = s implies that the starting state is s. The value of a state is called the
state value.
Fundamental concepts of RL- The value
function
• Let's understand the value function with an example. Let's suppose we
generate the trajectory following some policy
• in our grid world environment, as shown in Figure
Fundamental concepts of RL- The value
function

•The value of state A is the return of the trajectory starting from state A. Thus, V(A) = 1+1+ -1+1 = 2.
•The value of state D is the return of the trajectory starting from state D. Thus, V(D) = 1-1+1= 1.
•The value of state E is the return of the trajectory starting from state E. Thus, V(E) = -1+1 = 0.
•The value of state H is the return of the trajectory starting from state H. Thus, V(H) = 1.
•Since I is the final state, we don't make any transition from the final state, so there is no reward and thus no
value for the final state I.
Reinforcement Learning Algorithms
Reinforcement Learning Algorithms

• Value-Based – The main goal of this method is to maximize


a value function. Here, an agent through a policy expects a
long-term return of the current states.

• Policy-Based – In policy-based, you enable to come up with


a strategy that helps to gain maximum rewards in the future
through possible actions performed in each state. Two types
of policy-based methods are deterministic and stochastic.

• Model-Based – In this method, we need to create a virtual


model for the agent to help in learning to perform in each
specific environment
Reinforcement Learning Algorithms

• Model-Based – In the model-based approach, a system


uses a predictive model of the world to ask questions of the
form “what will happen if I do x?” to choose the best x1.
•Markov Decision Process
Markov Decision Process (MDP)
• The Markov Decision Process (MDP) provides a mathematical framework for solving the RL
problem.
• Almost all RL problems can be modeled as an MDP.
• MDPs are widely used for solving various optimization problems.
• In this section, we will understand what an MDP is and how it is used in RL.
• To understand an MDP, first, we need to learn about the Markov property and Markov chain.
• The Markov property and Markov chain
• The Markov property states that the future depends only on the present and not on the past.
• The Markov chain, also known as the Markov process, consists of a sequence of states that strictly
obey the Markov property;
• that is, the Markov chain is the probabilistic model that solely depends on the current state to
predict the next state and not the previous states, that is, the future is conditionally independent of
the past.
Markov Decision Process (MDP)
• For example, if we want to predict the weather and we know that the current state is cloudy, we can
predict that the next state could be rainy.
• We concluded that the next state is likely to be rainy only by considering the current state (cloudy)
and not the previous states, which might have been sunny, windy, and so on.
• However, the Markov property does not hold for all processes. For instance, throwing a dice (the
next state) has no dependency on the previous number that showed up on the dice (the current
state)
• Moving from one state to another is called a transition, and its probability is called a transition
probability.
• We denote the transition probability by

• It indicates the probability of moving from the state s to the next state.
Markov Decision
Process (MDP)
• Say we have three states (cloudy, rainy, and windy)
in our Markov chain. Then we can represent the
probability of transitioning from one state to another
using a table called a Markov table, as shown in
Table.
• From the state cloudy, we transition to the state
rainy with 70% probability and to the state windy
with 30% probability.
• From the state rainy, we transition to the same state
rainy with 80% probability and to the state cloudy
with 20% probability.
• From the state windy, we transition to the state rainy
with 100% probability.
Markov Decision
Process (MDP)
• We can also formulate the transition probabilities into
a matrix called the transition matrix.
• we can say that the Markov chain or Markov process
consists of a set of states along with their transition
probabilities.


The Markov Reward Process
• The Markov Reward Process (MRP) is an extension of the Markov chain with the reward
function.
• That is, we learned that the Markov chain consists of states and a transition probability.
• The MRP consists of states, a transition probability, and also a reward function.
• A reward function tells us the reward we obtain in each state.
• For instance, based on our previous weather example, the reward function tells us the
reward we obtain in the state cloudy, the reward we obtain in the state windy, and so on.
• The reward function is usually denoted by R(s).
• Thus, the MRP consists of states s, a transition probability
• and a reward function R(s).
The Markov Decision Process
• The Markov Decision Process (MDP) is an extension of the MRP with actions.
• That is, we learned that the MRP consists of states, a transition probability, and a reward
function.
• The MDP consists of states, a transition probability, a reward function, and also actions.
• The Markov property states that the next state is dependent only on the current state and
is not based on the previous state.
• Is the Markov property applicable to the RL setting? Yes! In the RL environment, the
agent makes decisions only based on the current state and not based on the past states.
• So, we can model an RL environment as an MDP.
The Markov Decision Process

• For example, in our grid world environment,


say the transition probability of moving from
state A to state B while performing an action
right is 100%.
• This can be expressed as P(B|A, right) = 1.0.
We can also view this in the state diagram, as
shown in Figure.
The Markov Decision Process

• Suppose our agent is in state C and the


transition probability of moving from
state C to state F while performing the
action down is 90%, then it can be
expressed as P(F|C, down) = 0.9.
• We can also view this in the state
diagram, as shown in Figure
The Markov Decision Process
• Reward function - The reward function is
denoted by

• It represents the reward our agent obtains


while transitioning from state s to state
• while performing an action a.
The Markov Decision Process

• Say the reward we obtain while transitioning from


state A to state B while performing the action right is -1,
then it can be expressed as R(A, right, B) = -1.

• We can also view this in the state diagram, as shown in


Figure
The Markov Decision Process

• Suppose our agent is in state C and say the reward we


obtain while transitioning from state C to state F while
performing the action down is +1,then it can be
expressed as R(C, down, F) = +1.

• We can also view this in the state diagram, as shown


in Figure
The Markov Decision Process
• Thus, an RL environment can be represented as an MDP with
states, actions, transition probability, and the reward function.
• But wait! What is the use of representing the RL environment
using the MDP? We can solve the RL problem easily once we
model our environment as the MDP.
• For instance, once we model our grid world environment using
the MDP, then we can easily find how to reach the goal
state I from state A without visiting the shaded states.
Reinforcement Learning Algorithms

Q-learning is a model-free, value-based, off-policy learning


algorithm.
•Model-free: The algorithm that estimates its optimal policy
without the need for any transition or reward functions from
the environment.
•Value-based: Q learning updates its value functions based on
equations, (say Bellman equation) rather than estimating the
value function with a greedy policy.
•Off-policy: The function learns from its own actions and
doesn’t depend on the current policy.
Reinforcement Learning Algorithms

• Scenario – Robots in a Warehouse


• A growing e-commerce company is building a new warehouse, and the company would like all of the
picking operations in the new warehouse to be performed by warehouse robots.
• In the context of e-commerce warehousing, “picking” is the task of gathering individual items from
various locations in the warehouse in order to fulfill customer orders.
• After picking items from the shelves, the robots must bring the items to a specific location within the
warehouse where the items can be packaged for shipping.
• In order to ensure maximum efficiency and productivity, the robots will need to learn the shortest path
between the item packaging area and all other locations within the warehouse where the robots are
allowed to travel.
Q-Learning
• Q-learning is an off-policy learner.
• Means it learns the value of the optimal policy independently of the agent’s
actions.
• On the other hand, an on-policy learner learns the value of the policy being
carried out by the agent, including the exploration steps and it will find a policy
that is optimal, taking into account the exploration inherent in the policy.
• What’s this ‘Q’?
• The ‘Q’ in Q-learning stands for quality. Quality here represents how useful a
given action is in gaining some future reward.
Q learning

• Q-learning Definition
• Q*(s,a) is the expected value (cumulative discounted reward) of doing a in state s
and then following the optimal policy.
• Q-learning uses Temporal Differences(TD) to estimate the value of Q*(s,a).
Temporal difference is an agent learning from an environment through episodes
with no prior knowledge of the environment.
• The agent maintains a table of Q[S, A], where S is the set of states and A is the set
of actions.
• Q[s, a] represents its current estimate of Q*(s,a).
Q-learning Simple Example

• Let’s say an agent has to move from a starting point to an


ending point along a path that has obstacles.
• Agent needs to reach the target in the shortest path
possible without hitting in the obstacles and he needs to
follow the boundary covered by the obstacles.
• For our convenience, I have introduced this in a
customized grid environment as follows.
• Introducing the Q-Table
• Q-Table is the data structure used to calculate the
maximum expected future rewards for action at each state.
Basically, this table will guide us to the best action at each
state. To learn each value of the Q-table, Q-Learning
algorithm is used.
Q-learning Simple Example
• Q-function
• The Q-function uses the Bellman equation and takes two inputs:
state (s) and action (a).
Q-learning Algorithm Process
Step 2 & 3: Choose and Perform an Action

• The combination of steps 2 and 3 is performed for an


undefined amount of time. These steps runs until the
time training is stopped, or when the training loop
stopped as defined in the code.
• First, an action (a) in the state (s) is chosen based on the
Q-Table. Note that, as mentioned earlier, when the
episode initially starts, every Q-value should be 0.
• Then, update the Q-values for being at the start and
moving right using the Bellman equation which is
stated above.
• We can now update the Q-values for being at the start
and moving right using the Bellman equation.
Steps 4 : Measure Reward

• Now we have taken an action and observed an


outcome and reward.
• Steps 5 : Evaluate
• We need to update the function Q(s,a).
• This process is repeated again and again until the
learning is stopped. In this way the Q-Table is
been updated and the value function Q is
maximized. Here the Q(state, action) returns the
expected future reward of that action at that state.
• Reward when reach step closer to
goal= +1
• Reward when hit obstacle =-1
• Reward when idle=0
• Initially, we explore the agent’s
environment and update the Q-Table.
When the Q-Table is ready, the agent
start to exploit the environment and
start taking better actions. Final Q-
table can be like following (for
example).
• Following are the outcomes that
results the agent’s shortest path
towards goal after training.

• Q-Learning Algorithm: From Explanation to


Implementation | by Amrani Amine | Towards Data
Science
Monte Carlo Reinforcement Learning

• The Monte Carlo method for reinforcement learning learns directly from episodes of experience
without any prior knowledge of MDP transitions.
• Here, the random component is the return or reward.
• Monte Carlo methods require only experience — sample sequences of states, actions, and
rewards from actual or simulated interaction with an environment.
• Learning from actual experience is striking because it requires no prior knowledge of the
environment’s dynamics, yet can still attain optimal behavior.
• What this Monte thing used for in RL?
• It is a method for estimating Value-action(Value|State, Action) or Value function(Value|State) using some sample runs
from the environment for which we are estimating Value function.
Monte Carlo Reinforcement Learning

•Let us consider the above situation where we have a system of 3 states that are A, B & terminate.
•We are given two example episodes(we can generate it using random walk for any environment).
•A+3 →A+2 means the transition from state A →A with reward =3 for this transition.
2 types of Monte Carlo learning on how to average future rewards:
•First Visit Monte Carlo: First visit estimates (Value|State: S1) as the average of the returns following the first
visit to the state S1
•Every Visit Monte Carlo: It estimates (Value|State: S1) as the average of returns for every visit to the State
S1.
Monte Carlo Reinforcement Learning

First Visit Monte Carlo:


Calculating V(B)
Drawing reference from the above example:
•1st episode=-4+4–3=-3
•2nd episode=-2+3+-3=-2
Averaging, V(B)=(-3+-2)/2=-2.5
Monte Carlo Reinforcement Learning

Every Visit Monte Carlo: It estimates (Value|State: S1) as the average of returns for every visit to the State S1.
Calculating V(A)
Here, we would be creating a new summation term adding all rewards coming after every occurrence of
‘A’(including that of A as well).
•From 1st episode=(3+2+-4+4+-3)+(2+-4+4+-3)+(4+-3)=2+-1+1
•From 2nd episode=(3+-3)=0
As we got 4 summation terms, we will be averaging using N=4 i.e
V(A)=(2+-1+1+0)/4=0.5
Monte Carlo Reinforcement Learning

Every Visit Monte Carlo: It estimates (Value|State: S1) as the average of returns for every visit to the State S1.
Calculating V(B)
•From 1st episode=(-4+4+-3)+(-3)=-3+-3
•From 2nd episode=(-2+3–3)+(-3)=-2+-3
As we have 4 summation terms, averaging using N=4,
V(B)=(-3+-3+-2+-3)/4=-2.75
Dynamic Programming
• Planning by Dynamic Programming: Reinforcement Learning | by
Ryan Wong | Towards Data Science
Example Data
• Now let’s look at an example using random walk (Figure 1) as
our environment.

• The basic idea is that you always start in state ‘D’ and you move
randomly, with a 50% probability, to either the left or right until
you reach the terminal or ending states ‘A’ or ‘G’.
• If you end in state ‘A’ you get a reward of 0, but if you end in
state ‘G’ the reward is 1.
• There are no rewards for states ‘B’ through ‘F’.
Example Data

• The purpose of using reinforcement learning on this example is to see if we can


accurately predict the values of each of these states through a model-free
approach.
• The ground truth values are simply the probabilities for each state to gain a
reward with respect to ending in state ‘G’.
• So below I’ve included a table indicating the labels for each of the states that
we’ll measure our estimates against.
Temporal Difference Learning (TD
Learning)
• One of the problems with the environment is that rewards usually are not
immediately observable.
• For example, in tic-tac-toe or others, we only know the reward(s) on the final
move (terminal state).
• All other moves will have 0 immediate rewards.
• TD learning is an unsupervised technique to predict a variable's expected value in
a sequence of states.
• TD uses a mathematical trick to replace complex reasoning about the future with a
simple learning procedure that can produce the same results.
• Instead of calculating the total future reward, TD tries to predict the combination
of immediate reward and its own reward prediction at the next moment in time.
Temporal-Difference Learning
• Temporal difference is an agent learning from an environment through episodes
with no prior knowledge of the environment.
• This means temporal difference takes a model-free or unsupervised learning
approach.
• You can consider it learning from trial and error.
• 3 algorithms: TD(0), TD(1) and TD(λ).
Temporal-Difference Learning-Basic Notations
• Gamma (γ): the discount rate. A value between 0 and 1. The higher the value the
less you are discounting.
• Lambda (λ): the credit assignment variable. A value between 0 and 1. The higher
the value the more credit you can assign to further back states and actions.
• Alpha (α): the learning rate. How much of the error should we accept and
therefore adjust our estimates towards. A value between 0 and 1. A higher value
adjusts aggressively, accepting more of the error while a smaller one adjusts
conservatively but may make more conservative moves towards the actual
values.
• Delta (δ): a change or difference in value.
TD(1) Algorithm

• TD(1) makes an update to our values in the same manner as Monte Carlo, at the
end of an episode.
• So back to our random walk, going left or right randomly, until landing in ‘A’ or ‘G’.
• Once the episode ends then the update is made to the prior states.
• As we mentioned above if the higher the lambda value the further the credit can
be assigned and in this case its the extreme with lambda equaling 1.
• This is an important distinction because TD(1) and MC only work in episodic
environments meaning they need a ‘finish line’ to make an update.
TD(1) Algorithm

• Gt in above Figure 2 is the discounted sum of all the rewards seen in our episode.
• So as we’re traveling through our environment we keep track of all the rewards and sum
them together with a discount (γ).
• So lets act like we’re reading this out loud: the immediate reward (R) at a given point (time,
t+1) plus the discount (γ) of a future reward (Rt+2) and so on.
• You can see that we discount (γ) more heavily in the future with γ^T-1.
• So if γ=0.2 and you’re discounting the reward at time step 6, your discount value γ become
γ^6–1 which equals 0.00032.
• Significantly smaller after just 6 time steps.
TD(1) Algorithm

• Now we’ll make an update to our value estimates V(S).


• Its important to know that when starting out you really don’t have a good starting
estimate.
• You initialize using either random values or all zeros and then make updates to that
estimate. For our random walk sequence we initialize values for all the states
between ‘B’ and ‘F’ to zero [0,0,0,0,0,0,1] .
• We’ll leave the terminal states alone since those are known.
• Remember we’re only trying to predict the values for the non-terminal states since
we know the value of the terminal states.
TD(1) Algorithm

Figure 3: TD(1) Update Value toward Action Return

• We’ll use the sum of discounted rewards from above, Gt, that we saw from our
episode and we’ll subtract that from the prior estimate.
• This is called the TD Error. Our updated estimate minus the previous estimate.
• Then we multiple by an alpha (α) term to adjust how much of that error we want to
update by.
• Lastly we make the update by simply adding our pervious estimate V(St) to the
adjusted TD Error (Figure 3).
TD(1) Algorithm

Figure 3: TD(1) Update Value toward Action Return

• So thats 1 episode. We did 1 random walk and accumulated rewards.


• We then took those rewards at each time step and compared it to our original
estimate of values (all zeros).
• We weight the difference and adjust our prior estimate. Then start over again.
TD(0) Algorithm

Figure 4: TD(0) Update Value toward Estimated Return


• Instead of using the accumulated sum of discounted rewards (Gt) we will only look at the immediate
reward (Rt+1), plus the discount of the estimated value of only 1 step ahead (V(St+1)) (Figure 4).
• This is the only difference between the TD(0) and TD(1) update. Notice we just swapped out Gt, from
Figure 3, with the one step ahead estimation.
Temporal-Difference Learning
• TD learning is a combination of Monte Carlo and Dynamic Programming.

• The issue we can observe easily is that we always need a termination state!!
• If such is the case, what will happen to Continuous RL problems that don’t have a
termination state!!
• Also, why should we wait to update Value-Action-Function for all states till the
episode end? Can it be done before the episode ends?
• can be painful when we have 1000s of states
Temporal-Difference Learning
• Here comes Temporal Difference Learning which
• Doesn’t require any info about the environment (Like Monte Carlo)
• Update estimates based in part on other learned estimates, without
waiting for a final outcome (they bootstrap like DP).
• Hence,
• Temporal Difference= Monte Carlo + Dynamic Programming.
• In Temporal Difference, we also decide on how many references we need from the
future to update the current Value-Action-Function.
Temporal-Difference Learning
• It means we can update our present Value-Action-Function
using as many future rewards we want.
• It can be just one future reward TD(0) from the immediate next
future state
• or can be 5 future rewards from the next 5 future states i.e
TD(5). The onus is completely on us.
• Though I would be using TD(0) in the below examples.
Temporal-Difference-TD(0) Learning
Going step by step
• Input π i.e the policy (can be e-greedy, greedy,etc.)
• Initialize Value-Action-Function for every state(s belonging to S) in the environment
• for e →E (episodes/epochs we want to train):
• ___1.Take the initial state of the system
• ___2.For each step in the episode
• _______A.Choose Action according to the policy π.
• _______B.Update Value-Action-Function according to the current step chosen using the mentioned
equation and move to next state S’.
• It’s time to demystify the ghostly update equation now:
• V (S) ← V (S) + α[R + γV (S′) − V (S)]
• Here,
• V(S)/V(S,A)= Value-Action-Function for current state
• α= Constant
• R=Reward for present action
• γ= Discount Factor
• V(S’)/V(S’/A)=Value-Action-Function for next State when action A taken on state S
Reinforcement Learning in Business,
Marketing, and Advertising
• In money-oriented fields, technology can play a crucial role. Like, here RL models of companies
can analyze customer preferences and help in the better advertisement of the products.

• We know that business requires proper strategizing. The steps need careful planning for a product
or the company to gain profit.

• RL here helps to devise proper strategies by analyzing various possibilities and by that; it tries to
improve the profit margin in each result. Various multinational companies use these models. Also,
the cost of these models is high.
Reinforcement Learning in Gaming
• Gaming is a booming industry and is gradually advancing with technology. The games are now
becoming more realistic and have many more details for them.
• We have environments like PSXLE or PlayStation Reinforcement Learning Environment that
focus on providing better gaming environments by modifying the emulators.
• We have Deep learning algorithms like AlphaGo, AlphaZero that are gaming algorithms for games
like chess, shogi and go.
• With these platforms and algorithms, gaming is now more advanced and is helping in creating
games, which have countless possibilities.
• These can also be helpful in making story-mode games of PlayStation.
Reinforcement Learning in Recommendation systems

• RL is now a big help in recommendation systems like news, music apps, and web-
series apps like Netflix, etc. These apps work as per customer preferences.
• In the case of web-series apps like Netflix, the variety of shows that we watch
become a list of preferences for the algorithm.
• Companies like these have sophisticated recommendation systems.
• They consider many things like user preference, trending shows, related genres,
etc. Then according to these preferences, the model will show you the latest
trending shows.
• These models are very much cloud-based, so as users, we will use these models in
our daily lives through information and entertainment platforms.
Reinforcement Learning in Science

• AI and ML technologies nowadays have become an important part of the research. There are
various fields in science where reinforcement learning can come in handy.
• The most talked-about is in atomic science. Both the physics behind atoms and their chemical
properties are researched.
• Reinforcement learning helps to understand chemical reactions. We can try to have cleaner
reactions that yield better products. There can be various combinations of reactions for any
molecule or atom. We can understand their bonding patterns with machine learning.
• In most of these cases, for having better quality results, we would require deep reinforcement
learning. For that, we can use some deep learning algorithms like LSTM.
Reinforcement Learning

You might also like