Machine Learning BE Merged Modules
Machine Learning BE Merged Modules
Classification
Types
1.Multi-Label 1.Imbalanced
Classification Classification
Binary Classification:
Popular algorithms :
• Logistic Regression
• k-Nearest Neighbors
Classification •
•
Decision Trees
Support Vector Machine
Types • Naive Bayes
Multi-Class Classification
Popular algorithms :
• k-Nearest Neighbors.
• Decision Trees.
Classification • Naive Bayes.
• Random Forest.
Types • Gradient Boosting.
Multi-Label Classification
Popular algorithms :
• Examples include:
• Fraud detection.
• Outlier detection.
• Medical diagnostic tests.
Popular algorithms :
2) Unsupervised Learning
It is a learning method in which a machine learns without any supervision
The training is provided to the machine with the set of data that has not been
labelled, classified, or categorized
The algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.
It can be further classifieds into two categories of algorithms:
1.Clustering 2. Association
Labelled Training Data
Clustering Techniques
1.Image Recognition
It is used to identify objects, persons, places, digital images, etc.
Use case: Automatic friend tagging suggestion
2. Speech Recognition
Speech recognition is a process of converting voice instructions
into text, and it is also known as "Speech to text", or
"Computer speech recognition.“
Use Case: Google assistant, Siri, Cortana, and Alexa
3. Traffic prediction
Google Maps, which shows us the correct path with the
shortest route and predicts the traffic conditions such as
whether traffic is cleared, slow-moving, or heavily congested
Applications of Machine learning
4. Product recommendations
Google understands the user interest using various ML algorithms and suggests the product as per
customer interest.
Use Case: when we use Netflix, we find some recommendations for entertainment series, movies
5.Email Spam and Malware Filtering
We always receive an important mail in our inbox with the important symbol and spam emails in our spam
box.
Below are some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Applications of Machine learning
6.Virtual Personal Assistant
virtual personal assistants such as Google assistant, Alexa, Cortana, Siri.
These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.
Overfitting: Model with good training set. Underfitting: Not good for training
But may wrong for testing set as well as testing set
Real-life Example of overfitting and underfitting
Task: To identify whether the object is ball or not
Parameters:
Sphere-This feature is checking if the object is of a spherical shape.
Play-This feature is checking if one can play with it.
Eat-This feature is checking if one cannot eat it.
radius=5 cm-This feature is checking if an object's size is 5 cm or less than it.
Overfiiting Case:
If object (ball) with 10 cm radius is passed to classifier, it will classify it as not ball. Because
classifier is very much specific with features value.
Underfitting case:
If object (Orange) is passed to classifier , it will classify it as ball. Because it is very much
generalized with less number of parameter i.e. if object is sphere in shape it is ball.
Example:
Sphere play Eat Radius Class
yes yes No 5 Ball
yes yes No 3 Ball
yes no yes 5 Fruit
yes no yes 10 Fruit
yes yes No 10 Ball
Underfitting: model is buit on two parameters only (Sphere & radius). So when fruit is passed to
model it may classify it as ball
Overfitting: Model is specific with all parameters value (yes,yes,No,5), so when ball with 10 radius will
pass it may classify it as fruit
Bias/ Variance
Bias: Difference in predicted value and actual value
High Bias: difference is more
Variance: how the predicted values are scatter with
respect to each other
Low variance: values are not much scattered. They are
in groups.
Issues in Machine Learning contd..
4. Lack of Training Data
we need to ensure that Machine learning algorithms are trained with sufficient amounts of data.
5. Imperfections in the Algorithm When Data Grows
So you need regular monitoring and maintenance to keep the algorithm working. This is one of
the most exhausting issues faced by machine learning professionals.
How good is a model?
Linearity
Number of features
The main points to consider when trying to solve a
new problem are
Define the problem. What is the objective of the
problem?
• e.g., occupation=“ ”
• e.g., Salary=“-10”
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
numerical data
• Importance
• “Data cleaning is one of the three biggest problems in data
warehousing”
• Data Cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
5; 10; 11; 13; 15; 35; 50; 55; 72; 92; 204; 215;
bin 1 5,10,11,13
bin 2 15,35,50,55
bin 3 72,92,204,215
bin 1 5,10,11,13,15
bin 2 35,50,55,72,92
bin 3 204,215
( ex. 5; 10; 11; 13; 15; 35; 50; 55; 72; 92; 204; 215;)
• Such analysis can measure how strongly one attributes implies the
other based on the available data
(Observed Expected) 2
2
Expected
𝑐𝑜𝑢𝑛𝑡 𝐴 ∗(𝑐𝑜𝑢𝑛𝑡 𝐵)
• 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 =
𝑛
• The larger the Χ2 value, the more likely the variables are
related
( 250 90 ) 2
(50 210 ) 2
( 200 360 ) 2
(1000 840 ) 2
2 507.93
90 210 360 840
• If rA,B > 0,
A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger
correlation.
• rA,B = 0: independent;
Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values
Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
Example:
• Suppose two stocks A and B have the following values in one week: (2, 5), (3,
8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together? Days Stock Stock
A B
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4 Monday 2 5
Wednesday 5 10
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thursday 4 11
• Thus, A and B rise together since Cov(A, B) > 0. Friday 6 14
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set
of replacement values so that each old value can be identified with one of
the new values
• Methods
• Smoothing: Remove noise from data
• Techniques include binning, regression, clustering
• Attribute/feature construction
• New attributes constructed from the given ones
26
Data Transformation
27
Normalization
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
• Ex. Let min and max value for the attribute income are
$12,000 and $98,000 resp. Now map income to the range
[0.0, 1.0]. Then the value $73,000 for income is transformed
as,
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
28
Normalization
30
Example
Use the two methods below to normalize the following
group of data:
(a) Use min-max normalization to transform the value 35 for age onto the range
[0:0; 1:0]
(b) Use z-score normalization to transform the value 35 for age, where the
standard deviation of age is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.
$0…$1000
34
Automatic Concept Hierarchy Generation
• Data reduction
• Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the same)
analytical results
36
Data reduction strategies
y=wx + b
• Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
Raw Data
November 21, 2022 Data Mining: Concepts and Techniques 43
Sampling: Cluster or Stratified Sampling
• Outliers are interesting: It violates the mechanism that generates the normal data
• Applications:
• Credit card fraud detection
• Telecom fraud detection
• Medical analysis
46
Types of Outliers
• Collective Outliers
• A subset of data objects collectively deviate significantly from the
whole data set, even if the individual data objects may not be outliers
• Applications: E.g., intrusion detection:
• When a number of computers keep sending denial-of-service
packages to each other
47
Outlier Detection Methods
Unsupervised Methods
• Object labeled as normal or outlier are not available
• Assume the normal objects are somewhat ``clustered'‘ into multiple
groups, each having some distinct features
• An outlier is expected to be far away from any groups of normal
objects
49
Outlier Detection Methods
Semi-Supervised Methods
• Small set of data (normal objects/outliers)are labeled, but most of
the data are unlabeled
• Labeled normal data are used with unlabeled closed objects to build
a model for normal objects
• The model then can be used to detect the outlier ( objects not fitting
the model of normal objects are classified as outliers)
Outlier Detection Methods
• Statistical Methods
• Proximity-Based Methods
• Clustering-Based Methods
51
Outlier Detection : Statistical Methods
• Statistical methods (also known as model-based methods) assume that
the normal data follow some statistical model (a stochastic model)
• The data not following the model are outliers.
52
Outlier Detection : Proximity-Based Methods
• An object is an outlier if the nearest neighbors of the object are far
away, i.e., the proximity of the object is significantly deviates from
the proximity of most of the other objects in the same data set
53
Proximity-Based Approaches: Distance-Based vs. Density-
Based Outlier Detection
54
Outlier Detection : Clustering-Based Methods
55
Outlier detection and visualization: Scatter plot
56
Outlier detection and visualization: Box Plot
• Box Plots: Box plot is another very simple visualization tool to detect
outliers which use the concept of Interquartile range (IQR) technique.
• Example 2: Find the first and third quartiles of the set {3, 7, 8, 5, 12,
14, 21, 15, 18, 14}.
• Correlation:
Pearson’s Correlation Coefficient is a measure of quantifying the
association between the two continuous variables and the direction of
the relationship with its values ranging from -1 to 1.
• Chi-Square Test:
Chi-square method (X2) is generally used to test the relationship
between categorical variables. It compares the observed values from
different attributes of the dataset to its expected value.
• When using PCA, we take as input our original data and try to find a
combination of the input features which can best summarize the
original data distribution so that to reduce its original dimensions.
• LDA aims to maximize the distance between the mean of each class
and minimize the spreading within the class itself.
• LDA uses therefore within classes and between classes as measures.
This is a good choice because maximizing the distance between the
means of each class when projecting the data in a lower-dimensional
space can lead to better classification results
• Is Supervised or Unsupervised?
• What is Regression?
1 2 3 4
Predicting age of a Predicting Predicting whether Predicting whether
person nationality of a stock price of a a document is
person company will related to sighting
increase tomorrow of UFOs?
Linear Regression
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
7 55 ?
Linear Regression-Step I
GLUCOSE
SUBJECT AGE X LEVEL Y XY X2 Y2
y’ =86.327 6 59 81
7 55 86.327
Important points about LR
1. More susceptible to outliers hence;
2.It should not be used in the case of big-size data.
3.There should be a linear relationship between independent and
dependent variables.
4.There is only one independent and dependent variable.
5.The type of regression line: a best fit straight line.
Advantages And Disadvantages of LR
Advantages Disadvantages
It handles overfitting pretty well using Linear regression is quite sensitive to outliers
dimensionally reduction techniques, Hence,it should not be used in the case of
regularization, and cross-validation big-size data
Use Case – Implementing Linear Regression
1.Loading the Data
2.Exploring the Data
3.Slicing The Data
4.Train and Split Data
5.Generate The Model
6.Evaluate The accuracy
Multiple linear regression
• is used to estimate the relationship between two or more
independent variables and one dependent variable
• Example:
• 1 The selling price of a house can depend on the desirability of the
location, the number of bedrooms, the number of bathrooms, the
year the house was built, the square footage of the lot and a number
of other factors
• 2 The height of a child can depend on the height of the mother, the
height of the father, nutrition, and environmental factors.
Multiple linear regression
Multiple linear regression
The simplest multiple regression model for two predictor variables is
y = β0 + β1x1 + β2x2 + €
Multiple linear regression
The simplest multiple regression model for two predictor variables is
y = β0 + β1x1 + β2x2 + €
Linear Regression
A) TRUE
B) FALSE
Linear Regression
• A) TRUE
• B) FALSE
Linear Regression
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
plt.scatter(x, y)
plt.show()
Example
import numpy
import matplotlib.pyplot as plt
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
R-Squared
• It is important to know how well the relationship between the
values of the x- and y-axis is,
• if there are no relationship the polynomial regression can not be
used to predict anything.
• The relationship is measured with a value called the r-squared.
• The r-squared value ranges from 0 to 1, where 0 means no
relationship, and 1 means 100% related.
• Python and the Sklearn module will compute this value for you,
all you have to do is feed it with the x and y arrays:
R-Squared-Example
• How well does my data fit in a polynomial regression?
• import numpy
from sklearn.metrics import r2_score
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
print(r2_score(y, mymodel(x)))
• The result 0.94 shows that there is a very good relationship, and we can use polynomial
regression in future predictions.
Predict Future Values
• Now we can use the information we have gathered to predict future values.
• Example: Let us try to predict the speed of a car that passes the tollbooth at
around 17 P.M:
• import numpy
from sklearn.metrics import r2_score
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
speed = mymodel(17)
print(speed)
Bad Fit? Example
• These values for the x- and y-axis should result in a very bad fit for polynomial
regression:
• import numpy
import matplotlib.pyplot as plt
x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
Bad Fit? Example, the r-squared value?
• These values for the x- and y-axis should result in a very bad fit for polynomial
regression:
• import numpy
import matplotlib.pyplot as plt
x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]
print(r2_score(y, mymodel(x)))
• The result: 0.00995 indicates a very bad relationship, and tells us that this data set is not suitable
for polynomial regression.
Problem Description
• There is a Human Resource company, which is going to
hire a new candidate. The candidate has told his
previous salary 160K per annum, and the HR have to
check whether he is telling the truth or bluff.
• So to identify this, they only have a dataset of his
previous company in which the salaries of the top 10
positions are mentioned with their levels.
• By checking the dataset available, we have found that
there is a non-linear relationship between the
Position levels and the salaries.
• Our goal is to build a Bluffing detector
regression model, so HR can hire an honest candidate.
Below are the steps to build such a model.
• Problem
Multiple Regression
• Python Machine Learning Multiple Regression (w3schools.com)
Linear Regression Use Cases
• Sales of a product; pricing, performance, and risk parameters
• Generating insights on consumer behavior, profitability, and other business
factors
• Evaluation of trends; making estimates, and forecasts
• Determining marketing effectiveness, pricing, and promotions on sales of a
product
• Assessment of risk in financial services and insurance domain
• Studying engine performance from test data in automobiles
• Calculating causal relationships between parameters in biological systems
• Conducting market research studies and customer survey results analysis
• Astronomical data analysis
• Predicting house prices with the increase in sizes of houses
Regularization in Machine Learning
Regularization in Machine Learning
• Regularization is a technique used to reduce the errors by fitting
the function appropriately on the given training set and avoid
overfitting.
• It mainly regularizes or reduces the coefficient of features
toward zero.
• In simple words, "In regularization technique, we reduce the
magnitude of the features by keeping the same number of
features.“
• Hence, it maintains accuracy as well as a generalization of the
model.
How does Regularization Work?
• Regularization works by adding a penalty or complexity term or shrinkage term with
Residual Sum of Squares (RSS) to the complex model.
• Let’s consider the Simple linear regression equation:
• Here Y represents the dependent feature or response which is the learned relation.
Then,
• Y is approximated to β0 + β1X1 + β2X2 + …+ βpXp
• Here, X1, X2, …Xp are the independent features or predictors for Y, and
• β0, β1,…..βn represents the coefficients estimates for different variables or predictors(X),
which describes the weights or magnitude attached to the features, respectively.
• In simple linear regression, our optimization function or loss function is known as
the residual sum of squares (RSS).
Techniques of Regularization
• Mainly, there are three types of regularization techniques, which are given below:
1. Ridge Regression
2. Lasso Regression
3. Dropout
• Ridge Regression
• Ridge regression is one of the types of linear regression in which we introduce a small amount of bias,
known as Ridge regression penalty so that we can get better long-term predictions.
• In Statistics, it is known as the L-2 norm.
• In this technique, the cost function is altered by adding the penalty term (shrinkage term), which multiplies
the lambda with the squared weight of each individual feature. Therefore, the optimization function(cost
function) becomes:
• In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge
regression reduces the magnitudes of the coefficients that help to decrease the complexity of the
model.
Techniques of Regularization
• Lasso Regression
• Lasso regression is another variant of the regularization technique used to reduce the complexity
of the model. It stands for Least Absolute and Selection Operator.
• It is similar to the Ridge Regression except that the penalty term includes the absolute weights
instead of a square of weights. Therefore, the optimization function becomes:
• Cost Function for Lasso Regression is
Linear Regression
Evaluation Metrics for Regression Problems
Linear Regression
The Scikit-Learn is a great library, as it has almost all the inbuilt functions
Below is the code to implement Mean Absolute Error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_true, y_pred)
Here, ‘y_true’ is the true target values & ‘y_pred’ is the predicted target values.
Evaluation Metrics for Regression Problems
M.S.E (Mean Squared Error)
It takes the average of the square of the error. Here, the error is the difference
b/w actual & predicted values.
Below is the mathematical formula of the Mean Squared Error.
The Scikit-Learn is a great library, as it has almost all the inbuilt functions
Below is the code to implement Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y_true, y_pred)
Here, ‘y_true’ is the true target values & ‘y_pred’ is the predicted target values.
Root Mean Squared Error(RMSE)
Taking the log of the RMSE metric slows down the scale
of error.
we take the log of calculated RMSE error and resultant we get
as RMSLE.
R Squared (R2)
Now, how will you interpret the R2 score? suppose If the
R2 squared is also known as R2 score is zero then the above regression line by mean
Coefficient of Determination or line is equal means 1 so 1-1 is zero. So, in this case, both
sometimes also known as lines are overlapping means model performance is worst,
Goodness of fit It is not capable to take advantage of the output column.
Now the second case is when the R2 score is 1, it means
when the division term is zero and it will happen when the
regression line does not make any mistake, it is perfect. In
the real world, it is not possible.
So we can conclude that as our regression line moves
towards perfection, R2 score move towards one. And the
model performance improves.
The normal case is when the R2 score is between zero
and one like 0.8 which means your model is capable to
explain 80 per cent of the variance of data.
from sklearn.metrics import r2_score
r2 = r2_score(y_test,y_pred) print(r2)
Adjusted R Squared
4. We will repeat this process until our Cost function is very small
(ideally 0)
Gradient Descent Algorithm gives optimum values of m and c of the
linear regression equation. With these values of m and c, we will get
the equation of the best-fit line and ready to make predictions.
Supervised Learning with
Classification
Decision Tree - Classification
• Decision tree builds classification models in the form of a tree
structure.
• It breaks down a dataset into smaller and smaller subsets
while at the same time an associated decision tree is
incrementally developed.
• The final result is a tree with decision nodes and leaf nodes.
• A decision node has two or more branches
• Leaf node represents a classification or decision.
• The topmost decision node in a tree which corresponds to
the best predictor called root node.
• Decision trees can handle both categorical and numerical
data.
Classification Model
What is node impurity/purity in decision trees?
• The decision tree is a greedy algorithm that performs a recursive binary partitioning of the
feature space.
• The tree predicts the same label for each bottommost (leaf) partition.
• Each partition is chosen greedily by selecting the best split from a set of possible splits.
16
Naïve Bayes Classifier Algorithm
• Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional
training dataset.
• Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of
the probability of an object.
• Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles
Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two words Naïve and
Bayes, Which can be described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a
certain feature is independent of the occurrence of other features.
Such as if the fruit is identified on the bases of color, shape, and
taste, then red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that it is an
apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
Bayes' Theorem:
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
• The formula for Bayes' theorem is given as:
Where,
• P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
• P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
• P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
• P(B) is Marginal Probability: Probability of Evidence.
Advantages/ Disadvantages of Naïve Bayes Classifier
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other
Algorithms.
• It is the most popular choice for text classification problems.
43
Learning phase:
44
• Test Phase
– Given a new instance,
x’ = (Outlook=Sunny, Temperature=Cool,
Humidity=High, Wind=Strong)
• MAP rule
P(x’|Yes) =P(Outlook=Sunny / Yes) *
P(Temperature=Cool / Yes)*
P(Humidity=High / Yes) *
P(Wind=Strong / Yes) *
P(Yes)
= 2/9 * 3/9 * 3/9 * 3/9 * 9/14
= = 0.0053
45
Map rule:
P(x’|No) = P(Outlook=Sunny/No) *
P(Temperature=Cool/No) *
P(Humidity=High/No) *
P(Wind=Strong/No) *
P(No)
= 3/5 * 1/5 * 4/5 * 3/5 * 5/14
= 0.0206
• P(x’|No):
[P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No)
= 0.0206
• Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories.
• To solve this type of problem, we need a K-NN algorithm.
• With the help of K-NN, we can easily identify the category or class of a particular
dataset.
How does K-NN work?
• Hence, the SVM algorithm helps to find the best line or decision
boundary; this best boundary or region is called as a hyperplane.
• SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors.
• The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin.
• The hyperplane with maximum margin is called the optimal
hyperplane.
How does SVM works?
It unable to segregate the two classes using a straight line, as one of the stars
lies in the territory of other(circle) class as an outlier. The SVM algorithm has a
feature to ignore outliers and find the hyper-plane that has the maximum
margin. Hence, we can say, SVM classification is robust to outliers.
How does SVM works?
• In the scenario below, we can’t have linear hyper-plane between the two classes, so how does SVM
classify these two classes? SVM can solve this problem by introducing additional feature. Here, we will
add a new feature z=x^2+y^2. Now, let’s plot the data points on axis x and z:
• This is where model selection and model evaluation come into play.
What Is Model Selection
• Model selection is the process of selecting one final machine learning model from
among a collection of candidate machine learning models for a training dataset.
• Model selection is a process that can be applied both across different types of
models (e.g. logistic regression, SVM, KNN, etc.) and across models of the same type
configured with different model hyperparameters (e.g. different kernels in an SVM).
• For example, we may have a dataset for which we are interested in developing a
classification or regression predictive model. We do not know beforehand as to
which model will perform best on this problem, as it is unknowable. Therefore, we fit
and evaluate a suite of different models on the problem.
• Model selection is the process of choosing one of the models as the final model that
addresses the problem.
model • The process of evaluating a model’s
assessment performance
• Training Dataset:
• The sample of actual data used to fit the model.
• Validation Dataset:
• The sample of data used to provide an unbiased evaluation of a model fit on the
training dataset while tuning model hyperparameters.
• The evaluation becomes more biased as skill on the validation dataset is
incorporated into the model configuration.
• This dataset helps during the “development” stage of the model.
• Test Dataset:
• The sample of data used to provide an unbiased evaluation of a final model fit on
the training dataset. It is only used once a model is completely trained(using the
train and validation sets).
A visualization of the splits
Time-Based Split
• There are some types of data where random splits are not possible. For
example, if we have to train a model for weather forecasting, we cannot
randomly divide the data into training and testing sets. This will jumble up the
seasonal pattern! Such data is often referred to by the term – Time Series.
• In such cases, a time-wise split is used. The training set can have data for the last
three years and 10 months of the present year. The last two months can be
reserved for the testing or validation set.
• There is also a concept of window sets – where the model is trained till a
particular date and tested on the future dates iteratively such that the training
window keeps increasing shifting by one day (consequently, the test set also
reduces by a day). The advantage of this method is that it stabilizes the model
and prevents overfitting when the test set is very small (say, 3 to 7 days).
Time-Based Split
• However, the drawback of time-series data is that the events or data
points are mutually dependent. One event might affect every data
input that follows after.
• No machine learning model can learn from past data in such a case
because the data points before and after the event have major
differences.
Bootstrap
• The first step is to select a sample size (which is usually equal to the size of the
original dataset). Thereafter, a sample data point must be randomly selected
from the original dataset and added to the bootstrap sample. After the addition,
the sample needs to be put back into the original sample. This process needs to
be repeated for N times, where N is the sample size.
• The model is trained on the bootstrap sample and then evaluated on all those
data points that did not make it to the bootstrapped sample. These are called
the out-of-bag samples.
Bootstrapping
The bootstrap method involves iteratively resampling a dataset with
replacement.
Holdout Method
• Hold-out is when you split up your dataset into a ‘train’ and ‘test’ set.
The training set is what the model is trained on, and the test set is
used to see how well that model performs on unseen data.
• A common split when using the hold-out method is using 80% of data
for training and the remaining 20% of the data for testing.
• Hold-out, is dependent on just one train-test split. That makes the
hold-out method score dependent on how the data is split into train
and test sets.
• Although this approach is simple to perform, it still faces the issue of
high variance, and it also produces misleading results sometimes.
Cross-Validation
• Cross-validation is a technique for validating the model efficiency by
training it on the subset of input data and testing on previously unseen
subset of the input data.
• We can also say that it is a technique to check how a statistical model
generalizes to an independent dataset.
• In machine learning, there is always the need to test the stability of the
model. It means based only on the training dataset; we can't fit our
model on the training dataset.
• For this purpose, we reserve a particular sample of the dataset, which
was not part of the training dataset.
• After that, we test our model on that sample before deployment, and this
complete process comes under cross-validation.
Cross-Validation
Hence the basic steps of cross-validations are:
1. Reserve a subset of the dataset as a validation set.
2. Provide the training to the model using the training dataset.
3. Now, evaluate model performance using the validation set. If the
model performs well with the validation set, perform the further
step, else check for the issues.
Methods used for Cross-Validation
• Train/test split: The input data is divided into two parts, that are
training set and test set on a ratio of 70:30, 80:20, etc. It provides a high
variance, which is one of the biggest disadvantages.
• Training Data: The training data is used to train the model, and the dependent
variable is known.
• Test Data: The test data is used to make the predictions from the model that is
already trained on the training data. This has the same features as training data
but not the part of that.
• Cross-Validation dataset: It is used to overcome the disadvantage of
train/test split by splitting the dataset into groups of train/test splits,
and averaging the result. It can be used if we want to optimize our
model that has been trained on the training dataset for the best
performance. It is more efficient as compared to train/test split as every
observation is used for the training and testing both.
Limitations of Cross-Validation
• For the ideal conditions, it provides the optimum output. But for
the inconsistent data, it may produce a drastic result. So, it is one
of the big disadvantages of cross-validation, as there is no
certainty of the type of data in machine learning.
• In predictive modeling, the data evolves over a period, due to
which, it may face the differences between the training set and
validation sets. Such as if we create a model for the prediction of
stock market values, and the data is trained on the previous 5
years stock values, but the realistic future values for the next 5
years may drastically different, so it is difficult to expect the
correct output for such situations.
Applications of Cross-Validation
Here,
• θ is the parameter we wish to update,
• dJ/dθ is the partial derivative which tells us the rate of change of error on the cost function
with respect to the parameter θ,
• α here is the Learning Rate.
• So, this J here represents the cost function and there are multiple ways to calculate this
cost. Based on the way we are calculating this cost function there are different variants of
Gradient Descent.
Types of gradient Descent:
1. Batch Gradient Descent:
• This is a type of gradient descent which processes all the training
examples for each iteration of gradient descent.
• But if the number of training examples is large, then batch
gradient descent is computationally very expensive.
• Hence if the number of training examples is large, then batch
gradient descent is not preferred.
• Instead, we prefer to use stochastic gradient descent or mini-
batch gradient descent.
Batch Gradient Descent contd..
• If there are a total of ‘m’ observations in a data set then we use
all these observations to calculate the cost function J, then this
is known as Batch Gradient Descent.
• So for the entire training set, we calculate the cost function.
And then we update the parameters using the rate of change of
this cost function with respect to the parameters.
• An epoch is when the entire training set is passed through the
model.
• In batch Gradient Descent since we are using the entire training
set, the parameters will be updated only once per epoch.
2. Stochastic Gradient Descent
• If you use a single observation to calculate the cost function it is known
as Stochastic Gradient Descent, commonly abbreviated as SGD. We
pass a single observation at a time, calculate the cost and update the
parameters.
• This is a type of gradient descent which processes 1 training example
per iteration.
• Hence, the parameters are being updated even after one iteration in
which only a single example has been processed.
• Hence this is quite faster than batch gradient descent.
• But again, when the number of training examples is large, even then it
processes only one example which can be additional overhead for the
system as the number of iterations will be quite large.
• if we use the SGD, will take the first observation, then pass it through the neural network, calculate
the error and then update the parameters.
• Then will take the second observation and perform similar steps with it.
• This step will be repeated until all observations have been passed through the network and the parameters
have been updated.
• Each time the parameter is updated, it is known as an Iteration.
• Here since we have 5 observations, the parameters will be updated 5 times or we can say that there will be
5 iterations.
https://www.youtube.com/watch?v=vsWrXfO3wWw
Class Imbalance Problem in
Machine Learning
Class Imbalance Problem in Machine Learning
• Class imbalance is the problem when the number of examples available for one or more classes is
far less than other classes.
• In short, the distribution of examples across the known classes is biased.
• For Example: To detect fraud credit card transactions.
Class Imbalance Problem in Machine Learning
For Example: To detect fraud credit card
transactions.
Ex: fraud detection data set you have the count
following data: 1200
1000
Total Observations = 1000 1000
980
• Examples:
• Datasets to identify customer churn where a vast majority of
customers will continue using the service. Specifically,
Telecommunication companies where Churn Rate is lower than 2 %.
• Data sets to identify rare diseases in medical diagnostics etc.
• Natural Disaster like Earthquakes
Class Imbalance Problem in Machine Learning
• The classes which have a large number of samples are called the majority
classes
• while the classes which have very few samples are called the minority
classes.
• Techniques for handling Class-Imbalance Problem:
• Data Level Methods
• Algorithm/Classifier Level Methods
Class Imbalance Problem in Machine Learning
Data Level Methods (Resampling Techniques):
• Data Level methods are those where we make changes to the distribution of the
training set while keeping the algorithm and its subparts such as loss function,
optimizer constant.
• The data level methods aim to vary the dataset in a way to make standard
algorithms work.
• There are two famous data-level methods readily applied in the machine learning
domain.
• 1. Oversampling:
• 2. Undersampling:
Class Imbalance Problem in Machine Learning
• Data Level Methods:1 Oversampling:
• Upsampling Minority Class
• It is a very simple and widely known technique used to solve the problem of Class Imbalance.
• In this technique, we try to make the distribution of all the classes equal in a mini-batch by
sampling an equal number of samples from all the classes thereby sampling more examples from
minority classes as compared to majority classes.
•
Class Imbalance Problem in Machine Learning
• Oversampling:
Over-Sampling increases the number of instances in the minority class by randomly
replicating them in order to present a higher representation of the minority class in
the sample.
Total Observations = 1000
Fraud Observations =20
Non Fraud Observations = 980
Event Rate= 2 %
In this case we are replicating 20 fraud observations 20 times.
Non Fraud Observations =980
Fraud Observations after replicating the minority class observations= 400
Total Observations in the new data set after oversampling=980+400=1380
Event Rate for the new data set after under sampling= 400/1380 = 29 %
Class Imbalance Problem in Machine Learning
• Data Level Methods: 1 Oversampling :
• Advantages
• This method leads to no information loss.
• Outperforms under sampling.
• Disadvantages
• It increases the likelihood of overfitting since it replicates the minority class events.
Class Imbalance Problem in Machine Learning
• Data Level Methods: 2 Undersampling:
• Downsampling Majority Class
• It is just the opposite of Oversampling.
• In this, we randomly remove samples from the majority class until all the classes have the same
number of samples.
• This technique has a significant disadvantage in that it discards data which might lead to a
reduction in the number of representative samples in the dataset.
• To fix this shortcoming various methods are used which carefully remove redundant samples
thereby preserving the variability of the dataset.
Data Level Methods: 2 Undersampling:
• Undersampling aims to balance class distribution by randomly eliminating
majority class examples.
• This is done until the majority and minority class instances are balanced out.
• Total Observations = 1000
• Fraudulent Observations =20
• Non Fraudulent Observations = 980
• Event Rate= 2 %
• In this case we are taking 10 % samples without replacement from Non Fraud
instances. And combining them with Fraud instances.
• Non Fraudulent Observations after random under sampling = 10 % of 980 =98
• Total Observations after combining them with Fraudulent observations =
20+98=118
• Event Rate for the new dataset after under sampling = 20/118 = 17%
Class Imbalance Problem in Machine Learning
• Data Level Methods: 2 Undersampling:
• Advantages
• It can help improve run time and storage problems by reducing the number of
training data samples when the training data set is huge.
• Disadvantages
• It can discard potentially useful information which could be important for building
rule classifiers.
• The sample chosen by random under-sampling may be a biased sample.
• And it will not be an accurate representation of the population.
• Thereby, resulting in inaccurate results with the actual test data set.
Class Imbalance Problem in Machine Learning
• Data Level Methods: 3 Cluster-Based Over Sampling:
• In this case, the K-means clustering algorithm is independently applied to minority
and majority class instances.
• This is to identify clusters in the dataset.
• Subsequently, each cluster is oversampled such that all clusters of the same class
have an equal number of instances and all classes have the same size.
Class Imbalance Problem in Machine Learning
• Data Level Methods: 3 Cluster-Based Over Sampling:
Total Observations = 1000
Fraudulent Observations =20
Non Fraudulent Observations = 980
Event Rate= 2 %
Majority Class Clusters
Cluster 1: 150 Observations
Cluster 2: 120 Observations
Cluster 3: 230 observations
Cluster 4: 200 observations
Cluster 5: 150 observations
Cluster 6: 130 observations
Minority Class Clusters
Cluster 1: 8 Observations
Cluster 2: 12 Observations
Class Imbalance Problem in Machine Learning
• Data Level Methods: 3 Cluster-Based Over Sampling:
After oversampling of each cluster, all clusters of the same class contain the same
number of observations.
Majority Class Clusters
Cluster 1: 170 Observations
Cluster 2: 170 Observations
Cluster 3: 170 observations
Cluster 4: 170 observations
Cluster 5: 170 observations
Cluster 6: 170 observations
Minority Class Clusters
Cluster 1: 250 Observations
Cluster 2: 250 Observations
Event Rate post cluster based oversampling sampling = 500/ (1020+500) = 33 %
Data Level Methods: 3 Cluster-Based Over Sampling:
• Advantages
• This clustering technique helps overcome the challenge between class
imbalance Where the number of examples representing positive class differs
from the number of examples representing a negative class.
• Also, overcome challenges within class imbalance, where a class is composed
of different sub clusters. And each sub cluster does not contain the same
number of examples.
• Disadvantages
• The main drawback of this algorithm, like most oversampling techniques is
the possibility of over-fitting the training data.
Data Level Methods: 4 Informed Over Sampling: Synthetic Minority Over-sampling
Technique for imbalanced data
• Advantages
• Mitigates the problem of overfitting caused by random oversampling as
synthetic examples are generated rather than replication of instances
• No loss of useful information
• Disadvantages
• While generating synthetic examples SMOTE does not take into consideration
neighboring examples from other classes. This can result in increase in
overlapping of classes and can introduce additional noise
• SMOTE is not very effective for high dimensional data
Synthetic Minority Oversampling Algorithm
Class Imbalance Problem in Machine Learning
Algorithm Level Methods:
• Here we keep the dataset constant but altering the training or inference algorithms.
1. Cost-Sensitive Learning(Penalize Algorithms):
• Here we assign different costs to classes according to their distribution.
• use higher learning rate for examples belonging to majority class as compared to
examples belonging to minority class or
• use class weighted loss functions which calculate loss by taking the class distribution into
account and hence penalize the classifier more for misclassifying examples from
minority class as compared to majority class.
• Mostly widely used class weighted loss functions are WeightedCrossEntropy and Focal
Loss.
Class Imbalance Problem in Machine Learning
• Ada Boost is the first original boosting technique which creates a highly accurate
prediction rule by combining many weak and inaccurate rules.
• Each classifier is serially trained with the goal of correctly classifying examples in
every round that were incorrectly classified in the previous round.
• For a learned classifier to make strong predictions it should follow the following
three conditions:
• The rules should be simple
• Classifier should have been trained on sufficient number of training examples
• The Classifier should have low training error for the training instances
Adaptive Boosting- Ada Boost techniques for imbalanced data
• Each of the weak hypothesis has an accuracy slightly better than random guessing
i.e. Error Term € (t) should be slightly more than ½-β where β >0.
• This is the fundamental assumption of this boosting algorithm which can produce
a final hypothesis with a small error
• After each round, it gives more focus to examples that are harder to classify.
• The quantity of focus is measured by a weight, which initially is equal for all
instances.
• After each iteration, the weights of misclassified instances are increased and the
weights of correctly classified instances are decreased.
Adaptive Boosting- Ada Boost techniques for imbalanced data
Adaptive Boosting- Ada Boost techniques for imbalanced data
• For example in a data set containing 1000 observations out of which 20 are
labelled fraudulent.
• Equal weights W1 are assigned to all observations and the base classifier
accurately classifies 400 observations.
• Weight of each of the 600 misclassified observations is increased to w2 and
weight of each of the correctly classified observations is reduced to w3.
• In each iteration, these updated weighted observations are fed to the weak
classifier to improve its performance.
• This process continues till the misclassification rate significantly decreases
thereby resulting in a strong classifier.
Evaluation Metrics For Classification Model
• False Positive Rate (FPR): False Positive Rate corresponds to the proportion of
negative data points that are mistakenly considered as positive, for all negative
data points.
ROC curve (Receiver Operating Characteristic
curve)
They both have values in the range of [0,1] which are computed at varying
threshold values.
The perfect classifier will have high value of true positive rate and low value of
false positive rate.
ROC curve (Receiver Operating Characteristic curve)
1
Feature Selection Techniques in Machine Learning
• In other words, it is a way of selecting the optimal features from the input dataset.
3
1. Filter Methods
In this method, the dataset is filtered, and a subset that contains only the
relevant features is taken.
4
1. Filter Methods
• Correlation:
• Correlation explains how one or more variables are related to each other.
These variables can be input data features which have been used to forecast
our target variable.
• Pearson’s Correlation Coefficient is a measure of quantifying the association
between the two continuous variables and the direction of the relationship
with its values ranging from -1 to 1.
5
1. Filter Methods
Positive Correlation:
• Two features (variables) can be positively correlated with each other.
• It means that when the value of one variable increase then the value of the
other variable(s) also increases.
6
1. Filter Methods
Negative Correlation:
• Two features (variables) can be negatively correlated with each other.
• It means that when the value of one variable increase then the value of the other
variable(s) decreases.
7
1. Filter Methods
No Correlation:
• Two features (variables) are not correlated with each other.
• It means that when the value of one variable increase or decrease then the value of the
other variable(s) doesn’t increase or decreases.
8
Variance Threshold
• Variance Threshold is a feature selector that removes all the
low variance features from the dataset that are of no great use
in modeling.
• It looks only at the features (x), not the desired outputs (y), and
can thus be used for unsupervised learning.
• Default Value of Threshold is 0
• If Variance Threshold = 0 (Remove Constant Features )
• If Variance Threshold > 0 (Remove Quasi-Constant Features )
1. Filter Methods
• Chi-Square Test:
Chi-square method (X2) is generally used to test the relationship between
categorical variables.
It compares the observed values from different attributes of the dataset to
its expected value.
10
1. Filter Methods
• Variance Threshold – It is an approach where all features are
removed whose variance doesn’t meet the specific
threshold. By default, this method removes features having
zero variance. The assumption made using this method is
higher variance features are likely to contain more
information.
• Mean Absolute Difference (MAD) – This method is similar to
variance threshold method but the difference is there is no
square in MAD. This method calculates the mean absolute
difference from the mean value.
• Information Gain: It is defined as the amount of information
provided by the feature for identifying the target value and
measures reduction in the entropy values. Information gain
of each attribute is calculated considering the target values
for feature selection.
11
2. Wrappers Methods
• The wrapper method has the same goal as the filter method, but it
takes a machine learning model for its evaluation.
• In this method, some features are fed to the ML model, and
evaluate the performance. The performance decides whether to
add those features or remove to increase the accuracy of the
model.
• This method is more accurate than the filtering method but
complex to work.
Some common techniques of wrapper methods are:
• Forward Selection
• Backward Selection
• Bi-directional Elimination
• Forward selection –This method is an iterative approach where
we initially start with an empty set of features and keep adding
a feature which best improves our model after each iteration.
The stopping criterion is till the addition of a new variable does
not improve the performance of the model.
• Backward elimination – This method is also an iterative
approach where we initially start with all features and after
each iteration, we remove the least significant feature. The
stopping criterion is till no improvement in the performance of
the model is observed after the feature is removed.
• Bi-directional elimination – This method uses both forward
selection and backward elimination technique simultaneously
to reach to one unique solution.
14
• Regularization – This method adds a penalty to
different parameters of the machine learning model to
avoid over-fitting of the model. This approach of feature
selection uses Lasso (L1 regularization) and Elastic nets
(L1 and L2 regularization). The penalty is applied over
the coefficients, thus bringing down some coefficients
to zero. The features having zero coefficient can be
removed from the dataset.
• Tree-based methods – These methods such as Random
Forest, Gradient Boosting provides us feature
importance as a way to select features as well. Feature
importance tells us which features are more important
in making an impact on the target feature.
15
Issues in Decision Tree Learning
and How To solve them
Decision tree
• A decision tree is an algorithm for supervised learning.
• It uses a tree structure, in which there are two types of nodes:
decision node and leaf node.
• A decision node splits the data into two branches by asking a Boolean
question on a feature.
• A leaf node represents a class.
• The training process is about finding the “best” split at a certain
feature with a certain value.
• And the predicting process is to reach the leaf node from root by
answering the question at each decision node along the path.
Types of Decision Trees
• Types of decision trees are based on the type of target variable we
have.
• Categorical Variable Decision Tree:
• Decision Tree which has a categorical target variable then it called
a Categorical variable decision tree.
• Continuous Variable Decision Tree:
• Decision Tree has a continuous target variable then it is called Continuous
Variable Decision Tree.
Important Terminology related to Decision Trees
Decision Tree Example
Decision tree algorithms
• ID3
• C4.5
• CART
Issue 1 Overfitting the Data
• Over-fitting is nothing but the model runs accurately on the given
training data so much that it would be inaccurate in predicting the
outcomes of the untrained data.
• In decision trees, over-fitting occurs when the tree is designed so as to
perfectly fit all samples in the training data set.
• Thus it ends up with branches with strict rules of sparse data.
• Thus this effects the accuracy when predicting samples that are not
part of the training set.
Prevent overfitting/Determine how deeply to grow
the decision tree
• 1. we stop splitting the tree at some point;
• we need to introduce two hyperparameters for training like maximum depth of
the tree and minimum size of a leaf.
• 2. we generate a complete tree first, and then get rid of some
branches called as pruning.
• In pruning, you trim off the branches of the tree, i.e., remove the decision
nodes starting from the leaf node such that the overall accuracy is not
disturbed.
• This is done by segregating the actual training set into two sets: training data
set, D and validation data set, V.
• Prepare the decision tree using the segregated training data set, D.
• Then continue trimming the tree accordingly to optimize the accuracy of the
validation data set, V.
Overfitting the Data
• Unlike other regression models, decision tree doesn’t use
regularization to fight against overfitting.
• Instead, it employs tree pruning.
• Selecting the right hyperparameters (tree depth and leaf size) also
requires experimentation, e.g. doing cross-validation with a
hyperparameter matrix.
Issue 2 Continues Valued attributes
• Define new discrete valued attributes that partition the continuous attribute
value into a discrete set of intervals
• Find a set of thresholds midway Between different target values of the attribute :
Temperature>54 and Temperature>85
• There are two techniques given below that are used to perform
ensemble decision tree.
1. Bagging
2. Boosting
Bagging
• The total error is nothing, but the summation of all the sample weights of misclassified
data points. Here in our dataset let’s assume there is 1 wrong output, so our total error will
be 1/5, and alpha(performance of the stump) will be:
working of AdaBoost Algorithm
• Note: Total error will always be between 0 and 1.
• 0 Indicates perfect stump and 1 indicates horrible stump.
• The amount of say (alpha) will be negative when the sample is correctly classified.
• The amount of say (alpha) will be positive when the sample is miss-classified.
• There are four correctly classified samples and 1 wrong, here the sample weight of that
datapoint is 1/5 and the amount of say/performance of the stump of Gender is 0.69.
New weights for correctly classified samples are: For wrongly classified samples the updated weights
working of AdaBoost Algorithm
• Note: See the sign of alpha when I am putting the values, the alpha is negative when the
data point is correctly classified, and this decreases the sample weight from 0.2 to 0.1004.
• It is positive when there is misclassification, and this will increase the sample weight from
0.2 to 0.3988
working of AdaBoost Algorithm
• We know that the total sum of the sample weights must be equal to 1 but here if we sum
up all the new sample weights, we will get 0.8004.
• To bring this sum equal to 1 we will normalize these weights by dividing all the weights
by the total sum of updated weights that is 0.8004. So, after normalizing the sample
weights we get this dataset and now the sum is equal to 1.
working of AdaBoost Algorithm
• Step 5 – Now we need to make a new dataset to see if the errors decreased or not. For this
we will remove the “sample weights” and “new sample weights” column and then based
on the “new sample weights” we will divide our data points into buckets.
working of AdaBoost Algorithm
• Step 6 – We are almost done, now what the algorithm does is selects random numbers
from 0-1. Since incorrectly classified records have higher sample weights, the probability
to select those records is very high.
• Suppose the 5 random numbers our algorithm take is 0.38,0.26,0.98,0.40,0.55.
• Now we will see where these random numbers fall in the bucket and according to it, we’ll
make our new dataset shown below.
working of AdaBoost Algorithm
• This comes out to be our new dataset and we see the datapoint which was wrongly classified
has been selected 3 times because it has a higher weight.
• Step 9 – Now this act as our new dataset and we need to repeat all the above steps i.e.
1. Assign equal weights to all the datapoints
2. Find the stump that does the best job classifying the new collection of samples by finding their
Gini Index and selecting the one with the lowest Gini index
3. Calculate the “Amount of Say” and “Total error” to update the previous sample weights.
4. Normalize the new sample weights.
5. Iterate through these steps until and unless a low training error is achieved.
6. Suppose with respect to our dataset we have constructed 3 decision trees (DT1, DT2, DT3) in a
sequential manner. If we send our test data now it will pass through all the decision trees and
finally, we will see which class has the majority, and based on that we will do predictions for
our test dataset.
• AdaBoost Algorithm - A Complete Guide for Beginners - Analytics
Vidhya
Difference between Bagging and Boosting
Bagging Boosting
Various training data subsets are randomly Each new subset contains the components that were
drawn with replacement from the whole misclassified by previous models.
training dataset.
Bagging attempts to tackle the over-fitting Boosting tries to reduce bias.
issue.
If the classifier is unstable (high variance), then If the classifier is steady and straightforward (high bias),
we need to apply bagging. then we need to apply boosting.
Every model receives an equal weight. Models are weighted by their performance.
Objective to decrease variance, not bias. Objective to decrease bias, not variance.
It is the easiest way of connecting predictions It is a way of connecting predictions that belong to the
that belong to the same type. different types.
Every model is constructed independently. New models are affected by the performance of the
previously developed model.
Classification of Class-Imbalanced Data Sets
31
Reinforcement Learning
Types of Machine Learning
Types of Machine Learning
What is Reinforcement Learning?
Iteration 1:
• the agent performs a random action in each state. For instance,
look at the following figure. In the first iteration, the agent
moves right from state A and reaches the new state B.
• But since B is the shaded state, the agent will receive a negative
reward and so the agent will understand that moving right is
not a good action in state A.
• When it visits state A next time, it will try out a different action
instead of moving right:
RL agent in the grid world
Iteration 1:
• from state B, the agent moves down and reaches the new
state E. Since E is an unshaded state, the agent will receive a
positive reward, so the agent will understand that
moving down from state B is a good action.
• From state E, the agent moves right and reaches state F.
Since F is an unshaded state, the agent receives a positive
reward, and it will understand that moving right from state E is
a good action.
• From state F, the agent moves down and reaches the goal
state I and receives a positive reward, so the agent will
understand that moving down from state F is a good action
RL agent in the grid world
Iteration 2:
In the second iteration, from state A, instead of
moving right, the agent tries out a different action as the
agent learned in the previous iteration that moving right is
not a good action in state A.
Thus, as Figure shows, in this iteration the agent
moves down from state A and reaches state D. Since D is
an unshaded state, the agent receives a positive reward
and now the agent will understand that moving down is a
good action in state A:
RL agent in the grid world
Iteration 2:
• from state D, the agent moves down and reaches state G. But
since G is a shaded state, the agent will receive a negative
reward and so the agent will understand that moving down is
not a good action in state D, and when it visits state D next
time, it will try out a different action instead of moving down.
• From G, the agent moves right and reaches state H. Since H is
a shaded state, it will receive a negative reward and understand
that moving right is not a good action in state G.
• From H it moves right and reaches the goal state I and receives
a positive reward, so the agent will understand that
moving right from state H is a good action.
RL agent in the grid world
Iteration 3:
• The agent moves down from state A since, in the second iteration, our agent
learned that moving down is a good action in state A.
So, the agent moves down from state A and reaches the next
state, D, as Figure shows.
• Now, from state D, the agent tries a different action instead of moving down
since in the second iteration our agent learned that moving down is not a
good action in state D. So, in this iteration, the agent moves right from state
D and reaches state E.
• From state E, the agent moves right as the agent already learned in the first
iteration that moving right from state E is a good action and reaches state F.
• Now, from state F, the agent moves down since the agent learned in the first
iteration that moving down is a good action in state F, and reaches the goal
state I.
Types of Reinforcement Learning
• 1 Positive Reinforcement Learning:
• In this type of RL, the algorithm receives a type of reward for a certain result. In other words, here we try to add a
reward for every good result in order to increase the likelihood of a good result.
• We can understand this easily with the help of a good example.
• In order to make a child do a certain task like cleaning their rooms or study hard to get marks, some parents often
promise them a reward at the end of the task.
• Like, the parents promise to give the child something that he or she loves like chocolate. This rather has a good
impact as it automatically makes the child work as they think of the reward. In this learning, we are adding a good
reward to increase the likelihood of task completion.
• This can have good impacts like improvement in performance, sustaining the change for a longer duration, etc., but its
negative side could be that too much of RL could cause overloading of states that could impact the results.
Types of Reinforcement Learning
• 2 Negative Reinforcement Learning:
• This RL Type is a bit different from positive RL. Here, we try to remove something negative in order to improve
performance.
• We can take the same child-parent example here as well. Some parents punish kids for not cleaning their rooms.
• The punishment can be no video games for one week or sometimes a month. To avoid the punishment the kids often
work harder or complete the job assigned to them.
• We can also take the example of getting late for the office. People often sleep late and get up late. To avoid being late
at the office, they try to change their sleep habits.
• From these examples, we understand that the algorithm in this case will receive negative feedback. Hence, it would
avoid the process that resulted in negative feedback. This also has it’s good impacts like, the behavior toward
performing the task would increase. It would force you to provide better results.
• The negative impact is that it would only force you to meet the minimum necessary requirement to complete the job.
Supervised vs Unsupervised vs Reinforcement
Learning
Criteria Supervised ML Unsupervised ML Reinforcement ML
Regression and
Type of problems Association and Clustering Exploitation or Exploration
classification
Supervised vs Unsupervised vs Reinforcement Learning
Regression and
Type of problems Association and Clustering Exploitation or Exploration
classification
Linear Regression,
K – Means, Q – Learning,
Algorithms Logistic Regression,
C – Means, Apriori SARSA
SVM, KNN etc.
Supervised vs Unsupervised vs Reinforcement Learning
Criteria Supervised ML Unsupervised ML Reinforcement ML
Regression and
Type of problems Association and Clustering Exploitation or Exploration
classification
Discover underlying
Aim Calculate outcomes Learn a series of action
patterns
Supervised vs Unsupervised vs Reinforcement Learning
Criteria Supervised ML Unsupervised ML Reinforcement ML
Regression and
Type of problems Association and Clustering Exploitation or Exploration
classification
Discover underlying
Aim Calculate outcomes Learn a series of action
patterns
Risk Evaluation, Forecast Recommendation System, Self Driving Cars, Gaming,
Application
Sales Anomaly Detection Healthcare
Fundamental concepts of RL
• Math essentials
• Before going ahead, let's quickly recap expectation from our high school
days, as we will be dealing with expectation throughout the book.
• Expectation
• Let's say we have a variable X and it has the values 1, 2, 3, 4, 5, 6.
• To compute the average value of X, we can just sum all the values
of X divided by the number of values of X. Thus, the average of X is
(1+2+3+4+5+6)/6 = 3.5.
• Now, let's suppose X is a random variable.
• The random variable takes values based on a random experiment, such
as throwing dice, tossing a coin, and so on. The random variable takes
different values with some probabilities. Let's suppose we are throwing
a fair dice, then the possible outcomes (X) are 1, 2, 3, 4, 5, and 6 and the
probability of occurrence of each of these outcomes is 1/6, as shown
in table
Fundamental concepts of RL
• How can we compute the average value of the random variable X? Since each value
has a probability of an occurrence, we can't just take the average.
• So, instead, we compute the weighted average, that is, the sum of values
of X multiplied by their respective probabilities, and this is called expectation.
• The expectation of a random variable X can be defined as:
Action space:
• Consider the grid world environment shown in Figure
• In the preceding grid world environment, the goal of the agent is
to reach state I starting from state A without visiting the shaded
states. In each of the states, the agent can perform any of the four
actions—up, down, left, and right—to achieve the goal.
• The set of all possible actions in the environment is called the
action space. Thus, for this grid world environment, the action
space will be [up, down, left, right].
• We can categorize action spaces into two types:
1. Discrete action space
2. Continuous action space
Fundamental concepts of RL-Action
space
• A policy defines the agent's behavior in an environment. The policy tells the
agent what action to perform in each state. For instance, in the grid world
environment, we have states A to I and four possible actions. The policy may
tell the agent to move down in state A, move right in state D, and so on.
• To interact with the environment for the first time, we initialize a random policy,
that is, the random policy tells the agent to perform a random action in each
state.
• Thus, in an initial iteration, the agent performs a random action in each state and
tries to learn whether the action is good or bad based on the reward it obtains.
• Over a series of iterations, an agent will learn to perform good actions in each
state, which gives a positive reward.
• Thus, we can say that over a series of iterations, the agent will learn a good
policy that gives a positive reward.
Fundamental concepts of RL-Policy
• The optimal policy is the policy that gets the agent a good reward
and helps the agent to achieve the goal. For instance, in our grid
world environment, the optimal policy tells the agent to perform an
action in each state such that the agent can reach state I from state
A without visiting the shaded states.
• The agent interacts with the environment by performing some actions, starting from
the initial state and reaches the final state.
• This agent-environment interaction starting from the initial state until the final state
is called an episode.
• For instance, in a car racing video game, the agent plays the game by starting from
the initial state (the starting point of the race) and reaches the final state (the
endpoint of the race). This is considered an episode.
• An episode is also often called a trajectory (the path taken by the agent) and it is
denoted by
Fundamental concepts of RL-Episode
• An agent can play the game for any number of episodes, and each episode is independent
of the others.
• What is the use of playing the game for multiple episodes? In order to learn the optimal
policy, that is, the policy that tells the agent to perform the correct action in each state, the
agent plays the game for many episodes.
• For example, let's say we are playing a car racing game; the first time, we may not win
the game, so we play the game several times to understand more about the game and
discover some good strategies for winning the game.
• Similarly, in the first episode, the agent may not win the game and it plays the game for
several episodes to understand more about the game environment and good strategies to
win the game.
• Say we begin the game from an initial state at a time step t = 0 and reach the final state at
a time step T, then the episode information consists of the agent-environment interaction,
such as state, action, and reward, starting from the initial state until the final state, that is,
(s0, a0, r0, s1, a1, r1,…,sT).
Fundamental concepts of RL-
Episode
• Figure shows an example of an
episode/trajectory:
Episode and the optimal policy with the
grid world environment
• In the grid world environment, the goal of our agent is to reach the final state I starting from the
initial state A without visiting the shaded states. An agent receives a +1 reward when it visits the
unshaded states and a -1 reward when it visits the shaded states.
• When we say generate an episode, it means going from the initial state to the final state. The agent
generates the first episode using a random policy and explores the environment and over several
episodes, it will learn the optimal policy.
• Episode 1
Episode and the optimal policy with the grid world environment
• Episode 2
• In the second episode, the agent tries a different policy to avoid the negative
rewards it received in the previous episode.
• For instance, as we can observe in the previous episode, the agent selected the
action right in state A and received a negative reward, so in this episode, instead of
selecting the action right in state A, it tries a different action, say down, as shown
in figure
Episode and the optimal policy with the grid world environment
• Episode n
• Thus, over a series of episodes, the agent learns the optimal policy, that is, the
policy that takes the agent to the final state I from state A without visiting the
shaded states, as Figure shows:
Fundamental concepts of RL- The value function
• The value function, also called the state value function, denotes the value of the
state. The value of a state is the return an agent would obtain starting from that
state following policy
• The value of a state or value function is usually denoted by V(s) and it can be
expressed as:
• where s0 = s implies that the starting state is s. The value of a state is called the
state value.
Fundamental concepts of RL- The value
function
• Let's understand the value function with an example. Let's suppose we
generate the trajectory following some policy
• in our grid world environment, as shown in Figure
Fundamental concepts of RL- The value
function
•The value of state A is the return of the trajectory starting from state A. Thus, V(A) = 1+1+ -1+1 = 2.
•The value of state D is the return of the trajectory starting from state D. Thus, V(D) = 1-1+1= 1.
•The value of state E is the return of the trajectory starting from state E. Thus, V(E) = -1+1 = 0.
•The value of state H is the return of the trajectory starting from state H. Thus, V(H) = 1.
•Since I is the final state, we don't make any transition from the final state, so there is no reward and thus no
value for the final state I.
Reinforcement Learning Algorithms
Reinforcement Learning Algorithms
• It indicates the probability of moving from the state s to the next state.
Markov Decision
Process (MDP)
• Say we have three states (cloudy, rainy, and windy)
in our Markov chain. Then we can represent the
probability of transitioning from one state to another
using a table called a Markov table, as shown in
Table.
• From the state cloudy, we transition to the state
rainy with 70% probability and to the state windy
with 30% probability.
• From the state rainy, we transition to the same state
rainy with 80% probability and to the state cloudy
with 20% probability.
• From the state windy, we transition to the state rainy
with 100% probability.
Markov Decision
Process (MDP)
• We can also formulate the transition probabilities into
a matrix called the transition matrix.
• we can say that the Markov chain or Markov process
consists of a set of states along with their transition
probabilities.
•
The Markov Reward Process
• The Markov Reward Process (MRP) is an extension of the Markov chain with the reward
function.
• That is, we learned that the Markov chain consists of states and a transition probability.
• The MRP consists of states, a transition probability, and also a reward function.
• A reward function tells us the reward we obtain in each state.
• For instance, based on our previous weather example, the reward function tells us the
reward we obtain in the state cloudy, the reward we obtain in the state windy, and so on.
• The reward function is usually denoted by R(s).
• Thus, the MRP consists of states s, a transition probability
• and a reward function R(s).
The Markov Decision Process
• The Markov Decision Process (MDP) is an extension of the MRP with actions.
• That is, we learned that the MRP consists of states, a transition probability, and a reward
function.
• The MDP consists of states, a transition probability, a reward function, and also actions.
• The Markov property states that the next state is dependent only on the current state and
is not based on the previous state.
• Is the Markov property applicable to the RL setting? Yes! In the RL environment, the
agent makes decisions only based on the current state and not based on the past states.
• So, we can model an RL environment as an MDP.
The Markov Decision Process
• Q-learning Definition
• Q*(s,a) is the expected value (cumulative discounted reward) of doing a in state s
and then following the optimal policy.
• Q-learning uses Temporal Differences(TD) to estimate the value of Q*(s,a).
Temporal difference is an agent learning from an environment through episodes
with no prior knowledge of the environment.
• The agent maintains a table of Q[S, A], where S is the set of states and A is the set
of actions.
• Q[s, a] represents its current estimate of Q*(s,a).
Q-learning Simple Example
• The Monte Carlo method for reinforcement learning learns directly from episodes of experience
without any prior knowledge of MDP transitions.
• Here, the random component is the return or reward.
• Monte Carlo methods require only experience — sample sequences of states, actions, and
rewards from actual or simulated interaction with an environment.
• Learning from actual experience is striking because it requires no prior knowledge of the
environment’s dynamics, yet can still attain optimal behavior.
• What this Monte thing used for in RL?
• It is a method for estimating Value-action(Value|State, Action) or Value function(Value|State) using some sample runs
from the environment for which we are estimating Value function.
Monte Carlo Reinforcement Learning
•Let us consider the above situation where we have a system of 3 states that are A, B & terminate.
•We are given two example episodes(we can generate it using random walk for any environment).
•A+3 →A+2 means the transition from state A →A with reward =3 for this transition.
2 types of Monte Carlo learning on how to average future rewards:
•First Visit Monte Carlo: First visit estimates (Value|State: S1) as the average of the returns following the first
visit to the state S1
•Every Visit Monte Carlo: It estimates (Value|State: S1) as the average of returns for every visit to the State
S1.
Monte Carlo Reinforcement Learning
Every Visit Monte Carlo: It estimates (Value|State: S1) as the average of returns for every visit to the State S1.
Calculating V(A)
Here, we would be creating a new summation term adding all rewards coming after every occurrence of
‘A’(including that of A as well).
•From 1st episode=(3+2+-4+4+-3)+(2+-4+4+-3)+(4+-3)=2+-1+1
•From 2nd episode=(3+-3)=0
As we got 4 summation terms, we will be averaging using N=4 i.e
V(A)=(2+-1+1+0)/4=0.5
Monte Carlo Reinforcement Learning
Every Visit Monte Carlo: It estimates (Value|State: S1) as the average of returns for every visit to the State S1.
Calculating V(B)
•From 1st episode=(-4+4+-3)+(-3)=-3+-3
•From 2nd episode=(-2+3–3)+(-3)=-2+-3
As we have 4 summation terms, averaging using N=4,
V(B)=(-3+-3+-2+-3)/4=-2.75
Dynamic Programming
• Planning by Dynamic Programming: Reinforcement Learning | by
Ryan Wong | Towards Data Science
Example Data
• Now let’s look at an example using random walk (Figure 1) as
our environment.
• The basic idea is that you always start in state ‘D’ and you move
randomly, with a 50% probability, to either the left or right until
you reach the terminal or ending states ‘A’ or ‘G’.
• If you end in state ‘A’ you get a reward of 0, but if you end in
state ‘G’ the reward is 1.
• There are no rewards for states ‘B’ through ‘F’.
Example Data
• TD(1) makes an update to our values in the same manner as Monte Carlo, at the
end of an episode.
• So back to our random walk, going left or right randomly, until landing in ‘A’ or ‘G’.
• Once the episode ends then the update is made to the prior states.
• As we mentioned above if the higher the lambda value the further the credit can
be assigned and in this case its the extreme with lambda equaling 1.
• This is an important distinction because TD(1) and MC only work in episodic
environments meaning they need a ‘finish line’ to make an update.
TD(1) Algorithm
• Gt in above Figure 2 is the discounted sum of all the rewards seen in our episode.
• So as we’re traveling through our environment we keep track of all the rewards and sum
them together with a discount (γ).
• So lets act like we’re reading this out loud: the immediate reward (R) at a given point (time,
t+1) plus the discount (γ) of a future reward (Rt+2) and so on.
• You can see that we discount (γ) more heavily in the future with γ^T-1.
• So if γ=0.2 and you’re discounting the reward at time step 6, your discount value γ become
γ^6–1 which equals 0.00032.
• Significantly smaller after just 6 time steps.
TD(1) Algorithm
• We’ll use the sum of discounted rewards from above, Gt, that we saw from our
episode and we’ll subtract that from the prior estimate.
• This is called the TD Error. Our updated estimate minus the previous estimate.
• Then we multiple by an alpha (α) term to adjust how much of that error we want to
update by.
• Lastly we make the update by simply adding our pervious estimate V(St) to the
adjusted TD Error (Figure 3).
TD(1) Algorithm
• The issue we can observe easily is that we always need a termination state!!
• If such is the case, what will happen to Continuous RL problems that don’t have a
termination state!!
• Also, why should we wait to update Value-Action-Function for all states till the
episode end? Can it be done before the episode ends?
• can be painful when we have 1000s of states
Temporal-Difference Learning
• Here comes Temporal Difference Learning which
• Doesn’t require any info about the environment (Like Monte Carlo)
• Update estimates based in part on other learned estimates, without
waiting for a final outcome (they bootstrap like DP).
• Hence,
• Temporal Difference= Monte Carlo + Dynamic Programming.
• In Temporal Difference, we also decide on how many references we need from the
future to update the current Value-Action-Function.
Temporal-Difference Learning
• It means we can update our present Value-Action-Function
using as many future rewards we want.
• It can be just one future reward TD(0) from the immediate next
future state
• or can be 5 future rewards from the next 5 future states i.e
TD(5). The onus is completely on us.
• Though I would be using TD(0) in the below examples.
Temporal-Difference-TD(0) Learning
Going step by step
• Input π i.e the policy (can be e-greedy, greedy,etc.)
• Initialize Value-Action-Function for every state(s belonging to S) in the environment
• for e →E (episodes/epochs we want to train):
• ___1.Take the initial state of the system
• ___2.For each step in the episode
• _______A.Choose Action according to the policy π.
• _______B.Update Value-Action-Function according to the current step chosen using the mentioned
equation and move to next state S’.
• It’s time to demystify the ghostly update equation now:
• V (S) ← V (S) + α[R + γV (S′) − V (S)]
• Here,
• V(S)/V(S,A)= Value-Action-Function for current state
• α= Constant
• R=Reward for present action
• γ= Discount Factor
• V(S’)/V(S’/A)=Value-Action-Function for next State when action A taken on state S
Reinforcement Learning in Business,
Marketing, and Advertising
• In money-oriented fields, technology can play a crucial role. Like, here RL models of companies
can analyze customer preferences and help in the better advertisement of the products.
• We know that business requires proper strategizing. The steps need careful planning for a product
or the company to gain profit.
• RL here helps to devise proper strategies by analyzing various possibilities and by that; it tries to
improve the profit margin in each result. Various multinational companies use these models. Also,
the cost of these models is high.
Reinforcement Learning in Gaming
• Gaming is a booming industry and is gradually advancing with technology. The games are now
becoming more realistic and have many more details for them.
• We have environments like PSXLE or PlayStation Reinforcement Learning Environment that
focus on providing better gaming environments by modifying the emulators.
• We have Deep learning algorithms like AlphaGo, AlphaZero that are gaming algorithms for games
like chess, shogi and go.
• With these platforms and algorithms, gaming is now more advanced and is helping in creating
games, which have countless possibilities.
• These can also be helpful in making story-mode games of PlayStation.
Reinforcement Learning in Recommendation systems
• RL is now a big help in recommendation systems like news, music apps, and web-
series apps like Netflix, etc. These apps work as per customer preferences.
• In the case of web-series apps like Netflix, the variety of shows that we watch
become a list of preferences for the algorithm.
• Companies like these have sophisticated recommendation systems.
• They consider many things like user preference, trending shows, related genres,
etc. Then according to these preferences, the model will show you the latest
trending shows.
• These models are very much cloud-based, so as users, we will use these models in
our daily lives through information and entertainment platforms.
Reinforcement Learning in Science
• AI and ML technologies nowadays have become an important part of the research. There are
various fields in science where reinforcement learning can come in handy.
• The most talked-about is in atomic science. Both the physics behind atoms and their chemical
properties are researched.
• Reinforcement learning helps to understand chemical reactions. We can try to have cleaner
reactions that yield better products. There can be various combinations of reactions for any
molecule or atom. We can understand their bonding patterns with machine learning.
• In most of these cases, for having better quality results, we would require deep reinforcement
learning. For that, we can use some deep learning algorithms like LSTM.
Reinforcement Learning