Python Unit 4
Python Unit 4
What is a pattern?
A pattern is some phenomenon that repeats regularly based on a set rule or condition.
(Or)
Pattern is everything around in this digital world. A pattern can either be seen physically or it can
be observed mathematically by applying algorithms.
Example:
Slide #1 shows a set of training examples and the corresponding label associated with each
example.
Some of you will try to unlock the labeling pattern by looking at the data because we have few
examples only. Some of you with access to machine learning software will immediately plug in
these training examples and select their favored method of learning, for example decision trees, to
figure out the relationship between the numbers and their labels.
Let us look at our problem again. But this time let us use a different representation for our problem
as shown in Slide #2.
Pattern recognition is the process of recognizing patterns by using machine learning algorithm.
Pattern recognition can be defined as the classification of data based on knowledge already
gained or on statistical information extracted from patterns and/or their representation.
Features:
A feature is a property or characteristic of a pattern. For example, weight and height are the
two different features characterizing the chairs and humans.
Types of Features
There are different types of features. We may categorize them as follows:
1. Nominal data:
A nominal feature fN assumes values from a set of distinct values.
Ex: Type of Curve: A possible domain of this feature is: {line, parabola, circle, ellipse}.
2. Ordinal data:
The elements of the domain of the variable are ordered, in addition to being distinct.
EX: Height of an object: domain = {very tall, tall, medium, short, very short}.
operations possible on ordinal data comparison, mode, and median and percentile.
3. Interval-valued Data:
For interval-valued data, the differences between values and also need to satisfy the
properties of ordinal data.
Mean and Standard Deviation are two possible operations on these Data.
• Here, the training dataset may be represented as a matrix of size (nxd), where each row
corresponds to a pattern and each column represents a feature.
• The class label is a dependent attribute which depends on the ‘d’ independent attributes.
In this case, n=7 and d=6. As can be seen, each pattern has six attributes (or features). Each
attribute in this case is a number between 1 and 9. The last number in each line gives the class of
the pattern. In this case, the class of the patterns is either 1, 2 or 3.
GTGCATCTGACTCCT...
RNA is expressed as
GUGCAUCUGACUCCU....
VHLTPEEK ....
Each string of characters represents a pattern. Operations like pattern matching or finding the
similarity between strings are carried out with these patterns.
• Another example would be if (has-trunk(x)) and (color(x) = black) and (size(x) = large) then
elephant(x)
In ML, curse of dimensionality can be defined as follows: as the number of features or dimensions
‘d’ grows, the amount of data we require to generalize accurately grows exponentially. As the
dimensions increase the data becomes sparse and as the data becomes sparse it becomes hard to
generalize the model.
Curse of Dimensionality refers to a set of problems that arise when working with high-
dimensional data.
A dataset with a large number of attributes, generally of the order of a hundred or more, is
referred to as high dimensional data.
For example:
Suppose we are building several machine learning models to analyze the performance of a Formula
One (F1) driver. Consider the following cases:
i) Model_1 consists of only two features say the circuit name and the country name.
ii) Model_2 consists of 4 features say weather and max speed of the car including the above two.
iii) Model_3 consists of 8 features say driver’s experience, number of wins, car condition, and
driver’s physical fitness including all the above features.
iv) Model_4 consists of 16 features say driver’s age, latitude, longitude, driver’s height, hair color,
car color, the car company, and driver’s marital status including all the above features.
The following figure shows the decrease in the standard deviation of the distribution as the
number of dimensions increases.
It is observed that on increasing the number of features the accuracy tends to increase until a
certain threshold value and after that, it starts to decrease. From the above example the accuracy
of Model_1 < accuracy of Model_2 < accuracy of Model_3 but if we try to extrapolate this trend it
doesn’t hold true for all the models having more than 8 features.
If we think logically some of the features provided to Model_4 don’t actually contribute anything
towards analyzing the performance of the F1 driver. For example, the driver’s height, hair color, car
color, car company, and the driver’s marital status is giving useless information for the model to
learn, hence the model gets confused with all this extra information, and the accuracy starts to go
down.
Feature extraction: This reduces the data in a high dimensional space to a lower dimension
space, i.e. a space with lesser no. of dimensions.
A 3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a
simple 2 dimensional space, and a 1-D problem to a simple line.
c) Next, we need to center the values in each column by subtracting the mean column
value.
e) Finally, we calculate the engine Values and engine vectors. The eigenvectors can be
sorted by the eigenvalues in descending order to provide a ranking of the
components or axes of the new subspace for A.
g) Once chosen, data can be projected into the subspace via matrix multiplication.
Where A is the original data that we wish to project, B^T is the transpose of the
chosen principal components and P is the projection of A.
Output:
The first step is to calculate the separability between different classes(i.e the distance between the
Second Step is to calculate the distance between the mean and sample of each class,which is called
The third step is to construct the lower dimensional space which maximizes the between class
variance and minimizes the within class variance. Let P be the lower dimensional space projection,
Or
Given a pattern, the task of identifying the class to which the pattern belongs is called classification
In Classification, a program learns from the given dataset or observations and then classifies new
observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat
or dog, etc. Classes can be called as targets/labels or categories.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
It can be performed on both structured and unstructured data. The process starts with predicting
the class of given data points. The classes are often referred to as target, label or categories.
Generally, a set of patterns is given where the class label of each pattern is known. This is known as
the training data.
The algorithm which implements the classification on a dataset is known as a classifier. There are
two types of Classifications:
Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Classification Algorithms can be further divided into the Mainly two category:
Linear classifier
A linear classifier achieves by making a classification decision based on the value of a linear
combination of the characteristics.
Non-linear classifier
Non-linear functions can be used to separate instances that are not linearly separable.
1) Bayesian Classifier:
A Bayesian classifier can be trained by determining the mean vector and the covariance matrices of
the discriminant functions for the abnormal and normal classes from the training data. Instead of
computing the maximum of the two discriminant functions gabnormal(x) and gnormal(x), the
decision was based on the ratio gabnorm(x)/normal(x).
A decision threshold T was set, such that if the ratio is larger than T the unknown pattern vector is
classified as abnormal, else as normal.
By changing T, the sensitivity/specificity trade-off of the Bayes classifier can be altered. A larger T
will result in lower TP and FP rates, while a smaller T will result in higher TP and FP rates.
The basic perceptron algorithm is a binary linear classifier for supervised learning. The idea behind
the binary linear classifier can be described as follows.
where x is the feature vector, θ is the weight vector, and θⁿ is the bias. The sign function is used to
distinguish x as either a positive (+1) or a negative (-1) label.
There is the decision boundary to separate the data with different labels, which occurs at
The data will be labeled as positive in the region that θ⋅ x + θⁿ > 0, and be labeled as negative in the
region that θ⋅ x + θⁿ < 0.
If all the instances in a given data are linearly separable, there exists a θ and a θⁿ such that y⁽ⁱ ⁾ (θ⋅
x⁽ⁱ ⁾ + θⁿ) > 0 for every i-th data point, where y⁽ⁱ ⁾ is the label.
Figure illustrates the aforementioned concepts with the 2-D case where the x = [x₀ x₁]ᵀ, θ = [θ₀ θ₁]
and θⁿ is a offset scalar.
Among the various methods of supervised statistical pattern recognition, the Nearest Neighbour
rule achieves consistently high performance, without a priori assumptions about the distributions
from which the training examples are drawn.
It involves a training set of both positive and negative cases. A new sample is classified by
calculating the distance to the nearest training case; the sign of that point then determines the
classification of the sample.
The k-NN classifier extends this idea by taking the k nearest points and assigning the sign of the
majority. It is common to select k small and odd to break ties.
Larger k values help reduce the effects of noisy points within the training data set, and the choice of
k is often performed through cross-validation.
There are many techniques available for improving the performance and speed of a nearest
neighbour classification.
One approach to this problem is to pre-sort the training sets in some way.
Another solution is to choose a subset of the training data such that classification by the 1-NN rule
(using the subset) approximates the Bayes classifier.
This can result in significant speed improvements as k can now be limited to 1 and redundant data
points have been removed from the training set.
These data modification techniques can also improve the performance through removing points
that cause mis-classifications.
Example:
Consider the following data concerning credit default. Age and Loan are two numerical variables
(predictors) and Default is the target.
We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using
With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction
for the unknown case is again Default=Y.
4) Logistic regression
Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below
image is showing the logistic function:
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or
the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.
This function can be represented as:
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
5) Naïve-Bayes
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training dataset.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(No)= 0.29
P(Sunny)= 0.35
6) Decision trees
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
Below diagram explains the general structure of a decision tree:
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset.
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step
-3. Continue this process until a stage is reached where you cannot further classify the
nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node . The
root node splits further into the next decision node (distance from the office) and one leaf node
based on the corresponding labels. The next decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted
offers and Declined offer). Consider the below diagram:
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of
that dataset." Instead of relying on one decision tree, the random forest takes the prediction from
each tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm:
Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to
the Random forest classifier. The dataset is divided into subsets and given to each decision tree.
During the training phase, each decision tree produces a prediction result, and when a new data
point occurs, then based on the majority of results, the Random Forest classifier predicts the final
decision. Consider the below image:
There are mainly four sectors where Random forest mostly used:
Boosting
Boosting algorithms is the family of algorithms that combine weak learners into a strong learner.
Instead of training the models in parallel, we can train them sequentially. This is the main idea of
Boosting.
The idea behind boosting algorithms is to learn weak classifiers that are only slightly correlated with
the true classification, they combine them into a strong classifier that is well-correlated with the
true classification.
Boosting algorithm iteratively learns weak classifiers and add them to a final strong classifier. The
added weak classifiers are usually weighted according to their accuracy. After each iteration, the
training data is reweighted in a way that misclassified instances gain weight and correctly classified
instances lose weight. On the next iteration, the future weak learner concentrates mostly on
wrongly classified instances. The boosting algorithms mostly differ in reweighting approach applied
to the training set.
AdaBoost
GBM
XGBM
Light GBM
CatBoost
As we discussed before bagging is an ensemble technique mainly used to reduce the variance of our
predictions by combining the result of multiple classifiers modelled on different sub-samples of the
same data set.
Working of bagging
Creating multiple datasets: Sampling is done with a replacement on the original data set
and new datasets are formed from the original dataset.
Building multiple classifiers: On each of these smaller datasets, a classifier is built, usually,
the same classifier is built on all the datasets.
Combining Classifiers: The predictions of all the individual classifiers are now combined to
give a better classifier, usually with very less variance compared to before.
Bagging is similar to Divide and conquer. It is a group of predictive models run on multiple subsets
from the original dataset combined together to achieve better accuracy and model stability.
It is basically a collection of objects on the basis of similarity and dissimilarity between them.
Hierarchical Clustering
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group
the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.
The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the
datasets into clusters, it follows the bottom-up approach. It means, this algorithm considers each
dataset as a single cluster at the beginning, and then start combining the closest pair of clusters
together. It does this until all the clusters are merged into a single cluster that contains all the
datasets.
The working of the AHC algorithm can be explained using the below steps:
Step-2: Take two closest data points or clusters and merge them to form one cluster. So,
there will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one cluster.
There will be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to
divide the clusters as per the problem.
Also known as top-down approach. This algorithm also does not require to prespecify the number
of clusters. Top-down clustering requires a method for splitting a cluster that contains the whole
data and proceeds by splitting clusters recursively until individual data have been splitted into
singleton cluster.
Partitioning Clustering:
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the
centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of
pre-defined groups. The cluster center is created in such a way that the distance between the data
points of one cluster is minimum as compared to another cluster centroid.
There are essentially three stopping criteria that can be adopted to stop the K-means algorithm:
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and
get sales on that. The below list shows the advertisement made by the company in the last 5 years
and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine
learning, we need regression analysis.
Types of Regression:
There are various types of regressions which are used in data science and machine learning.
Y= aX+b
Logistic Regression:
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No,
True or False, Spam or not spam, etc.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The function
can be represented as:
To estimate how poorly models perform, cost functions are employed. Simply put, a cost function is
a measure of how inaccurate the model is in estimating the connection between X and y. This is
usually stated as a difference or separation between the expected and actual values.
The term ‘loss' in machine learning refers to the difference between the anticipated and actual
value. The "Loss Function" is a function that is used to quantify this loss in the form of a single real
number during the training phase. These are utilised in algorithms that apply optimization
approaches in supervised learning.
There are many cost functions in machine learning and each has its use cases depending on
whether it is a regression problem or classification problem.
Regression models deal with predicting a continuous value for example salary of an employee, price
of a car, loan prediction, etc. A cost function used in the regression problem is called “Regression
Cost Function”.
This cost function is used in the classification problems where there are multiple classes and input
data belongs to only one class. Let us now understand how cross-entropy is calculated.
K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on
a similarity measure. Our KNN model will find the similar features of the new data set to the cats
and dogs images and based on the most similar features it will put it in either cat or dog category.
The K-NN working can be explained on the basis of the below algorithm:
Below are some points to remember while selecting the value of K in the K-NN algorithm:
There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
Large values for K are good, but it may find some difficulties.
In particular, three data sets are commonly used in different stages of the creation of the model:
training, validation and test sets.
The model is initially fit on a training data set, which is a set of examples used to fit the parameters
(e.g. weights of connections between neurons in artificial neural networks) of the model.
The model (e.g. a naive Bayes classifier) is trained on the training data set using a supervised
learning method, for example using optimization methods such as gradient descent or stochastic
gradient descent.
In practice, the training data set often consists of pairs of an input vector (or scalar) and the
corresponding output vector (or scalar), where the answer key is commonly denoted as the target
(or label).
The current model is run with the training data set and produces a result, which is then compared
with the target, for each input vector in the training data set.
Based on the result of the comparison and the specific learning algorithm being used, the
parameters of the model are adjusted. The model fitting can include both variable selection and
parameter estimation.
Successively, the fitted model is used to predict the responses for the observations in a second data
set called the validation data set.
The validation data set provides an unbiased evaluation of a model fit on the training data set while
tuning the model's hyper parameters (e.g. the number of hidden units—layers and layer widths—in
a neural network).
Validation datasets can be used for regularization by early stopping (stopping training when the
error on the validation data set increases, as this is a sign of over-fitting to the training data set).
This simple procedure is complicated in practice by the fact that the validation dataset's error may
fluctuate during training, producing multiple local minima. This complication has led to the creation
of many ad-hoc rules for deciding when over-fitting has truly begun.
The term "validation set" is sometimes used instead of "test set" in some literature (e.g., if the
original data set was partitioned into only two subsets, the test set might be referred to as the
validation set).
Deciding the sizes and strategies for data set division in training, test and validation sets is very
dependent on the problem and data available
Cross-validation
The goal of cross-validation is to test the model's ability to predict new data that was not used in
estimating it.
1) Validation:
In this method, we perform training on the 50% of the given data-set and rest 50% is used for the
testing purpose. The major drawback of this method is that we perform training on the 50% of
the dataset, it may possible that the remaining 50% of the data contains some important
information which we are leaving while training our model i.e higher bias.
In this method, we perform training on the whole data-set but leaves only one data-point of the
available data-set and then iterates for each data-point. It has some advantages as well as
disadvantages also.
An advantage of using this method is that we make use of all data points and hence it is low bias.
In this method, we split the data-set into k number of subsets(known as folds) then we perform
training on the all the subsets but leave one(k-1) subset for the evaluation of the trained model.
In this method, we iterate k times with a different subset reserved for testing purpose each time.
This is a scenario where the number of observations belonging to one class is significantly lower
than those belonging to the other classes.
This problem is predominant in scenarios where anomaly detection is crucial like fraudulent
transactions in banks, identification of rare diseases, etc.
In this situation, the predictive model developed using conventional machine learning algorithms
could be biased and inaccurate.
This happens because Machine Learning Algorithms are usually designed to improve accuracy by
reducing the error. Thus, they do not take into account the class distribution / proportion or
balance of classes.
Dealing with imbalanced datasets strategies improve classification algorithms or balancing classes in
the training data before providing the data as input to the machine learning algorithm. The later
technique is preferred as it has wider application.
The main objective of balancing classes is to either increasing the frequency of the minority class or
decreasing the frequency of the majority class.
This is done in order to obtain approximately the same number of instances for both the classes.
Let us look at a few resampling techniques:
1) Random Under-Sampling
Random Undersampling aims to balance class distribution by randomly eliminating majority class
examples. This is done until the majority and minority class instances are balanced out.
Event Rate= 2 %
Event Rate for the new dataset after under sampling = 20/118 = 17%
2) Random Over-Sampling
Over-Sampling increases the number of instances in the minority class by randomly replicating
them in order to present a higher representation of the minority class in the sample.
Event Rate= 2 %
Event Rate for the new data set after under sampling= 400/1380 = 29 %
This technique is followed to avoid overfitting which occurs when exact replicas of minority
instances are added to the main dataset. A subset of data is taken from the minority class as an
example and then new synthetic similar instances are created. These synthetic instances are then
added to the original dataset. The new dataset is used as a sample to train the classification models.
Fraudulent Observations = 20
A sample of 15 instances is taken from the minority class and similar synthetic instances are
generated 20 times
An alternate approach i.e. Modifying existing classification algorithms to make them appropriate
for imbalanced data sets.
The main objective of ensemble methodology is to improve the performance of single classifiers.
The approach involves constructing several two stage classifiers from the original data and then
aggregate their predictions.
Bagging is used for reducing Overfitting in order to create strong learners for generating accurate
predictions. Unlike boosting, bagging allows replacement in the bootstrapped sample.
Event Rate= 2 %
There are 10 bootstrapped samples chosen from the population with replacement. Each sample
contains 200 observations. And each sample is different from the original dataset but resembles the
dataset in distribution & variability.
The machine learning algorithms like logistic regression, neural networks, decision tree are fitted to
each bootstrapped sample of 200 observations. And the Classifiers c1, c2…c10 are aggregated to
produce a compound classifier. This ensemble methodology produces a stronger compound
classifier since it combines the results of individual classifiers to come up with an improved one.
Confusion matrix:
A confusion matrix is a technique for summarizing the performance of a classification algorithm.
Classification accuracy alone can be misleading if you have an unequal number of observations in
each class or if you have more than two classes in your dataset.
Calculating a confusion matrix can give you a better idea of what your classification model is getting
right and what types of errors it is making.
The number of correct and incorrect predictions are summarized with count values and broken
down by each class. This is the key to the confusion matrix.
The confusion matrix shows the ways in which your classification model
is confused when it makes predictions.
It gives you insight not only into the errors being made by your classifier but more importantly the
types of errors that are being made.
1. You need a test dataset or a validation dataset with expected outcome values.
Expected down the side: Each row of the matrix corresponds to a predicted class.
Predicted across the top: Each column of the matrix corresponds to an actual class.
The total number of correct predictions for a class go into the expected row for that class value and
the predicted column for that class value.
In the same way, the total number of incorrect predictions for a class go into the expected row for
that class value and the predicted column for that class value.
Evaluation metrics
Evaluation metrics are used to measure the quality of the statistical or machine learning model.
Evaluating machine learning models or algorithms is essential for any project.
There are many different types of evaluation metrics available to test a model. These include
classification accuracy, logarithmic loss, confusion matrix, and others.
Classification accuracy is the ratio of the number of correct predictions to the total number of input
samples, which is usually what we refer to when we use the term accuracy.
In a classification task, the precision for a class is the number of true positives divided by the total
number of elements labeled as belonging to the positive class.
Logarithmic loss, also called log loss, works by penalizing the false classifications. Log loss is one of
the most popular measurements of error in applied machine learning.
Errors and learning initiative failures play an essential role in the machine learning process, as
discovering them and minimizing them ultimately maximizes the process’s accuracy.
Evaluation metrics involves using a combination of these individual evaluation metrics to test a
model or algorithm.