ML Unit-2
ML Unit-2
Home
Unit-2
Unit-3
Ensemble Learning and Random Forests: Introduction, Voting Classifiers, Bagging and
Pasting, Random Forests, Boosting, Stacking. Support Vector Machine: Linear SVM
Classification, Nonlinear SVM Classification SVM Regression, Naïve Bayes Classifiers.
Unit-4
Unit-5
Neural Networks and Deep Learning: Introduction to Artificial Neural Networks with Keras,
Implementing MLPs with Keras, Installing Tensor Flow 2, Loading and Preprocessing Data
with Tensor Flow.
Home
Unit-2
Regression is a type of supervised learning used to predict continuous numeric values. The
goal of regression is to model the relationship between input features (independent variables)
and the target variable (dependent variable).
Examples:
• Predicting house prices based on features like size, location, and age.
• Forecasting stock prices or sales revenue.
• Estimating temperature based on weather conditions.
Key Characteristics:
• Output: Continuous numeric values (e.g., y=32.5).
• Evaluation Metrics: Mean Squared Error (MSE), Mean Absolute Error (MAE), R-
squared (R2).
Home
Distance Metrics: Distance is a measure of similarity or dissimilarity between two data points.
Home
K-Nearest Neighbour
Home
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
• By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
Home
• As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
Example: The table below represents our data set. We have two columns Brightness and Saturation.
Each row in the table has a class of either Red or Blue.Before we introduce a new data entry, let's
assume the value of K is 5.
Home
Here's the new data entry:
We have a new entry, but it doesn't have a class yet. To know its class, we must calculate the
distance from the latest entry to other entries in the data set using the Euclidean distance
formula.
Here's the formula: √(X₂-X₁)²+(Y₂-Y₁)²
Where:
• X₂ = New entry's brightness (20).
• X₁= Existing entry's brightness.
• Y₂ = New entry's saturation (35).
• Y₁ = Existing entry's saturation.
Let's do the calculation together. I'll calculate the first three.
Distance #1
For the first row, d1:
Home
Here's what the table will look like after all the distances have been calculated:
Since we chose 5 as the value of K, we'll only consider the first five rows. That is:
As you can see above, the majority class within the 5 nearest neighbors to the new entry is Red.
Therefore, we'll classify the new entry as Red.
Brightness Saturation Class
20 35 Red
Home
Decision Trees
Decision tree induction is the learning of decision trees from class-labeled training tuples. A
decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes
a test on an attribute, each branch represents an outcome of the test, and each leaf node (or
terminal node) holds a class label. The topmost node in a tree is the root node. A typical decision
tree is shown in below figure. It represents the concept buys computer, that is, it predicts
whether a customer at AllElectronics is likely to purchase a computer. Rectangles denote
internal nodes, and ovals denote leaf nodes. Some decision tree algorithms produce only binary
trees (where each internal node branches to exactly two other nodes), whereas others can
produce nonbinary trees.
A decision tree for the concept buys computer, indicating whether an AllElectronics customer
is likely to purchase a computer. Each internal (nonleaf) node represents a test on an
attribute. Each leaf node represents a class
“How are decision trees used for classification?” Given a tuple, X, for which the associated
class label is unknown, the attribute values of the tuple are tested against the decision tree. A
path is traced from the root to a leaf node, which holds the class prediction for that tuple.
Decision trees can easily be converted to classification rules.
“Why are decision tree classifiers so popular?” The construction of decision tree classifiers
does not require any domain knowledge or parameter setting, and therefore is appropriate for
exploratory knowledge discovery. Decision trees can handle multidimensional data. Their
representation of acquired knowledge in tree formis intuitive and generally easy to assimilate
by humans. The learning and classification steps of decision tree induction are simple and fast.
Home
In general, decision tree classifiers have good accuracy. However, successful use may depend
on the data at hand. Decision tree induction algorithms have been used for classification in
many application areas such as medicine, manufacturing and production, financial analysis,
astronomy, and molecular biology. Decision trees are the basis of several commercial rule
induction systems.
Home
Home
Information Gain
ID3 uses information gain as its attribute selection measure. The expected information needed
to classify a tuple in D is given by
where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is
estimated by |Ci,D| / |D|.Note that, at this point, the information we have is based solely on the
proportions of tuples of each class. Info(D) is also known as the entropy of D. How much more
information would we still need (after the partitioning) to arrive at an exact classification? This
amount is measured by
Information gain is defined as the difference between the original information requirement (i.e.,
based on just the proportion of classes) and the new requirement (i.e., obtained after
partitioning on A). That is,
The class label attribute, buys computer, has two distinct values (namely, fyes, nog); therefore,
there are two distinct classes (i.e., m =2). Let class C1 correspond to yes and class C2
correspond to no.There are nine tuples of class yes and five tuples of class no. A (root) node N
is createdfor the tuples in D. To find the splitting criterion for these tuples, we must compute
the information gain of each attribute.
Home
Next, we need to compute the expected information requirement for each attribute. Let’s start
with the attribute age. We need to look at the distribution of yes and no tuples for each category
of age. For the age category “youth,” there are two yes tuples and three no tuples. For the
category “middle aged,” there are four yes tuples and zero no tuples.For the category “senior,”
there are three yes tuples and two no tuples.
Similarly, we can compute Gain.income = 0.029 bits, Gain.student= 0.151 bits, and Gain.credit
rating= 0.048 bits. Because age has the highest information gain among the attributes, it is
selected as the splitting attribute. Node N is labeled with age, and branches are grown for each
of the attribute’s values. Notice that the tuples falling into the partition for age = middle aged
all belong to the same class. Because they all belong to class “yes,”a leaf should therefore be
created at the end of this branch and labeled “yes.”
Home
Gain Ratio
The information gain measure is biased toward tests with many outcomes. That is, it prefers to
select attributes having a large number of values. For example, consider an attribute that acts
as a unique identifier such as product ID. A split on product ID would result in a large number
of partitions (as many as there are values), each one containing just one tuple. Because each
partition is pure, the information required to classify data set D based on this partitioning would
be Infoproduct ID(D) = 0. Therefore, the information gained by partitioning on this attribute is
maximal. Clearly, such a partitioning is useless for classification.
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which
attempts to overcome this bias. It applies a kind of normalization to information gain using a
“split information” value defined analogously with Info(D) as
The attribute with the maximum gain ratio is selected as the splitting attribute. Note, however,
that as the split information approaches 0, the ratio becomes unstable. A constraint is added to
avoid this, whereby the information gain of the test selected must be large at least as great as
the average gain over all tests examined.
Computation of gain ratio for the attribute income. A test on income splits the data of Table
8.1 into three partitions, namely low, medium, and high, containing four, six, and four tuples,
respectively. To compute the gain ratio of income.
Home
Gini Index
The Gini index is used in CART. Using the notation previously described, the Gini index
measures the impurity of D, a data partition or set of training tuples, as
For example, if income has three possible values, namely {low, medium, high}, then the
possible subsets are {low, medium, high}, {low, medium}, {low, high}, {medium, high}, {low},
{medium}, {high}, and {}. We exclude the power set, {low, medium, high}, and the empty set
from consideration since, conceptually, they do not represent a split. If a binary split on A
partitions D into D1 and D2, the Gini index of D given that partitioning is
The reduction in impurity that would be incurred by a binary split on a discrete- or continuous-
valued attribute A is
Induction of a decision tree using the Gini index. There are nine tuples belonging to the class
buys_computer =yes and the remaining five tuples belong to the class buys_computer = no. A
(root) node N is created for the tuples in D. Use the Gini index to compute the impurity of D:
We need to compute the Gini index for each attribute. Let’s start with the attribute income and
consider each of the possible splitting subsets. Consider the subset {low, medium}. This would
result in 10 tuples in partition D1 satisfying the condition “income {low, medium}.” The
remaining four tuples of D would be assigned to partition D2.
Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the subsets
{low, high} and {medium}) and 0.450 (for the subsets {medium, high} and {low}). Therefore,
the best binary split for attribute income is on {low, medium} (or {high}) because it minimizes
the Gini index. The lowest value of Gini is considered for selecting the best root node.
Home
Naive Bayes
What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They can predict
class membership probabilities such as the probability that a given tuple belongs to a particular
class.Bayesian classification is based on Bayes’ theorem, described next. Studies comparing
classification algorithms have found a simple Bayesian classifier known as the naïve Bayesian
classifier to be comparable in performance with decision tree and selected neural network
classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to
large databases.
Naïve Bayesian Classification:
The na¨ıve Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, : : : , xn), depicting n measurements
made on the tuple from n attributes, respectively, A1, A2, : : : , An.
2. Suppose that there are m classes, C1, C2, : : : , Cm. Given a tuple, X, the classifier will predict
that X belongs to the class having the highest posterior probability, conditioned on X. That is,
the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
Thus, we maximize P(Ci | X). The class Ci for which P(Ci | X) is maximized is called the
maximum posteriori hypothesis.
Example : Predicting a class label using na¨ıve Bayesian classification. We wish to predict the
class label of a tuple using na¨ıve Bayesian classification, given the same training data as
discussed in decision tree induction. The data tuples are described by the attributes age, income,
student, and credit rating. The class label attribute, buys computer, has two distinct values
(namely, {yes, no}). Let C1 correspond to the class buys computer = yes and C2 correspond to
buys computer = no. The tuple we wish to classify is
Home
Home
Linear Models
Linear models are statistical techniques used to describe the relationship between one or more
independent variables (x) and a dependent variable (y). They assume that this relationship can
be expressed as a linear equation.
General Form of Linear Models
The general form of a linear model is:
Where:
• y: Dependent variable (response variable).
• x1,x2,…,xk: Independent variables (predictors).
• β0: Intercept (value of y when all x variables are 0).
• β1,β2,…,βk: Coefficients representing the effect of each independent variable on y.
• ϵ: Error term representing the difference between the observed and predicted y.
Linear models are widely used because they are simple, interpretable, and effective in many
practical applications.
Home
Home
Linear Regression
Linear Regression is one of the most simple Machine learning algorithm that comes under
Supervised Learning technique and used for solving regression problems. It is used for
predicting the continuous dependent variable with the help of independent variables.
The goal of the Linear regression is to find the best fit line that can accurately predict the output
for the continuous dependent variable. If single independent variable is used for prediction then
it is called Simple Linear Regression and if there are more than two independent variables then
such regression is called as Multiple Linear Regression. By finding the best fit line, algorithm
establish the relationship between dependent variable and independent variable. And the
relationship should be of linear nature.The output for Linear regression should only be the
continuous values such as price, age, salary, etc. The relationship between the dependent
variable and independent variable can be shown in below image:
In above image the dependent variable is on Y-axis (salary) and independent variable is on x-
axis(experience). The regression line can be written as:
The formula y=α+β*x is the common way to represent the equation of a simple linear
regression model, where:
• y: Dependent variable (e.g., Salary)
• x: Independent variable (e.g., Experience)
• α: Intercept (value of y when x=0)
• β: Slope (rate of change of y with respect to x)
This representation is equivalent to y=c+mx, but with different symbols.
The slope β in linear regression is usually calculated as:
Home
Where:
Home
Prediction
For 6 years of experience (x=6):
y=30,000+3,800*6=30,000+22,800=52,800
The predicted salary is $52,800.
Home
Multiple Linear Regression
Multiple linear regression models the relationship between one dependent variable (y) and two
or more independent variables (x1,x2,…,xk). It generalizes simple linear regression to handle
multiple predictors.
The equation for multiple linear regression is:
Where:
• y: Dependent variable (response variable).
• x1,x2,…,xk: Independent variables (predictors).
• β0: Intercept (value of y when all x variables are 0).
• β1,β2,…,βk: Coefficients representing the effect of each independent variable on y.
• ϵ: Error term representing the difference between the observed and predicted y.
Linear models are widely used because they are simple, interpretable, and effective in many
practical applications.
Example:Predicting House Prices
You want to predict the price of a house (y) based on:
1. Size in square feet (x1).
2. Number of Bedrooms (x2).
3. Distance to the City Center (x3).
Home
Logistic Regression
Logistic regression is one of the most popular Machine learning algorithm that comes under
Supervised Learning techniques. It can be used for Classification as well as for Regression
problems, but mainly used for Classification problems. Logistic regression is used to predict
the categorical dependent variable with the help of independent variables. The output of
Logistic Regression problem can be only between the 0 and 1.
Logistic regression can be used where the probabilities between two classes is required. Such
as whether it will rain today or not, either 0 or 1, true or false etc. Logistic regression is based
on the concept of Maximum Likelihood estimation. According to this estimation, the observed
data should be most probable. In logistic regression, we pass the weighted sum of inputs
through an activation function that can map values in between 0 and 1. Such activation function
is known as sigmoid function and the curve obtained is called as sigmoid curve or S-curve.
Consider the below image:
Where:
Home
Types of Logistic Regression
1. Binary Logistic Regression:
o Predicts one of two possible outcomes.
o Example: Predicting if a customer will buy a product (yes/no).
2. Multinomial Logistic Regression:
o Predicts outcomes with three or more unordered categories.
o Example: Predicting the type of transport (car, bus, train).
3. Ordinal Logistic Regression:
o Predicts outcomes with three or more ordered categories.
o Example: Predicting customer satisfaction (low, medium, high).
Home
Steps to calculate β0 , β1
Home
Home
Home
Multinomial Logistic Regression
Multinomial logistic regression is an extension of binary logistic regression that is used when
the dependent variable (y) has more than two categories. Here, y is categorical and takes on
one of k classes (y∈{1,2,…,k}).
Example Problem
Dataset
We have the following dataset with x1 (Age) and x2 (Education Level) as predictors and y (Job
Type) as the outcome variable. y has three categories: y=1 (Engineer), y=2 (Teacher), and y=3
(Doctor).
Home
Home
Home
Ordinal Logistic Regression:
Ordinal Logistic Regression is used when the dependent variable (y) is ordinal—i.e., it has
a natural order but the distances between categories are not assumed to be equal. For instance,
ratings like "Low," "Medium," and "High" have an order, but the difference between "Low"
and "Medium" may not be the same as between "Medium" and "High."
Example:
Consider the following dataset where we predict customer satisfaction (y) based on their
monthly income (x1) and hours spent shopping online (x2):
Here:
• y (Satisfaction) is the ordinal dependent variable with values 1 (Low), 2 (Medium), 3
(High).
• x1 (Income) and x2 (Hours Online) are independent variables.
Where:
• j is the category (j=1,2).
• θj are the threshold parameters (cut-points) separating the categories.
• β1 and β2 are the coefficients for x1 (Income) and x2(Hours Online), respectively.
Home
The cumulative probabilities are modeled as:
1. P(y≤1) (Low satisfaction or below),
2. P(y≤2) (Medium satisfaction or below).
The probability for y=3(High satisfaction) is:
P(y=3) = 1−P(y≤2)
Home
Home
Home
Difference between Linear regression vs Logistic Regression
Home
Generalized Linear Model (GLM)
A Generalized Linear Model (GLM) is an extension of ordinary linear regression that allows
the dependent variable (y) to have a distribution other than the normal distribution. GLMs are
highly flexible and can handle different types of data such as binary, count, and categorical
outcomes.
Key Components of a GLM
1. Random Component:
• Specifies the probability distribution of the response variable (y).
• Examples: Normal, Binomial, Poisson, etc.
2. Systematic Component:
• A linear predictor is used to combine the independent variables (x1,x2,…).
• Form: η=β0+β1x1+β2x2+……
• Where η\etaη is the linear predictor.
3. Link Function:
• Links the expected value of the response variable (E(y)=μ) to the linear
predictor.
• Form: g(μ)=η
Types of GLMs
1. Linear Regression
• Distribution: Normal.
• Link Function: Identity g(μ)=η.
• Example: Predicting house prices based on area and location.
Price=β0+β1⋅Area+β2⋅Location
2. Logistic Regression
• Distribution: Binomial.
• Link Function: Logit (g(μ)=ln(μ/1−μ)
• Example: Predicting whether a student passes (1) or fails (0) based on study
hours
Home
3. Poisson Regression
• Distribution: Poisson.
• Link Function: Log (g(μ)=ln(μ).
• Example: Predicting the number of customers arriving at a store based on
time of day.
ln(μ)=β0+β1⋅Time of Day
4. Multinomial Logistic Regression
• Distribution: Multinomial.
• Link Function: Generalized Logit.
• Example: Predicting the choice of transportation (car, bus, train) based on
income and distance.
Example:
A hospital wants to model the number of daily patient arrivals (y) based on the number of staff
on duty (x1) and whether it is a weekend (x2).
Data:
Home
Home
Support Vector Machines
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning. The goal of the SVM algorithm is to create
the best line or decision boundary that can segregate n-dimensional space into classes so that
we can easily put the new data point in the correct category in the future. This best decision
boundary is called a hyperplane. SVM chooses the extreme points/vectors that help in creating
the hyperplane. These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature. So
as support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Home
Types of SVM:
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane, Support Vectors and Margin in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
Margin
A margin is a separation gap between the two lines on the closest data points. It is calculated
as the perpendicular distance from the line to support vectors or closest data points. In SVMs,
we try to maximize this separation gap so that we get maximum margin.
Home
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors. The distance between the vectors and
the hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
Home
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as: z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Home
Kernel trick
In practice, SVM algorithm is implemented using a kernel. It uses a technique called the kernel
trick. In simple words, a kernel is just a function that maps the data to a higher dimension where
data is separable. A kernel transforms a low-dimensional input data space into a higher
dimensional space. So, it converts non-linear separable problems to linear separable problems
by adding more dimensions to it. Thus, the kernel trick helps us to build a more accurate
classifier. Hence, it is useful in non-linear separation problems.
We can define a kernel function as follows-
Kernel function
Linear kernel
In linear kernel, the kernel function takes the form of a linear function as follows-
linear kernel : K(xi , xj ) = xiT xj
Linear kernel is used when the data is linearly separable. It means that data can be separated
using a single line. It is one of the most common kernels to be used. It is mostly used when
there are large number of features in a dataset. Linear kernel is often used for text classification
Home
purposes.Training with a linear kernel is usually faster, because we only need to optimize the
C regularization parameter. When training with other kernels, we also need to optimize the γ
parameter. So, performing a grid search will usually take more time.Linear kernel can be
visualized with the following figure.
Linear Kernel
Polynomial Kernel
Polynomial kernel represents the similarity of vectors (training samples) in a feature space over
polynomials of the original variables. The polynomial kernel looks not only at the given
features of input samples to determine their similarity, but also combinations of the input
samples.For degree-d polynomials, the polynomial kernel is defined as follows –
Polynomial kernel : K(xi , xj ) = (γxiT xj + r)d , γ > 0
Polynomial kernel is very popular in Natural Language Processing. The most common degree
is d = 2 (quadratic), since larger degrees tend to overfit on NLP problems. It can be visualized
with the following diagram.
Polynomial Kernel
Home
Radial Basis Function Kernel
Radial basis function kernel is a general purpose kernel. It is used when we have no prior
knowledge about the data. The RBF kernel on two samples x and y is defined by the following
equation –
Radial Basis Function kernel
The following diagram demonstrates the SVM classification with RBF kernel.
SVM Classification with RBF kernel
Sigmoid kernel
Sigmoid kernel has its origin in neural networks. We can use it as the proxy for neural networks.
Sigmoid kernel is given by the following equation.
sigmoid kernel : k (x, y) = tanh(αxTy + c)
Sigmoid kernel can be visualized with the following diagram-
Sigmoid kernel
Home
Types of Classification Models: Binary, Multiclass, and Multilabel
Machine learning classification has numerous applications across various fields. It ranges from
spam detection in emails to medical diagnosis and sentiment analysis in customer reviews.
Binary classification is a fundamental aspect of machine learning, categorizing data into two
distinct classes. This method is essential for tasks like email spam detection and medical
diagnostics. It provides a clear decision boundary, making it a cornerstone of many
applications.
Logistic regression is a widely used algorithm for binary classification. It determines the
probability of a sample falling into one of two classes. This approach is particularly effective
when a simple yes-or-no decision is necessary.
When evaluating binary classification models, several key metrics are crucial. Accuracy
measures the overall correctness, while precision focuses on true positives. Recall evaluates
the model's ability to identify all positive instances. The F1 score, a balanced measure,
combines precision and recall to assess model performance.
Despite its simplicity and efficiency, binary classification faces challenges with imbalanced
datasets and setting appropriate decision boundaries.
Home
Multiclass Classification: Expanding Possibilities
Multiclass classification elevates binary classification by categorizing data into three or more
classes. This method is essential for tackling complex problems that require more than simple
yes-or-no answers. It's a powerful tool for a variety of real-world applications.
In multiclass classification, each data point is assigned to one of several classes. Unlike binary
classification, which limits itself to two categories, multiclass models can manage multiple
distinct groups. This flexibility makes it a valuable asset for many tasks.
• One-vs-rest strategy: Trains a separate classifier for each class against all others
These strategies facilitate the effective management of complex classification tasks. They make
multiclass classification a versatile tool in machine learning.
Home
Outputs
Softmax Direct multiclass Requires neural
probabilities for
activation approach network architecture
each class
Multi-label classification is a method designed to handle complex scenarios where data points
can fit into multiple categories at once. Unlike traditional binary or multiclass models, this
technique allows for detailed categorization. It's particularly useful for tasks such as document
tagging and image annotation.
In this approach, each instance can be linked to several labels. For instance, a news article could
be classified under "politics," "economy," and "international affairs" simultaneously. This
flexibility is essential for real-world applications where items often possess multiple attributes
or fall into overlapping categories.
One common strategy in multi-label classification is binary relevance. This method breaks
the problem down into several binary classification tasks, one for each label. Although
straightforward, it might not fully capture the relationships between labels. Other strategies,
like label powerset and algorithm adaptation methods, focus on enhancing these relationships.
When evaluating multi-label models, specialized metrics are necessary. Hamming loss
measures the proportion of incorrectly predicted labels. Precision at k and recall at k evaluate
the model's performance for the top k predicted labels. These metrics are crucial for assessing
the model's accuracy in complex labeling scenarios.
Home
However, multi-label classification also presents challenges. Managing label
correlations and dealing with large label spaces can be complex. As the number of possible
label combinations increases, so does the computational complexity. Researchers are
continually exploring new techniques to address these challenges and enhance multi-label
classification performance across various domains.
MNIST
The MINST dataset stands for "Modified National Institute of Standards and Technology".
The dataset contains a large collection of handwritten digits that is commonly used for training
various image processing systems. The dataset was created by re-mixing samples from NIST's
original datasets, which were taken from American Census Bureau employees and high school
students. It is designed to help scientists develop and test machine learning algorithms in
pattern recognition and machine learning. It contains 60,000 training images and 10,000 testing
images, each of which is a grayscale image of size 28x28 pixels.
The MNIST dataset is a collection of 70,000 handwritten digits (0-9), with each image being
28x28 pixels. Here is the dataset information in the specified format:
• Target: Column represents the digit (0-9) corresponding to the handwritten image
• Pixel 1-784: Each pixel value (0-255) represents the grayscale intensity of the
corresponding pixel in the image.
1. Training Set: Consists of 60,000 images along with their labels, commonly
used for training machine learning models.
2. Test Set: Contains 10,000 images with their corresponding labels, used for
evaluating the performance of trained models.
The MNIST dataset, which currently represents a primary input for many tasks in image
processing and machine learning, can be traced back to the National Institute of Standards and
Home
Technology (NIST). NIST, a US government agency focused on measurement science and
standards, curates various datasets, including two particularly relevant to handwritten digits:
• Special Database 1 (SD-1): Since being the Bureau of US census employees with
sizable population among the workplace was private handwritten data - they all came
from a desirable source. Census staff are seen handling written values on a repeat basis,
thus rendering their samples a high chance of success in algorithm training.
• Special Database 3 (SD-3): This data set contained digitized handwriting figures of
high-schoolers, provided by students. However, in terms of authenticity, this
information looked less "official" than the numbers provided by the Census Bureau, but
the great thing is that they applied in a variety of writing styles.
While these datasets existed, unfortunately, they could not be used directly and instead, they
had to be transformed and divided into specifically data for training and testing the AI models.
The separation between the two NIST collections created a potential bias:
• SD-1 was then kept aside as a teaching set. The AI problem can be attributed to the fact
that the technicians having more experience in writing the hand-written numbers. So
the model might go on to become overly biased towards such "clean" numbers.
• In SD-3 we assigned it to do the test runs. Without being exposed to more types of write
styles during training (if only from SD-1), the model may misguided on SD-3 testing.
To tackle this bias and get a more balanced data set for machine learning, the MNIST
developers used an original trick of combining characters from NIST Special databases and
symbols from a such font as Zapf Dingbats. By using this approach, the data used for both
training and testing became more inclusive of the wide range of alphabets used, thereby
resulting in more generally applicable data processing and machine learning models.
Loading the MNIST dataset in Python can be done in several ways, depending on the libraries
and tools you prefer to use. Below are some of the most common methods to load the MNIST
dataset using different Python libraries:
Home
Loading MNIST dataset using TensorFlow/Keras
This code snippet load mnist dataset keras example using Keras, retrieves the training
images and labels, and then plots four images in a row with their corresponding labels. Each
image is displayed in grayscale.
import numpy as np
plt.figure(figsize=(10, 5))
for i in range(4):
plt.subplot(1, 4, i+1)
plt.imshow(X_train[i], cmap='gray')
plt.title(f"Label: {y_train[i]}")
plt.axis('off')
plt.tight_layout()
plt.show()
Output:
Home
Loading MNIST dataset Using PyTorch
In this examples we will explore to load mnist dataset pytorch example. PyTorch offers a
similar utility through torchvision.datasets, which is very convenient, especially when
combined with torchvision.transforms to perform basic preprocessing like converting images
to tensor format.
import matplotlib.pyplot as plt
import torch
from torchvision import datasets, transforms
# Define the transformation to convert images to PyTorch tensors
transform = transforms.Compose([transforms.ToTensor()])
# Load the MNIST dataset with the specified transformation
mnist_pytorch = datasets.MNIST(root='./data', train=True, download=True,
transform=transform)
# Create a DataLoader to load the dataset in batches
train_loader_pytorch = torch.utils.data.DataLoader(mnist_pytorch, batch_size=1,
shuffle=False)
# Create a figure to display the images
plt.figure(figsize=(15, 3))
# Print the first few images in a row
for i, (image, label) in enumerate(train_loader_pytorch):
if i < 5: # Print the first 5 samples
plt.subplot(1, 5, i + 1)
plt.imshow(image[0].squeeze(), cmap='gray')
plt.title(f"Label: {label.item()}")
plt.axis('off')
else:
break # Exit the loop after printing 5 samples
plt.tight_layout()
plt.show()
Output:
Home
Applications of MNIST
While it's primarily used for educational purposes and in benchmarking algorithms in academic
studies, learning and experimenting with the MNIST dataset can also have practical
applications. MNIST dataset finds applications in the Banking Sector, Postal Services, and
Document Management:
1. Banking Sector
• Recognizing Handwritten Numbers on Checks: The banks are primarily
responsible for this role, namely, cashing the checks. MNIST is the core of
training numerical recognition systems in classifying the digits to identify the
ones or the amount on a check. Thus, this removes data entry, eliminates error,
and expedites check handling.
2. Postal Services
• Automating Postal Code Reading: Accurate parcel sorting and timely postal
delivery depend a lot on proper recognition of a postal code. The MNIST data
set is used to train an image recognition model recognizing zip codes on
envelopes regardless of varied hand writing quality and print quality. This
results in the rapid links of the sorting and postage which would ultimately
facilitate fast delivery hence reducing delays.
3. Document Management
• Digitizing Written Documents and Recognizing Numbers: A lot of them
have handwriting numbers too; these are the invoices, receipts, and forms.
MNIST can be applied in developing such systems that can perform operations
like extracting and recognizing those figures during the scan and digitization
process. The benefits of data entry automation are the opportunities to
streamline the process, simplify the data mining, and increase the documents
searchability.
Home
Ranking
Ranking in Machine Learning
Ranking is a type of machine learning problem where the goal is to predict the order or
preference of items rather than the exact label. It is commonly used in applications like search
engines, recommendation systems, and advertisements.
Types of Ranking Problems:
1. Pointwise Ranking:
o In pointwise ranking, each item is treated independently, and the model predicts
a score for each item. The items are then ranked based on their scores.
o Example: Ranking products based on their relevance to a user’s query.
2. Pairwise Ranking:
o In pairwise ranking, the model learns to predict the relative ranking of two
items. For each pair, the model predicts which item is preferred.
o Example: Ranking search results by learning which document is better between
two given documents.
o RankNet, SVMRank are popular pairwise ranking algorithms.
3. Listwise Ranking:
o Listwise ranking algorithms consider the entire list of items simultaneously and
predict the best order for all items.
o Example: Ranking search results where the whole list of results is considered to
maximize the quality of ranking as a whole.
o LambdaRank, LambdaMART are examples of listwise ranking methods.
Applications of Ranking:
• Search Engines: Ranking search results based on relevance to the query.
• Recommendation Systems: Ranking items like products, movies, or music based on a
user’s preferences.
• Ad Placement: Ranking ads based on predicted click-through rates to optimize
advertising revenue.