0% found this document useful (0 votes)
19 views94 pages

Ad8552 ML Unit Ii

Uploaded by

saiprassad20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views94 pages

Ad8552 ML Unit Ii

Uploaded by

saiprassad20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

1

2
Pleaseread this disclaimerbefore proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.

3
DIGITAL NOTES ON
AD8552 Machine Learning

Department : Artificial Intelligence and Data Science

Batch/Year : 2020-2024/III

Created by : Ms. A. AKILA

Date : 08-08-2022

Signature :

4
Table of Contents
S NO CONTENTS SLIDE NO

1 Contents 5

2 Course Objectives 7

9
3 Pre Requisites (Course Names with Code)

4 Syllabus (With Subject Code, Name, LTPC details) 11

5 Course Outcomes 12

6 CO- PO/PSO Mapping 15

7 Lecture Plan 17

8 Activity Based Learning 19

9 Lecture Notes 21-71

10 Assignments 73

11 Part A (Q & A) 75

12 Part B Qs 80

13 Supportive Online Certification Courses 82

Real time Applications in day to day life and to


84
14
Industry

15 Contents Beyond the Syllabus 86

16 Assessment Schedule 90

17 Prescribed Text Books & Reference Books 92

18 Mini Project Suggestion 94

5
Course Objectives
COURSE OBJECTIVES

 To understand the basics of Machine Learning (ML)

 To understand the methods of Machine Learning

 To know about the implementation aspects of machine learning

 To understand the concepts of Data Analytics and Machine Learning

 To understand and implement use cases of ML


·
· ·
·

7
PRE REQUISITES
PREREQUISITE

AD8302 – Fundamentals of Data Science


MA8301 – Artificial Intelligence
GE8152 -- Problem Solving and Python Programming
Syllabus
SYLLABUS L T P C
3 0 0 3
AD8552 MACHINE LEARNING

UNIT I MACHINE LEARNING BASICS 8


Introduction to Machine Learning (ML) - Essential concepts of ML – Types of learning –
Machine learning methods based on Time – Dimensionality – Linearity and Non linearity –
Early trends in Machine learning – Data Understanding Representation and visualization.

UNIT II MACHINE LEARNING METHODS 11


Linear methods – Regression -Classification –Perceptron and Neural networks – Decision
trees –Support vector machines – Probabilistic models ––Unsupervised learning –
Featurization

UNIT III MACHINE LEARNING IN PRACTICE 9


Ranking – Recommendation System - Designing and Tuning model pipelines- Performance
measurement – Azure Machine Learning – Open-source Machine Learning libraries –
Amazon’s Machine Learning Tool Kit: Sagemaker

UNIT IV MACHINE LEARNING AND DATA ANALYTICS 9


Machine Learning for Predictive Data Analytics – Data to Insights to Decisions – Data
Exploration –Information based Learning – Similarity based learning – Probability based
learning – Error based learning – Evaluation – The art of Machine learning to Predictive
Data Analytics.

UNIT V APPLICATIONS OF MACHINE LEARNING 8


Image Recognition – Speech Recognition – Email spam and Malware Filtering – Online
fraud detection – Medical Diagnosis.

11
Course Outcomes
Course Outcomes
Cognitive/
Affective
Expected
Course Level of
Course Outcome Statement Level of
Code the
Course Attainment
Outcome
Course Outcome Statements in Cognitive Domain

CO1 Understand the basics of ML


K2

Explain various ZMachine Learning methods


CO2 K2

CO3 Demonstrate various ML techniques using


standard packages.
K3

CO4 Explore knowledge on Machine learning and


Data Analytics K3

Apply ML to various real time examples K3


CO5

13
CO – PO/PSO Mapping
CO- PO/PSO Mapping

Overall Correlation Matrix of the Course as per Anna University Curriculum


Cour
se PO1 P01 PO1
Cod PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 0 1 2
e

Correlation Matrix of the Course Outcomes to Programme Outcomes and Programme


Specific Outcomes Including Course Enrichment Activities

Cou Programme Outcomes (POs), Programme Specific Outcomes (PSOs)


rse
Out
com PO PO PO PO PO PO PO PO PO PO PO PO PS PS PS
es 1 2 3 4 5 6 7 8 9 10 11 12 O1 O2 O3
(CO
s)
2 1 1 - - - - - - - - -
CO1 2 1 -

2 2 1 - - - - - - - - -
CO2 2 2 1

2 1 1 - - - - - - - - -
CO3 2 1 -

2 2 1 - - - - - - - - -
CO4 2 1 1

2 2 1 - - - - - - - - -
CO5 2 1 1

15
Lecture Plan
Unit II
LECTURE PLAN

UNIT – II

Taxonomy level
Proposed date

Actual Lecture

pertaining CO
No of periods

Mode of
delivery
Date
S No Topics

1 K2
1 PPT
Linear methods – Regression 24.08.2022
2
1 25.08.2022 K2 PPT
Classification
3
Perceptron and Neural 1 K2 PPT
networks 26.08.2022
4
1 K2 PPT
Decision trees 01.09.2022
5 02.09.2022
2 05.09.2022 K2 PPT
Support vector machines
6
2 06.09.2022 K2 PPT
Probabilistic model 07.09.2022
7
2 08.09.2022 K2
Unsupervised learning 09.09.2022
8
1 K2 PPT
Featurizatio 13.09.2022

17
Activity Based Learning
Unit II
ACTIVITY BASED LEARNING

(MODEL BUILDING/PROTOTYPE)

S NO TOPICS
Cross word Puzzle

Down
1. A post-prediction adjustment,
typically to
account for prediction bias.
2. A TensorFlow programming
environment in
which operations run immediately.
4. Obtaining an understanding of data
by
considering samples, measurement, and
visualization.
5. An ensemble approach to finding the
decision tree that best fits the training data
Across 7. state-action value function
3. In machine learning, a mechanism 8. Loss function based on the absolute
for value of
bucketing categorical data the difference between the values that a
6. The primary algorithm for model
performing is predicting and the actual values of the
labels
gradient descent on neural networks.
10. A metric that your algorithm is
9. Abbreviation for independently and trying to
identically distributed optimize.
12. The more common label in a class- 11. The recommended format for
imbalanced dataset. saving and
13. Applying a constraint to an recovering TensorFlow models.
algorithm to 14. A statistical way of comparing two
ensure one or more definitions of fairness (or
more) techniques, typically an incumbent
are
against a new rival.
satisfied. 15. When one number in your model
18. A process used, as part of training, becomes
to a NaN during training, which causes many
evaluate the quality of a machine learning or
model using the validation set. all other numbers in your model to
19. A coefficient for a feature in a eventually
linear become a NaN.
16. Q-learning In reinforcement
model, or an edge in a deep network.
learning,
20. A column-oriented data analysis implementing Q-learning by using a table to
API. store the Q-functions
21. Abbreviation for generative 17. A popular Python machine learning
adversarial API
network

19
Lecture Notes – Unit 2
UNIT II MACHINE LEARNING METHODS 11
Linear methods – Regression -Classification –Perceptron and Neural networks – Decision trees
–Support vector machines – Probabilistic models ––Unsupervised learning – Featurization

1. Introduction
In general machine learning algorithms are divided into two types:
1. Supervised learning algorithms
2. Unsupervised learning algorithms
Supervised learning algorithms deal with problems that involve learning
with guidance. In other words, the training data in supervised learning methods need labelled
samples. For example for a classification problem, we need samples with class label, or for a
regression problem we need samples with desired output value for each sample, etc. The
underlying mathematical model then learns its parameters using the labelled samples and
then it is ready to make predictions on samples that the model has not seen, also called as
test samples.
Unsupervised learning deals with problems that involve data without labels.
In some sense one can argue that this is not really a problem in machine learning as there is
no knowledge to be learned from the past experiences. Unsupervised approaches try to find
some structure or some form of trends in the training data.A common example of
unsupervised learning is clustering.
Linear models are the machine learning models that deal with linear data or
nonlinear data that be somehow transformed into linear data using suitable transformations.
Although these linear models are relatively simple, they illustrate fundamental concepts in
machine learning theory and pave the way for more complex models. These linear models are
the focus of this Unit

The models that operate on strictly linear data are called linear models, and the models that
use some nonlinear transformation to map original nonlinear data to linear data and then
process it are called as generalized linear models.

The concept of linearity in case of supervised learning implies that the

relationship between the input and output can be described using linear equations

21
For unsupervised learning,the concept of linearity implies that the
distributions that we can impose on the given data are defined using linear equations. It is
important to note that the notion of linearity does not imply any constraints on the
dimensions. Hence we can have multivariate data that is strictly linear.
In case of one-dimensional input and output, the equation of the relationship
would define a straight line in two-dimensional space. In case of two-dimensional data
with one-dimensional output, the equation would describe a two-dimensional plane in three-
dimensional space and so on.
1.1. Linear Regression
Linear regression is a classic example of strictly linear models. It is also called as
polynomial fitting and is one of the simplest linear methods in machine learning.
1.2 Defining the Problem
The method of linear regression defines the following relationship between input
xi and predicted output 𝑦ො in the form of linear equation as:

𝑛
𝑦ො = ෍ 𝑥𝑖𝑗 ⋅ 𝑤𝑗 + 𝑤0 (1)
𝑗=1

𝑦ො is the predicted output when the actual output is 𝑦𝑖 . 𝑤𝑖 i = 1,...,p are called as the weight
parameters and 𝑤0 is called as the bias. Evaluating these parameters is the objective of
training. The same equation can also be written in matrix form as
𝑦ො = 𝑥 𝑇 ⋅ 𝑤 + 𝑤0 (2)
Where X= [𝑥𝑖𝑇 ], i= 1,……..,p and w=[𝑤𝑖 ],i=1,…,n. The problem is to find the values of all
weight parameters using the training data.
1.3 Solving the Problem
Most commonly used method to find the weight parameters is to minimize the
mean square error between the predicted values and actual values. It is called as least
squares method. When the error is distributed as Gaussian, this method yields an estimate
called as maximum likelihood estimate or MLE. This is the best unbiased estimate one can
find given the training data. The optimization problem can be defined as

𝑚ⅈ𝑛 𝑦𝑖 − 𝑦෡
ො𝑖 2 (3)
Expanding the predicted value term, the full minimization problem to find the optimal
weight vector wlr can be written as

𝜌
𝑛 2
𝑤 𝑙𝑟 = arg mⅈn ෎ 𝑦𝑖 − ෍ 𝑥𝑖𝑗 ⋅ 𝑤𝑗 − 𝑤0 (4)
w 𝑖=𝑖

This is a standard quadratic optimization problem and is widely studied in the


literature. As the entire formulation is defined using linear equations, only linear
relationships between input and output can be modelled. Figure 1.1 shows an
example.

Fig. 1.1 Plot of logistic sigmoid function

2. Regularized Linear Regression

In general, the solution obtained by solving Eq. 4 gives the best unbiased estimate, but in
some specific cases, where it is known that the error distribution is not Gaussian or the
optimization problem is highly sensitive to the noise in the data, above procedure can result
in what is called as overfitting. In such cases, a mathematical technique called
regularization is used.

2.1 Regularization

Regularization is a formal mathematical trickery that modifies the problem


statement with additional constraints. The main idea behind the concept of
regularization is to simplify the solution
Regularization tried to add additional constraints on the solution,thereby making sure the
overfitting is avoided and the solution is more generalizable. Multiple regularization
approaches are there , the two most commonly used approaches discussed below are also
sometimes referred to as shrinkage methods, as they try to shrink the weight parameters
close to zero

2.2 Ridge Regression


In Ridge Regression approach, the minimization problem defined in Eq. 4 is constrained
with

𝑛 2
෍ 𝑤𝑗 ≤𝑡 (5)
𝑗=1

where t is a constraint parameter. Using Lagrangian approach,the joint optimization problem


can be written as

𝜌
𝑛 2 𝑛 2
𝑤 𝑅ⅈ𝑑𝑔ⅇ = arg mⅈn ෎ 𝑦𝑖 − ෍ 𝑥𝑖𝑗 ⋅ 𝑤𝑗 − 𝑤0 + 𝜆෍ 𝑤𝑗 (6)
𝑗=1
w 𝑖=𝑖 𝑗=1

λ is the standard Lagrangian multiplier.

2.3 Lasso Regression


In Lasso Regression approach, the minimization problem defined in Eq. 4 is constrained with

𝑛 2
෍ 𝑤𝑗 ≤𝑡 (7)
𝑗=1

where t is a constraint parameter. Using Lagrangian approach, the joint optimization problem
can be written as

𝜌
𝑛 2
𝑛
𝑤 Lasso = arg mⅈn ෎ 𝑦𝑖 − ෍ 𝑥𝑖𝑗 ⋅ 𝑤𝑗 − 𝑤0 + 𝜆 ෌𝑗=1 𝑤𝑖 2 (8)
w 𝑖=𝑖 𝑗=1

2.3 Generalized Linear Models (GLM)


The Generalized Linear Models or GLMs represent generalization of linear
models by expanding their scope to handle nonlinear data that can be converted into
linear form using suitable transformations. The obvious drawback or limitation of linear
regression is the assumption of linear relationship between input and output.
 In quite a few cases, the nonlinear relationship between input and output can be converted
into linear relationship by adding an additional step of transforming one of the data (input
or output) into another domain.

 The function that performs such transformation is called as basis function or link
function. For example, logistic regression uses logistic function as basis function to
transform the nonlinearity into linearity.

 Logistic function is a special case where it also maps the output between range of [0-1],
which is equivalent to a probability density function. Also, sometimes the response between
input and output is monotonic, but not necessarily linear due to discontinuities.

 Such cases can also be converted into linear space with the use of specially constructed
basis functions. We will discuss logistic regression to illustrate the concept of GLM.

2.3.1 Logistic Regression


The building block of logistic regression is the logistic sigmoid function s(x) and is defined
as
1
𝜎 𝑥 = 1+ⅇ−𝑥 (9)

Logistic regression adds an exponential functional on top of linear regression to constrain


the output yi ∈[0, 1],The relationship between input and predicted output for logistic
regression can be given as
𝑛
𝑦ො = 𝜎 ෍ 𝑥𝑖𝑗 ⋅ 𝑤𝑗 + 𝑤0 (10)
j=1

As the output -α - α, it is also better suited for classis constrained between [0, 1], it can
be treated as a probabilistic measure. Also, due to symmetrical distribution of the output

of logistic function between ification problems. Due to its validity in regression as well

as classification problems, unlike the linear regression, logistic regression is the most
commonly used approach in the field of machine learning
as default first alternative.
2.4 k-Nearest Neighbor (KNN) Algorithm

It is one of the simplest algorithms in the field of machine learning, and is apt to discuss it
here in the first chapter of this part. KNN is also a generic method that can be used as
classifier or regressor.

2.4.1 Definition of KNN

In order to illustrate the concept of k-nearest neighbor algorithm, consider a case of 2-


dimensional input data as shown in Fig. 2.1 . The top plot in the figure shows the
distribution of the data. Let us consider that we are using the value of k as 3. As shown in
bottom plot in the figure let there be a test sample located as shown by red dot. Then we
find the 3 nearest neighbors of the test point from the training distribution as shown. Now,
in order to predict the output value for the test point, all we need to do is find the value of
the output for the 3 nearest neighbors and average that value.

Fig. 2.1 Figure showing a distribution of input data and showing the
concept of finding nearest neighbors
This can be written in equation form as
𝑦ො = σ𝑘𝑖=1 𝑦𝑖 Τ𝑘 (11)

Where 𝑦𝑖 is the output value of the i th nearest neighbor. As can be seen this is one of the
simplest way to define the input to output mapping

2.4.2 Classification and Regression

 As the formula expressed in Eq. 11 can be applied to classification as well as


regression problems, KNN can be applied to both types of problems without need to
change anything in the architecture.
 KNN is a local method as opposed to global method, it can easily handle nonlinear
relationships.
 Consider the two class nonlinear distribution as shown in Fig. 2.2. KNN can easily
separate the two classes by creating the circular boundaries as shown based on the
local neighborhood information expressed by Eq.11.

Fig. 2.2 Figure showing nonlinear distribution of the data


3. Perceptron and Neural Networks

3.1 Introduction
 Frank Rosenblatt (1928 – 1971) was an American psychologist notable in the field of
Artificial Intelligence.
 In 1957 he started something really big. He "invented" a Perceptron program, on an
IBM 704 computer at Cornell Aeronautical Laboratory.
 Scientists had discovered that brain cells (Neurons) receive input from our senses by
electrical signals.
 The Neurons, then again, use electrical signals to store information, and to make
decisions based on previous input.
 Frank had the idea that Perceptrons could simulate brain principles, with the ability to
learn and make decisions.
 A Perceptron is an Artificial Neuron
 It is the simplest possible Neural Network
 Neural Networks are the building blocks of Machine Learning
3.2 Perceptron
 The original Perceptron was designed to take a number of binary inputs, and produce
one binary output (0 or 1).
 The idea was to use different weights to represent the importance of each input, and
that the sum of the values should be greater than a threshold value before making a
decision like true or false (0 or 1).
 Geometrically a single layered perceptron with linear mapping represents a linear plane
in n-dimensions. In n-dimensional space the input vector is represented as (𝑥1 , 𝑥2 … , 𝑥𝑛)
or x. The coefficients or weights in n-dimensions are represented as ((𝑤1 , 𝑤2 , … , w𝑛 ) or
w. The equation of perceptron in the n-dimensions is then written in vector form as
x.w = y

Fig. 3.1 Perceptron


A perceptron consists of different parts. These are:
•Input Values or Single Input Layer: These are the first layer of inputs where the neural
network uses these artificial input neurons to make the perceptron. It takes the initial data
into the neural system for further computation. Here the ‘x1’es are the input layers.
•Weights: It denotes the proportions or stability of an input's connection between units.
When the weight value from node 1 to node 2 goes high in quantity, neuron one has a more
substantial influence over the other neuron.
•Bias: This value acts as an intercept included in a linear equation. This auxiliary parameter
modifies the output in conjunction with the weighted sum of the input within the other
neuron.
•Net Sum: It is the complete summation of the input, its associated weight, bias, and the
operation performed to get the final sum.
•Activation Function: The activation function is a mathematical function that determines
whether the artificial neuron in the neural network can get activated or not. It helps in
calculating the weighted sum & further adds bias with it for giving the result.

There are two types of perceptron in TensorFlow. These are:


1. Single-Layer Perceptron
2. Multi-Layer Perceptron

 Single-layer perceptron is an artificial neural net that comprises one layer for
computation. In such a perceptron type, the neural network performs the computation
directly from the input layer to the output. Such a perceptron does not contain any
hidden layer.
 In such a type of perceptron, the input nodes are directly connected to the final layer. It
is easy for TensorFlow to run such algorithms. A node in the next layer carries the
weighted sum of various other inputs.
This is what a single layer perceptron will look like

Fig. 3.2 single layered perceptron


3.3 Multi-layer perceptron or Artificial Neural Network
 Multi-layer perceptron is a complicated artificial neural net that originates from multiple
layers of perceptron, where one is feed-forward, generating a collection of outputs from
a collection of inputs. In other words, a multi-layer perceptron is a directed graph within
a neural network that helps in connecting multiple layers to transmit a processed signal
that goes in one direction.
 The multi-layer perceptron network comprises an input layer, an output layer, and various
hidden layers. Each hidden layer consists of multiple perceptrons termed Hidden layers
or Hidden units. This is what a multi-layer perceptron will look like:

Fig. 3.3 Multi-layered perceptron


3.3.1 Feedforward Operation

The network shown in Fig. 3.3 also emphasizes another important aspect of MLP called as
feedforward operation. The information that is entered from the input propagates through
each layer towards the output. There is no feedback of the information from any layer
backwards when the network is used for predicting the output in the form of regression or
classification.

3.3.2 Nonlinear MLP or Nonlinear ANN


The major improvement in the MLP architecture comes in the way of using
nonlinear mapping. Instead of using simple dot product of the input and weights,a
nonlinear function, called as activation function is used.

3.3.2.1 Activation Functions


 Most simple activation function is a step function, also called as sign function as
shown in Fig. 3.4.
 This activation function is suited for applications like binary classification.The
continuous version of step function is called as sigmoid function or logistic function.
 Sometimes, a hyperbolic tan or tanh function is used, which has similar shape but its
values range from [-1, 1], instead of [0 -1] as in case of sigmoid function. Figure 3.5
shows the plot of tanh function.

Fig. 3.4 Activation function sign Fig. 3.5 Activation function tanh
3.3.3 Training MLP

During the training process, the weights of the network are learned from the labelled
training data. Conceptually the process can be described as:
1. Present the input to the neural network.
2. All the weights of the network are assigned some default value.
3. The input is transformed into output by passing through each node or neuron in each layer.
4. The output generated by the network is then compared with the expected output or label.
5. The error between the prediction and label is then used to update the weights of each node.
6. The error is then propagated in backwards direction through every layer, to update the
weights in each layer such that they minimize the error.

Thus backpropagation algorithm for training and feedforward operation for


prediction mark the two phases in the life of neural network. Backpropagation-based
training needs to be done in two different methods.
1. Online or stochastic method
2. Batch method
Online or Stochastic Learning
 In this method a single sample is sent as input to the network and based on the output
error the weights are updated.
 The optimization method most commonly used to update the weights is called stochastic
gradient descent or SGD method.
 The use of stochastic here implies that the samples are drawn randomly from the whole
data set, rather than using them sequentially.
 The process can converge to desired accuracy level even before all the samples are used.
 It is important to understand that in stochastic learning process, single sample is used in
each iteration and the learning path is more noisy.
Batch Learning
 In batch method the total data set is divided into a number of batches.
 Entire batch of samples is sent to the network before computing the error and updating
the weights.
 After entire batch is processed, the weights are updated. Each batch process is called as
one iteration.
 When all the samples are used once, it’s considered as one epoch in the training
process.
 Typically multiple epochs are used before the algorithm fully converges. As the batch
learning uses a batch of samples in each iteration, it reduces the overall noise and
learning path is cleaner.
 However, the process is lot more computation heavy and needs more memory and
computation resources
3.3.4 Hidden Layers

 Hidden Layers are not directly connected with inputs and outputs.Each layer in MLP
transforms the input to a new dimensional space.
 The hidden layers can have higher dimensionality than the actual input and thus they
can transform the input into even higher dimensional space
 Hidden layers, simply put, are layers of mathematical functions each designed to
produce an output specific to an intended result.
 Hidden layers allow for the function of a neural network to be broken down into specific
transformations of the data. Each hidden layer function is specialized to produce a
defined output

3.4 Radial Basis Function Networks

 Radial basis function networks RBFN or radial basis function neural networks RBFNN are
a variation of the feedforward neural networks (we will call them as RBF networks to
avoid confusion).
 The RBF networks are characterized by three layers, input layer, a single hidden layer,
and output layer.

Fig. 3.6 Architecture of radial basis function neural network

https://mccormickml.com/2013/08/15/radial-basis-function-network-rbfn-tutorial/
 The input and output layers are linear weighing functions, and the hidden layer has a radial
basis activation function instead of sigmoid type activation function that is used in
traditional MLP. The basis function is defined as

𝑓𝑅𝐵𝐹 𝑥 = ⅇ−𝛽 𝑥− 2

 Above equation is defined for a scalar input µ is called as center and ß represents the
spread or variance of the radial basis function. It lies in the input space. Figure 3.7 shows
the plot of the basis function. This plot is similar to Gaussian distribution.

Fig. 3.7 Plot of radial basis function

3.4.1 Interpretation of RBF Networks

 Consider that the desired values of output form n number of clusters for the corresponding
clusters in the input space. Each node in the hidden layer can be thought of as a
representative of each transformation from input cluster to output cluster.
 As can be seen from Fig. 3.7, the value of radial basis function reduces to 0 rather quickly
as the distance between the input and the center of the radial basis function µ increases
with respect to the spread ß.
 Thus RBF network as a whole maps the input space to output space by linear combination
of outputs generated by each hidden RBF node.
 It is important to choose these cluster centers carefully to make sure the input space is
mapped uniformly and there are no gaps.
 If requirements for the RBF network are followed, it produces accurate predictions
3. 5 Overfitting and Regularization

Neural Network scope to improve the performance for the given training data by
increasing the complexity of the network. Complexity can be increased by manipulating
various factors like
1. Increasing number of hidden layers
2. Increasing the nodes in hidden layers
3. Using complex activation functions
4. Increasing the training epochs
Such improvements in training performance with arbitrary increase in complexity
typically lead to overfitting. Overfitting is a phenomenon where we try to model the
training data so accurately that in essence we just memorize the training data rather than
identifying the features and structure of it. Such memorization leads to significantly worse
performance on unseen data. However determining the optimal threshold where the
optimization should be stopped to keep the model generic enough is not trivial.
• For example, when the model learns signals as well as noises in the training data but
couldn’t perform appropriately on new data upon which the model wasn’t trained, the
condition/problem of overfitting takes place.
• Overfitting simply states that there is low error with respect to training dataset, and
high error with respect to test datasets.
• Regularization is the most used technique to penalize complex models in machine
learning, it is deployed for reducing overfitting (or, contracting generalization errors) by
putting network weights small.
3.5.1 L1 and L2 Regularization
When you have a large number of features in your data set, you may wish to create a
less complex, more parsimonious model. Two widely used regularization techniques
used to address overfitting and feature selection are L1 and L2 regularization.
 L1 Regularization, also called a lasso regression, adds the “absolute value of
magnitude” of the coefficient as a penalty term to the loss function.
 L2 Regularization, also called a ridge regression, adds the “squared magnitude”
of the coefficient as the penalty term to the loss function.

Above Equ. Show the updated cost function C(x) use of L1 and L2 type
of regularizations to reduce the overfitting.
 L(x) is the loss function that is dependent on the error in prediction, while W stand for the
vector of weights in the neural network.
 The L1 norm tries to minimize the sum of absolute values of the weights while the L2
norm tries to minimize the sum of squared values of the weights.

Pros and Cons of L1 and L2 Regularization

 The L1 regularization requires less computation but is less sensitive to strong outliers
,as well as it is prone to making all the weights zero.
 L2 regularization is overall a better metric and provides slow weight decay towards zero,
but is more computation intensive.

3.5.2 Dropout Regularization


 This is an interesting method and is only applicable to the case of neural networks.
 In dropout regularization, some neurons are randomly dropped from the path.
 The effect of each dropout on overall accuracy is considered, and after some iterations
optimal set of neurons are selected in the final models.
 As this technique actually makes the model simpler rather than adding more complexity
like L1 and L2 regularization techniques.
 This method is quite popular, specifically in case of more complex and deep neural
networks

4. Decision Tree
4.1 Introduction
In decision analysis, a decision tree can be used to visually and explicitly represent
decisions and decision making. It uses a tree-like model of decisions. Though a commonly
used tool in data mining for deriving a strategy to reach a particular goal, its also widely
used in machine learning.

Fig:4.1 An example of a Decision Tree


Below are some assumptions that we made while using decision tree:
•At the beginning, we consider the whole training set as the root.
•Feature values are preferred to be categorical. If the values are continuous then they are
discretized prior to building the model.
•On the basis of attribute values records are distributed recursively.
•We use statistical methods for ordering attributes as root or the internal node.
4.2 Why Decision Trees?
Before going into the details of decision tree theory, let’s understand why decision trees are
so important. Here are the advantages of using decision tree algorithms for reference:
1. More human-like behavior.
2. Can work directly on non-numeric data, e.g., categorical.
3. Can work directly with missing data. As a result data cleaning step can be skipped.
4. Trained decision tree has high interpretability compared to abstract nature of trained
models using other algorithms like neural networks, or SVM, etc.
5. Decision tree algorithms scale easily from linear data to nonlinear data without any
change in core logic.
6. Decision trees can be used as non-parametric model, thus hyperparameter tuning
becomes unnecessary.
4.2.1 Types of Decision Trees
Based on the application (classification or regression) there are some differences in how the
trees are built, and consequently they are called classification decision trees and regression
decision trees.
4.3 Algorithms for Building Decision Trees
Most commonly used algorithms for building decision trees are:
• CART or Classification and Regression Tree
• ID3 or Iterative Dichotomiser
• CHAID or Chi-Squared Automatic Interaction Detector
 CART or classification and regression tree is a generic term used for describing the
process of building decision trees as described by Breiman–Friedman .
 ID3 is a variation of CART methodology with slightly different use of optimization
method.
 CHAID uses a significantly different procedure and we will study it separately.
Let’s consider a two-dimensional space defined by axes (x1,x2). The space is divided into 5
regions (R1,R2,R3,R4,R5)as shown in figure, using a set of rules as defined in Fig. 4.2.

Fig. 4.2 Rectangular regions created by decision tree

4.4 Regression

A regression tree is built through a process known as binary recursive


partitioning, which is an iterative process that splits the data into partitions or branches,
and then continues splitting each partition into smaller groups as the method moves up
each branch.
Based on the example shown in Figs. 4.2 and 4.3, let the classes be regions R1 to R5 and
the input data is two dimensional. In such case, the desired response of the decision tree is
defined as

𝑡 𝑥 = 𝑟𝑘 ∀𝑥𝑖 ∈ 𝑅𝑘
where 𝑟𝑘 is a constant value of output in region 𝑅𝑘
If we define the optimization problem as minimizing the mean square error,
then simple calculation would show that the estimate for 𝑟𝑘 is given by

Fig. 4.3 Hierarchical rules defining the decision tree

Let us denote largetreeasT0 . Then the algorithm must apply a pruning technique to
reduce the tree size to find the optimal tradeoff that captures the most of the structure in
the data without overfitting it. This is achieved by using squared error node impurity
measure optimization.

4.5 Classification Tree


In case of classification, the output is not a continuous numerical value, but a discreet
class label. The development of the large tree follows the same steps as described in
the regression tree subsection, but the pruning methods need to be updated as the
squared error method is not suitable for classification. Three different types of
measures are popular in the literature:
• Misclassification error
• Gini index
• Cross-entropy or deviance
Let there be “k” classes and “n” nodes. Let the frequency of class (m) predictions at
each node (i) be denoted as fmi. The fraction of the classes predicted as m at node i be
denoted as pmi. Let the majority class at node m be cm . Hence the fraction of classes
cm at node m would be pmcm
4.6 Decision Metrics
Let’s define the metrics used for making the decision at each node. Differences in the metric
definition separate the different decision tree algorithms.
4.6.1 Misclassification Error
Based on the variables defined above the misclassification rate is defined as 1 -
As can be seen from the figure this rate is not a continuous function and hence cannot be
differentiated. However, this is one of the most intuitive formulations and hence is fairly
popular.
4.6.2 Gini Index
Gini index is the measure of choice in CART. Concept of the Gini index can be summarized as
the probability of misclassification of a randomly selected input sample if it was labelled based
on the distribution of the classes in the given node. Mathematically it is defined as

Fig. 4.4 The plot of decision metrics for a case of 2 class problem. X-axis shows the
proportion in class 1. Curves are scaled to fit, without loss of generality

As the plot in Fig. 4.4 shows, this is a smooth function of the proportion and is continuously
differentiable and can be safely used in optimization.
4.6.3 Cross-Entropy or Deviance
Cross-entropy is an information-theoretic metric defined as

 This definition resembles classical entropy of a single random variable. However, as


the random variable here is already a combination of the class prediction and nodes
of the tree, it is called as cross-entropy.
 ID3 models use cross-entropy as the measure of choice. As the plot in figure shows,
this is a smooth function of the proportion and is continuously differentiable and can
be safely used in optimization.

4.7 CHAID
 Chi-square automatic interaction detector or CHAID is a decision tree technique that
derives its origin in statistical chi-square test for goodness of fit. It was first published
by G. V. Kass in 1980, but some parts of the technique were already in use in 1950s.
 This test uses the chi-square distribution to compare a sample with a population and
predict at desired statistical significance whether the sample belongs to the population.
CHAID technique uses this theory to build a decision tree.

4.7.1 CHAID Algorithm


The first task in building the CHAID tree is to find the most dependent variable. This is
in a way directly related to what is the final application of the tree. The algorithm works
best if a single desired variable can be identified. Once such variable is identified, it is
called as root node. Then the algorithm tries to split the node into two or more nodes,
called as initial or parent nodes. All the subsequent nodes are called as child nodes,
till we reach the final set of nodes that are not split any further.
These nodes are called as terminal nodes. Splitting at each node is entirely
based on statistical dependence as dictated by chi-square distribution in case of
categorical data and by F-test in case of continuous data. As each split is based on
dependency of variables, unlike a more complex expression like Gini impurity or cross-
entropy in case of CART or ID3-based trees, the tree structure developed using CHAID
is more interpretable and human readable in most cases.
4.8 Training Decision Tree

4.8.1 Steps
1. Start with the training data.
2. Choose the metric of choice (Gini index or cross-entropy).
3. Choose the root node, such that it splits the data with optimal values of metrics into two
branches.
4. Split the data into two parts by applying the decision rule of root node.
5. Repeat the steps 3 and 4 for each branch.
6. Continue the splitting process till leaf nodes are reached in all the branches with
predefined stop rule.

Ensemble methods, which combines several decision trees to produce better predictive
performance than utilizing a single decision tree. The main principle behind the ensemble
model is that a group of weak learners come together to form a strong learner.
There are three main types of ensembles:
1. Bagging
2. Random forest
3. Boosting
4.9 Bagging Ensemble Trees
The term bagging finds it origins in Bootstrap Aggregation. Coincidentally, literal meaning
of bagging, which means putting multiple decision trees in a bag is not too far from the
way the bagging techniques work. Bagging technique can be described using following
steps:
1. Split the total training data into a predetermined number of sets with random sampling
with replacement. The term With replacement means that same sample can appear in
multiple sets. Each sample is called as Bootstrap sample.
2. Train decision tree using CART or ID3 method using each of the data sets.
3. Each learned tree is called as a weak learner.
4. Aggregate all the weak learners by averaging the outputs of individual learners for the
case of regression and aggregate all the individual weak learners by voting for the case of
classification. The aggregation steps involve optimization, such that prediction error is
minimized.
5. The output of the aggregate or ensemble of the weak learners is considered as the
final output.
4.10 Random Forest Tree
 Random Forest is a flexible, easy to use machine learning algorithm that
produces, even without hyper-parameter tuning, a great result most of the time.
 It is also one of the most used algorithms, because it’s simplicity and the fact
that it can be used for both classification and regression tasks.
 Random Forest is a supervised learning algorithm. It creates a forest and makes
it somehow random.
 The forest it builds, is an ensemble of Decision Trees, most of the time trained
with the “bagging” method.
 The general idea of the bagging method is that a combination of learning
models increases the overall result.
 Random forest builds multiple decision trees and merges them together
to get a more accurate and stable prediction.
 One big advantage of random forest is, that it can be used for both classification
and regression problems, which form the majority of current machine learning
systems.
Below you can see how a random forest would look like with two trees

 With a few exceptions a random-forest classifier has all the hyperparameters of


a decision-tree classifier and also all the hyperparameters of a bagging classifier,
to control the ensemble itself.
 Instead of building a bagging-classifier and passing it into a decision-tree-
classifier, you can just use the random-forest classifier class, which is more
convenient and optimized for decision trees.
 Note that there is also a random-forest regressor for regression tasks.
 The random-forest algorithm brings extra randomness into the model, when it is growing
the trees.
 Instead of searching for the best feature while splitting a node, it searches for the best
feature among a random subset of features.
 This process creates a wide diversity, which generally results in a better model.
 Therefore when you are growing a tree in random forest, only a random subset of the
features is considered for splitting a node.
 You can even make trees more random, by using random thresholds on top of it, for each
feature rather than searching for the best possible thresholds (like a normal decision tree
does).

4.10.1 Decision Jungles


 Recently a modification to the method of random forests was proposed in the form of
Decision Jungles . One of the drawbacks of random forests is that they can grow
exponentially with data size and if the compute platform is limited by memory, the
depth of the trees needs to be restricted. This can result in suboptimal performance.
 Decision jungles propose to improve on this by representing each weak learner in
random forest method by a directed acyclic graph DAG instead of open-ended tree.
 The DAG has capability to fuse some of the nodes thereby creating multiple paths to a
leaf from root node.
 As a result decision jungles can represent the same logic as random forest trees, but in
a significantly compact manner.
4.11 Boosted Ensemble Trees
Boosting technique employs a very different approach, where first tree is
trained based on a random sample of data.However, the data used by the second tree
depends on the outcome of training of first tree. The second tree is used to focus on
the specific samples where first decision tree is not performing well. Thus training of
second tree is dependent on the training of first tree and they cannot be trained in
parallel. The training continues in this fashion to third tree and fourth and so on.
Due to unavailability of parallel computation, the training of boosted trees
is significantly slower than training tress using bagging and random forest. Once all the
trees are trained then the output of all individual trees is combined
with necessary weights to generate final output.
4.11.1 AdaBoost
AdaBoost was one of the first boosting algorithms proposed by Freund and Schapire. The
algorithm was primarily developed for the case of binary classification and it was quite
effective in improving the performance of a decision tree in a systematic iterative manner. The
algorithm was then extended to support multi-class classification as well as regression.
4.11.2 Gradient Boosting
Gradient boosting is a generalization of AdaBoost algorithm using statistical
framework developed by Breiman and Friedman . In gradient boosted trees, the boosting
problem is stated as numerical optimization problem with objective to minimize the error by
sequentially adding weak learners using gradient descent algorithm. Gradient descent being a
greedy method, gradient boosting algorithm is susceptible to overfitting the training data.
Hence regularization techniques are always used with
gradient boosting to limit the overfitting.

5. Support Vector Machine

 Support Vector Machine is responsible for finding the decision boundary to separate
different classes and maximize the margin.
 Margins are the (perpendicular) distances between the line and those dots closest to the
line.
 Hyperplane is an (n minus 1)-dimensional subspace for an n-dimensional space. For a 2-
dimension space, its hyperplane will be 1-dimension, which is just a line. For a 3-
dimension space, its hyperplane will be 2-dimension, which is a plane that slice the cube.

SVM Algorithm

Separable case – Infinite boundaries are possible to separate the data into two classes.

Non Separable case – Two classes are not separated but overlap with each other.
Separable case SVM

Let’s understand the working of SVM using an example. Suppose we have a dataset that
has two classes (green and blue). We want to classify that the new data point as either blue
or green.

To classify these points, we can have many decision boundaries, but the
question is which is the best and how do we find it? NOTE: Since we are plotting the data
points in a 2-dimensional graph we call this decision boundary a straight line but if we
have more dimensions, we call this decision boundary a “hyperplane”.
The best hyperplane is that plane that has the maximum distance from
both the classes, and this is the main aim of SVM. This is done by finding different
hyperplanes which classify the labels in the best way then it will choose the one which is
farthest from the data points or the one which has a maximum margin.
Any Hyperplane can be written mathematically as above

For a 2-dimensional space, the Hyperplane, which is the line.

The dots above this line, are those x1, x2 satisfy the formula above

The dots below this line, similar logic.


Assuming the label y is either 1 (for green) or -1 (for red), all those three lines below are
separating hyperplanes. Because they all share the same property — above the line, is
green; below the line, is red.

This property can be written in math again as followed:

If we further generalize these two into one, it becomes:

The distance between either side of the dashed line to the solid line is the margin

Non-Separable Case

In the linearly separable case, SVM is trying to find the hyperplane that
maximizes the margin, with the condition that both classes are classified correctly. But in
reality, datasets are probably never linearly separable, so the condition of 100% correctly
classified by a hyperplane will never be met.
SVM address non-linearly separable cases by introducing two concepts:
Soft Margin and Kernel Tricks.
Let’s use an example. If I add one red dot in the green cluster, the dataset becomes linear
non separable anymore.
Two solutions to this problem:
1.Soft Margin: try to find a line to separate, but tolerate one or few misclassified dots (e.g.
the dots circled in red)
2.Kernel Trick: try to find a non-linear decision boundary
Soft Margin
Two types of misclassifications are tolerated by SVM under soft margin:
1.The dot is on the wrong side of the decision boundary but on the correct side/ on the
margin (shown in left)
2.The dot is on the wrong side of the decision boundary and on the wrong side of the margin
(shown in right)

Applying Soft Margin, SVM tolerates a few dots to get misclassified and tries to balance the
trade-off between finding a line that maximizes the margin and minimizes the
misclassification.
Kernels in Support Vector Machine

The most interesting feature of SVM is that it can even work with a non-linear dataset and
for this, we use “Kernel Trick” which makes it easier to classifies the points. Suppose we
have a dataset like this:

Here we see we cannot draw a single line or say hyperplane which can classify the points
correctly. So what we do is try converting this lower dimension space to a higher dimension
space using some quadratic functions which will allow us to find a decision boundary that
clearly divides the data points. These functions which help us do this are called Kernels and
which kernel to use is purely determined by hyperparameter tuning.

Different Kernel functions


Some kernel functions which you can use in SVM are given below:
1. Polynomial kernel
Following is the formula for the polynomial kernel

Here d is the degree of the polynomial, which we need to specify manually.


Suppose we have two features X1 and X2 and output variable as Y, so using polynomial
kernel we can write it as:

So we basically need to find X12 , X22 and X1.X2, and now we can see that 2 dimensions got
converted into 5 dimensions.

2. Sigmoid kernel
We can use it as the proxy for neural networks. Equation is:

It is just taking your input, mapping them to a value of 0 and 1 so that they can be
separated by a simple straight line.
3. RBF kernel
What it actually does is to create non-linear combinations of our features to lift your
samples onto a higher-dimensional feature space where we can use a linear decision
boundary to separate your classes It is the most used kernel in SVM classifications, the
following formula explains it mathematically:

where,
1. ‘σ’ is the variance and our hyperparameter
2. ||X₁ – X₂|| is the Euclidean Distance between two points X₁ and X₂

https://www.analyticsvidhya.com/blog/2021/10/support-vector-
machinessvm-a-complete-guide-for-beginners/
6. Probabilistic Models

Probabilistic methods try to assign some form of uncertainty to the unknown variables and
some form of belief probability to known variables and try to find the unknown values using
the extensive library of probabilistic models. The probabilistic models are mainly classified into
two types:
1. Generative
2. Discriminative

The difference between the two types is given in terms of the probabilities
that they deal with. If we have an observable input X and observable output Y , then the
generative models try to model the joint probability P(X; Y), while the discriminative models
try to model the conditional probability P(Y|X).
 the discriminative models as the models that try to predict the changes in the output
based on only changes in the input.
 The generative models are the models that try to model the changes in the output
based on changes in input as well as the changes in the state.

The probabilistic approaches (discriminative as well as generative) are also sliced based on
two universities of thought groups:
1. Maximum likelihood estimation
2. Bayesian approach

6.1 Discriminative Models


Maximum Likelihood Estimation

The maximum likelihood estimation or MLE approach deals with the problems at the
face value and parameterizes the information into variables. The values of the variables
that maximize the probability of the observed variables lead to the solution of the
problem.
Let us define the problem using formal notations. Let there be a function f(x; θ) that
produces the observed output y. x ∈
θ ∈ Θ represent a parameter vector that can be single or multidimensional.
The MLE method defines a likelihood function denoted as L(y|θ). Typically the likelihood
function is the joint probability of the parameters and observed variables as L(y|θ) = P(y; θ).
The objective is to find the optimal values for θ that maximizes the likelihood function as
given by

or

Bayesian Approach

All the unknowns are modelled as random variables with known prior probability
distributions. Let us denote the conditional prior probability of observing the output y for
parameter vector θ as P(y|θ). The marginal probabilities of these variables are denoted
as P(y) and P(θ). The joint probability of the variables can be written in terms of
conditional and marginal probabilities as
P(y; θ) = P(y|θ) · P(θ) 1
The same joint probability can also be given as
P(y; θ) = P(θ|y) · P(y) 2
Here the probability P(θ/y) is called as posterior probability, Combining equ. 1 and 2
P(θ|y) · P(y) = P(y|θ) · P(θ)
rearranging the terms we get

Equation 3 is called as Bayes’ theorem. This theorem gives relationship between the
posteriory probability and priori probability in a simple and elegant manner. This equation
is the foundation of the entire bayesian framework.
Each term in the above equation is given a name,
P(θ) is called as prior,
P(y|θ) is called as likelihood,
P(y) is called as evidence, and P(θ|y) is called as posterior.

The Bayes’ estimate is based on maximizing the posterior. Hence, the optimization problem
based on Bayes’ theorem can now be stated as

expanding the term

Comparison of MLE and Bayesian Approach

In order to understand them to the full extent let us consider a simple


numerical example. Let there be an experiment to toss a coin for 5 times. Let’s say the
two possible outcomes of each toss are H,T,or Head or Tai l . The outcome of our
experiment is H,H,T, H,H. The objective is to find the outcome of the 6th toss. Let’s work
out this problem using MLE and Bayes’ approach.

Solution Using MLE

The likelihood function is defined as, L(y|θ) = P(y; θ), where y denotes the outcome of the
trial and θ denotes the property of the coin in the form of probability of getting given
outcome. Let probability of getting a Head be h and probability of getting a Tai l will be 1-h.
Now, outcome of each toss is independent of the outcome of the other tosses. Hence the
total likelihood of the experiment can be given as

Now, let us solve this equation,


In order to maximize the likelihood, we need to use the fundamental principle from
differential calculus, that at any maximum or minimum of a continuous function the first
order derivative is 0. In order to maximize the likelihood, we will differentiate above equation
with respect to h and equate it to 0. Solving the equation (assuming h ≠ 0) we get,

This probability of getting Head in the next toss would be 4/5

Solution Using Bayes’s Approach


Writing the posterior as per Bayes’ theorem

So we can now proceed with the optimization problem as before. In order to maximize the
posterior, let’s differentiate it with respect to h as before,

Substituting the values and solving (assuming h ≠ 0),


With Bayes’s approach, probability of getting Head in the next toss would be 9/10. Thus the
assumption of a non-trivial prior with Bayes’ approach leads to a different answer compared
to MLE.

Probability density function (pdf) for the prior


6.2 Generative Models

 Generative modeling is an unsupervised form of machine learning where the model


learns to discover the patterns in input data. Using this knowledge, the model can
generate new data on its own, which is relatable to the original training dataset.
 A generative model includes the distribution of the data itself, and tells you how likely a
given example is. For example, models that predict the next word in a sequence are
typically generative models because they can assign a probability to a sequence of
words.
 The generative models can be broadly classified into two types:
 (1) Classical models
 (2) Deep learning based models.
Few examples of classical generative models
Mixture Methods
One of the fundamental aspect of generative models is to understand the
composition of the input. Understand how the input data came to existence in the first
place. Most simplistic case would be to have all the input data as outcome of a single
process. If we can identify the parameters describing the process, we can understand the
input to its fullest extent.
Bayesian Networks
Bayesian networks represent directed acyclic graphs as shown in Fig. 6.1.
Each node represents an observable variable or a state. The edges represent the
conditional dependencies between the nodes. Training of Bayesian network involves
identifying the nodes and predicting the conditional probabilities that best represent the
given data.

Fig. 6.1 Sample Bayesian network


6.3 Some Useful Probability Distributions

Definition : pdf A probability density function or pdf is a function P(X = x)


that provides probability of occurrence of value x for a given variable X. The plot of
P(X = x) is bounded between [0, 1] on y-axis and can spread between [-∞, + ∞]
on x-axis and integrates to 1.
Definition : cdf A cumulative density function or cdf is a function C(X = x)
that provides sum of probabilities of occurrences of values of X between [- ∞,x].
This plot is also bounded between [0, 1]. Unlike pdf, this plot starts at 0 on left and
ends into 1 at the right.

Normal or Gaussian Distribution


Normal distribution is one of the most widely used probability distribution. It is also called
as bell shaped distribution due to the shape of its pdf. The distribution has vast array

of applications including error analysis. Another reason the normal distribution is popular is

due to central limit theorem.


Definition : Central Limit Theorem : Central limit theorem states that, if sufficiently
large number of samples are taken from a population from any distribution with finite
variance, then the mean of the samples asymptotically approaches the mean of the
population. In other words, sampling distribution of mean taken from population of any
distribution asymptotically approaches normal distribution.
Hence sometimes the normal distribution is also called as distribution of distributions.
Normal distribution is also an example of continuous and unbounded distribution, where
the value of x can span [-∞, ∞]. Mathematically, the pdf of normal distribution is given as

where µ is the mean and s is the standard deviation of the distribution.


Variance is 𝜎 2 .
cdf of normal distribution is given as
The plots of the pdf and cdf are shown in figure

Bernoulli Distribution

Bernoulli distribution is an example of discrete distribution and its most


common application is probability of coin toss. The distribution is based on two
parameters p and q, which are related as p = 1 - q. Typically p is called the probability of
success (or in case of coin toss, it can be called as probability of getting a Head) and q is
called the probability of failure (or in case of coin toss,probability of getting a Tai l ).
Based on these parameters, the pdf (sometimes, in case of discrete variables, it is called
as probability mass function or pmf, but for the sake of consistency, we will call it pdf) of
Bernoulli distribution is given as
here, we use the discrete variable k instead of continuous variable x. The cdf is given as

Binomial Distribution
Binomial distribution generalizes Bernoulli distribution for multiple trials.
Binomial distribution has two parameters n and p. n is number of trials of the
experiment,where probability of success is p. The probability of failure is q = 1 - p just like
Bernoulli distribution, but it is not considered as a separate third parameter. The pdf for
binomial distribution is given as

where

It also represents the number of combinations of k in n from the permutation-combination


theory where it is represented as

The cdf of binomial distribution is given as


Gamma Distribution

Gamma distribution is also one of the very highly studied distribution in theory of
statistics. It forms a basic distribution for other commonly used distributions like chi-
squared distribution, exponential distribution etc., which are special cases of gamma
distribution. It is defined in terms of two parameters: a and ß. The pdf of gamma
distribution is given as

The cdf of gamma function cannot be stated easily as a single valued function,but rather is
given as sum of an infinite series as

Plot of Gamma pdfs for different values of 𝛼 and ß


Poisson Distribution

Poisson distribution is a discrete distribution loosely similar to Binomial


distribution.Poisson distribution is developed to model the number of occurrences of an
outcome in fixed interval of time. It is named after a French mathematician Siméon
Poisson. The pdf of Poisson distribution is given in terms of number of events (k)in the
interval as

where the single parameter λ is the average number of events in the interval. The cdf
of Poisson distribution is given as

Plot of Poisson pdfs for different values of λ


7. Unsupervised Learning

Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
Below are some main reasons which describe the importance of Unsupervised Learning:
•Unsupervised learning is helpful for finding useful insights from the data.
•Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
•Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
•In real-world, we do not always have input data with the corresponding output so to solve
such cases, we need unsupervised learning
Diiferent aspects of unsupervised learning:
1. Clustering
2. Component Analysis
3. Self Organizing Maps (SOM)
4. Autoencoding neural networks
Clustering
• Clustering is essentially aggregating the samples in the form of groups. The criteria used
for deciding the membership to a group is determined by using some form of metric or
distance.
• Simple method of clustering is K-means clustering. The variable K denotes number of
clusters. The method expects the user to determine the value of K before starting to
apply the algorithm.
• k-Means Clustering
The k-means clustering algorithm can be summarized as follows:
1. Start with a default value of k, which is the number of clusters to find in the given data.
2. Randomly initialize the k cluster centers as k samples in training data, such that there are
no duplicates.
3. Assign each of the training samples to one of the k cluster centers based on a chosen
distance metric.
4. Once the classes are created, update the centers of each class as mean of all the
samples in that class.
5. Repeat steps 2–4 until there is no change in the cluster centers.
 The distance metric used in the algorithm is typically the Euclidean distance.
 In most practical situations, where the clusters are not well separated, or the
number of naturally occurring clusters is different than the initialized value of k, the
algorithm may not converge.
 The cluster centers can keep oscillating between two different values in
subsequent iterations or they can just keep shifting from one set of clusters to
another. In such cases multiple optimizations of the algorithms are proposed
Optimizations for k-Means Clustering
1. Change the stopping criterion from absolute “no change” to cluster centers to allow for a
small change in the clusters.
2. Restrict the number of iterations to a maximum number of iterations.
3. Find the number of samples in each cluster and if the number of samples is less than a
certain threshold, delete that cluster and repeat the process.
4. Find the intra-cluster distance versus inter-cluster distance, and if two clusters are too close
to each other relative to other clusters, merge them and repeat the process.
5. If some clusters are getting too big, apply a threshold of maximum number of samples in a
cluster and split the cluster into two or more clusters and repeat the process.

Figures showing the progress of k-means clustering algorithm iterating


through steps to converge on the desired clusters
Improvements to k-Means Clustering

Even after the multiple optimizations are described , there are


cases, when the results are still suboptimal and some further improvements can be
applied.
 Hierarchical k-Means Clustering
 Fuzzy k-Means Clustering
Hierarchical k-Means Clustering
In some cases, using the same k-means clustering algorithm recursively can be helpful.
After each successful completion of the clustering, new set of random clusters are
initialized inside of each cluster and same algorithm is repeated in the subset of each
cluster to find the sub-clusters. This is called as hierarchical k-means clustering method.
Fuzzy k-Means Clustering
In traditional k-means clustering algorithm after the cluster centers are chosen or
updated, all the training samples are grouped into nearest cluster. Instead of using such
absolute grouping, fuzzy k-means algorithm suggests a use of probabilistic grouping. In
this case, each sample has non-zero probability of belonging to multiple clusters at the
same time. The nearer the cluster, the higher the probability and so on.

Component Analysis
Another important aspect of unsupervised machine learning is dimensionality reduction.
Component analysis methods are quite effective in this regard. It is similar to principle
component Analysis.
Independent Component Analysis (ICA)
ICA takes a very different and more probabilistic approach towards finding the core
dimensions in the data by making the assumption that given data is generated as a result
of combination of a finite set of independent components. These independent components
are not directly observable and hence sometimes referred to as latent components.
Mathematically, the ICA can be defined for given data as data (xi),i = 1, 2,...,n

where aj represent weights for corresponding k number of sj independent components.


The cost function to find the values of aj is typically based on mutual
information.
Self Organizing Maps (SOM)

 Self organizing maps, also called as self organizing feature maps present a neural network
based unsupervised learning system.
 SOMs define a different type of cost function that is based on similarity in the
neighborhood. The idea here is to maintain the topological distribution of the data while
expressing it in smaller dimensions efficiently.
 In order to illustrate the functioning of SOM, it will be useful to take an actual example.
Fig.6.2 shows data distributed in 3dimensions.
 This is a synthetically generated data with ideal distribution to illustrate the concept.

Figure 6.2 showing original 3-dimensional distribution of the data and its 2-dimensional
representation generated by SOM.
• The data is essentially a 2-dimensional plane folded into 3-dimensions.SOM unfolds the
plane back into 2-dimensions as shown in the bottom figure. Withthis unfolding, the
topological behavior of the original distribution is still preserved.All the samples that
are neighbors in original distribution are still neighbors. Also,the relative distances of
different points from one another are also preserved in that order.

Fig. shows The SOM essentially unfolds 2-dimensional place folded


into 3-dimensional space
Autoencoding Neural Networks

 Autoencoding neural networks or just autoencoders are a type of neural networks that
work without any labels and belong to the class of unsupervised learning.
 Below figure shows architecture of autoencoding neural network. There is input layer
matching the dimensionality of the input and hidden layer with reduced dimensionality
followed by an output layer with the same dimensionality as input.
 The target here is to regenerate the input at the output stage.
 The network is trained to regenerate the input at the output layer. Thus the labels are
essentially same as input.

• The unique aspect of autoencoding networks is reduced dimensionality of the hidden


layer. If an autoencoding network is successfully trained within the required error
margins, then in essence we are representing the input in lesser dimensional space
in the form of coefficients of the nodes at hidden layer.
• Furthermore, the dimensionality of the hidden layer is programmable. Typically, with
the use of linear activation functions, the lower dimensional representation
generated by the autoencoding networks resembles the dimensionality reduction
obtained from PCA.
8. Featurization

Featurization is the process to convert varied forms of data to numerical data


which can be used for basic ML algorithms. Data can be text data, images, videos,
graphs, various database tables, time-series, categorical features, etc

Missing Data Imputation:


1.Complete case analysis
2.Mean / Median / Mode imputation
3.Random Sample Imputation
4.Replacement by Arbitrary Value
5.Missing Value Indicator
6.Multivariate imputation
Categorical Encoding:
1.One hot encoding
2.Count and Frequency encoding
3.Target encoding / Mean encoding
4.Ordinal encoding
5.Weight of Evidence
6.Rare label encoding
7.BaseN, feature hashing and others
Variable Transformation:
1.Logarithm
2.Reciprocal
3.Square root
4.Exponential
5.Yeo-Johnson
6.Box-Cox
Discretisation:
1.Equal frequency discretisation
2.Equal length discretisation
3.Discretisation with trees
4.Discretisation with ChiMerge
Outlier Removal:
1.Removing outliers
2.Treating outliers as NaN
3.Capping, Windsorisation
Feature Scaling:
1.Standardisation
2.MinMax Scaling
3.Mean Scaling
4.Max Absolute Scaling
5.Unit norm-Scaling
Date and Time Engineering:
1.Extracting days, months, years, quarters, time elapsed
Feature Creation:
1.Sum, subtraction, mean, min, max, product, quotient of group of features
Aggregating Transaction Data:
1.Same as above but in same feature over time window
Extracting features from text:
1.Bag of words
2.tfidf
3.n-grams
4.word2vec
5.topic extraction
And finally extracting features from images.
Assignments
ASSIGNMENT - 1

Consider UCI: Adult Salary Predictor, UCI Machine Learning Repositoty is one of the well-
known resources for finding sample problems in machine learning. It hosts multiple data
sets targeting variety of different problems. The data presented here contains information
collected by census about the salaries of people from different work-classes, age groups,
locations, etc. The objective is to predict the salary, specifically predict a binary classification
between salary greater than $50K/year and less than $50K/year.

73
Part A – Q & A
Unit - II
PART -A

S.N Question and Answer CO,K


o
What is linear model and nonlinear model? CO2, K1
1. Nonlinear regression is a form of regression analysis in
which data is fit to a model and then expressed as a
mathematical function. Simple linear regression relates two
variables (X and Y) with a straight line (y = mx + b), while
nonlinear regression relates the two variables in a nonlinear
(curved) relationship.

2. Why linear regression is called linear? CO2, K1


Linear regression fits a straight line or surface that minimizes
the discrepancies between predicted and actual output
values. There are simple linear regression calculators that use a
“least squares” method to discover the best-fit line for a set of
paired data

3. What is an example of linear regression? CO2, K1


We could use the equation to predict weight if we knew an
individual's height. In this example, if an individual was 70 inches
tall, we would predict his weight to be: Weight = 80 + 2 x (70)
= 220 lbs. In this simple linear regression, we are examining the
impact of one independent variable on the outcome.31-May-2016

4. Why is it called a regression? CO2, K1


"Regression" comes from "regress" which in turn comes from
latin "regressus" - to go back (to something). In that sense,
regression is the technique that allows "to go back" from messy,
hard to interpret data, to a clearer and more meaningful model.

5. Why is regression used? CO2, K1


Typically, a regression analysis is done for one of two purposes: In
order to predict the value of the dependent variable for individuals
for whom some information concerning the explanatory variables is
available, or in order to estimate the effect of some explanatory
variable on the dependent variable.

6. Why is classification used in machine learning? CO2, K1


A common job of machine learning algorithms is to recognize
objects and being able to separate them into categories.
This process is called classification, and it helps us segregate vast
quantities of data into discrete values, i.e. :distinct, like 0/1,
True/False, or a pre-defined output label

7
5
PART -A

S.No Question and Answer CO,K

7. What is the use of classification? CO2, K1


Classification is a machine learning technique used to categorize data
into a given number of classes. It will predict the class labels or
categories for the new data. A decision tree is a supervised machine
learning technique that predicts the class label of data objects

8. What is perceptron and neural network? CO2, K1


Perceptron is a single layer neural network and a multi-layer
perceptron is called Neural Networks. Perceptron is a linear
classifier (binary). Also, it is used in supervised learning. It helps to
classify the given input data.

9. What are different types of perceptron? CO2, K1


Based on the layers, Perceptron models are divided into two types.
These are as follows: Single-layer Perceptron Model. Multi-layer
Perceptron model.

10. What is a decision tree in machine learning? CO2, K1


Introduction Decision Trees are a type of Supervised Machine Learning
(that is you explain what the input is and what the corresponding
output is in the training data) where the data is continuously split
according to a certain parameter. The tree can be explained by two
entities, namely decision nodes and leaves.

11. How decision trees are used in learning? CO2, K1


A Decision tree is the denotative representation of a decision-making
process. Decision trees in artificial intelligence are used to arrive at
conclusions based on the data available from decisions made in
the past.

12. What are the types of decision tree models? CO2, K1


There are 4 popular types of decision tree algorithms: ID3, CART
(Classification and Regression Trees), Chi-Square and
Reduction in Variance.

13. What are the advantages and disadvantages of decision trees? CO2, K1
They are very fast and efficient compared to KNN and other
classification algorithms. Easy to understand, interpret, visualize. The
data type of decision tree can handle any type of data whether it is
numerical or categorical, or boolean. Normalization is not required in
the Decision Tree
7
6
PART -A
S.N Question and Answer CO,K
o
14 What is SVM example? CO2, K1
Support Vector Machine (SVM) is a supervised machine learning
algorithm capable of performing classification, regression and
even outlier detection. The linear SVM classifier works by drawing a
straight line between two classes

15 How does SVM predict? CO2, K1


This is exactly what SVM does! It tries to find a line/hyperplane (in
multidimensional space) that separates these two classes. Then it
classifies the new point depending on whether it lies on the positive or
negative side of the hyperplane depending on the classes to predict.

16 What are the parameters of SVM? CO2, K1


SVM has the penalty parameter which controls the trade-off between
minimizing the training error and maximizing the classification margin.
Moreover, kernel parameters determine the distances between patterns
into the new space, dimensions of the new space, and the complexity of
the classification model

17 Can SVM Overfit? CO2, K1


SVMs avoid overfitting by choosing a specific hyperplane among the
many that can separate the data in the feature space. SVMs find the
maximum margin hyperplane, the hyperplane that maximixes the
minimum distance from the hyperplane to the closest training point

18 How does SVM deal with outliers? CO2, K1


To deal with outliers, robust variants of SVM have been proposed,
such as the robust outlier detection algorithm and an SVM with
a bounded loss called the ramp loss. In this paper, we propose a
robust variant of SVM and investigate its robustness in terms of the
breakdown point

19 What are probabilistic models in machine learning? CO2, K1


ML models are probabilistic models, both in the sense that they assign
probabilities to predictions in a supervised learning context (see
later) and because they create distributions of the data in latent space
representations.

20 What is example of probabilistic model? CO2, K1


For example, if you live in a cold climate you know that traffic
tends to be more difficult when snow falls and covers the roads.
We could go a step further and hypothesize that there will be a strong
correlation between snowy weather and increased traffic incidents
7
7
PART -A

S.N Question and Answer CO,K


o
21 What is Featurization ? CO2, K1
Featurization is the process to convert varied forms of data to
numerical data which can be used for basic ML algorithms. Data
can be text data, images, videos, graphs, various database tables, time-
series, categorical features, etc
22 What are the 2 steps of feature engineering? CO2, K1
Feature engineering in ML consists of four main steps: Feature
Creation, Transformations, Feature Extraction, and Feature Selection.
Feature engineering consists of creation, transformation, extraction, and
selection of features, also known as variables, that are most conducive
to creating an accurate ML algorithm.

23 What is unsupervised learning in machine learning with example? CO2, K1


Unsupervised learning, also known as unsupervised machine learning,
uses machine learning algorithms to analyze and cluster
unlabeled datasets. These algorithms discover hidden patterns or
data groupings without the need for human intervention

24 What are the applications of unsupervised learning? CO2, K1


The main applications of unsupervised learning include clustering,
visualization, dimensionality reduction, finding association
rules, and anomaly detection

25 Which are the unsupervised machine learning algorithms? CO2, K1


Below is the list of some popular unsupervised learning
algorithms:
K-means clustering.
KNN (k-nearest neighbors)
Hierarchal clustering.
Anomaly detection.
Neural Networks.
Principle Component Analysis.
Independent Component Analysis.
Apriori algorithm.

7
8
Part B – Questions
S.No Question and Answer CO,K
PART -B
1. Briefly explain about Regression and its types with suitable examples. CO2,K3

2. Write short Notes on a) KNN algorithms b) classification and regression CO2,K3

3. Explain in detail about perceptron and neural networks . CO2,K3

4. Explain about decision tree and its various algorithms . CO2,K3

5. Explain in detail about support vector machines with suitable examples. CO2,K3

6. Explain about probabilistic models in machine learning. CO2,K3

7. Explain in deatil about featurization steps in machine learning. CO2,K3

8
0
Supportive online
Certification courses
(NPTEL, Swayam,
Coursera, Udemy, etc.,)
SUPPORTIVE ONLINE COURSES

Course
S No Course title Link
provider
Introduction to Machine
1 Learnng https://onlinecourses.nptel.ac
NPTEL .in/noc22_cs29

http://surl.li/crwpi
2 Coursera Building Machine
Learning Models

https://www.coursera.org/lear
Coursera Predictive Analytics n/population-health-
3
predictive-analytics

82
Real time Applications in
day to day life and to
Industry
REAL TIME APPLICATIONS IN DAY TO DAY LIFE
AND TO INDUSTRY

i) Optical character recognition


The optical character recognition problem, which is the problem of recognizing
character codes from their images, is an example of classification problem. This is an
example where there are multiple classes, as many as there are characters we would
like to recognize. Especially interesting is the case when the characters are
handwritten. People have different handwriting styles; characters may be written small
or large, slanted, with a pen or pencil, and there are many possible images
corresponding to the same character.
ii) Face recognition
In the case of face recognition, the input is an image, the classes are people to be
recognized, and the learning program should learn to associate the face images to
identities. This problem is more difficult than optical character recognition because
there are more classes, input image is larger, and a face is three-dimensional and
differences in pose and lighting cause significant changes in the image.
iii) Speech recognition
In speech recognition, the input is acoustic and the classes are words that can be
uttered.
iv) Medical diagnosis
In medical diagnosis, the inputs are the relevant information we have about the patient
and the classes are the illnesses. The inputs contain the patient’s age, gender, past
medical history, and current symptoms. Some tests may not have been applied to the
patient, and thus these inputs would be missing.

84
Content Beyond Syllabus
Contents beyond the Syllabus

Apriori Algorithm
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent
itemsets in a dataset for boolean association rule. Name of the algorithm is Apriori because it
uses prior knowledge of frequent itemset properties. We apply an iterative approach or level-
wise search where k-frequent itemsets are used to find k+1 itemsets.
To improve the efficiency of level-wise generation of frequent itemsets, an important
property is used called Apriori property which helps by reducing the search space.
AprioriProperty –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori
algorithm is its anti-monotonicity of support measure. Apriori assumes that all subsets of a
frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.
Before we start understanding the algorithm, go through some definitions which are
explained in my previous post.
Consider the following dataset and we will find frequent itemsets and generate association
rules for them.

minimum support count is 2


minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called
C1(candidate set)

86
(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then remove
those items). This gives us itemset L1.

Step-2: K=2
•Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in common.
•Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
•Now find support count of these itemsets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L2.
Step-3:
•Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}
•Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3,
I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
•find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here min_support=2
if support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L3

Step-4:
•Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-1 (K=4) is
that, they should have (K-2) elements in common. So here, for L3, first 2 elements (items)
should match.
•Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3
is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in
C4
•We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of each
rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread
also bought butter
Assessment Schedule
(Proposed Date & Actual
Date)
Assessment Schedule

S.no Assessment Test Date


1. First Internal Assessment 21.09.2022
2. Second Internal Assessment 07.11.2022
3 Model Examination 08.12.2022

90
Prescribed Text Books &
Reference
Prescribed Text Books & Reference Books

TEXT BOOKS
1. Ameet V Joshi, Machine Learning and Artificial Intelligence, Springer Publications, 2020
2. John D. Kelleher, Brain Mac Namee, Aoife D’ Arcy, Fundamentals of Machine learning for
Predictive Data Analytics, Algorithms, Worked Examples and case studies, MIT press,2015
REFERENCES
1. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer Publications,
2011
2. Stuart Jonathan Russell, Peter Norvig, John Canny, Artificial Intelligence: A Modern
Approach, Prentice Hall, 2020
3. Machine Learning Dummies, John Paul Muller, Luca Massaron, Wiley Publications, 2021

92
Mini Project Suggestions
Mini Project Suggestion

a) An emergency room in a hospital measures 17 variables like blood pressure, age, etc. of
newly admitted patients. A decision has to be made whether to put the patient in an
ICU. Due to the high cost of ICU, only patients who may survive a month or more are
given higher priority. Such patients are labeled as “low-risk patients” and others are
labeled “high-risk patients”. The problem is to device a rule to classify a patient as a
“low-risk patient” or a “high-risk patient”.

b) A credit card company receives hundreds of thousands of applications for new cards.
The applications contain information regarding several attributes like annual salary, age,
etc. The problem is to devise a rule to classify the applicants to those who are credit-
worthy, who are not credit-worthy or to those who require further analysis.

c) Astronomers have been cataloguing distant objects in the sky using digital images
created using special devices. The objects are to be labeled as star, galaxy, nebula, etc.
The data is highly noisy and are very faint. The problem is to device a rule using which
a distant object can be correctly labeled.

94
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
relianceon the contentsof this informationis strictly prohibited.

95

You might also like