ML Notes Unit 1-2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 55

UNIT-I

Introduction:
Basic concepts: Definition of learning systems, Goals and applications of machine learning. Aspects of
developing a learning system: training data, concept representation, function approximation.
Types of Learning: Supervised learning and unsupervised learning. Overview of classification: setup, training,
test, validation dataset, over fitting.
Classification Families: linear discriminative, non-linear discriminative, decision trees, probabilistic
(conditional and generative), nearest neighbor. [T1,
T2][No. of Hrs: 12]
UNIT-II
Logistic regression, Perceptron, Exponential family, Generative learning algorithms, Gaussian discriminant
analysis, Naive Bayes, Support vector machines: Optimal hyper plane, Kernels. Model selection and feature
selection. Combining classifiers: Bagging, boosting (The Ada boost algorithm), Evaluating and debugging
learning algorithms, Classification errors.
[T1, T2][No. of Hrs: 11]
1.1 Basic concepts:

Definition of learning systems

A machine is said to be learning from past Experiences (data feed in) with
respect to some class of tasks if its Performance in a given Task improves with
the Experience.
A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks T, as measured by P, improves with
experience E.

Examples

Handwriting recognition learning problem

• Task T: Recognising and classifying handwritten words within images

• Performance P: Percent of words correctly classified

• Training experience E: A dataset of handwritten words with given classifications

A computer program which learns from experience is called a machine learning program or
simply a learning program. Such a program is sometimes also referred to as a learner.

1.1.1 Basic components of learning process

The learning process, whether by a human or a machine, can be divided into four
components, namely, data storage, abstraction, generalization and evaluation. Various
components and the steps involved in the learning process.
1. Data storage- Facilities for storing and retrieving huge amounts of data are an important
component of the learning process. Humans and computers alike utilize data storage as a
foundation for advanced reasoning.

2. Abstraction- The second component of the learning process is known as abstraction.


Abstraction is the process of extracting knowledge about stored data. This involves creating
general concepts about the data as a whole. The creation of knowledge involves application
of known models and creation of new models.

3. Generalization- The third component of the learning process is known as generalisation.


The term generalization describes the process of turning the knowledge about stored data into
a form that can be utilized for future action. These actions are to be carried out on tasks that
are similar, but not identical, to those what have been seen before. In generalization, the goal
is to discover those properties of the data that will be most relevant to future tasks.

4. Evaluation- Evaluation is the last component of the learning process. It is the process of
giving feedback to the user to measure the utility of the learned knowledge. This feedback is
then utilised to effect improvements in the whole learning process.

1.1.2 Machine learning algorithms build a mathematical model based on sample data,
known as “training data”, in order to make predictions or decisions without being explicitly
programmed to do so.

Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence, coined the term “Machine Learning” in 1959 while at IBM. He defined machine
learning as “the field of study that gives computers the ability to learn without being
explicitly programmed.”

Machine Learning (ML) is basically that field of computer science with the help of which
computer systems can provide sense to data in much the same way as human beings do. In
simple words, ML is a type of artificial intelligence that extracts patterns out of raw data by
using an algorithm or method. The key focus of ML is to allow computer systems to learn
from experience without being explicitly programmed or human intervention.

1.1.3 Goals and applications of machine learning


Goals of ML:

1. To make the computers smarter, more intelligent. The more direct objective in this aspect
is to develop systems (programs) for specific practical learning tasks in application domains

2. To develop computational models of human learning process and perform computer


simulations, study in this aspect is also called cognitive modelling
3. To explore new learning methods and develop general learning algorithms independent of
applications.

Applications of machine learning:

1) Supervised ML Algorithms

Multiple Regression Modeling: this is where statistical values are computed on a continual


basis.

Multi Classification: this is where the ML system must choose between two or more
outcomes.

Binary Classification: this is when the system must subdivide the datasets into at least two
or more distinct categories.

2) Unsupervised ML Algorithms

Reduction of Dimensionality: Here, the total number of variables that are not needed are
automatically excluded to streamline the prediction process.

Association Mining/Skewness Detection: This is where datasets that are similar and unlike
one another are detected and identified.

Statistical Clustering: Here, datasets are subdivided based on their degree of similarity to


one another.

3) Supervised and Unsupervised ML Algorithms

Foreign Language Translation: this is how the ML system is taught to translate a language


by providing less than 50% of the total words and vocabulary.

Data Labeling: In these applications, ML systems can take a very small number of inputs
(datasets), and automatically create and apply labels to a much larger dataset.

4) Reinforcement ML Algorithms

Resource Allocation: a key principle of economics is how to do more with a finite number
of resources.  These ML systems are used to determine the optimal mix of how the inputs
should be distributed to yield the maximum output possible.

Robotics Technology: typically found in the manufacturing sector, such as the production of


cars.  When these types of algorithms are used, robots powered by machine learning can learn
how best to accomplish tasks based on the mistakes or achievements it
makes.                                                        

Effective machine learning techniques can solve complex problems that engineers may
struggle with. Used correctly, machine learning is capable of disrupting and creating new
capabilities in most industries. This can help improve productivity while reducing costs and
increasing ROI. 

Machine learning has also seen usage in the business sector when it comes to understanding
their respective customer base.

Typical examples include the following:

1. Breaking down the lifecycle of the customer acquisition, from identifying prospects to
becoming paying customers.
2. Knowing when customers leave to acquire products and services from the
competition.
3. Learning how to market the right products/services to the appropriate customer
segment at the proper timing.
4. In retail business, machine learning is used to study consumer behaviour and
understanding buying patterns behaviors of most valuable customers.
5. In finance, banks analyze their past data to build models to use in credit applications,
fraud detection, and the stock market
6. In manufacturing, learning models are used for optimization, control, and
troubleshooting.
7. In medicine, learning programs are used for medical diagnosis.
8. In telecommunications, call patterns are analyzed for network optimization and
maximizing the quality of service.
9. In science, large amounts of data in physics, astronomy, and biology can only be
analyzed fast enough by computers. The World Wide Web is huge; it is constantly
growing and searching for relevant information cannot be done manually.
10. In artificial intelligence, it is used to teach a system to learn and adapt to changes so
that the system designer need not foresee and provide solutions for all possible
situations.
11. It is used to find solutions to many problems in vision, speech recognition, and
robotics.
12. Machine learning methods are applied in the design of computer-controlled vehicles
to steer correctly when driving on a variety of roads.
13. Machine learning methods have been used to develop programmes for playing games
such as chess, backgammon and Go.

1.1.4 History:-

The early history of Machine Learning (Pre-1940):

● 1834: In 1834, Charles Babbage, the father of the computer, conceived a device that
could be programmed with punch cards. However, the machine was never built, but
all modern computers rely on its logical structure.

● 1936: In 1936, Alan Turing gave a theory that how a machine can determine and
execute a set of instructions.
The era of stored program computers:

● 1940: In 1940, the first manually operated computer, "ENIAC" was invented, which
was the first electronic general-purpose computer. After that stored program computer
such as EDSAC in 1949 and EDVAC in 1951 were invented.

● 1943: In 1943, a human neural network was modeled with an electrical circuit. In
1950, the scientists started applying their idea to work and analyzed how human
neurons might work.

Computer machinery and intelligence:

● 1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery and
Intelligence," on the topic of artificial intelligence. In his paper, he asked, "Can
machines think?"

Machine intelligence in Games:

● 1952: Arthur Samuel, who was the pioneer of machine learning, created a program
that helped an IBM computer to play a checkers game. It performed better more it
played.

● 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.

The first "AI" winter:

● The duration of 1974 to 1980 was the tough time for AI and ML researchers, and this
duration was called as AI winter.

● In this duration, failure of machine translation occurred, and people had reduced their
interest from AI, which led to reduced funding by the government to the researches.

Machine Learning from theory to reality

● 1959: In 1959, the first neural network was applied to a real-world problem to remove
echoes over phone lines using an adaptive filter.

● 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural network
NETtalk, which was able to teach itself how to correctly pronounce 20,000 words in
one week.

● 1997: The IBM's Deep blue intelligent computer won the chess game against the
chess expert Garry Kasparov, and it became the first computer which had beaten a
human chess expert.

Machine Learning at 21st century

● 2006: In the year 2006, computer scientist Geoffrey Hinton has given a new name to
neural net research as "deep learning," and nowadays, it has become one of the most
trending technologies.

● 2012: In 2012, Google created a deep neural network which learned to recognize the
image of humans and cats in YouTube videos.

● 2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing Test. It was the
first Chabot who convinced the 33% of human judges that it was not a machine.

● 2014: DeepFace was a deep neural network created by Facebook, and they claimed
that it could recognize a person with the same precision as a human can do.

● 2016: AlphaGo beat the world's number second player Lee sedol at Go game. In
2017 it beat the number one player of this game Ke Jie.

● 2017: In 2017, the Alphabet's Jigsaw team built an intelligent system that was able to
learn the online trolling. It used to read millions of comments of different websites to
learn to stop online trolling.

1.1.5 Aspects of developing a learning system: training data, concept


representation, function approximation.

To make a system learn, a training data is to be provided. For example if we want to make a
system learn to differentiate between animals and birds, then we provide it with some
examples of animals and some example of birds (labelled dataset i.e. each example also
contains information of whether it is a bird or animal).

From these examples, the system learns some concept, for example, birds have wings, a beak
and two legs whereas animals have four legs and have a nose, etc.

After concept representation the system should be able to identify new objects. However, the
learning system needs to use function approximation as some of the new example may
exactly not match the concepts i.e. an elephant has a trunk (which is neither a nose nor a
beak) so the system should be able to approximate whether it is an animal or a bird.

. We want to design a learning system that follows the learning process,we need to consider a
few design choices.The design choices will be to decide the following key component.

● Training data
● Concept representation
● Function approximation

Training Data:-Training data is also known as training dataset, learning set & training set, it
is an essential component of every machine learning model and helps them make accurate
predictions or perform a desired task.

REMEMBER: A computer program is said to learn from experience E with respect to


some class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.

Problem 1: Handwriting recognition learning problem

For handwriting recognition learning problem, TPE would be,

Task T: To recognize and classify handwritten words within the given images.
Performance measure P: Total percent of words being correctly classified by the program.

Training experience E: A set of handwritten words with given classifications/labels.

Problem 2: Spam Mail detection learning problem

For a system being designed to detect spam emails, TPE would be,

Task T: To recognize and classify mails into 'spam' or 'not spam'.

Performance measure P: Total percent of mails being correctly classified as 'spam' (or 'not
spam' ) by the program.

Training experience E: A set of mails with given labels ('spam' / 'not spam').

Problem 3: Checkers learning problem

For a checkers learning problem, TPE would be,

Task T: To play checkers

Performance measure P: Total percent of the game won in the tournament.

Training experience E: A set of games played against itself

If we are able to find the factors T, P, and E of a learning problem, we will be able to decide
the following three key components:

1. The exact type of knowledge to be learned (Choosing the Target Function)

2. A representation for this target knowledge (Choosing a representation for the Target
Function)

3. A learning mechanism (Choosing an approximation algorithm for the Target Func-


tion)

Steps for Designing Learning System are:

STEP1: Choosing the Training Experience: 


The very important and first task is to choose the training data or training experience which
will be fed to the Machine Learning Algorithm. It is important to note that the data or experi -
ence that we fed to the algorithm must have a significant impact on the Success or Failure of
the Model. So Training data or experience should be chosen wisely.
Below are the attributes which will impact on Success and Failure of Data:
 The training experience will be able to provide direct or indirect feedback regard-
ing choices. For example: While Playing chess the training data will provide feed-
back to itself like instead of this move if this is chosen the chances of success in-
creases.
 Second important attribute is the degree to which the learner will control the se-
quences of training examples. For example: when training data is fed to the ma-
chine then at that time accuracy is very less but when it gains experience while
playing again and again with itself or opponent the machine algorithm will get
feedback and control the chess game accordingly.
 Third important attribute is how it will represent the distribution of examples over
which performance will be measured. For example, a Machine learning algorithm
will get experience while going through a number of different cases and different
examples. Thus, Machine Learning Algorithm will get more and more experience
by passing through more and more examples and hence its performance will in-
crease.

STEP 2: Choose Target Function:-

The next important step is choosing the target function. It means according to the knowledge
fed to the algorithm the machine learning will choose NextMove function which will describe
what type of legal moves should be taken.  For example : While playing chess with the op-
ponent, when opponent will play then the machine learning algorithm will decide what be the
number of possible legal moves taken in order to get success. Let's take the example of a
checkers-playing program that can generate the legal moves (M) from any board state (B).
The program needs only to learn how to choose the best move from among these legal
moves. Let's assume a function NextMove such that:

NextMove: B -> M

Here, B denotes the set of board states and M denotes the set of legal moves given a board
state. NextMove is our target function.

Concept representation:-

STEP 3: Choose the Representation of Target Function

When the machine algorithm will know all the possible legal moves the next step is to choose
the optimized move using any representation i.e. using linear Equations, Hierarchical Graph
Representation, Tabular form etc. The NextMove function will move the Target move like
out of these move which will provide more success rate. For Example : while playing chess
machine have 4 possible moves, so the machine will choose that optimized move which will
provide success to it. We need to choose a representation that the learning algorithm will use
to describe the function NextMove. The function NextMove will be calculated as a linear
combination of the following board features:

X1:the no. of black pieces in the board.

X2: the no. of red pieces on the board.

X3: the no. of black king on the board.

X4: the no. of red king on the board.


X5: the no. of black piece threatened by red.

X6: the no. of red pieces threatened by black

Thus, the learning program will be regarded as a linear function of the form .

NextMove= w0+w1x1+w2x2+w3x3+w4x4+w5x5+w6x6

Where , w0 through w6 are numerical coefficients , or weights that will be chosen by the
learning algorithm .

Learned values for the weight w1 through w6 will determine the relative importance of the
various board features in determining the value of the board.

The weight w0 will provide an additive constant to the board value.


STEP 4: Choose a Function approximation algorithm:-

An optimized move cannot be chosen just with the training data. The training data had to go
through with set of example and through these examples the training data will approximates
which steps are chosen and after that machine will provide feedback on it. For Example :
When a training data of Playing chess is fed  to algorithm so at that time it is not machine
algorithm will fail or get success and again from that failure or success it will measure while
next move what step should be chosen and what is its success rate. To learn the target
function NextMove, we require a set of training examples, each describing a specific board
state b and the training value (Correct Move ) y for b. The training algorithm
learns/approximate the coefficients w0, w1 up to w6 with the help of these training examples
by estimating and adjusting these weights.

STEP 5: Final Design: 

The final design is created at last when system goes from number of examples  , failures and
success , correct and incorrect decision and what will be the next step etc. Example:
DeepBlue is an intelligent  computer which is ML-based won chess game against the chess
expert Garry Kasparov, and it became the first computer which had beaten a human chess
expert.
Choices in designing the checkers learning problem
1.2 Types of Learning:

a) Supervised learning: The computer is presented with example inputs and their
desired outputs, and the goal is to learn a general rule that maps inputs to outputs.
Supervised learning is used in predictive analytics i.e. classification and regression.
Classification deals with discrete output, and regression deals with continuous output.
Types of Supervised Learning:
1. Classification: It is a Supervised Learning task where output is having defined labels
(discrete value). For example in above Figure A, Output – Purchased has defined
labels i.e. 0 or 1; 1 means the customer will purchase and 0 means that customer
won’t purchase. The goal here is to predict discrete values belonging to a particular
class and evaluate them on the basis of accuracy.
Example: Gmail classifies mails in more than one class like social, promotions, updates,
forums.
2. Regression: It is a Supervised Learning task where output is having continuous
value.
Example in above Figure B, Output – Wind Speed is not having any discrete value but is
continuous in the particular range. The goal here is to predict a value as much closer
to the actual output value as our model can and then evaluation is done by calculating
the error value. The smaller the error the greater the accuracy of our regression model.

Disadvantages o f supervised learning:-

1. Computation time is vast for supervised learning


2. Unwanted data down efficiency
3. Preprocessing of data is no less than a big challenge.
4. Always in need of updates.
5. Anyone can overfit supervised algorithms easily.

Advantages of supervised learning:-

1. Automation of everything.
2. Wide rage of application.
3. Scope of improvement.
4. Efficient handling of data.
5. Best for education.

b) Unsupervised learning: No labels are given to the learning algorithm, leaving it on


its own to find patterns in its input. Unsupervised learning deals with pattern mining
and clustering.

Types of Unsupervised Learning:-


· Clustering: Broadly this technique is applied to group data based on different
patterns, our machine model finds. For example, in the above figure, we are not given
an output parameter value, so this technique will be used to group clients based on the
input parameters provided by our data.
· Association: This technique is a rule-based ML technique that finds out some very
useful relations between parameters of a large data set. For e.g. shopping stores use
algorithms based on this technique to find out the relationship between the sale of one
product w.r.t to others sales based on customer behavior. Once trained well, such
models can be used to increase their sales by planning different offers.

Advantages of Unsupervised Learning:-

1) Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labeled input data.

2) Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to


labeled data.

Disadvantages of Unsupervised Learning:-

1) Unsupervised learning is intrinsically more difficult than supervised learning as it does not
have corresponding output.

2) The result of the unsupervised learning algorithm might be less accurate as input data is
not labeled, and algorithms do not know the exact output in advance

The main differences between Supervised and Unsupervised learning are given below:

Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained using Unsupervised learning algorithms are trained
labeled data. using unlabeled data.
Supervised learning model takes direct feedback Unsupervised learning model does not take
to check if it is predicting correct output or not. any feedback.

Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.

In supervised learning, input data is provided to In unsupervised learning, only input data is
the model along with the output. provided to the model.

The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it is the hidden patterns and useful insights from
given new data. the unknown dataset.

Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.

Supervised learning can be categorized in Unsupervised Learning can be classified in


Classification and Regression problems. Clustering and Associations problems.

Supervised learning can be used for those cases Unsupervised learning can be used for those
where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.

Supervised learning model produces an accurate Unsupervised learning model may give less
result. accurate result as compared to supervised
learning.

Supervised learning is not close to true Artificial Unsupervised learning is more close to the true
intelligence as in this, we first train the model Artificial Intelligence as it learns similarly as a
for each data, and then only it can predict the child learns daily routine things by his
correct output. experiences.

It includes various algorithms such as Linear It includes various algorithms such as


Regression, Logistic Regression, Support Vector Clustering, KNN, and Apriori algorithm.
Machine, Multi-class Classification, Decision
tree, Bayesian Logic, etc.
c) Semi-supervised Learning: its working lies between Supervised and Unsupervised
techniques. These techniques are used when we are dealing with data that is a little bit
labelled and the rest large portion of it is unlabelled. We can use the unsupervised techniques
to predict labels and then feed these labels to supervised techniques. This technique is mostly
applicable in the case of image data sets where usually all images are not labelled.

d) Reinforcement learning: A computer program interacts with a dynamic environment in


which it must perform a certain goal (such as driving a vehicle or playing a game against an
opponent). The program is provided feedback in terms of rewards and punishments as it
navigates its problem space. Reinforcement learning differs from standard supervised
learning in that correct input/output pairs are never presented, nor sub-optimal actions
explicitly corrected. It finds its main application in control theory, and robotics.
· Step 1 − First, we need to prepare an agent with some initial set of strategies.
· Step 2 − Then observe the environment and its current state.
· Step 3 − Next, select the optimal policy regards the current state of the environment and
perform important action.
· Step 4 − Now, the agent can get corresponding reward or penalty as per accordance with
the action taken by it in previous step.
· Step 5 − Now, we can update the strategies if it is required so.
· Step 6 − At last, repeat steps 2-5 until the agent got to learn and adopt the optimal
policies.

1.2.1 Datasets

Training Dataset:

Sample of data used to fit the model. It is the actual dataset that we use to train the model
(weights and biases). The model sees and learns from this data.

Validation Dataset

Sample of data used to provide an unbiased evaluation of a model fit on the training dataset
while tuning model hyper-parameters. Evaluation becomes more biased as skill on the
validation dataset is incorporated into the model configuration. The model occasionally sees
this data, but never does it “learn” from this

Test Dataset

Sample of data used to provide an unbiased evaluation of a final model fit on the training
dataset. It is used once a model is completely trained (using the train and validation sets).
Test set is generally what is used to evaluate competing models. It contains carefully sampled
data that spans the various classes that the model would face, when used in the real world

1.2.2 Overfitting and Underfitting

Let us consider that we are designing a machine learning model. A model is


said to be a good machine learning model if it generalizes any new input
data from the problem domain in a proper way. This helps us to make predic-
tions in the future data, that the data model has never seen. Now, suppose
we want to check how well our machine learning model learns and general-
izes to the new data. For that, we have overfitting and underfitting, which are
majorly responsible for the poor performances of the machine learning al-
gorithms.
Before diving further let’s understand two important terms:
 Bias: Assumptions made by a model to make a function easier to
learn.
 Variance: If you train your data on training data and obtain a very
low error, upon changing the data and then training the same previ-
ous model you experience a high error, this is variance.

Overfitting

A statistical model is said to be overfitted when we train it with a lot of data (just like fitting
ourselves in oversized pants!). When a model gets trained with so much data, it starts learn-
ing from the noise and inaccurate data entries in our data set. Then the model does not cat-
egorize the data correctly, because of too many details and noise. The causes of overfitting
are the non-parametric and non-linear methods because these types of machine learning al-
gorithms have more freedom in building the model based on the dataset and therefore they
can really build unrealistic models. A solution to avoid overfitting is using a linear algorithm
if we have linear data or using the parameters like the maximal depth if we are using decision
trees. 
In a nutshell, High variance and low bias

Techniques to reduce overfitting :

1. Increase training data.

2. Reduce model complexity.

3. Early stopping during the training phase (have an eye over the loss over the training period
as soon as loss begins to increase stop training).

4. Ridge Regularization and Lasso Regularization

5. Use dropout for neural networks to tackle overfitting.

Underfitting

A statistical model or a machine learning algorithm is said to have underfitting when it


cannot capture the underlying trend of the data. (It’s just like trying to fit undersized
pants!) Underfitting destroys the accuracy of our machine learning model. Its occurrence
simply means that our model or the algorithm does not fit the data well enough. It usually
happens when we have fewer data to build an accurate model and also when we try to build a
linear model with fewer non-linear data. In such cases, the rules of the machine learning
model are too easy and flexible to be applied on such minimal data and therefore the model
will probably make a lot of wrong predictions. Underfitting can be avoided by using more
data and also reducing the features by feature selection. 

In nutshell, it is High bias and low variance

Techniques to reduce underfitting:

1. Increase model complexity

2. Increase number of features, performing feature engineering

3. Remove noise from the data.

4. Increase the number of epochs or increase the duration of training to get better results.

Good Fit in a Statistical Model:  


Ideally, the case when the model makes the predictions with 0 error, is said to have a good
fit on the data. This situation is achievable at a spot between overfitting and underfitting. In
order to understand it, we will have to look at the performance of our model with the passage
of time, while it is learning from training dataset.
With the passage of time, our model will keep on learning and thus the error for the model on
the training and testing data will keep on decreasing. If it will learn for too long, the model
will become more prone to overfitting due to the presence of noise and less useful details.
Hence the performance of our model will decrease. In order to get a good fit, we will stop at a
point just before where the error starts increasing. At this point, the model is said to have
good skills on training datasets as well as our unseen testing dataset. 

1.3 Classification Families:

Classification is a process of categorizing a given set of data into classes, It can be performed
on both structured or unstructured data. The process starts with predicting the class of given
data points. The classes are often referred to as target, label or categories.

Some of the most important applications of classification algorithms are as follows −

• Speech Recognition

• Handwriting Recognition

• Biometric Identification

• Document Classification

Advantages:

• Mining Based Methods are cost-effective and efficient

• Helps in identifying criminal suspects


• Helps in predicting the risk of diseases

• Helps Banks and Financial Institutions to identify defaulters so that they may
approve Cards, Loan, etc.

Disadvantages:

Privacy: When the data is either are chances that a company may give some information
about their customers to other vendors or use this information for their profit.

Accuracy Problem: Selection of Accurate model must be there in order to get the best
accuracy and result.

1.3.1 Linear Discriminative

(Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function


Analysis) is a dimensionality reduction technique which is commonly used for the supervised
classification problems. It is used for modelling differences in groups i.e. separating two or
more classes. It projects the features in higher dimension space into a lower dimension space.

For example, we have two classes and we need to separate them efficiently. Classes can have
multiple features. Using only a single feature to classify them may result in some
overlapping. So, we will keep on increasing the number of features for proper classification.

Here, LDA uses both the axes (X and Y) to create a new axis and projects data onto a new
axis in a way to maximize the separation of the two categories and hence, reducing the
2Dgraph into a 1D graph.

Two criteria are used by LDA to create a new axis:

1. Maximize the distance between means of the two classes

2. Minimize the variation within each class


1.3.2 Non-linear Discriminative

But LDA fails when the mean of the distributions are shared, as it becomes impossible for
LDA to find a new axis that makes both the classes linearly separable. In such cases, we use
non-linear discriminant analysis.

Extensions to LDA:

1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).

2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used


such as splines.

3. Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of


the variance (actually covariance), moderating the influence of different variables on LDA.

1.3.3 Decision Trees

· Tree is built by splitting the source set, constituting the root node of the tree,
into subsets—which constitute the successor children

· splitting is based on a set of splitting rules based on classification features

· process is repeated on each derived subset in a recursive manner called


recursive partitioning

· recursion is completed when the subset at a node has all the same values of the
target variable, or when splitting no longer adds value to the predictions

· Entropy is a measure of disorder or uncertainty

· Metric is needed to measure the reduction of this disorder in our target


variable/class given additional information (features / independent variables)
about it. This is Information Gain

Advantages of Decision Tree

1. Clear Visualization: The algorithm is simple to understand, interpret and visualize as the
idea is mostly used in our daily lives. Output of a Decision Tree can be easily interpreted by
humans.

2. Simple and easy to understand: Decision Tree looks like simple if-else statements which
are very easy to understand.

3. Decision Tree can be used for both classification and regression problems.

4. Decision Tree can handle both continuous and categorical variables.

5. No feature scaling required: No feature scaling (standardization and normalization)


required in case of Decision Tree as it uses rule based approach instead of distance
calculation.

6. Handles non-linear parameters efficiently: Non linear parameters don't affect the
performance of a Decision Tree unlike curve based algorithms. So, if there is high non-
linearity between the independent variables, Decision Trees may outperform as compared to
other curve based algorithms.

7. Decision Tree can automatically handle missing values.

8. Decision Tree is usually robust to outliers and can handle them automatically.

9. Less Training Period: Training period is less as compared to Random Forest because it
generates only one tree unlike forest of trees in the Random Forest.

Disadvantages of Decision Tree

1. Overfitting: This is the main problem of the Decision Tree. It generally leads to overfitting
of the data which ultimately leads to wrong predictions. In order to fit the data (even noisy
data), it keeps generating new nodes and ultimately the tree becomes too complex to
interpret. In this way, it loses its generalization capabilities. It performs very well on the
trained data but starts making a lot of mistakes on the unseen data.

2. High variance: As mentioned in point 1, Decision Tree generally leads to the overfitting of
data. Due to the overfitting, there are very high chances of high variance in the output which
leads to many errors in the final estimation and shows high inaccuracy in the results. In order
to achieve zero bias (overfitting), it leads to high variance.

3. Unstable: Adding a new data point can lead to re-generation of the overall tree and all
nodes need to be recalculated and recreated.
4. Affected by noise: Little bit of noise can make it unstable which leads to wrong
predictions.

5. Not suitable for large datasets: If data size is large, then one single tree may grow complex
and lead to overfitting. So in this case, we should use Random Forest instead of a single
Decision Tree

1.4 probabilistic (conditional and generative)

Probabilistic model gives the probability of each class for the given data instance. The class
whose probability comes out to be highest for the instance is assigned. A probabilistic model
can be further classified as conditional/discriminative and generative. A conditional model
uses a discriminant function to calculate the probabilities e.g. logistic regression, whereas a
generative model models the probability distribution of the given dataset and based on this
distribution predicts the probabilities of the instance belonging to each class e.g. Naive
Bayes.

Discriminative Model
The discriminative model is used particularly for supervised machine learning. Also called a
conditional model, it learns the boundaries between classes or labels in a dataset. It creates
new instances using probability estimates and maximum likelihood. However, they are not
capable of generating new data points. The ultimate goal of discriminative models is to
separate one class from another.

Types of discriminative models in machine learning include:

Logistic regression: It is one of the most popular machine learning algorithms. Logistic
regression is a mathematical model in statistics that uses previous data to estimate an event’s
probability. The output is a categorical or a discrete value. In principle, logistic regression is
similar to linear regression. However, there is a small difference between them. While linear
regression is used for regression problems, logistic regression is used for classification
problems.

Support vector machine: It is a supervised learning algorithm used both for classification and
regression problems. A type of discriminative modelling, support vector machine (SVM)
creates a decision boundary to segregate n-dimensional space into classes. The best decision
boundary is called a hyperplane created by choosing the extreme points called the support
vectors.

Decision tree: A type of supervised machine learning model where data is continuously split
according to certain parameters. It has two main entities–decision nodes and leaves. While
leaves are the final outcomes or decisions, nodes are the points where data is split.

Random forest: It is a flexible and easy-to-use machine learning algorithm that gives great
results without even using hyper-parameter tuning. Because of its simplicity and diversity, it
is one of the most used algorithms for both classification and regression tasks.

Generative models

Read link: https://developers.google.com/machine-learning/gan/generative

Generative models are a class of statistical models that generate new data instances. These
models are used in unsupervised machine learning to perform tasks such as probability and
likelihood estimation, modelling data points, and distinguishing between classes using these
probabilities. Generative models rely on the Bayes theorem to find the joint probability.
Common examples of generative models are:

Latent Dirichlet Allocation (LDA): It is a generative probabilistic model with collection of


discrete data, each of which is modelled as a finite mixture. Some of the common
applications of LDA are collaborative filtering and content-based image retrieval.

Bayesian Network: Also known as Bayes network, it is a generative probabilistic graphical


model that gives an efficient representation of joint probability distribution over a set of
random variables.

Hidden Markov model: It is a statistical model known for its effectiveness in modelling the
correlation between adjacent symbols, events or domains, finding major application in speech
recognition and digital communication.

Autoregressive model: An AR model predicts future values based on past values. This kind
of model is good at handling a wide range of time-series patterns.

Generative adversarial network: GANs have gained much popularity recently. A GAN model
has two parts–generator and discriminator. The generative model captures the data
distribution and the discriminative model estimates the probability of sample coming from
training data rather than the generative model.

Difference:

Discriminative models draw boundaries in the data space, while generative ones model how
data is placed throughout the space. Mathematically speaking, a discriminative machine
learning trains a model by learning parameters that maximise the conditional probability P(Y|
X), but a generative model learns parameters by maximising the joint probability P(X,Y).

Because of their different approaches to machine learning, both are suited for specific tasks.
Generative models are useful for unsupervised learning tasks and discriminative models work
better for supervised learning tasks.

1.5 Nearest Neighbour

READ THE FOLLOWING LINKS:

https://www.analyticsvidhya.com/blog/2021/04/simple-understanding-and-implementation-
of-knn-algorithm/

https://www.geeksforgeeks.org/k-nearest-neighbours/

It is one of the simplest Machine Learning algorithms which is based on Supervised Learning
technique. K-NN algorithm assumes the similarity between the newcase/data and available
cases and put the new case into the category that is most similar to the available categories

stores all the available data and classifies a new data point based on the similarity when new
data appears then it can be easily classified into a well suited category can be used for
Regression as well as for Classification is a non-parametric algorithm, which means it does
not make any assumption on underlying data a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the s dataset and at the time of
classification, it performs an action on the dataset.

1. Select the number K of the neighbors

2. Calculate the Euclidean distance of K number of neighbors

3. Take the K nearest neighbors as per the calculated Euclidean distance

4. Among these k neighbors, count the number of the data points in each category

5. Assign the new data points to that category for which the number of the neighbor is
maximum

Advantages

· simple to implement

· robust to the noisy training data


· more effective if the training data is large

Disadvantages

· needs to determine the value of K which may be complex some time

· computation cost is high because of calculating the s distance between the data
points for all the training
UNIT-II

2.1 Logistic regression


Logistic Regression in Machine Learning
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.

o Logistic regression predicts the output of a categorical dependent variable. Therefore


the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,

true or False, etc. but instead of giving the exact value as 0 and 1, it gives the

probabilistic values which lie between 0 and 1.

o Logistic Regression is much similar to the Linear Regression except that how they are

used. Linear Regression is used for solving Regression problems, whereas Logistic

regression is used for solving the classification problems.

o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).

o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.

o Logistic Regression is a significant machine learning algorithm because it has the


ability to provide probabilities and classify new data using continuous and discrete
datasets.

o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
Logistic Function (Sigmoid Function):

● The sigmoid function is a mathematical function used to map the predicted values to
probabilities.

● It maps any real value into another value within a range of 0 and 1.
● The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.

● In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Assumptions for Logistic Regression:

● The dependent variable must be categorical in nature.


● The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

● We know the equation of the straight line can be written as:


● In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):

● But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

● Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.

● Multinomial: In multinomial Logistic regression, there can be 3 or more possible


unordered types of the dependent variable, such as "cat", "dogs", or "sheep"

● Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".

linear regression which outputs continuous number values, logistic regression transforms
its output using the logistic sigmoid function to return a probability value which can then
be mapped to two or more discrete classes
2.2 Perceptron

A perceptron is the simplest model of Artificial Neural Network. It consists of a single


artificial neuron with Heaviside Step function as the activation function.

The Heaviside step function is written as:

f(x) = +1 if x>θ

= -1 if x≤θ
The perceptron is a linear binary classifier. The training phase of perceptron performs
multiple iterations on the training data points. Each iteration goes through all the training
instances, and if a misclassified instance is encountered, the parameters of the hyperplane are
changed so that the misclassified instance moves closer to the hyperplane or maybe even
across the hyperplane onto the correct side.

A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary


classifiers decide whether an input, usually represented by a series of vectors, belongs to a
specific class. In short, a perceptron is a single-layer neural network. They consist of four
main parts including input values, synaptic weights and bias, net sum, and an activation
function.

Perceptron Learning Algorithm

Set all the weights to zero

Until all the instances in the training data are classified correctly

For each instance I in the training data

If I is classified incorrectly by the perceptron

If I belongs to first class add it to the weight vector

else subtract it from the weight vector

2.3 Exponential family


2.4 Generative learning algorithms,

Generative Adversarial Networks (GANs) are a powerful class of neural networks that are
used for unsupervised learning. It was developed and introduced by Ian J. Goodfellow in
2014. GANs are basically made up of a system of two competing neural network models
which compete with each other and are able to analyze, capture and copy the variations
within a dataset. Generative approaches try to build a model of the positives and a model of
the negatives. You can think of a model as a “blueprint” for a class. A decision boundary is
formed where one model becomes more likely. As these create models of each class they can
be used for generation.

To create these models, a generative learning algorithm learns the joint probability
distribution P(x, y).

Now time for some maths!

The joint probability can be written as:

P(x, y) = P(x | y) . P(y) ….(i)

Also, using Bayes’ Rule we can write:

P(y | x) = P(x | y) . P(y) / P(x) ….(ii)

Since, to predict a class label y, we are only interested in the arg max , the denominator can
be removed from (ii).

Hence to predict the label y from the training example x, generative models evaluate:

f(x) = argmax_y P(y | x) = argmax_y P(x | y) . P(y)

The most important part in the above is P(x | y). This is what allows the model to be
generative! P(x | y) means – what x (features) are there given class y. Hence, with the joint
probability distribution function (i), given a y, you can calculate (“generate”) its
corresponding x. For this reason they are called generative models!

Generative learning algorithms make strong assumptions on the data.

How does GANs work?

Generative Adversarial Networks (GANs) can be broken down into three parts:
· Generative: To learn a generative model, which describes how data is generated in
terms of a probabilistic model.
· Adversarial: The training of a model is done in an adversarial setting.
· Networks: Use deep neural networks as the artificial intelligence (AI) algorithms for
training purpose.

Different types of GANs:


GANs are now a very active topic of research and there have been many different types of
GAN implementation. Some of the important ones that are actively being used currently are
described below:
1. Vanilla GAN: This is the simplest type GAN. Here, the Generator and the
Discriminator are simple multi-layer perceptrons. In vanilla GAN, the algorithm is
really simple, it tries to optimize the mathematical equation using stochastic gradient
descent.
2. Conditional GAN (CGAN): CGAN can be described as a deep learning method in
which some conditional parameters are put into place. In CGAN, an additional
parameter ‘y’ is added to the Generator for generating the corresponding data. Labels
are also put into the input to the Discriminator in order for the Discriminator to help
distinguish the real data from the fake generated data.
3. Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular also the
most successful implementation of GAN. It is composed of ConvNets in place of
multi-layer perceptrons. The ConvNets are implemented without max pooling, which
is in fact replaced by convolutional stride. Also, the layers are not fully connected.
4. Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible
image representation consisting of a set of band-pass images, spaced an octave apart,
plus a low-frequency residual. This approach uses multiple numbers of Generator and
Discriminator networks and different levels of the Laplacian Pyramid. This approach
is mainly used because it produces very high-quality images. The image is down-
sampled at first at each layer of the pyramid and then it is again up-scaled at each
layer in a backward pass where the image acquires some noise from the Conditional
GAN at these layers until it reaches its original size.
5. Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of
designing a GAN in which a deep neural network is used along with an adversarial
network in order to produce higher resolution images. This type of GAN is
particularly useful in optimally up-scaling native low-resolution images to enhance its
details minimizing errors while doing so.

2.5 Gaussian/Linear discriminant analysis

LDA is a generative learner as it makes assumption about the data distribution.

LDA makes some simplifying assumptions about your data:

· That your data is Gaussian, that each variable is shaped like a bell curve when plotted.

· That each attribute has the same variance that values of each variable vary around the mean

by the same amount on average. With these assumptions, the LDA model estimates the mean
and variance from your data for each class.

The discriminant function used by LDA is:


fi = μi C-1(xk)T – 0.5* μi C-1(μi)T + ln(pi)

where,

fi is the probability of the input to belong to class i

μi is the mean of features for class i

C-1 is the inverse of pooled covariance matrix

xk is the object which is to be classified

We assign the object k with features xk to group i that has maximum fi

● models the decision boundary between the classes


● learns the conditional probability distribution p(y|x)p(y|x)
● they:
1. assume some functional form for p(y|x)p(y|x)
2. estimate parameters of p(y|x)p(y|x) directly from training data
● examples :
○ logistic regression
○ scalar vector machine
○ traditional neural networks
○ nearest neighbour

Steps:

1. Train the data and obtain a discriminate fn, tells which class a data point has higher
prob of belonging to
2. compute μμ and σσ for each class, then calculate prob that data belongs to it, class
with highest prob chosen
2.6 Naive Bayes,

It is a supervised learning algorithm, which is based on Bayes theorem and used for solving
classification problems. Mainly used in text classification that includes a high-dimensional
training dataset. It is simple and most effective Classification algorithms. Probabilistic
classifier, which means it predicts on the basis of the probability of an object. Examples are
spam filtration, sentimental analysis, and classifying articles. It assumes that the occurrence
of a certain feature is independent of the occurrence of other features

uses Bayes theorem : p(Ck∣x)=p(Ck) p(x∣Ck)p(x)p(Ck∣x)=p(Ck)


p(x∣Ck)p(x)

Naive Bayes classifier is based on Bayes theorem which says that

P(H|E) = P(E|H) * P(H) / P(E)

where H is some hypothesis based on some evidence E e.g. evidence=fever,


hypothesis=dengue.

P(E), P(H), P(E|H) are priori-probabilities which are used to calculate conditional probability
P(H|E).

In Naive Bayes, we have to predict the class (C) of an example(X), so the equation can be re-
written as
P(C|X) = P(X|C) * P(C) / P(X)

Let's understand Naive Bayes by an example. Suppose, we are given the following dataset for
training the classifier, where "Play" is the output with 2 labels "yes" and "no".

So we have to build a classifier using the above training set i.e. we have to calculate
priori probabilities P(C), P(X|C) and P(X).

As we have only two classes in out training dataset, therefore P(C) is P(yes) and P(no).

P(C) = number of examples belonging to class C / total examples

P(yes) = 9/14

P(no) = 5/14

P(X) = number of examples having X / total examples

P(sunny) = 5/14

P(overcast) = 4/14

P(rainy) = 5/14

P(hot) = 4/14

P(mild) = 6/14

P(cool) = 4/14
P(high) = 7/14

P(normal) = 7/14

P(false) = 8/14

P(true) = 6/14

P(X|C) = number of times X is associated with C / number of examples belonging to class C

P(sunny|yes) = 2/9, P(sunny|no) = 3/5

P(overcast|yes) = 4/9, P(overcast|no) = 0/5

P(rainy|yes) = 3/9, P(rainy|no) = 2/5

P(hot|yes) = 2/9, P(hot|no) = 2/5

P(mild|yes) = 4/9, P(mild|no) = 2/5

P(cool|yes) = 3/9, P(cool|no) = 1/5

P(high|yes) = 3/9, P(high|no) = 4/5

P(normal|yes) = 6/9, P(normal|no) = 1/5

P(false|yes) = 6/9, P(false|no) = 2/5

P(true|yes) = 3/9, P(true|no) = 3/5

We have obtained all the 3 priori probabilities from the training dataset. Now, we want to
classify a new unclassified example.

Let the example be {sunny,cool,high,true} and we have to predict it's class. The class can be

predicted using the formula

P(C|X) = {[ P(C)*ΠP(X|C)] } / ΠP(X)

Case I: Yes

P(yes|sunny,cool,high,true) = P(yes) * P(sunny|yes) * P(cool|yes) * P(high|yes) * P(true|yes) /

P(sunny) * P(cool) * P(high) * P(true) = 9/14 * 2/9 *3/9 * 3/9 * 3/9 / ΠP(X)

Case II : No

P(no|sunny,cool,high,true) = P(no) * P(sunny|no) * P(cool|no) * P(high|no) * P(true|no) /


P(sunny) *
P(cool) * P(high) * P(true) = 5/14 * 3/5 * 1/5 * 4/5 * 3/5 / ΠP(X)

Result:

As P(X) is same in both equations, we can ignore it giving

P(yes|sunny,cool,high,true) = 0.00529

P(no|sunny,cool,high,true) = 0.02057

As P(no|sunny,cool,high,true) > P(yes|sunny,cool,high,true), therefore we assign label "no" to


it.

Types

1. Gaussian:
○ assumes that features follow a normal distribution
○ if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution
2. Multinomial:
○ used when the data is multinomial distributed
○ used for document classification problems
○ uses the frequency of words for the predictors
3. Bernoulli:
○ works similar to the Multinomial classifier, but the predictor variables are the
independent Booleans variables

2.7 Support vector machines: Optimal hyper plane,

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine.

Types of SVM
SVM can be of two types:

● Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.

● Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

by adding a new dimension, z=x2+y2z=x2+y2


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.

o The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.

o We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

oA hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of


that space that divides the space into two disconnected parts. There can be multiple
lines/decision boundaries to segregate the classes in n-dimensional space. SVM algorithm
finds the closest point of the lines from both the classes, these points are called support
vectors. The distance between the vectors and the hyperplane is called as margin, the goal
of SVM is to maximize this margin. The hyperplane with maximum margin is called the
optimal hyperplane.

Kernels.

● SVM algorithms use a set of mathematical functions that are defined as the kernel
● function of kernel is to take data as input and transform it into the required form
● different SVM algorithms use different types of kernel functions. These functions can
be different types. For example linear, nonlinear, polynomial, radial basis function
(RBF), and sigmoid
● most used type of kernel function is RBF. Because it has localized and finite response
along the entire x-axis
● kernel functions return the inner product between two points in a suitable feature
space, thus defining a notion of similarity, with little computational cost even in very
high-dimensional spaces
● examples:

1. Polynomial

2. Gaussian

3. Gaussian radial basis function

4. Laplace RBF

Radial Basis Function (RBF) kernel

A radial basis function is a real-valued function whose value depends only on the distance
from the origin. Any function that satisfies the property ϕ(x)=ϕ(||x||) is a radial function.

There are various types of RBF: Gaussian, Multi-quadratic, Inverse quadratic, etc.

Gaussian Kernel

The Gaussian kernel is an example of RBF kernel. The adjustable parameter sigma plays a
major role in the performance of the kernel, and should be carefully tuned to the problem at
hand. If over-estimated the exponential will behave almost linearly and the higher
dimensional projection will start to lose its non-linear power. On the other hand, if under-
estimated, the function will lack regularization and the decision boundary will be highly
sensitive to noise in training data.

Exponential kernel

The exponential kernel is closely related to the Gaussian kernel, with only the square of the
norm left out. It is also a radial basis function kernel.
2.8 Model selection and feature selection.

Model selection

· Given a set of models, choose the model that is expected to give the best results.

· Choosing among different learning algorithms e.g. choosing kNN over other

· Classification algorithms.

· Choosing parameters in same learning model e.g. choosing value of k in kNN.

Feature Selection- Selecting a useful subset from all the features.

Why Feature Selection?

· Some algorithms scale (computationally) poorly with increased dimension

· Irrelevant features can confuse some algorithms

· Redundant features adversely affect regularization

· Removal of features can increase (relative) margin (and generalization)

· Reduces data set and resulting model size

· Note: Feature Selection is different from Feature Extraction. The latter transforms
original

· features to get a small set of new features

· How?

· Remove a binary feature if nearly all of it values are same.

· Use some criteria to rank features and keep top ranked features.

· Wrapper Methods: requires repeated runs of the learning algorithm with different

· Set of features.
2.9 Combining classifiers: Bagging, boosting (The Ada boost algorithm),

Ensemble Models

Bagging

Its objective is to create several subsets of data from training sample chosen randomly with
replacement. Each collection of subset data is used to train their decision trees. We get an
ensemble of different models. Average of all the predictions from different trees are used
which is more robust than a single decision tree classifier

Steps:
1. Suppose there are observations and features in training data set, sample from
training data set is taken randomly with replacement
2. A subset of features are selected randomly and whichever feature gives the best
split is used to split the node iteratively
3. The tree is grown to the largest
4. Above steps are repeated times and prediction is given based on the aggregation
of predictions from number of trees.

Advantages:

Reduces over-fitting of the model

Handles higher dimensionality data very well

Maintains accuracy for missing data

Disadvantages:

Since final prediction is based on the mean predictions from subset trees, it won’t give
precise values for the classification and regression model.

Boosting

It is used to create a collection of predictors. Learners are learned sequentially with early
learners fitting simple models to the data and then analysing data for errors. Consecutive trees
are fit and at every step, the goal is to improve the accuracy from the prior tree. When an
input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more
likely to classify it correctly. Process converts weak learners into better performing model.
Steps:
1. Draw a random subset of training samples without replacement from the training
set to train a weak learner
2. Draw second random training subset without replacement from the training set
and add percent of the samples that were previously falsely classified/misclassified to train a
weak learner
3. Find the training samples d3 in the training set D on which and disagree to train a
third weak learner
4. Combine all the weak learners via majority voting.

Advantages

Supports different loss function

Works well with interactions.

Disadvantages

Prone to over-fitting

Requires careful tuning of different hyper-parameters

Adaboost

Weak models are added sequentially, trained using the weighted training data.

The training weights are updated giving more weight to incorrectly predicted instances, and
less weight to correctly predicted instances.

The process continues until a pre-set number of weak learners have been created (a user
parameter) or no further improvement can be made on the training dataset.

Once completed, you are left with a pool of weak learners each with a stage value.

A stage value is calculated for the trained model which provides a weighting for any
predictions that the model makes.

Predictions are made by calculating the weighted average of the weak classifiers.
2.10Evaluating and debugging learning algorithms, Classification errors.

Evaluating your machine learning algorithm is an essential part of any project.


Your model may give you satisfying results when evaluated using a metric say
accuracy_score but may give poor results when evaluated against other metrics
such as logarithmic_loss or any other such metric. Most of the times we use
classification accuracy to measure the performance of our model, however it is
not enough to truly judge our model. In this post, we will cover different types
of evaluation metrics available.
Classification Accuracy

Classification Accuracy is what we usually mean, when we use the term


accuracy. It is the ratio of the number of correct predictions to the total number
of input samples.
It works well only if there are an equal number of samples belonging to each
class.

For example, consider that there are 98% samples of class A and 2% samples of
class B in our training set. Then our model can easily get 98% training accuracy by
simply predicting every training sample belonging to class A.

When the same model is tested on a test set with 60% samples of class A and
40% samples of class B, then the test accuracy would drop down to 60%.
Classification Accuracy is great, but gives us the false sense of achieving high
accuracy.

The real problem arises, when the cost of misclassification of the minor class
samples are very high. If we deal with a rare but fatal disease, the cost of failing
to diagnose the disease of a sick person is much higher than the cost of sending
a healthy person to more tests.
Logarithmic Loss

Logarithmic Loss or Log Loss, works by penalising the false classifications. It


works well for multi-class classification. When working with Log Loss, the
classifier must assign probability to each class for all the samples. Suppose,
there are N samples belonging to M classes, then the Log Loss is calculated as
below :

where,

y_ij, indicates whether sample i belongs to class j or not

p_ij, indicates the probability of sample i belonging to class j

Log Loss has no upper bound and it exists on the range [0,
∞). Log Loss nearer to 0 indicates higher accuracy,
whereas if the Log Loss is away from 0 then it indicates
lower accuracy.
In general, minimising Log Loss gives greater accuracy for the classifier.
Confusion Matrix

Confusion Matrix as the name suggests gives us a matrix as output and


describes the complete performance of the model.

Lets assume we have a binary classification problem. We have some samples


belonging to two classes : YES or NO. Also, we have our own classifier which
predicts a class for a given input sample. On testing our model on 165
samples ,we get the following result.

Confusion Matrix

There are 4 important terms :

True Positives : The cases in which we predicted YES and the actual output was
also YES.

True Negatives : The cases in which we predicted NO and the actual output was
NO.

False Positives : The cases in which we predicted YES and the actual output was
NO.

False Negatives : The cases in which we predicted NO and the actual output was
YES.

Accuracy for the matrix can be calculated by taking average of the values lying
across the “main diagonal” i.e
Confusion Matrix forms the basis for the other types of metrics.
Area Under Curve

Area Under Curve(AUC) is one of the most widely used metrics for evaluation. It is
used for binary classification problem. AUC of a classifier is equal to the
probability that the classifier will rank a randomly chosen positive example
higher than a randomly chosen negative example. Before defining AUC, let us
understand two basic terms :

True Positive Rate (Sensitivity) : True Positive Rate is defined as TP/ (FN+TP). True
Positive Rate corresponds to the proportion of positive data points that are
correctly considered as positive, with respect to all positive data points.

True Negative Rate (Specificity) : True Negative Rate is defined as TN / (FP+TN).


False Positive Rate corresponds to the proportion of negative data points that
are correctly considered as negative, with respect to all negative data points.

False Positive Rate : False Positive Rate is defined as FP / (FP+TN). False Positive
Rate corresponds to the proportion of negative data points that are mistakenly
considered as positive, with respect to all negative data points.

False Positive Rate and True Positive Rate both have values in the range [0, 1]. FPR
and TPR both are computed at varying threshold values such as (0.00, 0.02, 0.04,
…., 1.00) and a graph is drawn. AUC is the area under the curve of plot False
Positive Rate vs True Positive Rate at different points in [0, 1].
As evident, AUC has a range of [0, 1]. The greater the value, the better is the
performance of our model.
F1 Score

F1 Score is used to measure a test’s accuracy

F1 Score is the Harmonic Mean between precision and recall. The range for F1
Score is [0, 1]. It tells you how precise your classifier is (how many instances it
classifies correctly), as well as how robust it is (it does not miss a significant
number of instances).

High precision but lower recall, gives you an extremely accurate, but it then
misses a large number of instances that are difficult to classify. The greater the
F1 Score, the better is the performance of our model. Mathematically, it can be
expressed as :

F1 Score

F1 Score tries to find the balance between precision and recall.


Precision : It is the number of correct positive results divided by the number of
positive results predicted by the classifier.

Precision

Recall : It is the number of correct positive results divided by the number of all
relevant samples (all samples that should have been identified as positive).

Recall
Mean Absolute Error

Mean Absolute Error is the average of the difference between the Original
Values and the Predicted Values. It gives us the measure of how far the
predictions were from the actual output. However, they don’t gives us any idea
of the direction of the error i.e. whether we are under predicting the data or over
predicting the data. Mathematically, it is represented as :

Mean Squared Error

Mean Squared Error(MSE) is quite similar to Mean Absolute Error, the only
difference being that MSE takes the average of the square of the difference
between the original values and the predicted values. The advantage of MSE
being that it is easier to compute the gradient, whereas Mean Absolute Error
requires complicated linear programming tools to compute the gradient. As we
take square of the error, the effect of larger errors become more pronounced
then smaller error, hence the model can now focus more on the larger errors.

You might also like