0% found this document useful (0 votes)
2 views108 pages

Unit - 1+2

The document is a theory manual for Machine Learning Techniques as part of the B. Tech program at Meerut Institute of Engineering and Technology. It covers the introduction to machine learning, its importance, types of learning (supervised, unsupervised, semi-supervised, and reinforcement), and the design of learning systems. Additionally, it discusses key considerations such as reliability, scalability, maintainability, and adaptability in machine learning systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views108 pages

Unit - 1+2

The document is a theory manual for Machine Learning Techniques as part of the B. Tech program at Meerut Institute of Engineering and Technology. It covers the introduction to machine learning, its importance, types of learning (supervised, unsupervised, semi-supervised, and reinforcement), and the design of learning systems. Additionally, it discusses key considerations such as reliability, scalability, maintainability, and adaptability in machine learning systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

MEERUT INSTITUTE OF ENGINEERING AND

TECHNOLOGY

DR. A.P.J ABDUL KALAM TECHNICAL UNIVERSITY

Theory Manual
Machine Learning Techniques
(BCAI-601)
Session: 2024-25 (EVEN Semester)
Programme Name: B. Tech
Semester: VI
Name of the Department: CSE (AI&ML)
Dr. Pawan
Associate Professor
MIET MEERUT
UNIT – 1

1. INTRODUCTION
Learning, Types of Learning, Well defined Learning Problems, Designing a Learning
1.1
System
a. To Understand the basics of Machine Learning and types of Learning.
WHY b. To Understand the History of Machine Learning.

a. Implement various algorithms of Supervised, Unsupervised and Reinforcement


WHAT Machine Learnings.

a. In the Selection of Datasets for various Machine Learning Problems.


WHERE
b. Applications of Clustering and Classification.

Machine Learning refers to the process by which algorithms improve their performance
on tasks over time, based on experience or data. This can involve supervised learning,
where the system learns from labeled data, or unsupervised learning, where it identifies
patterns in unlabeled data.

Why Learning is Important:

 Improves Accuracy: Learning allows models to adapt and improve their predictions
or classifications based on new data.

 Adaptability: Enables systems to handle changing environments or new data


distributions.

 Efficiency: Automates tasks that would otherwise require manual intervention.

What Learning Entails:

 Data Processing: Involves processing large datasets to extract insights or patterns.

 Model Training: Training models using algorithms to optimize performance metrics.

 Evaluation: Assessing model performance on unseen data to ensure generalizability.

Where Learning is Applied:

 Healthcare: Predicting patient outcomes, diagnosing diseases.


 Finance: Predicting stock prices, detecting fraud.

 Retail: Personalized recommendations, inventory management.

Machine learning (ML) is a subdomain of artificial intelligence (AI) that focuses on


developing systems that learn—or improve performance—based on the data they ingest.
Artificial intelligence is a broad word that refers to systems or machines that resemble
human intelligence. Machine learning and AI are frequently discussed together, and the
terms are occasionally used interchangeably, although they do not signify the same thing.
A crucial distinction is that, while all machine learning is AI, not all AI is machine learning.

What is Machine Learning?

Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed. ML is one of the most exciting technologies that one would
have ever come across. As it is evident from the name, it gives the computer that makes it
more similar to humans: The ability to learn. Machine learning is actively being used today,
perhaps in many more places than one would expect.

Features of Machine learning

 Machine learning is data driven technology. Large amount of data generated by


organizations on daily bases. So, by notable relationships in data, organizations
make better decisions.
 Machine can learn itself from past data and automatically improve.
 From the given dataset it detects various patterns on data.
 For the big organizations branding is important and it will become easier to target
relatable customer base.
 It is similar to data mining because it is also deals with the huge amount of data.

Designing a Learning System in Machine Learning

Designing a learning system in machine learning requires careful consideration of several


key factors, including the type of data being used, the desired outcome, and the available
resources. In this article, we will explore the key steps involved in designing a learning
system in machine learning and discuss some best practices to keep in mind .

o The first step in designing a learning system in machine learning is to identify the
type of data that will be used. This can include structured data, such as numerical
and categorical data, as well as unstructured data, such as text and images. The type
of data will determine the type of machine learning algorithms that can be used and
the preprocessing steps required.
o Once the data has been identified, the next step is to determine the desired
outcome of the learning system. This can include classifying data, making
predictions, or identifying patterns in the data. The desired outcome will determine
the type of machine learning algorithm that should be used, as well as the
evaluation metrics that will be used to measure the performance of the learning
system.
o Next, the resources available for the learning system must be considered. This
includes the amount of data available, the computational power available, and the
amount of time available to train the model. These resources will determine the
complexity of the machine learning algorithm that can be used and the amount of
data that can be used for training.
o Once the data, desired outcome, and resources have been identified, it is time to
select a machine-learning algorithm and begin the training process. Decision trees,
SVMs, and neural networks are examples of common algorithms. It is crucial to
assess the effectiveness of the learning system using the right assessment measures,
such as recall, accuracy, and precision.
o After the learning system is trained, it is important to fine-tune the model by
adjusting the parameters and hyperparameters. This can be done using techniques
such as cross-validation and grid search. The final model should be tested on a
hold-out test set to evaluate its performance on unseen data.

When constructing a machine learning system, there are some other recommended
practices to bear in mind in addition to these essential processes. A crucial factor to take
into account is making sure that the training data are indicative of the data that will be
encountered in the actual world. To do this, the data may be divided into training,
validation, and test sets.

Another best practice is to use appropriate regularization techniques to prevent overfitting.


This can include techniques such as L1 and L2 regularization and dropout. It is also
important to use feature scaling and normalization to ensure that the data is in a format
that is suitable for the machine learning algorithm being used.

Following are the qualities that you need to keep in mind while designing a
learning system:

Reliability
The system must be capable of carrying out the proper task at the appropriate degree of
performance in a given setting. Testing the dependability of ML systems that learn from
data is challenging because a system's failure need not result in an error; instead, it could
simply produce garbage results, meaning that some results were produced even though
the system had not been trained with the corresponding ground truth.

When a typical system fails, you receive an error message, such as The crew is addressing
a technical issue and will return soon.

When a machine learning (ML) system fails, it usually does so without being seen. For
instance, when translating from English to Hindi or vice versa, even if the model may not
have seen all of the words, it may nevertheless provide a translation that is illogical.

Scalability

There should be practical methods for coping with the system's expansion as it changes (in
terms of data amount, traffic volume, or complexity). Because certain essential applications
might lose millions of dollars or their credibility with just one hour of outage or failure,
there should be an automated provision to grow computing and storage capacity.

For instance, if a feature on an e-commerce website fails to function as planned on a busy


day, it might result in a loss of millions of dollars in sales.

Maintainability

The performance of the model may fluctuate as a result of changes in data distribution
over time. In the ML system, there should be a provision to first determine whether there is
any model drift or data drift, and once the major drift is noticed, how to re-train/re-fresh
and enable new ML models without interfering with the ML system's present functioning.

Adaptability

The availability of fresh data with increased features or changes in business objectives,
such as conversion rate vs. customer engagement time for e-commerce, are the other
changes that occur most frequently in machine learning (ML) systems. As a result, the
system has to be adaptable to fast upgrades without causing any service disruptions.

Data

1. For example, human age and height have expected value ranges, but they can't be
too huge, like age value 150+, height - 10 feet, etc. Feature expectations are
recorded in a schema - ranges of the feature values carefully captured to avoid any
unanticipated value, which can result in a trash answer.
2. All features are advantageous; features introduced to the system should be valuable
in some way, such as being a predictor or an identifier, as each feature has a
handling cost.
3. No feature should cost more than it is worth; each new feature should be evaluated
in terms of cost vs. benefits in order to eliminate those that would be difficult to
implement or manage.
4. The data pipeline has the necessary privacy protections in place; for instance,
personally identifiable information (PII) should be managed carefully because any
leaking of sensitive information may have legal repercussions.
5. If any new external component has an influence on the system, it will be easier to
introduce new features to boost system performance.
6. All input feature code, including one-hot encoding/binning features and the
handling of unseen levels in one-hot encoded features, must be checked in order to
avoid any intermediate values from departing from the desired range.

Model

1. Model specifications are evaluated and submitted; for quicker re-training, correct
versioning of the model learning code is required.
2. Correlation between offline and online metrics: Model metrics (log loss, mape, mse)
should be strongly associated with the application's goal, such as
revenue/cost/time.
3. Hyperparameters like learning rates, the number of layers, the size of the layers, the
maximum depth, and regularisation coefficients must be modified for the use case
because the selection of hyperparameter values can significantly affect the accuracy
of predictions.
4. To support the most recent model in production, it is important to comprehend
how frequently to retrain models depending on changes in data distribution. Model
staleness has an influence that is known.
5. Simple linear models with high-level characteristics are a good starting point for
functional testing and doing cost-benefit analyses when compared to more
complex models. However, a simpler model is not always better.
6. Model performance must be assessed using adequately representative data to
ensure that model quality is satisfactory on significant data slices.
7. The model is put to the test for inclusion-model characteristics, which should be
thoroughly examined against predicting importance since, in some applications,
specific features may slant outcomes in favour of particular categories, usually for
reasons of fairness.

Types of Machine Learning

Machine learning is a subset of AI, which enables the machine to automatically learn
from data, improve performance from past experiences, and make predictions.

Machine learning contains a set of algorithms that work on a huge amount of data. Data is
fed to these algorithms to train them, and on the basis of training, they build the model &
perform a specific task.

These ML algorithms help to solve different business problems like Regression,


Classification, Forecasting, Clustering, and Associations, etc.

Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:

1. Supervised Machine Learning


2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
In this topic, we will provide a detailed description of the types of Machine Learning along
with their respective algorithms:

1. Supervised Machine Learning


As its name suggests, Supervised machine learning is based on supervision. It means in the
supervised learning technique, we train the machines using the "labelled" dataset, and
based on the training, the machine predicts the output. Here, the labelled data specifies
that some of the inputs are already mapped to the output. More preciously, we can say;
first, we train the machine with the input and corresponding output, and then we ask the
machine to predict the output using the test dataset.

Let's understand supervised learning with an example. Suppose we have an input dataset
of cats and dog images. So, first, we will provide the training to the machine to understand
the images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour,
height (dogs are taller, cats are smaller), etc. After completion of training, we input the
picture of a cat and ask the machine to identify the object and predict the output. Now,
the machine is well trained, so it will check all the features of the object, such as height,
shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat
category. This is the process of how the machine identifies the objects in Supervised
Learning.

The main goal of the supervised learning technique is to map the input variable(x)
with the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.

Categories of Supervised Machine Learning

Supervised machine learning can be classified into two types of problems, which are given
below:

o Classification
o Regression
a) Classification

Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.

Some popular classification algorithms are given below:

1. Random Forest Algorithm


2. Decision Tree Algorithm
3. Logistic Regression Algorithm
4. Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.

Some popular Regression algorithms are given below :

1. Simple Linear Regression Algorithm


2. Multivariate Regression Algorithm
3. Decision Tree Algorithm
4. Lasso Regression

Advantages and Disadvantages of Supervised Learning

Advantages:

1. Since supervised learning work with the labelled dataset so we can have an exact
idea about the classes of objects.
2. These algorithms are helpful in predicting the output on the basis of prior
experience.

Disadvantages:

1. These algorithms are not able to solve complex tasks.


2. It may predict the wrong output if the test data is different from the training data.
3. It requires lots of computational time to train the algorithm.
Applications of Supervised Learning

Some common applications of Supervised Learning are given below:

1. Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.
2. Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new
patients.
3. Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic data
to identify the patterns that can lead to possible fraud.
4. Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent
to the spam folder.
5. Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can
be done using the same, such as voice-activated passwords, voice commands, etc.

2. Unsupervised Machine Learning

Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output
without any supervision.

In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.

The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines are
instructed to find the hidden patterns from the input dataset.

Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown
to the model, and the task of the machine is to find the patterns and categories of the
objects.

So, now the machine will discover its patterns and differences, such as colour difference,
shape difference, and predict the output when it is tested with the test dataset.
Categories of Unsupervised Machine Learning

Unsupervised Learning can be further classified into two types, which are given below:

o Clustering
o Association

1) Clustering

The clustering technique is used when we want to find the inherent groups from the data.
It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of other
groups. An example of the clustering algorithm is grouping the customers by their
purchasing behaviour.

Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm


o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis

2) Association

Association rule learning is an unsupervised learning technique, which finds interesting


relations among variables within a large dataset. The main aim of this learning algorithm is
to find the dependency of one data item on another data item and map those variables
accordingly so that it can generate maximum profit. This algorithm is mainly applied
in Market Basket analysis, Web usage mining, continuous production, etc.

Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.

Advantages and Disadvantages of Unsupervised Learning Algorithm

Advantages:

o These algorithms can be used for complicated tasks compared to the supervised
ones because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:

o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.

Applications of Unsupervised Learning

o Network Analysis: Unsupervised learning is used for identifying plagiarism and


copyright in document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised
learning techniques for building recommendation applications for different web
applications and e-commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of unsupervised
learning, which can identify unusual data points within the dataset. It is used to
discover fraudulent transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to
extract particular information from the database. For example, extracting
information of each user located at a particular location.

3. Semi-Supervised Learning

Semi-Supervised learning is a type of Machine Learning algorithm that lies between


Supervised and Unsupervised machine learning. It represents the intermediate ground
between Supervised (With Labelled training data) and Unsupervised learning (with no
labelled training data) algorithms and uses the combination of labelled and unlabeled
datasets during the training period.

Although Semi-supervised learning is the middle ground between supervised and


unsupervised learning and operates on the data that consists of a few labels, it mostly
consists of unlabeled data. As labels are costly, but for corporate purposes, they may have
few labels. It is completely different from supervised and unsupervised learning as they are
based on the presence & absence of labels.

To overcome the drawbacks of supervised learning and unsupervised learning


algorithms, the concept of Semi-supervised learning is introduced.

The main aim of semi-supervised learning is to effectively use all the available data, rather
than only labelled data like in supervised learning. Initially, similar data is clustered along
with an unsupervised learning algorithm, and further, it helps to label the unlabeled data
into labelled data. It is because labelled data is a comparatively more expensive acquisition
than unlabeled data.
We can imagine these algorithms with an example. Supervised learning is where a student
is under the supervision of an instructor at home and college. Further, if that student is
self-analysing the same concept without any help from the instructor, it comes under
unsupervised learning. Under semi-supervised learning, the student has to revise himself
after analyzing the same concept under the guidance of an instructor at college.

Advantages and disadvantages of Semi-supervised Learning

Advantages:

o It is simple and easy to understand the algorithm.


o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms .
Disadvantages:

o Iterations results may not be stable.


o We cannot apply these algorithms to network-level data.
o Accuracy is low.

4. Reinforcement Learning

Reinforcement learning works on a feedback-based process, in which an AI agent (A


software component) automatically explore its surrounding by hitting & trail, taking action,
learning from experiences, and improving its performance.

Agent gets rewarded for each good action and get punished for each bad action; hence
the goal of reinforcement learning agent is to maximize the rewards.

In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.

The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning
is to play a game, where the Game is the environment, moves of an agent at each step
define states, and the goal of the agent is to get a high score. Agent receives feedback in
terms of punishment and rewards.

Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.

A reinforcement learning problem can be formalized using Markov Decision


Process(MDP). In MDP, the agent constantly interacts with the environment and performs
actions; at each action, the environment responds and generates a new state .
Categories of Reinforcement Learning
Reinforcement learning is categorized mainly into two types of methods/algorithms:

o Positive Reinforcement Learning: Positive reinforcement learning specifies


increasing the tendency that the required behaviour would occur again by adding
something. It enhances the strength of the behaviour of the agent and positively
impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly
opposite to the positive RL. It increases the tendency that the specific behaviour
would occur again by avoiding the negative condition.

Real-world Use cases of Reinforcement Learning

o VideoGames:
RL algorithms are much popular in gaming applications. It is used to gain super-
human performance. Some popular games that use RL algorithms
are AlphaGO and AlphaGO Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that
how to use RL in computer to automatically learn and schedule resources to wait for
different jobs in order to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial
and manufacturing area, and these robots are made more powerful with
reinforcement learning. There are different industries that have their vision of
building intelligent robots using AI and Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with
the help of Reinforcement Learning by Salesforce company.

Advantages and Disadvantages of Reinforcement Learning

Advantages

o It helps in solving complex real-world problems which are difficult to be solved by


general techniques.
o The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
o Helps in achieving long term results.
Disadvantage

o RL algorithms are not preferred for simple problems.


o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can
weaken the results.
o The curse of dimensionality limits reinforcement learning for real physical systems.

History of Machine Learning

Before some years (about 40-50 years), machine learning was science fiction, but today
it is the part of our daily life. Machine learning is making our day to day life easy from
self-driving cars to Amazon virtual assistant "Alexa". However, the idea behind machine
learning is so old and has a long history. Below some milestones are given which have
occurred in the history of machine learning:

The early history of Machine Learning (Pre-1940):

1834: In 1834, Charles Babbage, the father of the computer, conceived a device that
could be programmed with punch cards. However, the machine was never built, but all
modern computers rely on its logical structure.

1936: In 1936, Alan Turing gave a theory that how a machine can determine and
execute a set of instructions.

The era of stored program computers:

1940: In 1940, the first manually operated computer, "ENIAC" was invented, which was
the first electronic general-purpose computer. After that stored program computer
such as EDSAC in 1949 and EDVAC in 1951 were invented.

1943: In 1943, a human neural network was modeled with an electrical circuit. In 1950,
the scientists started applying their idea to work and analyzed how human neurons
might work.

Computer machinery and intelligence:

1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery and
Intelligence," on the topic of artificial intelligence. In his paper, he asked, "Can machines
think?"
Machine intelligence in Games:

1952: Arthur Samuel, who was the pioneer of machine learning, created a program that
helped an IBM computer to play a checkers game. It performed better more it played.

1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.

The first "AI" winter:

The duration of 1974 to 1980 was the tough time for AI and ML researchers, and this
duration was called as AI winter.

In this duration, failure of machine translation occurred, and people had reduced their
interest from AI, which led to reduced funding by the government to the researches.

Machine Learning from theory to reality

1959: In 1959, the first neural network was applied to a real-world problem to remove
echoes over phone lines using an adaptive filter.

1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural network
NETtalk, which was able to teach itself how to correctly pronounce 20,000 words in one
week.

1997: The IBM's Deep blue intelligent computer won the chess game against the chess
expert Garry Kasparov, and it became the first computer which had beaten a human
chess expert.

Machine Learning at 21st century

2006:
Geoffrey Hinton and his group presented the idea of profound getting the hang of
utilizing profound conviction organizations.

The Elastic Compute Cloud (EC2) was launched by Amazon to provide scalable
computing resources that made it easier to create and implement machine learning
models.

2007:
Participants were tasked with increasing the accuracy of Netflix's recommendation
algorithm when the Netflix Prize competition began.

Support learning made critical progress when a group of specialists utilized it to


prepare a PC to play backgammon at a top-notch level.

2008:

Google delivered the Google Forecast Programming interface, a cloud-based help that
permitted designers to integrate AI into their applications.

Confined Boltzmann Machines (RBMs), a kind of generative brain organization,


acquired consideration for their capacity to demonstrate complex information
conveyances.

2009:

Profound learning gained ground as analysts showed its viability in different errands,
including discourse acknowledgment and picture grouping.

The expression "Large Information" acquired ubiquity, featuring the difficulties and
open doors related with taking care of huge datasets.

2010:

The ImageNet Huge Scope Visual Acknowledgment Challenge (ILSVRC) was presented,
driving progressions in PC vision, and prompting the advancement of profound
convolutional brain organizations (CNNs).

2011:

On Jeopardy! IBM's Watson defeated human champions., demonstrating the potential


of question-answering systems and natural language processing.

2012:

AlexNet, a profound CNN created by Alex Krizhevsky, won the ILSVRC, fundamentally
further developing picture order precision and laying out profound advancing as a
predominant methodology in PC vision.
Google's Cerebrum project, drove by Andrew Ng and Jeff Dignitary, utilized profound
figuring out how to prepare a brain organization to perceive felines from unlabeled
YouTube recordings.

2013:

Ian Goodfellow introduced generative adversarial networks (GANs), which made it


possible to create realistic synthetic data.

Google later acquired the startup DeepMind Technologies, which focused on deep
learning and artificial intelligence.

2014:

Facebook presented the DeepFace framework, which accomplished close human


precision in facial acknowledgment.

AlphaGo, a program created by DeepMind at Google, defeated a world champion Go


player and demonstrated the potential of reinforcement learning in challenging games.

2015:

Microsoft delivered the Mental Toolbox (previously known as CNTK), an open-source


profound learning library.
The performance of sequence-to-sequence models in tasks like machine translation
was enhanced by the introduction of the idea of attention mechanisms.

2016:

The goal of explainable AI, which focuses on making machine learning models easier to
understand, received some attention.

Google's DeepMind created AlphaGo Zero, which accomplished godlike Go abilities to


play without human information, utilizing just support learning.

2017:

Move learning acquired noticeable quality, permitting pretrained models to be utilized


for different errands with restricted information.
Better synthesis and generation of complex data were made possible by the
introduction of generative models like variational autoencoders (VAEs) and Wasserstein
GANs.

These are only a portion of the eminent headways and achievements in AI during the
predefined period. The field kept on advancing quickly past 2017, with new leap
forwards, strategies, and applications arising.

Machine Learning at present:

The field of machine learning has made significant strides in recent years, and its
applications are numerous, including self-driving cars, Amazon Alexa, Catboats, and the
recommender system. It incorporates clustering, classification, decision tree, SVM
algorithms, and reinforcement learning, as well as unsupervised and supervised
learning.

Present day AI models can be utilized for making different expectations, including
climate expectation, sickness forecast, financial exchange examination, and so on.

Prerequisites

Before learning machine learning, you must have the basic knowledge of followings so
that you can easily understand the concepts of machine learning:

Fundamental knowledge of probability and linear algebra.

The ability to code in any computer language, especially in Python language.

Knowledge of Calculus, especially derivatives of single variable and multivariate


functions.

Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained using Unsupervised learning algorithms are
labeled data. trained using unlabeled data.

Supervised learning model takes direct feedback Unsupervised learning model does not take
to check if it is predicting correct output or not. any feedback.

Unsupervised learning model finds the


Supervised learning model predicts the output.
hidden patterns in data.

In supervised learning, input data is provided to In unsupervised learning, only input data is
the model along with the output. provided to the model.

The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it is the hidden patterns and useful insights from
given new data. the unknown dataset.

Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problems. in Clustering and Associations problems.

Supervised learning can be used for those cases Unsupervised learning can be used for
where we know the input as well as those cases where we have only input data
corresponding outputs. and no corresponding output data.

Unsupervised learning model may give less


Supervised learning model produces an accurate
accurate result as compared to supervised
result.
learning.

Supervised learning is not close to true Artificial Unsupervised learning is more close to the
intelligence as in this, we first train the model for true Artificial Intelligence as it learns
each data, and then only it can predict the similarly as a child learns daily routine
correct output. things by his experiences.

It includes various algorithms such as Linear


Regression, Logistic Regression, Support Vector It includes various algorithms such as
Machine, Multi-class Classification, Decision Clustering, KNN, and Apriori algorithm.
tree, Bayesian Logic, etc.
1. INTRODUCTION

1.2 Introduction of Machine Learning Approaches


a. To Understand various machine approaches like Artificial Neural Networks, Clustering,
Reinforcement Learning, Decision Tree Learning, Bayesian networks, Support Vector Machine,
WHY and Genetic Algorithms.
b. To Understand the issues related to Machine Learning and Data Science.

a. To implement and analyse small applications of Machine Learning.


WHAT
b. To implement and analyse probability distributions using Bayesian Theorem and Networks.

WHERE a. In the applications of Machine Learning.

Artificial Neural Network

The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modeled after the brain. An Artificial neural network is usually a
computational network based on biological neural networks that construct the
structure of the human brain. Similar to a human brain has neurons interconnected to
each other, artificial neural networks also have neurons that are linked to each other in
various layers of the networks. These neurons are known as nodes.

What is Artificial Neural Network?

The term "Artificial Neural Network" is derived from Biological neural networks that
develop the structure of a human brain. Similar to the human brain that has neurons
interconnected to one another, artificial neural networks also have neurons that are
interconnected to one another in various layers of the networks. These neurons are
known as nodes.
Dendrites from Biological Neural Network represent inputs in Artificial Neural
Networks, cell nucleus represents Nodes, synapse represents Weights, and Axon
represents Output.

Relationship between Biological neural network and artificial neural network:

Biological Neural Network Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon Output

An Artificial Neural Network in the field of Artificial intelligence where it attempts to


mimic the network of neurons makes up a human brain so that computers will have an
option to understand things and make decisions in a human-like manner. The artificial
neural network is designed by programming computers to behave simply like
interconnected brain cells.

There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000. In the human brain,
data is stored in such a manner as to be distributed, and we can extract more than one
piece of this data when necessary from our memory parallelly. We can say that the
human brain is made up of incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example
of a digital logic gate that takes an input and gives an output. "OR" gate, which takes
two inputs. If one or both the inputs are "On," then we get "On" in output. If both the
inputs are "Off," then we get "Off" in output. Here the output depends upon input. Our
brain does not perform the same task. The outputs to inputs relationship keep
changing because of the neurons in our brain, which are "learning."

The Architecture of an artificial neural network:

To understand the concept of the architecture of an artificial neural network, we have


to understand what a neural network consists of. In order to define a neural network
that consists of a large number of artificial neurons, which are termed units arranged in
a sequence of layers. Lets us look at various types of layers available in an artificial
neural network.

Artificial Neural Network primarily consists of three layers:

Input Layer:

As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:

The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.

Output Layer:

The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.

The artificial neural network takes input and computes the weighted sum of the inputs
and includes a bias. This computation is represented in the form of a transfer function.

It determines weighted total is passed as an input to an activation function to produce


the output. Activation functions choose whether a node should fire or not. Only those
who are fired make it to the output layer. There are distinctive activation functions
available that can be applied upon the sort of task we are performing.

Advantages of Artificial Neural Network (ANN)

Parallel processing capability:


Artificial neural networks have a numerical value that can perform more than one task
simultaneously.
Storing data on the entire network:
Data that is used in traditional programming is stored on the whole network, not on a
database. The disappearance of a couple of pieces of data in one place doesn't prevent
the network from working.
Capability to work with incomplete knowledge:
After ANN training, the information may produce output even with inadequate data.
The loss of performance here relies upon the significance of missing data.
Having a memory distribution:
For ANN is to be able to adapt, it is important to determine the examples and to
encourage the network according to the desired output by demonstrating these
examples to the network. The succession of the network is directly proportional to the
chosen instances, and if the event can't appear to the network in all its aspects, it can
produce false output.
Having fault tolerance:
Extortion of one or more cells of ANN does not prohibit it from generating output, and
this feature makes the network fault-tolerance.

Disadvantages of Artificial Neural Network:


 Assurance of proper network structure:
There is no particular guideline for determining the structure of artificial neural
networks. The appropriate network structure is accomplished through
experience, trial, and error.
 Unrecognized behavior of the network:
It is the most significant issue of ANN. When ANN produces a testing solution, it
does not provide insight concerning why and how. It decreases trust in the
network.
 Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per
their structure. Therefore, the realization of the equipment is dependent.
 Difficulty of showing the issue to the network:
ANNs can work with numerical data. Problems must be converted into numerical
values before being introduced to ANN. The presentation mechanism to be
resolved here will directly impact the performance of the network. It relies on the
user's abilities.
 The duration of the network is unknown:
The network is reduced to a specific value of the error, and this value does not
give us optimum results.
How do artificial neural networks work?

Artificial Neural Network can be best represented as a weighted directed graph, where
the artificial neurons form the nodes. The association between the neurons outputs and
neuron inputs can be viewed as the directed edges with weights. The Artificial Neural
Network receives the input signal from the external source in the form of a pattern and
image in the form of a vector. These inputs are then mathematically assigned by the
notations x(n) for every n number of inputs.

Afterward, each of the input is multiplied by its corresponding weights ( these weights
are the details utilized by the artificial neural networks to solve a specific problem ). In
general terms, these weights normally represent the strength of the interconnection
between neurons inside the artificial neural network. All the weighted inputs are
summarized inside the computing unit.

If the weighted sum is equal to zero, then bias is added to make the output non-zero or
something else to scale up to the system's response. Bias has the same input, and
weight equals to 1. Here the total of weighted inputs can be in the range of 0 to
positive infinity. Here, to keep the response in the limits of the desired value, a certain
maximum value is benchmarked, and the total of weighted inputs is passed through
the activation function.

The activation function refers to the set of transfer functions used to achieve the
desired output. There is a different kind of the activation function, but primarily either
linear or non-linear sets of functions. Some of the commonly used sets of activation
functions are the Binary, linear, and Tan hyperbolic sigmoidal activation functions. Let
us take a look at each of them in details:

Binary:

In binary activation function, the output is either a one or a 0. Here, to accomplish this,
there is a threshold value set up. If the net weighted input of neurons is more than 1,
then the final output of the activation function is returned as one or else the output is
returned as 0.

Sigmoidal Hyperbolic:

The Sigmoidal Hyperbola function is generally seen as an "S" shaped curve. Here the
tan hyperbolic function is used to approximate output from the actual net input. The
function is defined as:

F(x) = (1/1 + exp(-????x))

Where ???? is considered the Steepness parameter.

Types of Artificial Neural Network:

There are various types of Artificial Neural Networks (ANN) depending upon the human
brain neuron and network functions, an artificial neural network similarly performs
tasks. The majority of the artificial neural networks will have some similarities with a
more complex biological partner and are very effective at their expected tasks. For
example, segmentation or classification.

1. Feedback ANN:

In this type of ANN, the output returns into the network to accomplish the best-
evolved results internally. The feedback networks feed information back into itself and
are well suited to solve optimization issues. The Internal system error corrections utilize
feedback ANNs.

2. Feed-Forward ANN:

A feed-forward network is a basic neural network comprising of an input layer, an


output layer, and at least one layer of a neuron. Through assessment of its output by
reviewing its input, the intensity of the network can be noticed based on group
behavior of the associated neurons, and the output is decided. The primary advantage
of this network is that it figures out how to evaluate and recognize input patterns.

Clustering

Clustering or cluster analysis is a machine learning technique, which groups the


unlabelled dataset. It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The objects with the possible
similarities remain in a group that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size,
color, behavior, etc., and divides them as per the presence and absence of those similar
patterns.
It is an unsupervised learning method, hence no supervision is provided to the
algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and complex
datasets.
The clustering technique is commonly used for statistical data analysis.
Note: Clustering is somewhere similar to the classification algorithm, but the difference
is the type of dataset that we are using. In classification, we work with the labeled data
set, whereas in clustering, we work with the unlabelled dataset.

Example: Let's understand the clustering technique with the real-world example of
Mall: When we visit any shopping mall, we can observe that the things with similar
usage are grouped together. Such as the t-shirts are grouped in one section, and
trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses
of this technique are:

1. Market Segmentation
2. Statistical data analysis
3. Social network analysis
4. Image segmentation
5. Anomaly detection, etc.

The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.

Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint belongs to
only one group) and Soft Clustering (data points can belong to another group also).
But there are also other various approaches of Clustering exist. Below are the main
clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

1. Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the
distance between the data points of one cluster is minimum as compared to another
cluster centroid.

2. Density-Based Clustering

The density-based clustering method connects the highly-dense areas into clusters, and
the arbitrarily shaped distributions are formed as long as the dense region can be
connected. This algorithm does it by identifying different clusters in the dataset and
connects the areas of high densities into clusters. The dense areas in data space are
divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
3. Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done
by assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that
uses Gaussian Mixture Models (GMM).

4. Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as
there is no requirement of pre-specifying the number of clusters to be created. In this
technique, the dataset is divided into clusters to create a tree-like structure, which is
also called a dendrogram. The observations or any number of clusters can be selected
by cutting the tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.

5. Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to more
than one group or cluster. Each dataset has a set of membership coefficients, which
depend on the degree of membership to be in a cluster. Fuzzy C-means algorithm is
the example of this type of clustering; it is sometimes also known as the Fuzzy k-means
algorithm.

Clustering Algorithms

The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few are
commonly used. The clustering algorithm is based on the kind of data that we are
using. Such as, some algorithms need to guess the number of clusters in the given
dataset, whereas some are required to find the minimum distance between the
observation of the dataset.

Here we are discussing mainly popular Clustering algorithms that are widely used in
machine learning:

1. K-Means algorithm: The k-means algorithm is one of the most popular


clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be specified in
this algorithm. It is fast with fewer computations required, with the linear
complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the
smooth density of data points. It is an example of a centroid-based model, that
works on updating the candidates for centroid to be the center of the points
within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to the
mean-shift, but with some remarkable advantages. In this algorithm, the areas of
high density are separated by the areas of low density. Because of this, the
clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used
as an alternative for the k-means algorithm or for those cases where K-means
can be failed. In GMM, it is assumed that the data points are Gaussian
distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical
algorithm performs the bottom-up hierarchical clustering. In this, each data
point is treated as a single cluster at the outset and then successively merged.
The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does
not require to specify the number of clusters. In this, each data point sends a
message between the pair of data points until convergence. It has O(N2T) time
complexity, which is the main drawback of this algorithm.

Applications of Clustering
Below are some commonly known applications of clustering technique in Machine
Learning:
1. In Identification of Cancer Cells: The clustering algorithms are widely used for
the identification of cancerous cells. It divides the cancerous and non-cancerous
data sets into different groups.
2. In Search Engines: Search engines also work on the clustering technique. The
search result appears based on the closest object to the search query. It does it
by grouping similar data objects in one group that is far from the other
dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
3. Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
4. In Biology: It is used in the biology stream to classify different species of plants
and animals using the image recognition technique.
5. In Land Use: The clustering technique is used in identifying the area of similar
lands use in the GIS database. This can be very useful to find that for what
purpose the particular land should be used, that means for which purpose it is
more suitable.

Decision Tree Classification Algorithm

 Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and
each leaf node represents the outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
 The decisions or the test are performed on the basis of features of the given
dataset.
 It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
 In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.
 A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.

Below diagram explains the general structure of a decision tree:


Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for
the given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:

Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.

The logic behind the decision tree can be easily understood because it shows a tree-
like structure.

Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.

Branch/Sub Tree: A tree formed by splitting the tree.


Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the
tree. The complete process can be better understood using the below algorithm:

Step-1: Begin the tree with the root node, says S, which contains the complete dataset.

Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).

Step-3: Divide the S into subsets that contains possible values for the best attributes.

Step-4: Generate the decision tree node, which contains the best attribute.

Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision tree
starts with the root node (Salary attribute by ASM). The root node splits further into the
next decision node (distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:

1. Information Gain
2. Gini Index

1. Information Gain:

Information gain is the measurement of changes in entropy after the segmentation of a


dataset based on an attribute.
It calculates how much information a feature provides us about a class.
According to the value of information gain, we split the node and build the decision
tree.

A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated
using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

S= Total number of samples

P(yes)= probability of yes

P(no)= probability of no

2. Gini Index:

Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the high Gini
index.

It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.

Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all
the important features of the dataset. Therefore, a technique that decreases the size of
the learning tree without reducing accuracy is known as Pruning. There are mainly two
types of tree pruning technology used:
1. Cost Complexity Pruning
2. Reduced Error Pruning.

Advantages of the Decision Tree

 It is simple to understand as it follows the same process which a human follow


while making any decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

 The decision tree contains lots of layers, which makes it complex.


 It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
 For more class labels, the computational complexity of the decision tree may
increase.

Bayesian Belief Network in artificial intelligence

Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian
network as:

"A Bayesian network is a probabilistic graphical model which represents a set of


variables and their conditional dependencies using a directed acyclic graph."

It is also called a Bayes network, belief network, decision network, or Bayesian


model.

Bayesian networks are probabilistic, because these networks are built from a
probability distribution, and also use probability theory for prediction and anomaly
detection.

Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.

Bayesian Network can be used for building models from data and experts opinions,
and it consists of two parts:

Directed Acyclic Graph

Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:

 Each node corresponds to the random variables, and a variable can be


continuous or discrete.
 Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables. These directed links or arrows
connect the pair of nodes in the graph.

These links represent that one node directly influence the other node, and if there is
no directed link that means that nodes are independent with each other

 In the above diagram, A, B, C, and D are random variables represented by the


nodes of the network graph.
 If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
 Node C is independent of node A.

Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG.

The Bayesian network has mainly two components:

1. Causal Component
2. Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.

Bayesian network is based on Joint probability distribution and conditional


probability. So let's first understand the joint probability distribution:

Joint probability distribution:

If we have variables x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.

= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

Explanation of Bayesian network:

Let's understand the Bayesian network through an example by creating a directed


acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The
alarm reliably responds at detecting a burglary but also responds for minor
earthquakes. Harry has two neighbors David and Sophia, who have taken a
responsibility to inform Harry at work when they hear the alarm. David always calls
Harry when he hears the alarm, but sometimes he got confused with the phone
ringing and calls at that time too. On the other hand, Sophia likes to listen to high
music, so sometimes she misses to hear the alarm. Here we would like to compute
the probability of Burglary Alarm.

Problem:

Calculate the probability that alarm has sounded, but there is neither a burglary, nor
an earthquake occurred, and David and Sophia both called the Harry.

Solution:

The Bayesian network for the above problem is given below. The network structure
is showing that burglary and earthquake is the parent node of the alarm and
directly affecting the probability of alarm's going off, but David and Sophia's calls
depend on alarm probability.

The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.

The conditional distributions for each node are given as conditional probabilities
table or CPT.

Each row in the CPT must be sum to 1 because all the entries in the table represent
an exhaustive set of cases for the variable.

In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence,


if there are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:

 Burglary (B)
 Earthquake(E)
 Alarm(A)
 David Calls(D)
 Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.

P(E= True)= 0.001, which is the probability of a minor earthquake


P(E= False)= 0.999, Which is the probability that an earthquake not occurred.

We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of
Alarm.

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."
A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.

Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.

The semantics of Bayesian Network:

There are two ways to understand the semantics of the Bayesian network, which is
given below:
1. To understand the network as the representation of the Joint probability
distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional
independence statements.
It is helpful in designing inference procedure.
Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such a model
can be created by using the SVM algorithm. We will first train our model with lots of
images of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector creates a
decision boundary between these two data (cat and dog) and choose extreme cases
(support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:

1. Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.

2. Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,


which means if a dataset cannot be classified by using a straight line, then such data
is termed as non-linear data and classifier used is called as Non-linear SVM
classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the


classes in n-dimensional space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary is known as the hyperplane
of SVM.
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a
straight line. And if there are 3 features, then hyperplane will be a 2-dimension
plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.

How does SVM works?


Linear SVM:
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has
two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of
coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point
of the lines from both the classes. These points are called support vectors. The
distance between the vectors and the hyperplane is called as margin. And the goal
of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

o now, SVM will divide the datasets into classes in the following way. Consider the
below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.


Genetic Algorithm in Machine Learning

A genetic algorithm is an adaptive heuristic search algorithm inspired by "Darwin's


theory of evolution in Nature." It is used to solve optimization problems in machine
learning. It is one of the important algorithms as it helps solve complex problems
that would take a long time to solve.

Genetic Algorithms are being widely used in different real-world applications, for
example, Designing electronic circuits, code-breaking, image processing, and
artificial creativity.

In this topic, we will explain Genetic algorithm in detail, including basic


terminologies used in Genetic algorithm, how it works, advantages and limitations
of genetic algorithm, etc.

What is a Genetic Algorithm?

Before understanding the Genetic algorithm, let's first understand basic


terminologies to better understand this algorithm:
Population: Population is the subset of all possible or probable solutions, which can
solve the given problem.
Chromosomes: A chromosome is one of the solutions in the population for the
given problem, and the collection of gene generate a chromosome.
Gene: A chromosome is divided into a different gene, or it is an element of the
chromosome.
Allele: Allele is the value provided to the gene within a particular chromosome.
Fitness Function: The fitness function is used to determine the individual's fitness
level in the population. It means the ability of an individual to compete with other
individuals. In every iteration, individuals are evaluated based on their fitness
function.
Genetic Operators: In a genetic algorithm, the best individual mate to regenerate
offspring better than parents. Here genetic operators play a role in changing the
genetic composition of the next generation.
Selection

After calculating the fitness of every existent in the population, a selection process is
used to determine which of the individualities in the population will get to
reproduce and produce the seed that will form the coming generation.

Types of selection styles available

1. Roulette wheel selection


2. Event selection
3. Rank- grounded selection

So, now we can define a genetic algorithm as a heuristic search algorithm to solve
optimization problems. It is a subset of evolutionary algorithms, which is used in
computing. A genetic algorithm uses genetic and natural selection concepts to solve
optimization problems.

How Genetic Algorithm Work?

The genetic algorithm works on the evolutionary generational cycle to generate


high-quality solutions. These algorithms use different operations that either
enhance or replace the population to give an improved fit solution.
It basically involves five phases to solve the complex optimization problems, which
are given as below:
1. Initialization
2. Fitness Assignment
3. Selection
4. Reproduction
5. Termination

1. Initialization

The process of a genetic algorithm starts by generating the set of individuals, which
is called population. Here each individual is the solution for the given problem. An
individual contains or is characterized by a set of parameters called Genes. Genes
are combined into a string and generate chromosomes, which is the solution to the
problem. One of the most popular techniques for initialization is the use of random
binary strings.
2. Fitness Assignment

Fitness function is used to determine how fit an individual is? It means the ability of
an individual to compete with other individuals. In every iteration, individuals are
evaluated based on their fitness function. The fitness function provides a fitness
score to each individual. This score further determines the probability of being
selected for reproduction. The high the fitness score, the more chances of getting
selected for reproduction.

3. Selection

The selection phase involves the selection of individuals for the reproduction of
offspring. All the selected individuals are then arranged in a pair of two to increase
reproduction. Then these individuals transfer their genes to the next generation.
There are three types of Selection methods available, which are:
1. Roulette wheel selection
2. Tournament selection
3. Rank-based selection

4. Reproduction

After the selection process, the creation of a child occurs in the reproduction step.
In this step, the genetic algorithm uses two variation operators that are applied to
the parent population. The two operators involved in the reproduction phase are
given below:
 Crossover: The crossover plays a most significant role in the reproduction
phase of the genetic algorithm. In this process, a crossover point is selected
at random within the genes. Then the crossover operator swaps genetic
information of two parents from the current generation to produce a new
individual representing the offspring.

The genes of parents are exchanged among themselves until the crossover point is
met. These newly generated offspring are added to the population. This process is
also called or crossover. Types of crossover styles available:
 One point crossover
 Two-point crossover
 Livery crossover
 Inheritable Algorithms crossover

Mutation
The mutation operator inserts random genes in the offspring (new child) to
maintain the diversity in the population. It can be done by flipping some bits in the
chromosomes.
Mutation helps in solving the issue of premature convergence and enhances
diversification. The below image shows the mutation process:

Types of mutation styles available,


 Flip bit mutation
 Gaussian mutation
 Exchange/Swap mutation
5. Termination
After the reproduction phase, a stopping criterion is applied as a base for
termination. The algorithm terminates after the threshold fitness solution is reached.
It will identify the final solution as the best solution in the population.
General Workflow of a Simple Genetic Algorithm

Advantages of Genetic Algorithm

 The parallel capabilities of genetic algorithms are best.


 It helps in optimizing various problems such as discrete functions, multi-
objective problems, and continuous functions.
 It provides a solution for a problem that improves over time.
 A genetic algorithm does not need derivative information.
Limitations of Genetic Algorithms

 Genetic algorithms are not efficient algorithms for solving simple problems.
 It does not guarantee the quality of the final solution to a problem.
 Repetitive calculation of fitness values may generate some computational
challenges.

Difference between Genetic Algorithms and Traditional Algorithms

 A search space is the set of all possible solutions to the problem. In the
traditional algorithm, only one set of solutions is maintained, whereas, in a
genetic algorithm, several sets of solutions in search space can be used.
 Traditional algorithms need more information in order to perform a search,
whereas genetic algorithms need only one objective function to calculate the
fitness of an individual.
 Traditional Algorithms cannot work parallelly, whereas genetic Algorithms
can work parallelly (calculating the fitness of the individualities are
independent).
 One big difference in genetic Algorithms is that rather of operating directly
on seeker results, inheritable algorithms operate on their representations (or
rendering), frequently appertained to as chromosomes.
 One of the big differences between traditional algorithm and genetic
algorithm is that it does not directly operate on candidate solutions.
 Traditional Algorithms can only generate one result in the end, whereas
Genetic Algorithms can generate multiple optimal results from different
generations.
 The traditional algorithm is not more likely to generate optimal results,
whereas Genetic algorithms do not guarantee to generate optimal global
results, but also there is a great possibility of getting the optimal result for a
problem as it uses genetic operators such as Crossover and Mutation.
 Traditional algorithms are deterministic in nature, whereas Genetic
algorithms are probabilistic and stochastic in nature.
Issues in Machine Learning

"Machine Learning" is one of the most popular technology among all data scientists and
machine learning enthusiasts. It is the most effective Artificial Intelligence technology that
helps create automated learning systems to take future decisions without being constantly
programmed. It can be considered an algorithm that automatically constructs various
computer software using past experience and training data. It can be seen in every
industry, such as healthcare, education, finance, automobile, marketing, shipping,
infrastructure, automation, etc. Almost all big companies like Amazon, Facebook, Google,
Adobe, etc., are using various machine learning techniques to grow their businesses. But
everything in this world has bright as well as dark sides. Similarly, Machine Learning offers
great opportunities, but some issues need to be solved.

This article will discuss some major practical issues and their business implementation, and
how we can overcome them. So let's start with a quick introduction to Machine Learning.

Common issues in Machine Learning

Although machine learning is being used in every industry and helps organizations make
more informed and data-driven choices that are more effective than classical
methodologies, it still has so many problems that cannot be ignored. Here are some
common issues in Machine Learning that professionals face to inculcate ML skills and
create an application from scratch.

1. Inadequate Training Data

The major issue that comes while using machine learning algorithms is the lack of quality
as well as quantity of data. Although data plays a vital role in the processing of machine
learning algorithms, many data scientists claim that inadequate data, noisy data, and
unclean data are extremely exhausting the machine learning algorithms. For example, a
simple task requires thousands of sample data, and an advanced task such as speech or
image recognition needs millions of sample data examples. Further, data quality is also
important for the algorithms to work ideally, but the absence of data quality is also found
in Machine Learning applications. Data quality can be affected by some factors as follows:

o Noisy Data- It is responsible for an inaccurate prediction that affects the decision as
well as accuracy in classification tasks.
o Incorrect data- It is also responsible for faulty programming and results obtained in
machine learning models. Hence, incorrect data may affect the accuracy of the
results also.
o Generalizing of output data- Sometimes, it is also found that generalizing output
data becomes complex, which results in comparatively poor future actions.
2. Poor quality of data

As we have discussed above, data plays a significant role in machine learning, and it must
be of good quality as well. Noisy data, incomplete data, inaccurate data, and unclean data
lead to less accuracy in classification and low-quality results. Hence, data quality can also
be considered as a major common problem while processing machine learning algorithms.

3. Non-representative training data

To make sure our training model is generalized well or not, we have to ensure that sample
training data must be representative of new cases that we need to generalize. The training
data must cover all cases that are already occurred as well as occurring.

Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well for
generalized cases and provides accurate decisions. If there is less training data, then there
will be a sampling noise in the model, called the non-representative training set. It won't
be accurate in predictions. To overcome this, it will be biased against one class or a group.

Hence, we should use representative data in training to protect against being biased and
make accurate predictions without any drift.

4. Overfitting and Underfitting

Overfitting:

Overfitting is one of the most common issues faced by Machine Learning engineers and
data scientists. Whenever a machine learning model is trained with a huge amount of data,
it starts capturing noise and inaccurate data into the training data set. It negatively affects
the performance of the model. Let's understand with a simple example where we have a
few training data sets such as 1000 mangoes, 1000 apples, 1000 bananas, and 5000
papayas. Then there is a considerable probability of identification of an apple as papaya
because we have a massive amount of biased data in the training data set; hence
prediction got negatively affected. The main reason behind overfitting is using non-linear
methods used in machine learning algorithms as they build non-realistic data models. We
can overcome overfitting by using linear and parametric algorithms in the machine
learning models.

Methods to reduce overfitting:

o Increase training data in a dataset.


o Reduce model complexity by simplifying the model by selecting one with fewer
parameters
o Ridge Regularization and Lasso Regularization
o Early stopping during the training phase
o Reduce the noise
o Reduce the number of attributes in training data.
o Constraining the model.

Underfitting:

Underfitting is just the opposite of overfitting. Whenever a machine learning model is


trained with fewer amounts of data, and as a result, it provides incomplete and inaccurate
data and destroys the accuracy of the machine learning model.

Underfitting occurs when our model is too simple to understand the base structure of the
data, just like an undersized pant. This generally happens when we have limited data into
the data set, and we try to build a linear model with non-linear data. In such scenarios, the
complexity of the model destroys, and rules of the machine learning model become too
easy to be applied on this data set, and the model starts doing wrong predictions as well.

Methods to reduce Underfitting:

o Increase model complexity


o Remove noise from the data
o Trained on increased and better features
o Reduce the constraints
o Increase the number of epochs to get better results.

5. Monitoring and maintenance

As we know that generalized output data is mandatory for any machine learning model;
hence, regular monitoring and maintenance become compulsory for the same. Different
results for different actions require data change; hence editing of codes as well as
resources for monitoring them also become necessary.

6. Getting bad recommendations

A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example where
at a specific time customer is looking for some gadgets, but now customer requirement
changed over time but still machine learning model showing same recommendations to
the customer while customer expectation has been changed. This incident is called a Data
Drift. It generally occurs when new data is introduced or interpretation of data changes.
However, we can overcome this by regularly updating and monitoring data according to
the expectations.

7. Lack of skilled resources

Although Machine Learning and Artificial Intelligence are continuously growing in the
market, still these industries are fresher in comparison to others. The absence of skilled
resources in the form of manpower is also an issue. Hence, we need manpower having in-
depth knowledge of mathematics, science, and technologies for developing and managing
scientific substances for machine learning.

8. Customer Segmentation

Customer segmentation is also an important issue while developing a machine learning


algorithm. To identify the customers who paid for the recommendations shown by the
model and who don't even check them. Hence, an algorithm is necessary to recognize the
customer behavior and trigger a relevant recommendation for the user based on past
experience.

9. Process Complexity of Machine Learning

The machine learning process is very complex, which is also another major issue faced by
machine learning engineers and data scientists. However, Machine Learning and Artificial
Intelligence are very new technologies but are still in an experimental phase and
continuously being changing over time. There is the majority of hits and trial experiments;
hence the probability of error is higher than expected. Further, it also includes analyzing
the data, removing data bias, training data, applying complex mathematical calculations,
etc., making the procedure more complicated and quite tedious.

10. Data Bias

Data Biasing is also found a big challenge in Machine Learning. These errors exist when
certain elements of the dataset are heavily weighted or need more importance than others.
Biased data leads to inaccurate results, skewed outcomes, and other analytical errors.
However, we can resolve this error by determining where data is actually biased in the
dataset. Further, take necessary steps to reduce it.

Methods to remove Data Bias:

o Research more for customer segmentation.


o Be aware of your general use cases and potential outliers.
o Combine inputs from multiple sources to ensure data diversity.
o Include bias testing in the development process.
o Analyze data regularly and keep tracking errors to resolve them easily.
o Review the collected and annotated data.
o Use multi-pass annotation such as sentiment analysis, content moderation, and intent
recognition.

11. Lack of Explainability

This basically means the outputs cannot be easily comprehended as it is programmed in


specific ways to deliver for certain conditions. Hence, a lack of explainability is also found
in machine learning algorithms which reduce the credibility of the algorithms.

12. Slow implementations and results

This issue is also very commonly seen in machine learning models. However, machine
learning models are highly efficient in producing accurate results but are time-consuming.
Slow programming, excessive requirements' and overloaded data take more time to
provide accurate results than expected. This needs continuous maintenance and
monitoring of the model for delivering accurate results.

13. Irrelevant features

Although machine learning models are intended to give the best possible outcome, if we
feed garbage data as input, then the result will also be garbage. Hence, we should use
relevant features in our training sample. A machine learning model is said to be good if
training data has a good set of features or less to no irrelevant features.

Difference between Data Science and Machine Learning:

Parameter Data Science Machine Learning


A subset of AI and data science
A multidisciplinary field focused
focusing on building systems that
Definition on extracting knowledge and
learn from data and improve from
insights from data.
experience.
To analyze and interpret complex To develop algorithms that can learn
Objective data to aid decision-making and from and make predictions or
strategic planning. decisions based on data.
Broader, encompassing various More focused, primarily on
Scope techniques for data analysis, developing and tuning algorithms
including machine learning. that can learn and make predictions.
Tools and Python, R, SQL, Tableau, Hadoop, Python, R, TensorFlow, Scikit-Learn,
Technologies etc. PyTorch, etc.
Data preprocessing, model training,
Processes Data cleaning, data analysis, data
model testing, and model
Involved visualization, and interpretation.
deployment.
Market analysis, data reporting, Predictive analytics, speech
Applications business analytics, predictive recognition, recommendation
modeling. systems, self-driving cars.
Statistical analysis, data Deep understanding of algorithms,
Skills Required visualization, big data platforms, neural networks, statistical modeling,
domain-specific knowledge. and natural language processing.
To extract insights and To enable machines to learn from
End Goal knowledge from data in various data so they can provide accurate
formats. predictions and decisions.
Machine Learning Engineer, AI
Data Analyst, Data Scientist, Data
Career Path Engineer, Research Scientist, Data
Engineer, Business Analyst.
Scientist.
UNIT – 2

2. REGRESSION,BAYESIAN LEARNINGAND SUPPORT VECTOR MACHINE

2.1 REGRESSION
a. To Understand linear regression and logistic regression machine approaches for
WHY separation of data categorically and numerically.

a. Practice with data of various Problems and analyse the needs of Learning
WHAT algorithms.
b. implement various Machine Learning methods with various problems.

WHERE a. Used to analyse and evaluate the data in Machine Learning Problems.

Machine learning (ML) is a subdomain of artificial intelligence (AI) that focuses on


developing systems that learn—or improve performance—based on the data they ingest.
Artificial intelligence is a broad word that refers to systems or machines that resemble
human intelligence. Machine learning and AI are frequently discussed together, and the
terms are occasionally used interchangeably, although they do not signify the same thing.
A crucial distinction is that, while all machine learning is AI, not all AI is machine learning.

What is Machine Learning?

Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed. ML is one of the most exciting technologies that one would
have ever come across. As it is evident from the name, it gives the computer that makes it
more similar to humans: The ability to learn. Machine learning is actively being used today,
perhaps in many more places than one would expect.

Linear Regression vs Logistic Regression


Linear Regression and Logistic Regression are the two famous Machine Learning
Algorithms which come under supervised learning technique. Since both the algorithms
are of supervised in nature hence these algorithms use labeled dataset to make the
predictions. But the main difference between them is how they are being used. The Linear
Regression is used for solving Regression problems whereas Logistic Regression is used for
solving the Classification problems. The description of both the algorithms is given below
along with difference table.
Linear Regression:
Linear Regression is one of the most simple Machine learning algorithm that comes under
Supervised Learning technique and used for solving regression problems.

It is used for predicting the continuous dependent variable with the help of independent
variables.

The goal of the Linear regression is to find the best fit line that can accurately predict the
output for the continuous dependent variable.

If single independent variable is used for prediction then it is called Simple Linear
Regression and if there are more than two independent variables then such regression is
called as Multiple Linear Regression.

By finding the best fit line, algorithm establish the relationship between dependent
variable and independent variable. And the relationship should be of linear nature.

The output for Linear regression should only be the continuous values such as price, age,
salary, etc. The relationship between the dependent variable and independent variable can
be shown in below image:
In above image the dependent variable is on Y-axis (salary) and independent variable is on
x-axis(experience). The regression line can be written as:

y= a0+a1x+ ε

Where, a0 and a1 are the coefficients and ε is the error term.

Logistic Regression:
Logistic regression is one of the most popular Machine learning algorithm that comes
under Supervised Learning techniques.

It can be used for Classification as well as for Regression problems, but mainly used for
Classification problems.

Logistic regression is used to predict the categorical dependent variable with the help of
independent variables.

The output of Logistic Regression problem can be only between the 0 and 1.

Logistic regression can be used where the probabilities between two classes is required.
Such as whether it will rain today or not, either 0 or 1, true or false etc.
Logistic regression is based on the concept of Maximum Likelihood estimation. According
to this estimation, the observed data should be most probable.

In logistic regression, we pass the weighted sum of inputs through an activation function
that can map values in between 0 and 1. Such activation function is known as sigmoid
function and the curve obtained is called as sigmoid curve or S-curve. Consider the below
image:

o The equation for logistic regression is:

o Difference between Linear Regression and Logistic Regression:

Linear Regression Logistic Regression

Logistic Regression is used to predict


Linear regression is used to predict the
the categorical dependent variable
continuous dependent variable using a
using a given set of independent
given set of independent variables.
variables.

Linear Regression is used for solving Logistic regression is used for solving
Regression problem. Classification problems.

In Linear regression, we predict the value of In logistic Regression, we predict the


continuous variables. values of categorical variables.

In Logistic Regression, we find the S-


In linear regression, we find the best fit line,
curve by which we can classify the
by which we can easily predict the output.
samples.

Maximum likelihood estimation


Least square estimation method is used for
method is used for estimation of
estimation of accuracy.
accuracy.

The output of Logistic Regression


The output for Linear Regression must be a
must be a Categorical value such as 0
continuous value, such as price, age, etc.
or 1, Yes or No, etc.

In Logistic regression, it is not


In Linear regression, it is required that
required to have the linear
relationship between dependent variable
relationship between the dependent
and independent variable must be linear.
and independent variable.

In linear regression, there may be In logistic regression, there should


collinearity between the independent not be collinearity between the
variables. independent variable.

2. REGRESSION,BAYESIAN LEARNINGAND SUPPORT VECTOR MACHINE

2.2 BAYESIAN LEARNING


a. To Understand various concepts like Bayes theorem, Concept Learning, optimal
WHY classifier and Bayesian Belief Networks and Elimination algorithms.
b. To Understand the issues related to Machine Learning and Data Science.
a. Implement and analyse various machine learning problems with probability
WHAT distributions using Bayes theorem and belief networks.
b. Analyse and implement various algorithms and approaches of Machine Learning
a. Used in various problems machine learning problems for assured and accurate
WHERE outcomes.

Bayes Theorem in Machine learning


Machine Learning is one of the most emerging technology of Artificial Intelligence. We are
living in the 21th century which is completely driven by new technologies and gadgets in
which some are yet to be used and few are on its full potential. Similarly, Machine Learning
is also a technology that is still in its developing phase. There are lots of concepts that
make machine learning a better technology such as supervised learning, unsupervised
learning, reinforcement learning, perceptron models, Neural networks, etc. In this article
"Bayes Theorem in Machine Learning", we will discuss another most important concept of
Machine Learning theorem i.e., Bayes Theorem. But before starting this topic you should
have essential understanding of this theorem such as what exactly is Bayes theorem, why it
is used in Machine Learning, examples of Bayes theorem in Machine Learning and much
more. So, let's start the brief introduction of Bayes theorem.

Introduction to Bayes Theorem in Machine Learning


Bayes theorem is given by an English statistician, philosopher, and Presbyterian minister
named Mr. Thomas Bayes in 17th century. Bayes provides their thoughts in decision
theory which is extensively used in important mathematics concepts as Probability. Bayes
theorem is also widely used in Machine Learning where we need to predict classes
precisely and accurately. An important concept of Bayes theorem named Bayesian
method is used to calculate conditional probability in Machine Learning application that
includes classification tasks. Further, a simplified version of Bayes theorem (Naïve Bayes
classification) is also used to reduce computation time and average cost of the projects.

Bayes theorem is also known with some other name such as Bayes rule or Bayes
Law. Bayes theorem helps to determine the probability of an event with random knowledge.
It is used to calculate the probability of occurring one event while other one already
occurred. It is a best method to relate the condition probability and marginal probability.

In simple words, we can say that Bayes theorem helps to contribute more accurate results.

Bayes Theorem is used to estimate the precision of values and provides a method for
calculating the conditional probability. However, it is hypocritically a simple calculation but
it is used to easily calculate the conditional probability of events where intuition often fails.
Some of the data scientist assumes that Bayes theorem is most widely used in financial
industries but it is not like that. Other than financial, Bayes theorem is also extensively
applied in health and medical, research and survey industry, aeronautical sector, etc.
What is Bayes Theorem?
Bayes theorem is one of the most popular machine learning concepts that helps to
calculate the probability of occurring one event with uncertain knowledge while other one
has already occurred.

Bayes' theorem can be derived using product rule and conditional probability of event X
with known event Y:

o According to the product rule we can express as the probability of event X with
known event Y as follows;
P(X ? Y)= P(X|Y) P(Y) {equation 1}

o Further, the probability of event Y with known event X:


P(X ? Y)= P(Y|X) P(X) {equation 2}
Mathematically, Bayes theorem can be expressed by combining both equations on right
hand side. We will get:

Here, both events X and Y are independent events which means probability of outcome of
both events does not depends one another.

The above equation is called as Bayes Rule or Bayes Theorem.

o P(X|Y) is called as posterior, which we need to calculate. It is defined as updated


probability after considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is
true.
o P(X) is called the prior probability, probability of hypothesis before considering the
evidence
o P(Y) is called marginal probability. It is defined as the probability of evidence under
any consideration.
Hence, Bayes Theorem can be written as:

posterior = likelihood * prior / evidence


Prerequisites for Bayes Theorem
While studying the Bayes theorem, we need to understand few important concepts. These
are as follows:

1. Experiment

An experiment is defined as the planned operation carried out under controlled condition
such as tossing a coin, drawing a card and rolling a dice, etc.

2. Sample Space

During an experiment what we get as a result is called as possible outcomes and the set of
all possible outcome of an event is known as sample space. For example, if we are rolling a
dice, sample space will be:

S1 = {1, 2, 3, 4, 5, 6}

Similarly, if our experiment is related to toss a coin and recording its outcomes, then
sample space will be:

S2 = {Head, Tail}

3. Event

Event is defined as subset of sample space in an experiment. Further, it is also called as set
of outcomes.

Assume in our experiment of rolling a dice, there are two event A and B such that;

A = Event when an even number is obtained = {2, 4, 6}

B = Event when a number is greater than 4 = {5, 6}


o Probability of the event A ''P(A)''= Number of favourable outcomes / Total number of
possible outcomes
P(E) = 3/6 =1/2 =0.5
o Similarly, Probability of the event B ''P(B)''= Number of favourable outcomes / Total
number of possible outcomes
=2/6
=1/3
=0.333
o Union of event A and B:
A∪B = {2, 4, 5, 6}

o Intersection of event A and B:


A∩B= {6}

o Disjoint Event: If the intersection of the event A and B is an empty set or null then such
events are known as disjoint event or mutually exclusive events also.

4. Random Variable:

It is a real value function which helps mapping between sample space and a real line of an
experiment. A random variable is taken on some random values and each value having
some probability. However, it is neither random nor a variable but it behaves as a function
which can either be discrete, continuous or combination of both.

5. Exhaustive Event:

As per the name suggests, a set of events where at least one event occurs at a time, called
exhaustive event of an experiment.

Thus, two events A and B are said to be exhaustive if either A or B definitely occur at a time
and both are mutually exclusive for e.g., while tossing a coin, either it will be a Head or
may be a Tail.

6. Independent Event:

Two events are said to be independent when occurrence of one event does not affect the
occurrence of another event. In simple words we can say that the probability of outcome
of both events does not depends one another.

Mathematically, two events A and B are said to be independent if:

P(A ∩ B) = P(AB) = P(A)*P(B)

7. Conditional Probability:

Conditional probability is defined as the probability of an event A, given that another


event B has already occurred (i.e. A conditional B). This is represented by P(A|B) and we can
define it as:

P(A|B) = P(A ∩ B) / P(B)

8. Marginal Probability:

Marginal probability is defined as the probability of an event A occurring independent of


any other event B. Further, it is considered as the probability of evidence under any
consideration.

P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)


Here ~B represents the event that B does not occur.

How to apply Bayes Theorem or Bayes rule in Machine Learning?


Bayes theorem helps us to calculate the single term P(B|A) in terms of P(A|B), P(B), and
P(A). This rule is very helpful in such scenarios where we have a good probability of P(A|B),
P(B), and P(A) and need to determine the fourth term.

Naïve Bayes classifier is one of the simplest applications of Bayes theorem which is used in
classification algorithms to isolate data as per accuracy, speed and classes.

Let's understand the use of Bayes theorem in machine learning with below example.

Suppose, we have a vector A with I attributes. It means

A = A1, A2, A3, A4……………Ai

Further, we have n classes represented as C1, C2, C3, C4…………Cn.

These are two conditions given to us, and our classifier that works on Machine Language
has to predict A and the first thing that our classifier has to choose will be the best
possible class. So, with the help of Bayes theorem, we can write it as:

P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A)

Here;

P(A) is the condition-independent entity.

P(A) will remain constant throughout the class means it does not change its value with
respect to change in class. To maximize the P(Ci/A), we have to maximize the value of term
P(A/Ci) * P(Ci).

With n number classes on the probability list let's assume that the possibility of any class
being the right answer is equally likely. Considering this factor, we can say that:
P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).

This process helps us to reduce the computation cost as well as time. This is how Bayes
theorem plays a significant role in Machine Learning and Naïve Bayes theorem has
simplified the conditional probability tasks without affecting the precision. Hence, we can
conclude that:

P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)*……*P(An/C)

Hence, by using Bayes theorem in Machine Learning we can easily describe the
possibilities of smaller events.

Concept Learning in Machine Learning

The problem of inducing general functions from specific training examples is central to
learning.

Concept learning can be formulated as a problem of searching through a predefined space


of potential hypotheses for the hypothesis that best fits the training examples.

What is Concept Learning…?

“A task of acquiring potential hypothesis (solution) that best fits the given training
examples.”
Consider the example task of learning the target concept “days on which my friend
Prabhas enjoys his favorite water sport.”

Below Table describes a set of example days, each represented by a set of attributes. The
attribute EnjoySport indicates whether or not Prabhas enjoys his favorite water sport on
this day. The task is to learn to predict the value of EnjoySport for an arbitrary day, based
on the values of its other attributes.

What hypothesis representation shall we provide to the learner in this case?


What hypothesis representation shall we provide to the learner in this case?

Let us begin by considering a simple representation in which each hypothesis consists of a


conjunction of constraints on the instance attributes.

In particular, let each hypothesis be a vector of six constraints, specifying the values of the
six attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast.

For each attribute, the hypothesis will either

• indicate by a “?’ that any value is acceptable for this attribute,

• specify a single required value (e.g., Warm) for the attribute, or

• indicate by a “ø” that no value is acceptable.

If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a


positive example (h(x) = 1).

To illustrate, the hypothesis that Prabhas enjoys his favorite sport only on cold days with
high humidity (independent of the values of the other attributes) is represented by the
expression

(?, Cold, High, ?, ?, ?)

Most General and Specific Hypothesis

The most general hypothesis-that every day is a positive example-is represented by


(?, ?, ?, ?, ?, ?)

and the most specific possible hypothesis-that no day is a positive example-is represented
by

(ø, ø, ø, ø, ø, ø)

A CONCEPT LEARNING TASK – Search

Concept learning can be viewed as the task of searching through a large space of
hypotheses implicitly defined by the hypothesis representation.

The goal of this search is to find the hypothesis that best fits the training examples.

It is important to note that by selecting a hypothesis representation, the designer of the


learning algorithm implicitly defines the space of all hypotheses that the program can ever
represent and therefore can ever learn.

Instance Space

Consider, for example, the instances X and hypotheses H in the EnjoySport learning task.

Given that the attribute Sky has three possible values, and that AirTemp, Humidity,
Wind, Water, and Forecast each have two possible values, the instance space X contains
exactly

3 . 2 . 2 . 2 . 2 . 2 = 96 distinct instances.

Example:

Let’s assume there are two features F1 and F2 with F1 has A and B as possibilities and F2 as
X and Y as possibilities.

F1 – > A, B

F2 – > X, Y

Instance Space: (A, X), (A, Y), (B, X), (B, Y) – 4 Examples

Hypothesis Space: (A, X), (A, Y), (A, ø), (A, ?), (B, X), (B, Y), (B, ø), (B, ?), (ø, X), (ø, Y), (ø, ø), (ø,
?), (?, X), (?, Y), (?, ø), (?, ?) – 16

Hypothesis Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y (?, ?) – 10
Instance Space

Hypothesis Space

Similarly there are 5 . 4 . 4 . 4 . 4 . 4 = 5120 syntactically distinct hypotheses within H.

Notice, however, that every hypothesis containing one or more “ø” symbols represents the
empty set of instances; that is, it classifies every instance as negative.

Therefore, the number of semantically distinct hypotheses is only 1 + (4 . 3 . 3 . 3 . 3 . 3) =


973.

Our EnjoySport example is a very simple learning task, with a relatively small, finite
hypothesis space.
General-to-Specific Ordering of Hypotheses

To illustrate the general-to-specific ordering, consider the two hypotheses

h1 = (Sunny, ?, ?, Strong, ?, ?)

h2 = (Sunny, ?, ?, ?, ?, ?)

Now consider the sets of instances that are classified positive by hl and by h2. Because h2
imposes fewer constraints on the instance, it classifies more instances as positive.

In fact, any instance classified positive by h1 will also be classified positive by h2.
Therefore, we say that h2 is more general than h1.

For any instance x in X and hypothesis h in H, we say that x satisfies h if and only if h(x) =
1.

We define the more_general_than_or_equale_to relation in terms of the sets of instances


that satisfy the two hypotheses.

Bayes Optimal Classifier and Naive Bayes Classifier


The Bayes Optimal Classifier is a probabilistic model that predicts the most likely outcome
for a new situation. In this blog, we’ll have a look at Bayes optimal classifier and Naive
Bayes Classifier.

The Bayes theorem is a method for calculating a hypothesis’s probability based on its prior
probability, the probabilities of observing specific data given the hypothesis, and the seen
data itself.

What is Naïve Bayes Classifier in Machine Learning


Naïve Bayes theorem is also a supervised algorithm, which is based on Bayes theorem and
used to solve classification problems. It is one of the most simple and effective
classification algorithms in Machine Learning which enables us to build various ML models
for quick predictions. It is a probabilistic classifier that means it predicts on the basis of
probability of an object. Some popular Naïve Bayes algorithms are spam filtration,
Sentimental analysis, and classifying articles.

Advantages of Naïve Bayes Classifier in Machine Learning:


It is one of the simplest and effective methods for calculating the conditional probability
and text classification problems.

A Naïve-Bayes classifier algorithm is better than all other models where assumption of
independent predictors holds true.

It is easy to implement than other models.

It requires small amount of training data to estimate the test data which minimize the
training time period.

It can be used for Binary as well as Multi-class Classifications.


Disadvantages of Naïve Bayes Classifier in Machine Learning:
The main disadvantage of using Naïve Bayes classifier algorithms is, it limits the
assumption of independent predictors because it implicitly assumes that all attributes are
independent or unrelated but in real life it is not feasible to get mutually independent
attributes.

Bayesian Belief Network in artificial intelligence

Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian
network as:

"A Bayesian network is a probabilistic graphical model which represents a set of


variables and their conditional dependencies using a directed acyclic graph."

It is also called a Bayes network, belief network, decision network, or Bayesian


model.

Bayesian networks are probabilistic, because these networks are built from a
probability distribution, and also use probability theory for prediction and anomaly
detection.

Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.

Bayesian Network can be used for building models from data and experts opinions,
and it consists of two parts:

Directed Acyclic Graph

Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:
 Each node corresponds to the random variables, and a variable can be
continuous or discrete.
 Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables. These directed links or arrows
connect the pair of nodes in the graph.

These links represent that one node directly influence the other node, and if there is
no directed link that means that nodes are independent with each other

 In the above diagram, A, B, C, and D are random variables represented by the


nodes of the network graph.
 If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
 Node C is independent of node A.

Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG.

The Bayesian network has mainly two components:

1. Causal Component
2. Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional
probability. So let's first understand the joint probability distribution:

Joint probability distribution:

If we have variables x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.

= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

Explanation of Bayesian network:

Let's understand the Bayesian network through an example by creating a directed


acyclic graph:

Example: Harry installed a new burglar alarm at his home to detect burglary. The
alarm reliably responds at detecting a burglary but also responds for minor
earthquakes. Harry has two neighbors David and Sophia, who have taken a
responsibility to inform Harry at work when they hear the alarm. David always calls
Harry when he hears the alarm, but sometimes he got confused with the phone
ringing and calls at that time too. On the other hand, Sophia likes to listen to high
music, so sometimes she misses to hear the alarm. Here we would like to compute
the probability of Burglary Alarm.

Problem:

Calculate the probability that alarm has sounded, but there is neither a burglary, nor
an earthquake occurred, and David and Sophia both called the Harry.
Solution:

The Bayesian network for the above problem is given below. The network structure
is showing that burglary and earthquake is the parent node of the alarm and
directly affecting the probability of alarm's going off, but David and Sophia's calls
depend on alarm probability.

The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.

The conditional distributions for each node are given as conditional probabilities
table or CPT.

Each row in the CPT must be sum to 1 because all the entries in the table represent
an exhaustive set of cases for the variable.

In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence,


if there are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:

 Burglary (B)
 Earthquake(E)
 Alarm(A)
 David Calls(D)
 Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]


Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.

P(E= True)= 0.001, which is the probability of a minor earthquake

P(E= False)= 0.999, Which is the probability that an earthquake not occurred.

We can provide the conditional probabilities as per the below tables:


Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of
Alarm.

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."
A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.

Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.

The semantics of Bayesian Network:

There are two ways to understand the semantics of the Bayesian network, which is
given below:
1. To understand the network as the representation of the Joint probability
distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional
independence statements.
It is helpful in designing inference procedure.
EM Algorithm in Machine Learning

The EM algorithm is considered a latent variable model to find the local


maximum likelihood parameters of a statistical model, proposed by Arthur
Dempster, Nan Laird, and Donald Rubin in 1977. The EM (Expectation-
Maximization) algorithm is one of the most commonly used terms in machine
learning to obtain maximum likelihood estimates of variables that are sometimes
observable and sometimes not. However, it is also applicable to unobserved data or
sometimes called latent. It has various real-world applications in statistics, including
obtaining the mode of the posterior marginal distribution of parameters in
machine learning and data mining applications.

In most real-life applications of machine learning, it is found that several relevant


learning features are available, but very few of them are observable, and the rest are
unobservable. If the variables are observable, then it can predict the value using
instances. On the other hand, the variables which are latent or directly not
observable, for such variables Expectation-Maximization (EM) algorithm plays a vital
role to predict the value with the condition that the general form of probability
distribution governing those latent variables is known to us. In this topic, we will
discuss a basic introduction to the EM algorithm, a flow chart of the EM algorithm,
its applications, advantages, and disadvantages of EM algorithm, etc.

What is an EM algorithm?

The Expectation-Maximization (EM) algorithm is defined as the combination of


various unsupervised machine learning algorithms, which is used to determine
the local maximum likelihood estimates (MLE) or maximum a posteriori
estimates (MAP) for unobservable variables in statistical models. Further, it is a
technique to find maximum likelihood estimation when the latent variables are
present. It is also referred to as the latent variable model.

A latent variable model consists of both observable and unobservable variables


where observable can be predicted while unobserved are inferred from the
observed variable. These unobservable variables are known as latent variables.

Key Points:
o It is known as the latent variable model to determine MLE and MAP parameters for
latent variables.
o It is used to predict values of parameters in instances where data is missing or
unobservable for learning, and this is done until convergence of the values occurs.

EM Algorithm

The EM algorithm is the combination of various unsupervised ML algorithms, such


as the k-means clustering algorithm. Being an iterative approach, it consists of
two modes. In the first mode, we estimate the missing or latent variables. Hence it is
referred to as the Expectation/estimation step (E-step). Further, the other mode is
used to optimize the parameters of the models so that it can explain the data more
clearly. The second mode is known as the maximization-step or M-step.

o Expectation step (E - step): It involves the estimation (guess) of all missing values
in the dataset so that after completing this step, there should not be any missing
value.
o Maximization step (M - step): This step involves the use of estimated data in the
E-step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.

The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data
to update the values of the parameters in the M-step.
What is Convergence in the EM algorithm?

Convergence is defined as the specific situation in probability based on


intuition, e.g., if there are two random variables that have very less difference in
their probability, then they are known as converged. In other words, whenever the
values of given variables are matched with each other, it is called convergence.

Steps in EM Algorithm

The EM algorithm is completed mainly in 4 steps, which include Initialization Step,


Expectation Step, Maximization Step, and convergence Step. These steps are
explained as follows:

o 1st Step: The very first step is to initialize the parameter values. Further, the system
is provided with incomplete observed data with the assumption that data is
obtained from a specific model.

o 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or
guess the values of the missing or incomplete data using the observed data.
Further, E-step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use complete
data obtained from the 2nd step to update the parameter values. Further, M-step
primarily updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are converging or
not. If it gets "yes", then stop the process; else, repeat the process from step 2 until
the convergence occurs.

Gaussian Mixture Model (GMM)

The Gaussian Mixture Model or GMM is defined as a mixture model that has a
combination of the unspecified probability distribution function. Further, GMM
also requires estimated statistics values such as mean and standard deviation or
parameters. It is used to estimate the parameters of the probability distributions to
best fit the density of a given training dataset. Although there are plenty of
techniques available to estimate the parameter of the Gaussian Mixture Model
(GMM), the Maximum Likelihood Estimation is one of the most popular
techniques among them.

Let's understand a case where we have a dataset with multiple data points
generated by two different processes. However, both processes contain a similar
Gaussian probability distribution and combined data. Hence it is very difficult to
discriminate which distribution a given point may belong to.

The processes used to generate the data point represent a latent variable or
unobservable data. In such cases, the Estimation-Maximization algorithm is one of
the best techniques which helps us to estimate the parameters of the gaussian
distributions. In the EM algorithm, E-step estimates the expected value for each
latent variable, whereas M-step helps in optimizing them significantly using the
Maximum Likelihood Estimation (MLE). Further, this process is repeated until a good
set of latent values, and a maximum likelihood is achieved that fits the data.

Applications of EM algorithm

The primary aim of the EM algorithm is to estimate the missing data in the latent
variables through observed data in datasets. The EM algorithm or latent variable
model has a broad range of real-life applications in machine learning. These are as
follows:
o The EM algorithm is applicable in data clustering in machine learning.
o It is often used in computer vision and NLP (Natural language processing).
o It is used to estimate the value of the parameter in mixed models such as
the Gaussian Mixture Model and quantitative genetics.
o It is also used in psychometrics for estimating item parameters and latent abilities of
item response theory models.
o It is also applicable in the medical and healthcare industry, such as in image
reconstruction and structural engineering.
o It is used to determine the Gaussian density of a function.

Advantages of EM algorithm

o It is very easy to implement the first two basic steps of the EM algorithm in various
machine learning problems, which are E-step and M- step.
o It is mostly guaranteed that likelihood will enhance after each iteration.
o It often generates a solution for the M-step in the closed form.

Disadvantages of EM algorithm

o The convergence of the EM algorithm is very slow.


o It can make convergence for the local optima only.
o It takes both forward and backward probability into consideration. It is opposite to
that of numerical optimization, which takes only forward probabilities.
2. REGRESSION,BAYESIAN LEARNINGAND SUPPORT VECTOR MACHINE

2.3 SUPPORT VECTOR MACHINE


a. To Understand various support kernels like Linear, Polynomial and Gaussian.
WHY b. To Understand the importance and significance of hyperplane.
c. To Understand the properties and issues in Support Vector Machines.
a. Analyse the data of various Problems using Graphs and various machine Learning
WHAT
Algorithms.
a. Configuring and analysing the data in multi dimensions.
WHERE
b. Evaluate Models generated from data.

Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such a model
can be created by using the SVM algorithm. We will first train our model with lots of
images of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector creates a
decision boundary between these two data (cat and dog) and choose extreme cases
(support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:

1. Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.

2. Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,


which means if a dataset cannot be classified by using a straight line, then such data
is termed as non-linear data and classifier used is called as Non-linear SVM
classifier.
Major Kernel Functions in Support Vector Machine

What is Kernel Method?

A set of techniques known as kernel methods are used in machine learning to


address classification, regression, and other prediction issues. They are built around
the idea of kernels, which are functions that gauge how similar two data points are
to one another in a high-dimensional feature space.

Kernel methods' fundamental premise is used to convert the input data into a high-
dimensional feature space, which makes it simpler to distinguish between classes or
generate predictions. Kernel methods employ a kernel function to implicitly map the
data into the feature space, as opposed to manually computing the feature space.

The most popular kind of kernel approach is the Support Vector Machine (SVM), a
binary classifier that determines the best hyperplane that most effectively divides
the two groups. In order to efficiently locate the ideal hyperplane, SVMs map the
input into a higher-dimensional space using a kernel function.

Other examples of kernel methods include kernel ridge regression, kernel PCA, and
Gaussian processes. Since they are strong, adaptable, and computationally efficient,
kernel approaches are frequently employed in machine learning. They are resilient
to noise and outliers and can handle sophisticated data structures like strings and
graphs.

Kernel Method in SVMs

Support Vector Machines (SVMs) use kernel methods to transform the input data
into a higher-dimensional feature space, which makes it simpler to distinguish
between classes or generate predictions. Kernel approaches in SVMs work on the
fundamental principle of implicitly mapping input data into a higher-dimensional
feature space without directly computing the coordinates of the data points in that
space.

The kernel function in SVMs is essential in determining the decision boundary that
divides the various classes. In order to calculate the degree of similarity between any
two points in the feature space, the kernel function computes their dot product.
The most commonly used kernel function in SVMs is the Gaussian or radial basis
function (RBF) kernel. The RBF kernel maps the input data into an infinite-
dimensional feature space using a Gaussian function. This kernel function is popular
because it can capture complex nonlinear relationships in the data.

Other types of kernel functions that can be used in SVMs include the polynomial
kernel, the sigmoid kernel, and the Laplacian kernel. The choice of kernel function
depends on the specific problem and the characteristics of the data.

Basically, kernel methods in SVMs are a powerful technique for solving classification
and regression problems, and they are widely used in machine learning because
they can handle complex data structures and are robust to noise and outliers.

Characteristics of Kernel Function

Kernel functions used in machine learning, including in SVMs (Support Vector


Machines), have several important characteristics, including:

Mercer's condition: A kernel function must satisfy Mercer's condition to be valid.


This condition ensures that the kernel function is positive semi definite, which
means that it is always greater than or equal to zero.

Positive definiteness: A kernel function is positive definite if it is always greater


than zero except for when the inputs are equal to each other.

Non-negativity: A kernel function is non-negative, meaning that it produces non-


negative values for all inputs.

Symmetry: A kernel function is symmetric, meaning that it produces the same value
regardless of the order in which the inputs are given.

Reproducing property: A kernel function satisfies the reproducing property if it can


be used to reconstruct the input data in the feature space.

Smoothness: A kernel function is said to be smooth if it produces a smooth


transformation of the input data into the feature space.

Complexity: The complexity of a kernel function is an important consideration, as


more complex kernel functions may lead to over fitting and reduced generalization
performance.
Basically, the choice of kernel function depends on the specific problem and the
characteristics of the data, and selecting an appropriate kernel function can
significantly impact the performance of machine learning algorithms.

Major Kernel Function in Support Vector Machine

In Support Vector Machines (SVMs), there are several types of kernel functions that
can be used to map the input data into a higher-dimensional feature space. The
choice of kernel function depends on the specific problem and the characteristics of
the data.

Here are some most commonly used kernel functions in SVMs:

Linear Kernel

A linear kernel is a type of kernel function used in machine learning, including in


SVMs (Support Vector Machines). It is the simplest and most commonly used kernel
function, and it defines the dot product between the input vectors in the original
feature space.

The linear kernel can be defined as:

K(x, y) = x .y

Where x and y are the input feature vectors. The dot product of the input vectors is
a measure of their similarity or distance in the original feature space.

When using a linear kernel in an SVM, the decision boundary is a linear hyperplane
that separates the different classes in the feature space. This linear boundary can be
useful when the data is already separable by a linear decision boundary or when
dealing with high-dimensional data, where the use of more complex kernel
functions may lead to overfitting.

Polynomial Kernel

A particular kind of kernel function utilised in machine learning, such as in SVMs, is


a polynomial kernel (Support Vector Machines). It is a nonlinear kernel function that
employs polynomial functions to transfer the input data into a higher-dimensional
feature space.
One definition of the polynomial kernel is:

Where x and y are the input feature vectors, c is a constant term, and d is the
degree of the polynomial, K(x, y) = (x. y + c) d. The constant term is added to, and
the dot product of the input vectors elevated to the degree of the polynomial.

The decision boundary of an SVM with a polynomial kernel might capture more
intricate correlations between the input characteristics because it is a nonlinear
hyperplane.

The degree of nonlinearity in the decision boundary is determined by the degree of


the polynomial.

The polynomial kernel has the benefit of being able to detect both linear and
nonlinear correlations in the data. It can be difficult to select the proper degree of
the polynomial, though, as a larger degree can result in overfitting while a lower
degree cannot adequately represent the underlying relationships in the data.

In general, the polynomial kernel is an effective tool for converting the input data
into a higher-dimensional feature space in order to capture nonlinear correlations
between the input characteristics.

Gaussian (RBF) Kernel

The Gaussian kernel, also known as the radial basis function (RBF) kernel, is a
popular kernel function used in machine learning, particularly in SVMs (Support
Vector Machines). It is a nonlinear kernel function that maps the input data into a
higher-dimensional feature space using a Gaussian function.
The Gaussian kernel can be defined as:

K(x, y) = exp(-gamma * ||x - y||^2)

Where x and y are the input feature vectors, gamma is a parameter that controls the
width of the Gaussian function, and ||x - y||^2 is the squared Euclidean distance
between the input vectors.

When using a Gaussian kernel in an SVM, the decision boundary is a nonlinear


hyper plane that can capture complex nonlinear relationships between the input
features. The width of the Gaussian function, controlled by the gamma parameter,
determines the degree of nonlinearity in the decision boundary.

One advantage of the Gaussian kernel is its ability to capture complex relationships
in the data without the need for explicit feature engineering. However, the choice of
the gamma parameter can be challenging, as a smaller value may result in under
fitting, while a larger value may result in over fitting.

Laplace Kernel

The Laplacian kernel, also known as the Laplace kernel or the exponential kernel, is a
type of kernel function used in machine learning, including in SVMs (Support Vector
Machines). It is a non-parametric kernel that can be used to measure the similarity
or distance between two input feature vectors.

The Laplacian kernel can be defined as:

K(x, y) = exp(-gamma * ||x - y||)

Where x and y are the input feature vectors, gamma is a parameter that controls the
width of the Laplacian function, and ||x - y|| is the L1 norm or Manhattan distance
between the input vectors.

When using a Laplacian kernel in an SVM, the decision boundary is a nonlinear


hyperplane that can capture complex relationships between the input features. The
width of the Laplacian function, controlled by the gamma parameter, determines
the degree of nonlinearity in the decision boundary.

One advantage of the Laplacian kernel is its robustness to outliers, as it places less
weight on large distances between the input vectors than the Gaussian kernel.
However, like the Gaussian kernel, choosing the correct value of the gamma
parameter can be challenging.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the


classes in n-dimensional space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary is known as the hyperplane
of SVM.
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a
straight line. And if there are 3 features, then hyperplane will be a 2-dimension
plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.

How does SVM works?


Linear SVM:
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has
two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of
coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point
of the lines from both the classes. These points are called support vectors. The
distance between the vectors and the hyperplane is called as margin. And the goal
of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

o now, SVM will divide the datasets into classes in the following way. Consider the
below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

Properties of SVM

 Flexibility in choosing a similarity function


 Sparseness of solution when dealing with large data sets
 only support vectors are used to specify the separating hyperplane
 Ability to handle large feature spaces
 Complexity does not depend on the dimensionality of the feature space
 Overfitting can be controlled by soft margin approach
 Nice math property: a simple convex optimization problem which is
guaranteed to converge to a single global solution
 Feature Selection

The Disadvantages of Support Vector Machine (SVM) are:

1. Unsuitable to Large Datasets

2. Large training time

3. More features, more complexities

4. Bad performance on high noise

5. Does not determine Local optima


1) Unsuitable to Large Datasets
Support Vector Machines creates a margin of separation between the data point to be
classified. The usage of large datasets has its cons even if we use kernel trick for
classification. No matter how computationally efficient is the calculation, it is suitable for
small to medium size datasets, as the feature space can be very high dimensional, or even
infinite dimensional. The method becomes infeasible for large datasets. For large datasets,
this can still give us rich feature space representations, but with many fewer dimensions
than data points. It will not support large datasets and many dimensions at the same time.

2) Large training time


Due to high computational complexities and above stated reasons even if kernel trick is
used,SVM classification will be tedious as it will use a lot of processing time due to
complexities in calculations. This will result large time to train the datasets itself.

3) More features, more complexities


More the features are taken into consideration, it will result in more dimensions coming
into play.If the number of features is much greater than the number of samples, avoid
over-fitting in choosing Kernel functions and regularization term is crucial.

4) Bad performance on high noise


SVM does not perform very well, when the data set has more noise. When the data has
noise, it contains many overlapping points; there is a problem in drawing a clear
hyperplane without misclassifying.
Soft margin classification however allows misclassification to a small extent.
But as the noise increases, the amount of datapoints overlapping and disturbances result
in more misclassifications which is not ideal.

5) Does not determine Local optima


If you use gradient descent to solve the SVM optimization problem, then you'll always
converge to the global minimum.

You might also like