Unit - 1+2
Unit - 1+2
TECHNOLOGY
Theory Manual
Machine Learning Techniques
(BCAI-601)
Session: 2024-25 (EVEN Semester)
Programme Name: B. Tech
Semester: VI
Name of the Department: CSE (AI&ML)
Dr. Pawan
Associate Professor
MIET MEERUT
UNIT – 1
1. INTRODUCTION
Learning, Types of Learning, Well defined Learning Problems, Designing a Learning
1.1
System
a. To Understand the basics of Machine Learning and types of Learning.
WHY b. To Understand the History of Machine Learning.
Machine Learning refers to the process by which algorithms improve their performance
on tasks over time, based on experience or data. This can involve supervised learning,
where the system learns from labeled data, or unsupervised learning, where it identifies
patterns in unlabeled data.
Improves Accuracy: Learning allows models to adapt and improve their predictions
or classifications based on new data.
Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed. ML is one of the most exciting technologies that one would
have ever come across. As it is evident from the name, it gives the computer that makes it
more similar to humans: The ability to learn. Machine learning is actively being used today,
perhaps in many more places than one would expect.
o The first step in designing a learning system in machine learning is to identify the
type of data that will be used. This can include structured data, such as numerical
and categorical data, as well as unstructured data, such as text and images. The type
of data will determine the type of machine learning algorithms that can be used and
the preprocessing steps required.
o Once the data has been identified, the next step is to determine the desired
outcome of the learning system. This can include classifying data, making
predictions, or identifying patterns in the data. The desired outcome will determine
the type of machine learning algorithm that should be used, as well as the
evaluation metrics that will be used to measure the performance of the learning
system.
o Next, the resources available for the learning system must be considered. This
includes the amount of data available, the computational power available, and the
amount of time available to train the model. These resources will determine the
complexity of the machine learning algorithm that can be used and the amount of
data that can be used for training.
o Once the data, desired outcome, and resources have been identified, it is time to
select a machine-learning algorithm and begin the training process. Decision trees,
SVMs, and neural networks are examples of common algorithms. It is crucial to
assess the effectiveness of the learning system using the right assessment measures,
such as recall, accuracy, and precision.
o After the learning system is trained, it is important to fine-tune the model by
adjusting the parameters and hyperparameters. This can be done using techniques
such as cross-validation and grid search. The final model should be tested on a
hold-out test set to evaluate its performance on unseen data.
When constructing a machine learning system, there are some other recommended
practices to bear in mind in addition to these essential processes. A crucial factor to take
into account is making sure that the training data are indicative of the data that will be
encountered in the actual world. To do this, the data may be divided into training,
validation, and test sets.
Following are the qualities that you need to keep in mind while designing a
learning system:
Reliability
The system must be capable of carrying out the proper task at the appropriate degree of
performance in a given setting. Testing the dependability of ML systems that learn from
data is challenging because a system's failure need not result in an error; instead, it could
simply produce garbage results, meaning that some results were produced even though
the system had not been trained with the corresponding ground truth.
When a typical system fails, you receive an error message, such as The crew is addressing
a technical issue and will return soon.
When a machine learning (ML) system fails, it usually does so without being seen. For
instance, when translating from English to Hindi or vice versa, even if the model may not
have seen all of the words, it may nevertheless provide a translation that is illogical.
Scalability
There should be practical methods for coping with the system's expansion as it changes (in
terms of data amount, traffic volume, or complexity). Because certain essential applications
might lose millions of dollars or their credibility with just one hour of outage or failure,
there should be an automated provision to grow computing and storage capacity.
Maintainability
The performance of the model may fluctuate as a result of changes in data distribution
over time. In the ML system, there should be a provision to first determine whether there is
any model drift or data drift, and once the major drift is noticed, how to re-train/re-fresh
and enable new ML models without interfering with the ML system's present functioning.
Adaptability
The availability of fresh data with increased features or changes in business objectives,
such as conversion rate vs. customer engagement time for e-commerce, are the other
changes that occur most frequently in machine learning (ML) systems. As a result, the
system has to be adaptable to fast upgrades without causing any service disruptions.
Data
1. For example, human age and height have expected value ranges, but they can't be
too huge, like age value 150+, height - 10 feet, etc. Feature expectations are
recorded in a schema - ranges of the feature values carefully captured to avoid any
unanticipated value, which can result in a trash answer.
2. All features are advantageous; features introduced to the system should be valuable
in some way, such as being a predictor or an identifier, as each feature has a
handling cost.
3. No feature should cost more than it is worth; each new feature should be evaluated
in terms of cost vs. benefits in order to eliminate those that would be difficult to
implement or manage.
4. The data pipeline has the necessary privacy protections in place; for instance,
personally identifiable information (PII) should be managed carefully because any
leaking of sensitive information may have legal repercussions.
5. If any new external component has an influence on the system, it will be easier to
introduce new features to boost system performance.
6. All input feature code, including one-hot encoding/binning features and the
handling of unseen levels in one-hot encoded features, must be checked in order to
avoid any intermediate values from departing from the desired range.
Model
1. Model specifications are evaluated and submitted; for quicker re-training, correct
versioning of the model learning code is required.
2. Correlation between offline and online metrics: Model metrics (log loss, mape, mse)
should be strongly associated with the application's goal, such as
revenue/cost/time.
3. Hyperparameters like learning rates, the number of layers, the size of the layers, the
maximum depth, and regularisation coefficients must be modified for the use case
because the selection of hyperparameter values can significantly affect the accuracy
of predictions.
4. To support the most recent model in production, it is important to comprehend
how frequently to retrain models depending on changes in data distribution. Model
staleness has an influence that is known.
5. Simple linear models with high-level characteristics are a good starting point for
functional testing and doing cost-benefit analyses when compared to more
complex models. However, a simpler model is not always better.
6. Model performance must be assessed using adequately representative data to
ensure that model quality is satisfactory on significant data slices.
7. The model is put to the test for inclusion-model characteristics, which should be
thoroughly examined against predicting importance since, in some applications,
specific features may slant outcomes in favour of particular categories, usually for
reasons of fairness.
Machine learning is a subset of AI, which enables the machine to automatically learn
from data, improve performance from past experiences, and make predictions.
Machine learning contains a set of algorithms that work on a huge amount of data. Data is
fed to these algorithms to train them, and on the basis of training, they build the model &
perform a specific task.
Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:
Let's understand supervised learning with an example. Suppose we have an input dataset
of cats and dog images. So, first, we will provide the training to the machine to understand
the images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour,
height (dogs are taller, cats are smaller), etc. After completion of training, we input the
picture of a cat and ask the machine to identify the object and predict the output. Now,
the machine is well trained, so it will check all the features of the object, such as height,
shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat
category. This is the process of how the machine identifies the objects in Supervised
Learning.
The main goal of the supervised learning technique is to map the input variable(x)
with the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.
Supervised machine learning can be classified into two types of problems, which are given
below:
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.
Advantages:
1. Since supervised learning work with the labelled dataset so we can have an exact
idea about the classes of objects.
2. These algorithms are helpful in predicting the output on the basis of prior
experience.
Disadvantages:
1. Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.
2. Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new
patients.
3. Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic data
to identify the patterns that can lead to possible fraud.
4. Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent
to the spam folder.
5. Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can
be done using the same, such as voice-activated passwords, voice commands, etc.
Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output
without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines are
instructed to find the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown
to the model, and the task of the machine is to find the patterns and categories of the
objects.
So, now the machine will discover its patterns and differences, such as colour difference,
shape difference, and predict the output when it is tested with the test dataset.
Categories of Unsupervised Machine Learning
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data.
It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of other
groups. An example of the clustering algorithm is grouping the customers by their
purchasing behaviour.
2) Association
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.
Advantages:
o These algorithms can be used for complicated tasks compared to the supervised
ones because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
3. Semi-Supervised Learning
The main aim of semi-supervised learning is to effectively use all the available data, rather
than only labelled data like in supervised learning. Initially, similar data is clustered along
with an unsupervised learning algorithm, and further, it helps to label the unlabeled data
into labelled data. It is because labelled data is a comparatively more expensive acquisition
than unlabeled data.
We can imagine these algorithms with an example. Supervised learning is where a student
is under the supervision of an instructor at home and college. Further, if that student is
self-analysing the same concept without any help from the instructor, it comes under
unsupervised learning. Under semi-supervised learning, the student has to revise himself
after analyzing the same concept under the guidance of an instructor at college.
Advantages:
4. Reinforcement Learning
Agent gets rewarded for each good action and get punished for each bad action; hence
the goal of reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning
is to play a game, where the Game is the environment, moves of an agent at each step
define states, and the goal of the agent is to get a high score. Agent receives feedback in
terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.
o VideoGames:
RL algorithms are much popular in gaming applications. It is used to gain super-
human performance. Some popular games that use RL algorithms
are AlphaGO and AlphaGO Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that
how to use RL in computer to automatically learn and schedule resources to wait for
different jobs in order to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial
and manufacturing area, and these robots are made more powerful with
reinforcement learning. There are different industries that have their vision of
building intelligent robots using AI and Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with
the help of Reinforcement Learning by Salesforce company.
Advantages
Before some years (about 40-50 years), machine learning was science fiction, but today
it is the part of our daily life. Machine learning is making our day to day life easy from
self-driving cars to Amazon virtual assistant "Alexa". However, the idea behind machine
learning is so old and has a long history. Below some milestones are given which have
occurred in the history of machine learning:
1834: In 1834, Charles Babbage, the father of the computer, conceived a device that
could be programmed with punch cards. However, the machine was never built, but all
modern computers rely on its logical structure.
1936: In 1936, Alan Turing gave a theory that how a machine can determine and
execute a set of instructions.
1940: In 1940, the first manually operated computer, "ENIAC" was invented, which was
the first electronic general-purpose computer. After that stored program computer
such as EDSAC in 1949 and EDVAC in 1951 were invented.
1943: In 1943, a human neural network was modeled with an electrical circuit. In 1950,
the scientists started applying their idea to work and analyzed how human neurons
might work.
1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery and
Intelligence," on the topic of artificial intelligence. In his paper, he asked, "Can machines
think?"
Machine intelligence in Games:
1952: Arthur Samuel, who was the pioneer of machine learning, created a program that
helped an IBM computer to play a checkers game. It performed better more it played.
1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.
The duration of 1974 to 1980 was the tough time for AI and ML researchers, and this
duration was called as AI winter.
In this duration, failure of machine translation occurred, and people had reduced their
interest from AI, which led to reduced funding by the government to the researches.
1959: In 1959, the first neural network was applied to a real-world problem to remove
echoes over phone lines using an adaptive filter.
1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural network
NETtalk, which was able to teach itself how to correctly pronounce 20,000 words in one
week.
1997: The IBM's Deep blue intelligent computer won the chess game against the chess
expert Garry Kasparov, and it became the first computer which had beaten a human
chess expert.
2006:
Geoffrey Hinton and his group presented the idea of profound getting the hang of
utilizing profound conviction organizations.
The Elastic Compute Cloud (EC2) was launched by Amazon to provide scalable
computing resources that made it easier to create and implement machine learning
models.
2007:
Participants were tasked with increasing the accuracy of Netflix's recommendation
algorithm when the Netflix Prize competition began.
2008:
Google delivered the Google Forecast Programming interface, a cloud-based help that
permitted designers to integrate AI into their applications.
2009:
Profound learning gained ground as analysts showed its viability in different errands,
including discourse acknowledgment and picture grouping.
The expression "Large Information" acquired ubiquity, featuring the difficulties and
open doors related with taking care of huge datasets.
2010:
The ImageNet Huge Scope Visual Acknowledgment Challenge (ILSVRC) was presented,
driving progressions in PC vision, and prompting the advancement of profound
convolutional brain organizations (CNNs).
2011:
2012:
AlexNet, a profound CNN created by Alex Krizhevsky, won the ILSVRC, fundamentally
further developing picture order precision and laying out profound advancing as a
predominant methodology in PC vision.
Google's Cerebrum project, drove by Andrew Ng and Jeff Dignitary, utilized profound
figuring out how to prepare a brain organization to perceive felines from unlabeled
YouTube recordings.
2013:
Google later acquired the startup DeepMind Technologies, which focused on deep
learning and artificial intelligence.
2014:
2015:
2016:
The goal of explainable AI, which focuses on making machine learning models easier to
understand, received some attention.
2017:
These are only a portion of the eminent headways and achievements in AI during the
predefined period. The field kept on advancing quickly past 2017, with new leap
forwards, strategies, and applications arising.
The field of machine learning has made significant strides in recent years, and its
applications are numerous, including self-driving cars, Amazon Alexa, Catboats, and the
recommender system. It incorporates clustering, classification, decision tree, SVM
algorithms, and reinforcement learning, as well as unsupervised and supervised
learning.
Present day AI models can be utilized for making different expectations, including
climate expectation, sickness forecast, financial exchange examination, and so on.
Prerequisites
Before learning machine learning, you must have the basic knowledge of followings so
that you can easily understand the concepts of machine learning:
Supervised learning algorithms are trained using Unsupervised learning algorithms are
labeled data. trained using unlabeled data.
Supervised learning model takes direct feedback Unsupervised learning model does not take
to check if it is predicting correct output or not. any feedback.
In supervised learning, input data is provided to In unsupervised learning, only input data is
the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it is the hidden patterns and useful insights from
given new data. the unknown dataset.
Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.
Supervised learning can be used for those cases Unsupervised learning can be used for
where we know the input as well as those cases where we have only input data
corresponding outputs. and no corresponding output data.
Supervised learning is not close to true Artificial Unsupervised learning is more close to the
intelligence as in this, we first train the model for true Artificial Intelligence as it learns
each data, and then only it can predict the similarly as a child learns daily routine
correct output. things by his experiences.
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modeled after the brain. An Artificial neural network is usually a
computational network based on biological neural networks that construct the
structure of the human brain. Similar to a human brain has neurons interconnected to
each other, artificial neural networks also have neurons that are linked to each other in
various layers of the networks. These neurons are known as nodes.
The term "Artificial Neural Network" is derived from Biological neural networks that
develop the structure of a human brain. Similar to the human brain that has neurons
interconnected to one another, artificial neural networks also have neurons that are
interconnected to one another in various layers of the networks. These neurons are
known as nodes.
Dendrites from Biological Neural Network represent inputs in Artificial Neural
Networks, cell nucleus represents Nodes, synapse represents Weights, and Axon
represents Output.
Dendrites Inputs
Synapse Weights
Axon Output
There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000. In the human brain,
data is stored in such a manner as to be distributed, and we can extract more than one
piece of this data when necessary from our memory parallelly. We can say that the
human brain is made up of incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example
of a digital logic gate that takes an input and gives an output. "OR" gate, which takes
two inputs. If one or both the inputs are "On," then we get "On" in output. If both the
inputs are "Off," then we get "Off" in output. Here the output depends upon input. Our
brain does not perform the same task. The outputs to inputs relationship keep
changing because of the neurons in our brain, which are "learning."
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs
and includes a bias. This computation is represented in the form of a transfer function.
Artificial Neural Network can be best represented as a weighted directed graph, where
the artificial neurons form the nodes. The association between the neurons outputs and
neuron inputs can be viewed as the directed edges with weights. The Artificial Neural
Network receives the input signal from the external source in the form of a pattern and
image in the form of a vector. These inputs are then mathematically assigned by the
notations x(n) for every n number of inputs.
Afterward, each of the input is multiplied by its corresponding weights ( these weights
are the details utilized by the artificial neural networks to solve a specific problem ). In
general terms, these weights normally represent the strength of the interconnection
between neurons inside the artificial neural network. All the weighted inputs are
summarized inside the computing unit.
If the weighted sum is equal to zero, then bias is added to make the output non-zero or
something else to scale up to the system's response. Bias has the same input, and
weight equals to 1. Here the total of weighted inputs can be in the range of 0 to
positive infinity. Here, to keep the response in the limits of the desired value, a certain
maximum value is benchmarked, and the total of weighted inputs is passed through
the activation function.
The activation function refers to the set of transfer functions used to achieve the
desired output. There is a different kind of the activation function, but primarily either
linear or non-linear sets of functions. Some of the commonly used sets of activation
functions are the Binary, linear, and Tan hyperbolic sigmoidal activation functions. Let
us take a look at each of them in details:
Binary:
In binary activation function, the output is either a one or a 0. Here, to accomplish this,
there is a threshold value set up. If the net weighted input of neurons is more than 1,
then the final output of the activation function is returned as one or else the output is
returned as 0.
Sigmoidal Hyperbolic:
The Sigmoidal Hyperbola function is generally seen as an "S" shaped curve. Here the
tan hyperbolic function is used to approximate output from the actual net input. The
function is defined as:
There are various types of Artificial Neural Networks (ANN) depending upon the human
brain neuron and network functions, an artificial neural network similarly performs
tasks. The majority of the artificial neural networks will have some similarities with a
more complex biological partner and are very effective at their expected tasks. For
example, segmentation or classification.
1. Feedback ANN:
In this type of ANN, the output returns into the network to accomplish the best-
evolved results internally. The feedback networks feed information back into itself and
are well suited to solve optimization issues. The Internal system error corrections utilize
feedback ANNs.
2. Feed-Forward ANN:
Clustering
Example: Let's understand the clustering technique with the real-world example of
Mall: When we visit any shopping mall, we can observe that the things with similar
usage are grouped together. Such as the t-shirts are grouped in one section, and
trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses
of this technique are:
1. Market Segmentation
2. Statistical data analysis
3. Social network analysis
4. Image segmentation
5. Anomaly detection, etc.
The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.
The clustering methods are broadly divided into Hard clustering (datapoint belongs to
only one group) and Soft Clustering (data points can belong to another group also).
But there are also other various approaches of Clustering exist. Below are the main
clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
1. Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the
distance between the data points of one cluster is minimum as compared to another
cluster centroid.
2. Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and
the arbitrarily shaped distributions are formed as long as the dense region can be
connected. This algorithm does it by identifying different clusters in the dataset and
connects the areas of high densities into clusters. The dense areas in data space are
divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
3. Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done
by assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that
uses Gaussian Mixture Models (GMM).
4. Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as
there is no requirement of pre-specifying the number of clusters to be created. In this
technique, the dataset is divided into clusters to create a tree-like structure, which is
also called a dendrogram. The observations or any number of clusters can be selected
by cutting the tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.
5. Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more
than one group or cluster. Each dataset has a set of membership coefficients, which
depend on the degree of membership to be in a cluster. Fuzzy C-means algorithm is
the example of this type of clustering; it is sometimes also known as the Fuzzy k-means
algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few are
commonly used. The clustering algorithm is based on the kind of data that we are
using. Such as, some algorithms need to guess the number of clusters in the given
dataset, whereas some are required to find the minimum distance between the
observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in
machine learning:
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine
Learning:
1. In Identification of Cancer Cells: The clustering algorithms are widely used for
the identification of cancerous cells. It divides the cancerous and non-cancerous
data sets into different groups.
2. In Search Engines: Search engines also work on the clustering technique. The
search result appears based on the closest object to the search query. It does it
by grouping similar data objects in one group that is far from the other
dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
3. Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
4. In Biology: It is used in the biology stream to classify different species of plants
and animals using the image recognition technique.
5. In Land Use: The clustering technique is used in identifying the area of similar
lands use in the GIS database. This can be very useful to find that for what
purpose the particular land should be used, that means for which purpose it is
more suitable.
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and
each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
The decisions or the test are performed on the basis of features of the given
dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
There are various algorithms in Machine learning, so choosing the best algorithm for
the given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the
tree. The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision tree
starts with the root node (Salary attribute by ASM). The root node splits further into the
next decision node (distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
1. Information Gain
2. Gini Index
1. Information Gain:
A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated
using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Where,
P(no)= probability of no
2. Gini Index:
Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the high Gini
index.
It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all
the important features of the dataset. Therefore, a technique that decreases the size of
the learning tree without reducing accuracy is known as Pruning. There are mainly two
types of tree pruning technology used:
1. Cost Complexity Pruning
2. Reduced Error Pruning.
Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian
network as:
Bayesian networks are probabilistic, because these networks are built from a
probability distribution, and also use probability theory for prediction and anomaly
detection.
Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions,
and it consists of two parts:
The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
These links represent that one node directly influence the other node, and if there is
no directed link that means that nodes are independent with each other
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG.
1. Causal Component
2. Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
If we have variables x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.
In general for each variable Xi, we can write the equation as:
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor
an earthquake occurred, and David and Sophia both called the Harry.
Solution:
The Bayesian network for the above problem is given below. The network structure
is showing that burglary and earthquake is the parent node of the alarm and
directly affecting the probability of alarm's going off, but David and Sophia's calls
depend on alarm probability.
The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
The conditional distributions for each node are given as conditional probabilities
table or CPT.
Each row in the CPT must be sum to 1 because all the entries in the table represent
an exhaustive set of cases for the variable.
Burglary (B)
Earthquake(E)
Alarm(A)
David Calls(D)
Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability distribution:
Let's take the observed probability for the Burglary and earthquake component:
The Conditional probability of David that he will call depends on the probability of
Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."
A P(S= True) P(S= False)
From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
There are two ways to understand the semantics of the Bayesian network, which is
given below:
1. To understand the network as the representation of the Joint probability
distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional
independence statements.
It is helpful in designing inference procedure.
Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such a model
can be created by using the SVM algorithm. We will first train our model with lots of
images of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector creates a
decision boundary between these two data (cat and dog) and choose extreme cases
(support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
1. Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point
of the lines from both the classes. These points are called support vectors. The
distance between the vectors and the hyperplane is called as margin. And the goal
of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
o now, SVM will divide the datasets into classes in the following way. Consider the
below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Genetic Algorithms are being widely used in different real-world applications, for
example, Designing electronic circuits, code-breaking, image processing, and
artificial creativity.
After calculating the fitness of every existent in the population, a selection process is
used to determine which of the individualities in the population will get to
reproduce and produce the seed that will form the coming generation.
So, now we can define a genetic algorithm as a heuristic search algorithm to solve
optimization problems. It is a subset of evolutionary algorithms, which is used in
computing. A genetic algorithm uses genetic and natural selection concepts to solve
optimization problems.
1. Initialization
The process of a genetic algorithm starts by generating the set of individuals, which
is called population. Here each individual is the solution for the given problem. An
individual contains or is characterized by a set of parameters called Genes. Genes
are combined into a string and generate chromosomes, which is the solution to the
problem. One of the most popular techniques for initialization is the use of random
binary strings.
2. Fitness Assignment
Fitness function is used to determine how fit an individual is? It means the ability of
an individual to compete with other individuals. In every iteration, individuals are
evaluated based on their fitness function. The fitness function provides a fitness
score to each individual. This score further determines the probability of being
selected for reproduction. The high the fitness score, the more chances of getting
selected for reproduction.
3. Selection
The selection phase involves the selection of individuals for the reproduction of
offspring. All the selected individuals are then arranged in a pair of two to increase
reproduction. Then these individuals transfer their genes to the next generation.
There are three types of Selection methods available, which are:
1. Roulette wheel selection
2. Tournament selection
3. Rank-based selection
4. Reproduction
After the selection process, the creation of a child occurs in the reproduction step.
In this step, the genetic algorithm uses two variation operators that are applied to
the parent population. The two operators involved in the reproduction phase are
given below:
Crossover: The crossover plays a most significant role in the reproduction
phase of the genetic algorithm. In this process, a crossover point is selected
at random within the genes. Then the crossover operator swaps genetic
information of two parents from the current generation to produce a new
individual representing the offspring.
The genes of parents are exchanged among themselves until the crossover point is
met. These newly generated offspring are added to the population. This process is
also called or crossover. Types of crossover styles available:
One point crossover
Two-point crossover
Livery crossover
Inheritable Algorithms crossover
Mutation
The mutation operator inserts random genes in the offspring (new child) to
maintain the diversity in the population. It can be done by flipping some bits in the
chromosomes.
Mutation helps in solving the issue of premature convergence and enhances
diversification. The below image shows the mutation process:
Genetic algorithms are not efficient algorithms for solving simple problems.
It does not guarantee the quality of the final solution to a problem.
Repetitive calculation of fitness values may generate some computational
challenges.
A search space is the set of all possible solutions to the problem. In the
traditional algorithm, only one set of solutions is maintained, whereas, in a
genetic algorithm, several sets of solutions in search space can be used.
Traditional algorithms need more information in order to perform a search,
whereas genetic algorithms need only one objective function to calculate the
fitness of an individual.
Traditional Algorithms cannot work parallelly, whereas genetic Algorithms
can work parallelly (calculating the fitness of the individualities are
independent).
One big difference in genetic Algorithms is that rather of operating directly
on seeker results, inheritable algorithms operate on their representations (or
rendering), frequently appertained to as chromosomes.
One of the big differences between traditional algorithm and genetic
algorithm is that it does not directly operate on candidate solutions.
Traditional Algorithms can only generate one result in the end, whereas
Genetic Algorithms can generate multiple optimal results from different
generations.
The traditional algorithm is not more likely to generate optimal results,
whereas Genetic algorithms do not guarantee to generate optimal global
results, but also there is a great possibility of getting the optimal result for a
problem as it uses genetic operators such as Crossover and Mutation.
Traditional algorithms are deterministic in nature, whereas Genetic
algorithms are probabilistic and stochastic in nature.
Issues in Machine Learning
"Machine Learning" is one of the most popular technology among all data scientists and
machine learning enthusiasts. It is the most effective Artificial Intelligence technology that
helps create automated learning systems to take future decisions without being constantly
programmed. It can be considered an algorithm that automatically constructs various
computer software using past experience and training data. It can be seen in every
industry, such as healthcare, education, finance, automobile, marketing, shipping,
infrastructure, automation, etc. Almost all big companies like Amazon, Facebook, Google,
Adobe, etc., are using various machine learning techniques to grow their businesses. But
everything in this world has bright as well as dark sides. Similarly, Machine Learning offers
great opportunities, but some issues need to be solved.
This article will discuss some major practical issues and their business implementation, and
how we can overcome them. So let's start with a quick introduction to Machine Learning.
Although machine learning is being used in every industry and helps organizations make
more informed and data-driven choices that are more effective than classical
methodologies, it still has so many problems that cannot be ignored. Here are some
common issues in Machine Learning that professionals face to inculcate ML skills and
create an application from scratch.
The major issue that comes while using machine learning algorithms is the lack of quality
as well as quantity of data. Although data plays a vital role in the processing of machine
learning algorithms, many data scientists claim that inadequate data, noisy data, and
unclean data are extremely exhausting the machine learning algorithms. For example, a
simple task requires thousands of sample data, and an advanced task such as speech or
image recognition needs millions of sample data examples. Further, data quality is also
important for the algorithms to work ideally, but the absence of data quality is also found
in Machine Learning applications. Data quality can be affected by some factors as follows:
o Noisy Data- It is responsible for an inaccurate prediction that affects the decision as
well as accuracy in classification tasks.
o Incorrect data- It is also responsible for faulty programming and results obtained in
machine learning models. Hence, incorrect data may affect the accuracy of the
results also.
o Generalizing of output data- Sometimes, it is also found that generalizing output
data becomes complex, which results in comparatively poor future actions.
2. Poor quality of data
As we have discussed above, data plays a significant role in machine learning, and it must
be of good quality as well. Noisy data, incomplete data, inaccurate data, and unclean data
lead to less accuracy in classification and low-quality results. Hence, data quality can also
be considered as a major common problem while processing machine learning algorithms.
To make sure our training model is generalized well or not, we have to ensure that sample
training data must be representative of new cases that we need to generalize. The training
data must cover all cases that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well for
generalized cases and provides accurate decisions. If there is less training data, then there
will be a sampling noise in the model, called the non-representative training set. It won't
be accurate in predictions. To overcome this, it will be biased against one class or a group.
Hence, we should use representative data in training to protect against being biased and
make accurate predictions without any drift.
Overfitting:
Overfitting is one of the most common issues faced by Machine Learning engineers and
data scientists. Whenever a machine learning model is trained with a huge amount of data,
it starts capturing noise and inaccurate data into the training data set. It negatively affects
the performance of the model. Let's understand with a simple example where we have a
few training data sets such as 1000 mangoes, 1000 apples, 1000 bananas, and 5000
papayas. Then there is a considerable probability of identification of an apple as papaya
because we have a massive amount of biased data in the training data set; hence
prediction got negatively affected. The main reason behind overfitting is using non-linear
methods used in machine learning algorithms as they build non-realistic data models. We
can overcome overfitting by using linear and parametric algorithms in the machine
learning models.
Underfitting:
Underfitting occurs when our model is too simple to understand the base structure of the
data, just like an undersized pant. This generally happens when we have limited data into
the data set, and we try to build a linear model with non-linear data. In such scenarios, the
complexity of the model destroys, and rules of the machine learning model become too
easy to be applied on this data set, and the model starts doing wrong predictions as well.
As we know that generalized output data is mandatory for any machine learning model;
hence, regular monitoring and maintenance become compulsory for the same. Different
results for different actions require data change; hence editing of codes as well as
resources for monitoring them also become necessary.
A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example where
at a specific time customer is looking for some gadgets, but now customer requirement
changed over time but still machine learning model showing same recommendations to
the customer while customer expectation has been changed. This incident is called a Data
Drift. It generally occurs when new data is introduced or interpretation of data changes.
However, we can overcome this by regularly updating and monitoring data according to
the expectations.
Although Machine Learning and Artificial Intelligence are continuously growing in the
market, still these industries are fresher in comparison to others. The absence of skilled
resources in the form of manpower is also an issue. Hence, we need manpower having in-
depth knowledge of mathematics, science, and technologies for developing and managing
scientific substances for machine learning.
8. Customer Segmentation
The machine learning process is very complex, which is also another major issue faced by
machine learning engineers and data scientists. However, Machine Learning and Artificial
Intelligence are very new technologies but are still in an experimental phase and
continuously being changing over time. There is the majority of hits and trial experiments;
hence the probability of error is higher than expected. Further, it also includes analyzing
the data, removing data bias, training data, applying complex mathematical calculations,
etc., making the procedure more complicated and quite tedious.
Data Biasing is also found a big challenge in Machine Learning. These errors exist when
certain elements of the dataset are heavily weighted or need more importance than others.
Biased data leads to inaccurate results, skewed outcomes, and other analytical errors.
However, we can resolve this error by determining where data is actually biased in the
dataset. Further, take necessary steps to reduce it.
This issue is also very commonly seen in machine learning models. However, machine
learning models are highly efficient in producing accurate results but are time-consuming.
Slow programming, excessive requirements' and overloaded data take more time to
provide accurate results than expected. This needs continuous maintenance and
monitoring of the model for delivering accurate results.
Although machine learning models are intended to give the best possible outcome, if we
feed garbage data as input, then the result will also be garbage. Hence, we should use
relevant features in our training sample. A machine learning model is said to be good if
training data has a good set of features or less to no irrelevant features.
2.1 REGRESSION
a. To Understand linear regression and logistic regression machine approaches for
WHY separation of data categorically and numerically.
a. Practice with data of various Problems and analyse the needs of Learning
WHAT algorithms.
b. implement various Machine Learning methods with various problems.
WHERE a. Used to analyse and evaluate the data in Machine Learning Problems.
Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed. ML is one of the most exciting technologies that one would
have ever come across. As it is evident from the name, it gives the computer that makes it
more similar to humans: The ability to learn. Machine learning is actively being used today,
perhaps in many more places than one would expect.
It is used for predicting the continuous dependent variable with the help of independent
variables.
The goal of the Linear regression is to find the best fit line that can accurately predict the
output for the continuous dependent variable.
If single independent variable is used for prediction then it is called Simple Linear
Regression and if there are more than two independent variables then such regression is
called as Multiple Linear Regression.
By finding the best fit line, algorithm establish the relationship between dependent
variable and independent variable. And the relationship should be of linear nature.
The output for Linear regression should only be the continuous values such as price, age,
salary, etc. The relationship between the dependent variable and independent variable can
be shown in below image:
In above image the dependent variable is on Y-axis (salary) and independent variable is on
x-axis(experience). The regression line can be written as:
y= a0+a1x+ ε
Logistic Regression:
Logistic regression is one of the most popular Machine learning algorithm that comes
under Supervised Learning techniques.
It can be used for Classification as well as for Regression problems, but mainly used for
Classification problems.
Logistic regression is used to predict the categorical dependent variable with the help of
independent variables.
The output of Logistic Regression problem can be only between the 0 and 1.
Logistic regression can be used where the probabilities between two classes is required.
Such as whether it will rain today or not, either 0 or 1, true or false etc.
Logistic regression is based on the concept of Maximum Likelihood estimation. According
to this estimation, the observed data should be most probable.
In logistic regression, we pass the weighted sum of inputs through an activation function
that can map values in between 0 and 1. Such activation function is known as sigmoid
function and the curve obtained is called as sigmoid curve or S-curve. Consider the below
image:
Linear Regression is used for solving Logistic regression is used for solving
Regression problem. Classification problems.
Bayes theorem is also known with some other name such as Bayes rule or Bayes
Law. Bayes theorem helps to determine the probability of an event with random knowledge.
It is used to calculate the probability of occurring one event while other one already
occurred. It is a best method to relate the condition probability and marginal probability.
In simple words, we can say that Bayes theorem helps to contribute more accurate results.
Bayes Theorem is used to estimate the precision of values and provides a method for
calculating the conditional probability. However, it is hypocritically a simple calculation but
it is used to easily calculate the conditional probability of events where intuition often fails.
Some of the data scientist assumes that Bayes theorem is most widely used in financial
industries but it is not like that. Other than financial, Bayes theorem is also extensively
applied in health and medical, research and survey industry, aeronautical sector, etc.
What is Bayes Theorem?
Bayes theorem is one of the most popular machine learning concepts that helps to
calculate the probability of occurring one event with uncertain knowledge while other one
has already occurred.
Bayes' theorem can be derived using product rule and conditional probability of event X
with known event Y:
o According to the product rule we can express as the probability of event X with
known event Y as follows;
P(X ? Y)= P(X|Y) P(Y) {equation 1}
Here, both events X and Y are independent events which means probability of outcome of
both events does not depends one another.
1. Experiment
An experiment is defined as the planned operation carried out under controlled condition
such as tossing a coin, drawing a card and rolling a dice, etc.
2. Sample Space
During an experiment what we get as a result is called as possible outcomes and the set of
all possible outcome of an event is known as sample space. For example, if we are rolling a
dice, sample space will be:
S1 = {1, 2, 3, 4, 5, 6}
Similarly, if our experiment is related to toss a coin and recording its outcomes, then
sample space will be:
S2 = {Head, Tail}
3. Event
Event is defined as subset of sample space in an experiment. Further, it is also called as set
of outcomes.
Assume in our experiment of rolling a dice, there are two event A and B such that;
o Disjoint Event: If the intersection of the event A and B is an empty set or null then such
events are known as disjoint event or mutually exclusive events also.
4. Random Variable:
It is a real value function which helps mapping between sample space and a real line of an
experiment. A random variable is taken on some random values and each value having
some probability. However, it is neither random nor a variable but it behaves as a function
which can either be discrete, continuous or combination of both.
5. Exhaustive Event:
As per the name suggests, a set of events where at least one event occurs at a time, called
exhaustive event of an experiment.
Thus, two events A and B are said to be exhaustive if either A or B definitely occur at a time
and both are mutually exclusive for e.g., while tossing a coin, either it will be a Head or
may be a Tail.
6. Independent Event:
Two events are said to be independent when occurrence of one event does not affect the
occurrence of another event. In simple words we can say that the probability of outcome
of both events does not depends one another.
7. Conditional Probability:
8. Marginal Probability:
Naïve Bayes classifier is one of the simplest applications of Bayes theorem which is used in
classification algorithms to isolate data as per accuracy, speed and classes.
Let's understand the use of Bayes theorem in machine learning with below example.
These are two conditions given to us, and our classifier that works on Machine Language
has to predict A and the first thing that our classifier has to choose will be the best
possible class. So, with the help of Bayes theorem, we can write it as:
Here;
P(A) will remain constant throughout the class means it does not change its value with
respect to change in class. To maximize the P(Ci/A), we have to maximize the value of term
P(A/Ci) * P(Ci).
With n number classes on the probability list let's assume that the possibility of any class
being the right answer is equally likely. Considering this factor, we can say that:
P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).
This process helps us to reduce the computation cost as well as time. This is how Bayes
theorem plays a significant role in Machine Learning and Naïve Bayes theorem has
simplified the conditional probability tasks without affecting the precision. Hence, we can
conclude that:
Hence, by using Bayes theorem in Machine Learning we can easily describe the
possibilities of smaller events.
The problem of inducing general functions from specific training examples is central to
learning.
“A task of acquiring potential hypothesis (solution) that best fits the given training
examples.”
Consider the example task of learning the target concept “days on which my friend
Prabhas enjoys his favorite water sport.”
Below Table describes a set of example days, each represented by a set of attributes. The
attribute EnjoySport indicates whether or not Prabhas enjoys his favorite water sport on
this day. The task is to learn to predict the value of EnjoySport for an arbitrary day, based
on the values of its other attributes.
In particular, let each hypothesis be a vector of six constraints, specifying the values of the
six attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast.
To illustrate, the hypothesis that Prabhas enjoys his favorite sport only on cold days with
high humidity (independent of the values of the other attributes) is represented by the
expression
and the most specific possible hypothesis-that no day is a positive example-is represented
by
(ø, ø, ø, ø, ø, ø)
Concept learning can be viewed as the task of searching through a large space of
hypotheses implicitly defined by the hypothesis representation.
The goal of this search is to find the hypothesis that best fits the training examples.
Instance Space
Consider, for example, the instances X and hypotheses H in the EnjoySport learning task.
Given that the attribute Sky has three possible values, and that AirTemp, Humidity,
Wind, Water, and Forecast each have two possible values, the instance space X contains
exactly
3 . 2 . 2 . 2 . 2 . 2 = 96 distinct instances.
Example:
Let’s assume there are two features F1 and F2 with F1 has A and B as possibilities and F2 as
X and Y as possibilities.
F1 – > A, B
F2 – > X, Y
Instance Space: (A, X), (A, Y), (B, X), (B, Y) – 4 Examples
Hypothesis Space: (A, X), (A, Y), (A, ø), (A, ?), (B, X), (B, Y), (B, ø), (B, ?), (ø, X), (ø, Y), (ø, ø), (ø,
?), (?, X), (?, Y), (?, ø), (?, ?) – 16
Hypothesis Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y (?, ?) – 10
Instance Space
Hypothesis Space
Notice, however, that every hypothesis containing one or more “ø” symbols represents the
empty set of instances; that is, it classifies every instance as negative.
Our EnjoySport example is a very simple learning task, with a relatively small, finite
hypothesis space.
General-to-Specific Ordering of Hypotheses
h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
Now consider the sets of instances that are classified positive by hl and by h2. Because h2
imposes fewer constraints on the instance, it classifies more instances as positive.
In fact, any instance classified positive by h1 will also be classified positive by h2.
Therefore, we say that h2 is more general than h1.
For any instance x in X and hypothesis h in H, we say that x satisfies h if and only if h(x) =
1.
The Bayes theorem is a method for calculating a hypothesis’s probability based on its prior
probability, the probabilities of observing specific data given the hypothesis, and the seen
data itself.
A Naïve-Bayes classifier algorithm is better than all other models where assumption of
independent predictors holds true.
It requires small amount of training data to estimate the test data which minimize the
training time period.
Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian
network as:
Bayesian networks are probabilistic, because these networks are built from a
probability distribution, and also use probability theory for prediction and anomaly
detection.
Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions,
and it consists of two parts:
The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
Each node corresponds to the random variables, and a variable can be
continuous or discrete.
Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables. These directed links or arrows
connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is
no directed link that means that nodes are independent with each other
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG.
1. Causal Component
2. Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional
probability. So let's first understand the joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.
In general for each variable Xi, we can write the equation as:
Example: Harry installed a new burglar alarm at his home to detect burglary. The
alarm reliably responds at detecting a burglary but also responds for minor
earthquakes. Harry has two neighbors David and Sophia, who have taken a
responsibility to inform Harry at work when they hear the alarm. David always calls
Harry when he hears the alarm, but sometimes he got confused with the phone
ringing and calls at that time too. On the other hand, Sophia likes to listen to high
music, so sometimes she misses to hear the alarm. Here we would like to compute
the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor
an earthquake occurred, and David and Sophia both called the Harry.
Solution:
The Bayesian network for the above problem is given below. The network structure
is showing that burglary and earthquake is the parent node of the alarm and
directly affecting the probability of alarm's going off, but David and Sophia's calls
depend on alarm probability.
The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
The conditional distributions for each node are given as conditional probabilities
table or CPT.
Each row in the CPT must be sum to 1 because all the entries in the table represent
an exhaustive set of cases for the variable.
Burglary (B)
Earthquake(E)
Alarm(A)
David Calls(D)
Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability distribution:
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
The Conditional probability of David that he will call depends on the probability of
Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."
A P(S= True) P(S= False)
From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
There are two ways to understand the semantics of the Bayesian network, which is
given below:
1. To understand the network as the representation of the Joint probability
distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional
independence statements.
It is helpful in designing inference procedure.
EM Algorithm in Machine Learning
What is an EM algorithm?
Key Points:
o It is known as the latent variable model to determine MLE and MAP parameters for
latent variables.
o It is used to predict values of parameters in instances where data is missing or
unobservable for learning, and this is done until convergence of the values occurs.
EM Algorithm
o Expectation step (E - step): It involves the estimation (guess) of all missing values
in the dataset so that after completing this step, there should not be any missing
value.
o Maximization step (M - step): This step involves the use of estimated data in the
E-step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data
to update the values of the parameters in the M-step.
What is Convergence in the EM algorithm?
Steps in EM Algorithm
o 1st Step: The very first step is to initialize the parameter values. Further, the system
is provided with incomplete observed data with the assumption that data is
obtained from a specific model.
o 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or
guess the values of the missing or incomplete data using the observed data.
Further, E-step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use complete
data obtained from the 2nd step to update the parameter values. Further, M-step
primarily updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are converging or
not. If it gets "yes", then stop the process; else, repeat the process from step 2 until
the convergence occurs.
The Gaussian Mixture Model or GMM is defined as a mixture model that has a
combination of the unspecified probability distribution function. Further, GMM
also requires estimated statistics values such as mean and standard deviation or
parameters. It is used to estimate the parameters of the probability distributions to
best fit the density of a given training dataset. Although there are plenty of
techniques available to estimate the parameter of the Gaussian Mixture Model
(GMM), the Maximum Likelihood Estimation is one of the most popular
techniques among them.
Let's understand a case where we have a dataset with multiple data points
generated by two different processes. However, both processes contain a similar
Gaussian probability distribution and combined data. Hence it is very difficult to
discriminate which distribution a given point may belong to.
The processes used to generate the data point represent a latent variable or
unobservable data. In such cases, the Estimation-Maximization algorithm is one of
the best techniques which helps us to estimate the parameters of the gaussian
distributions. In the EM algorithm, E-step estimates the expected value for each
latent variable, whereas M-step helps in optimizing them significantly using the
Maximum Likelihood Estimation (MLE). Further, this process is repeated until a good
set of latent values, and a maximum likelihood is achieved that fits the data.
Applications of EM algorithm
The primary aim of the EM algorithm is to estimate the missing data in the latent
variables through observed data in datasets. The EM algorithm or latent variable
model has a broad range of real-life applications in machine learning. These are as
follows:
o The EM algorithm is applicable in data clustering in machine learning.
o It is often used in computer vision and NLP (Natural language processing).
o It is used to estimate the value of the parameter in mixed models such as
the Gaussian Mixture Model and quantitative genetics.
o It is also used in psychometrics for estimating item parameters and latent abilities of
item response theory models.
o It is also applicable in the medical and healthcare industry, such as in image
reconstruction and structural engineering.
o It is used to determine the Gaussian density of a function.
Advantages of EM algorithm
o It is very easy to implement the first two basic steps of the EM algorithm in various
machine learning problems, which are E-step and M- step.
o It is mostly guaranteed that likelihood will enhance after each iteration.
o It often generates a solution for the M-step in the closed form.
Disadvantages of EM algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such a model
can be created by using the SVM algorithm. We will first train our model with lots of
images of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector creates a
decision boundary between these two data (cat and dog) and choose extreme cases
(support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
1. Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
Kernel methods' fundamental premise is used to convert the input data into a high-
dimensional feature space, which makes it simpler to distinguish between classes or
generate predictions. Kernel methods employ a kernel function to implicitly map the
data into the feature space, as opposed to manually computing the feature space.
The most popular kind of kernel approach is the Support Vector Machine (SVM), a
binary classifier that determines the best hyperplane that most effectively divides
the two groups. In order to efficiently locate the ideal hyperplane, SVMs map the
input into a higher-dimensional space using a kernel function.
Other examples of kernel methods include kernel ridge regression, kernel PCA, and
Gaussian processes. Since they are strong, adaptable, and computationally efficient,
kernel approaches are frequently employed in machine learning. They are resilient
to noise and outliers and can handle sophisticated data structures like strings and
graphs.
Support Vector Machines (SVMs) use kernel methods to transform the input data
into a higher-dimensional feature space, which makes it simpler to distinguish
between classes or generate predictions. Kernel approaches in SVMs work on the
fundamental principle of implicitly mapping input data into a higher-dimensional
feature space without directly computing the coordinates of the data points in that
space.
The kernel function in SVMs is essential in determining the decision boundary that
divides the various classes. In order to calculate the degree of similarity between any
two points in the feature space, the kernel function computes their dot product.
The most commonly used kernel function in SVMs is the Gaussian or radial basis
function (RBF) kernel. The RBF kernel maps the input data into an infinite-
dimensional feature space using a Gaussian function. This kernel function is popular
because it can capture complex nonlinear relationships in the data.
Other types of kernel functions that can be used in SVMs include the polynomial
kernel, the sigmoid kernel, and the Laplacian kernel. The choice of kernel function
depends on the specific problem and the characteristics of the data.
Basically, kernel methods in SVMs are a powerful technique for solving classification
and regression problems, and they are widely used in machine learning because
they can handle complex data structures and are robust to noise and outliers.
Symmetry: A kernel function is symmetric, meaning that it produces the same value
regardless of the order in which the inputs are given.
In Support Vector Machines (SVMs), there are several types of kernel functions that
can be used to map the input data into a higher-dimensional feature space. The
choice of kernel function depends on the specific problem and the characteristics of
the data.
Linear Kernel
K(x, y) = x .y
Where x and y are the input feature vectors. The dot product of the input vectors is
a measure of their similarity or distance in the original feature space.
When using a linear kernel in an SVM, the decision boundary is a linear hyperplane
that separates the different classes in the feature space. This linear boundary can be
useful when the data is already separable by a linear decision boundary or when
dealing with high-dimensional data, where the use of more complex kernel
functions may lead to overfitting.
Polynomial Kernel
Where x and y are the input feature vectors, c is a constant term, and d is the
degree of the polynomial, K(x, y) = (x. y + c) d. The constant term is added to, and
the dot product of the input vectors elevated to the degree of the polynomial.
The decision boundary of an SVM with a polynomial kernel might capture more
intricate correlations between the input characteristics because it is a nonlinear
hyperplane.
The polynomial kernel has the benefit of being able to detect both linear and
nonlinear correlations in the data. It can be difficult to select the proper degree of
the polynomial, though, as a larger degree can result in overfitting while a lower
degree cannot adequately represent the underlying relationships in the data.
In general, the polynomial kernel is an effective tool for converting the input data
into a higher-dimensional feature space in order to capture nonlinear correlations
between the input characteristics.
The Gaussian kernel, also known as the radial basis function (RBF) kernel, is a
popular kernel function used in machine learning, particularly in SVMs (Support
Vector Machines). It is a nonlinear kernel function that maps the input data into a
higher-dimensional feature space using a Gaussian function.
The Gaussian kernel can be defined as:
Where x and y are the input feature vectors, gamma is a parameter that controls the
width of the Gaussian function, and ||x - y||^2 is the squared Euclidean distance
between the input vectors.
One advantage of the Gaussian kernel is its ability to capture complex relationships
in the data without the need for explicit feature engineering. However, the choice of
the gamma parameter can be challenging, as a smaller value may result in under
fitting, while a larger value may result in over fitting.
Laplace Kernel
The Laplacian kernel, also known as the Laplace kernel or the exponential kernel, is a
type of kernel function used in machine learning, including in SVMs (Support Vector
Machines). It is a non-parametric kernel that can be used to measure the similarity
or distance between two input feature vectors.
Where x and y are the input feature vectors, gamma is a parameter that controls the
width of the Laplacian function, and ||x - y|| is the L1 norm or Manhattan distance
between the input vectors.
One advantage of the Laplacian kernel is its robustness to outliers, as it places less
weight on large distances between the input vectors than the Gaussian kernel.
However, like the Gaussian kernel, choosing the correct value of the gamma
parameter can be challenging.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point
of the lines from both the classes. These points are called support vectors. The
distance between the vectors and the hyperplane is called as margin. And the goal
of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
o now, SVM will divide the datasets into classes in the following way. Consider the
below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Properties of SVM