ML R20 Material
ML R20 Material
Introduction:
ARTIFICIAL INTELIGENCE:
Artificial intelligence (AI) is the ability of a computer or a robot controlled by a
computer to do tasks that are usually done by humans because they require human
intelligence and discernment. Although there are no AIs that can perform the wide
variety of tasks an ordinary human can do, some AIs can match humans in specific
tasks.
MACHINE LEARNING:
machine learning is the concept that a computer program can learn and adapt to
new data without human intervention. Machine learning is a field of artificial
intelligence (AI) that keeps a computer’s built-in algorithms current regardless of
changes in the worldwide economy.
Machine learning can be applied in a variety of areas, such as in investing,
advertising, lending, organizing news, fraud detection, and more.
DEEP LEARNING:
Deep learning is a subset of machine learning (ML),
which is itself a subset of artificial intelligence (AI).
The concept of AI has been around since the 1950s,
with the goal of making computers able to think and
reason in a way similar to humans. As part of making
machines able to think, ML is focused on how to make
them learn without being explicitly programmed.
Deep learning goes beyond ML by creating more
complex hierarchical models that are meant to mimic
how humans learn new information
TYPES OF MECHINE LEARNING:
Machine learning is a subset of AI, which enables the machine to automatically
learn from data, improve performance from past experiences, and make predictions.
Machine learning contains a set of algorithms that work on a huge amount of data.
Data is fed to these algorithms to train them, and on the basis of training, they build
the model & perform a specific task
1. Supervised Machine Learning
As its name suggests, Supervised machine learning is based on supervision. It
means in the supervised learning technique, we train the machines using the
"labelled" dataset, and based on the training, the machine predicts the output. Here,
the labelled data specifies that some of the inputs are already mapped to the
output. More preciously, we can say; first, we train the machine with the input and
corresponding output, and then we ask the machine to predict the output using the
test dataset.
Supervised machine learning can be classified into two types of problems, which
are given below:
o Classification
o Regression
2. Unsupervised Machine Learning
Unsupervised learning is different from the Supervised learning technique; as its
name suggests, there is no need for supervision. It means, in unsupervised machine
learning, the machine is trained using the unlabeled dataset, and the machine
predicts the output without any supervision.
In unsupervised learning, the models are trained with the data that is neither
classified nor labelled, and the model acts on that data without any supervision.
Unsupervised Learning can be further classified into two types, which are given
below:
o Clustering
o Association
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate
ground between Supervised (With Labelled training data) and Unsupervised
learning (with no labelled training data) algorithms and uses the combination of
labelled and unlabeled datasets during the training period.
Although Semi-supervised learning is the middle ground between supervised and
unsupervised learning and operates on the data that consists of a few labels, it
mostly consists of unlabeled data. As labels are costly, but for corporate purposes,
they may have few labels. It is completely different from supervised and
unsupervised learning as they are based on the presence & absence of labels.
To overcome the drawbacks of supervised learning and unsupervised learning
algorithms, the concept of Semi-supervised learning is introduced. The main aim
of semi-supervised learning is to effectively use all the available data, rather than
only labelled data like in supervised learning. Initially, similar data is clustered along
with an unsupervised learning algorithm, and further, it helps to label the unlabeled
data into labelled data. It is because labelled data is a comparatively more
expensive acquisition than unlabeled data.
We can imagine these algorithms with an example. Supervised learning is where a
student is under the supervision of an instructor at home and college. Further, if
that student is self-analysing the same concept without any help from the instructor,
it comes under unsupervised learning. Under semi-supervised learning, the student
has to revise himself after analyzing the same concept under the guidance of an
instructor at college.
Advantages and disadvantages of Semi-supervised Learning
Advantages:
o It is simple and easy to understand the algorithm.
o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning
algorithms.
Disadvantages:
o Iterations results may not be stable.
o We cannot apply these algorithms to network-level data.
o Accuracy is low.
4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI agent
(A software component) automatically explore its surrounding by hitting & trail,
taking action, learning from experiences, and improving its performance. Agent
gets rewarded for each good action and get punished for each bad action; hence
the goal of reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and
agents learn from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child
learns various things by experiences in his day-to-day life. An example of
reinforcement learning is to play a game, where the Game is the environment,
moves of an agent at each step define states, and the goal of the agent is to get a
high score. Agent receives feedback in terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields
such as Game theory, Operation Research, Information theory, multi-agent systems.
A reinforcement learning problem can be formalized using Markov Decision
Process(MDP). In MDP, the agent constantly interacts with the environment and
performs actions; at each action, the environment responds and generates a new
state.
Categories of Reinforcement Learning
Reinforcement learning is categorized mainly into two types of methods/algorithms:
o Positive Reinforcement Learning: Positive reinforcement learning specifies
increasing the tendency that the required behaviour would occur again by
adding something. It enhances the strength of the behaviour of the agent
and positively impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly
opposite to the positive RL. It increases the tendency that the specific behaviour
would occur again by avoiding the negative condition
. Advantages:
o It helps in solving complex real-world problems which are difficult to be
solved by general techniques.
o The learning model of RL is similar to the learning of human beings; hence
most accurate results can be found.
o Helps in achieving long term results.
Disadvantage
o RL algorithms are not preferred for simple problems.
o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can
weaken the results.
Image Source
In the above image, we can see that even if our model is “AWESOME” and we feed it
with garbage data, the result will also be garbage(output). Our training data must
always contain more relevant and less to none irrelevant features.
The credit for a successful machine learning project goes to coming up with a good
set of features on which it has been trained (often referred to as feature
engineering ), which includes feature selection, extraction, and creating new
features which are other interesting topics to be covered in upcoming blogs.
4. Nonrepresentative training data:
To make sure that our model generalizes well, we have to make sure that our
training data should be representative of the new cases that we want to generalize
to.
If train our model by using a nonrepresentative training set, it won’t be accurate in
predictions it will be biased against one class or a group.
For E.G., Let us say you are trying to build a model that recognizes the genre of
music. One way to build your training set is to search it on youtube and use the
resulting data. Here we assume that youtube’s search engine is providing
representative data but in reality, the search will be biased towards popular artists
and maybe even the artists that are popular in your location(if you live in India you
will be getting the music of Arijit Singh, Sonu Nigam or etc).
So use representative data during training, so your model won’t be biased among
one or two classes when it works on testing data.
5. Overfitting and Underfitting :
Let’s start with an example, say one day you are walking down a street to buy
something, a dog comes out of nowhere you offer him something to eat but instead
of eating he starts barking and chasing you but somehow you are safe. After this
particular incident, you might think all dogs are not worth treating nicely.
So this overgeneralization is what we humans do most of the time, and
unfortunately machine learning model also does the same if not paid attention. In
machine learning, we call this overfitting i.e model performs well on training data
but fails to generalize well.
Overfitting happens when our model is too complex.
Things which we can do to overcome this problem:
1. Simplify the model by selecting one with fewer parameters.
2. By reducing the number of attributes in training data.
3. Constraining the model.
4. Gather more training data.
STATISTICAL LEARNING:
INTRODUCTION:
An Introduction to Statistical Learning provides a broad and less technical
treatment of key topics in statistical learning. Each chapter includes an R lab. This
book is appropriate for anyone who wishes to use contemporary tools for data
analysis.
SUPERVISED LEARNING:
As its name suggests, Supervised machine learning
is based on supervision. It means in the supervised learning technique, we train the
machines using the "labelled" dataset, and based on the training, the machine
predicts the output. Here, the labelled data specifies that some of the inputs are
already mapped to the output. More preciously, we can say; first, we train the
machine with the input and corresponding output, and then we ask the machine to
predict the output using the test dataset.
Let's understand supervised learning with an example. Suppose we have an input
dataset of cats and dog images. So, first, we will provide the training to the machine
to understand the images, such as the shape & size of the tail of cat and dog,
Shape of eyes, colour, height (dogs are taller, cats are smaller), etc. After
completion of training, we input the picture of a cat and ask the machine to identify
the object and predict the output. Now, the machine is well trained, so it will check
all the features of the object, such as height, shape, colour, eyes, ears, tail, etc., and
find that it's a cat. So, it will put it in the Cat category. This is the process of how the
machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x)
with the output variable(y). Some real-world applications of supervised learning
are Risk Assessment, Fraud Detection, Spam filtering, etc.
Categories of Supervised Machine Learning
Supervised machine learning can be classified into two types of problems, which
are given below:
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the
output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc.
The classification algorithms predict the categories present in the dataset. Some
real-world examples of classification algorithms are Spam Detection, Email filtering,
etc.
Some popular classification algorithms are given below:
o Random Forest Algorithm
o Decision Tree Algorithm
o Logistic Regression Algorithm
o Support Vector Machine Algorithm
b) Regression
Regression algorithms are used to solve regression problems in which there is a
linear relationship between input and output variables. These are used to predict
continuous output variables, such as market trends, weather prediction, etc.
Some popular Regression algorithms are given below:
o Simple Linear Regression Algorithm
o Multivariate Regression Algorithm
o Decision Tree Algorithm
o Lasso Regression
Advantages and Disadvantages of Supervised Learning
Advantages:
o Since supervised learning work with the labelled dataset so we can have an
exact idea about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior
experience.
Disadvantages:
o These algorithms are not able to solve complex tasks.
o It may predict the wrong output if the test data is different from the training
data.
o It requires lots of computational time to train the algorithm.
Applications of Supervised Learning
Some common applications of Supervised Learning are given below:
o ImageSegmentation:
Supervised Learning algorithms are used in image segmentation. In this
process, image classification is performed on different image data with pre-
defined labels.
o MedicalDiagnosis:
Supervised algorithms are also used in the medical field for diagnosis
purposes. It is done by using medical images and past labelled data with
labels for disease conditions. With such a process, the machine can identify a
disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using
historic data to identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are
used. These algorithms classify an email as spam or not spam. The spam
emails are sent to the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various
identifications can be done using the same, such as voice-activated
passwords, voice commands, etc
UNSUPERVISED LEARNING:
Unsupervised learnin
Is different from the Supervised learning technique; as its name suggests, there is
no need for supervision. It means, in unsupervised machine learning, the machine is
trained using the unlabeled dataset, and the machine predicts the output without
any supervision.
In unsupervised learning, the models are trained with the data that is neither
classified nor labelled, and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines
are instructed to find the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of
fruit images, and we input it into the machine learning model. The images are
totally unknown to the model, and the task of the machine is to find the patterns
and categories of the objects.
So, now the machine will discover its patterns and differences, such as colour
difference, shape difference, and predict the output when it is tested with the test
dataset.
Categories of Unsupervised Machine Learning
Unsupervised Learning can be further classified into two types, which are given
below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the
data. It is a way to group the objects into a cluster such that the objects with the
most similarities remain in one group and have fewer or no similarities with the
objects of other groups. An example of the clustering algorithm is grouping the
customers by their purchasing behaviour.
Some of the popular clustering algorithms are given below:
o K-Means Clustering algorithm
o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis
2) Association
Association rule learning is an unsupervised learning technique, which finds
interesting relations among variables within a large dataset. The main aim of this
learning algorithm is to find the dependency of one data item on another data item
and map those variables accordingly so that it can generate maximum profit. This
algorithm is mainly applied in Market Basket analysis, Web usage mining,
continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat,
FP-growth algorithm.
Advantages and Disadvantages of Unsupervised Learning Algorithm
Advantages:
o These algorithms can be used for complicated tasks compared to the
supervised ones because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the
unlabeled dataset is easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset
is not labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the
unlabelled dataset that does not map with the output.
Applications of Unsupervised Learning
o Network Analysis: Unsupervised learning is used for identifying plagiarism
and copyright in document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use
unsupervised learning techniques for building recommendation applications
for different web applications and e-commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of
unsupervised learning, which can identify unusual data points within the
dataset. It is used to discover fraudulent transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used
to extract particular information from the database. For example, extracting
information of each user located at a particular location
TRAINING AND TESTING:
What is Training Dataset?
The training data is the biggest (in -size) subset of the original dataset, which is
used to train or fit the machine learning model. Firstly, the training data is fed to
the ML algorithms, which lets them learn how to make predictions for the given task.
For example, for training a sentiment analysis model, the training data could be as
below:
187.7K
All the NEW Features & Changes in iOS 15 Beta 3: Safari Tweaks, Apple Music
Widget, & More!
Input Output (Labels)
The New UI is Great Positive
Update is really Slow Negative
The training data varies depending on whether we are using Supervised Learning
or Unsupervised Learning Algorithms.
For Unsupervised learning, the training data contains unlabeled data points, i.e.,
inputs are not tagged with the corresponding outputs. Models are required to find
the patterns from the given training datasets in order to make predictions.
On the other hand, for supervised learning, the training data contains labels in
order to train the model and make predictions.
The type of training data that we provide to the model is highly responsible for the
model's accuracy and prediction ability. It means that the better the quality of the
training data, the better will be the performance of the model. Training data is
approximately more than or equal to 60% of the total data for an ML project.
What is Test Dataset
Once we train the model with the training dataset, it's time to test the model with
the test dataset. This dataset evaluates the performance of the model and ensures
that the model can generalize well with the new or unseen dataset. The test
dataset is another subset of original data, which is independent of the training
dataset. However, it has some similar types of features and class probability
distribution and uses it as a benchmark for model evaluation once the model
training is completed. Test data is a well-organized dataset that contains data for
each type of scenario for a given problem that the model would be facing when
used in the real world. Usually, the test dataset is approximately 20-25% of the total
original data for an ML project.
Machine Learning algorithms enable the machines to make predictions and solve
problems on the basis of past observations or experiences. These experiences or
observations an algorithm can take from the training data, which is fed to it. Further,
one of the great things about ML algorithms is that they can learn and improve over
time on their own, as they are trained with the relevant training data.
Once the model is trained enough with the relevant training data, it is tested with
the test data. We can understand the whole process of training and testing in three
steps, which are as follows:
1. Feed: Firstly, we need to train the model by feeding it with training input data.
2. Define: Now, training data is tagged with the corresponding outputs (in
Supervised Learning), and the model transforms the training data into text
vectors or a number of data features.
3. Test: In the last step, we test the model by feeding it with the test
data/unseen dataset. This step ensures that the model is trained efficiently
and can generalize well.
The above process is explained using a flowchart given below:
HighBias
In such a problem, a hypothesis looks like follows.
Variance
The variability of model prediction for a given data point which tells us spread of
our data is called the variance of the model. The model with high variance has a
very complex fit to the training data and thus is not able to fit accurately on the
data which it hasn’t seen before. As a result, such models perform very well on
training data but has high error rates on test data.
When a model is high on variance, it is then said to as Overfitting of Data.
Overfitting is fitting the training set accurately via complex curve and high order
hypothesis but is not the solution as the error with unseen data is high.
While training a data model variance should be kept low.
The high variance data looks like follows.
High Variance
In such a problem, a hypothesis looks like follows.
This is referred to as the best point chosen for the training of the algorithm which
gives low error in training as well as testing data.
ESTIMATING RISK STATISTICS:
Unraveling the genetic background of human diseases serves a number of goals.
One aim is to identify genes that modify the susceptibility to disease. In this context,
we ask questions like: “Is this genetic variant more frequent in patients with the
disease of interest than in unaffected controls?” or “Is the mean phenotype higher
in carriers of this genetic variant than in non-carriers?” From the answers, we
possibly learn about the pathogenesis of the disease, and we can identify possible
targets for therapeutic interventions. Looking back at the past decade, it can be
summarized that genome-wide association (GWA) studies have been useful in this
endeavor (Hindorff et al. 2012).
When we consider classical measures for strength of association on the one hand,
such as the odds ratio (OR), and for classification on the other hand, such as
sensitivity (sens) and specificity (spec), there is a simple relationship between them
with
If the population standard deviation σ is not known, we cannot assume that the
sample mean X̅ is normally distributed. If certain conditions are satisfied
(explained below), then we can transform X̅ to another random variable t such
that,
The random variable t is said to follow the t-distribution with n-1 degrees of freedom,
where n is the sample size. The t-distribution is bell-shaped and symmetric (just like
the normal distribution) but has fatter tails compared to the normal distribution. This
means values further away from the mean have a higher likelihood of occurring
compared to that in the normal distribution.
The conditions to use the t-distribution for the random variable t are as follows
(Sharpe et al., 2020, pp. 415–420):
If X is normally distributed, even for small sample sizes (n<15), the t-distribution can
be used.
If the sample size is between 15 and 40, the t-distribution can be used as long as X is
unimodal and reasonably symmetric.
For sample sizes greater than 40, the t-distribution can be used unless X’s
distribution is heavily skewed
Empirical risk minimization (ERM):
We assumed that our samples come from this distribution and use our dataset as
an approximation.
If we compute the loss using the data points in our dataset, it’s called empirical risk.
It is “empirical”and not “true” because we are using a dataset that’s a subset of the
whole population.
When our learning model is built, we have to pick a function that minimizes the
empirical risk that is the delta between predicted output and actual output for data
points in the dataset.
This process of finding this function is called empirical risk minimization (ERM). We
want to minimize the true risk.
We don’t have information that allows us to achieve that, so we hope that this
empirical risk will almost be the same as the true empirical risk.
In the equation below, we can define the true error, which is based on the whole
domain X:
Since we only have access to S, a subset of the input domain, we learn based on
that sample of training examples. We don’t have access to the true error, but to
the empirical error:
Introduction
Unraveling
ntroduction
serves
modify
questions
patients
controls?’’
this
answers,
genetic
athe
with
number
we
like:
orsusceptibility
the
variant
possibly
‘‘Is
the
‘‘Is
genetic
the
ofdisease
this
goals.
than
mean
learn
genetic
background
to
inofOne
phenotype
about
non-carriers?’’
disease.
interest
variant
aimthe
isof
Inthan
topathogenesis
more
higher
human
this
identify
inFrom
context,
frequent
unaffected
indiseases
carriers
genes
the we
ofinthe
th
that
ask
of
Unraveling
ntroduction
serves
modify
questions
patients
controls?’’
this
answers,
genetic
athe
with
number
we
like:
orsusceptibility
the
variant
possibly
‘‘Is
the
‘‘Is
genetic
the
ofdisease
this
goals.
than
mean
learn
genetic
background
to
inofOne
phenotype
about
non-carriers?’’
disease.
interest
variant
aimthe
isof
Inthan
topathogenesis
more
higher
human
this
identify
inFrom
context,
frequent
unaffected
indiseases
carriers
genes
the we
ofinthe
that
ask
of
Unit-2
Supervised Learning Algorithm
Supervised learning is a type of Machine learning in which the machine needs
external supervision to learn. The supervised learning models are trained
using the labeled dataset. Once the training and processing are done, the
model is tested by providing a sample test data to check whether it predicts
the correct output.
The goal of supervised learning is to map input data with the output data.
Supervised learning is based on supervision, and it is the same as when a
student learns things in the teacher's supervision. The example of supervised
learning is spam filtering.
Supervised learning can be divided further into two categories of problem:
o Classification
o Regression
Distance-basedmodels
Like Linear models, distance-based models are based on the geometry of
data. As the name implies, distance-based models work on the concept of
distance. In the context of Machine learning, the concept of distance is not
based on merely the physical distance between two points.
o As we can see the 3 nearest neighbors are from category A, hence this
new data point must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-
NN algorithm:
o There is no particular way to determine the best value for "K", so we
need to try some values to find the best out of them. The most
preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some
time.
o The computation cost is high because of calculating the distance
between the data points for all the training samples.
Decision Tree Classification Algorithm
o Decision Tree is a Supervised learning technique that can be used for
both classification and Regression problems, but mostly it is preferred
for solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the
outcome.
o In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision
and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the
given dataset.
o It is a graphical representation for getting all the possible solutions to
a problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the
root node, which expands on further branches and constructs a tree-
like structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as
numeric data.
Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will first
train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these
two data (cat and dog) and choose extreme cases (support vectors), it will
see the extreme case of cat and dog. On the basis of the support vectors, it
will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight
line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate
the classes in n-dimensional space, but we need to find out the best decision
boundary that helps to classify the data points. This best boundary is known
as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then hyperplane
will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means
the maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since
these vectors support the hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the
dataset has two features x1 and x2. We want a classifier that can classify the
pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate
these two classes. But there can be multiple lines that can separate these
classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary;
this best boundary or region is called as a hyperplane. SVM algorithm finds
the closest point of the lines from both the classes. These points are called
support vectors. The distance between the vectors and the hyperplane is
called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line,
but for non-linear data, we cannot draw a single straight line. Consider the
below image:
So to separate these data points, we need to add one more dimension. For
linear data, we have used two dimensions x and y, so for non-linear data, we
will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way.
Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis.
If we convert it in 2d space with z=1, then it will become as:
About Ranking
Ranking is a machine learning technique to rank items.
Ranking is useful for many applications in information retrieval such as e-
commerce, social networks, recommendation systems, and so on. For
example, a user searches for an article or an item to buy online. To build a
recommendation system, it becomes important that similar articles or items
of relevance appear to the user such that the user clicks or purchases the
item. A simple regression model can predict the probability of a user to click
an article or buy an item. However, it is more practical to use ranking
technique and be able to order or rank the articles or items to maximize the
chances of getting a click or purchase. The prioritization of the articles or the
items influence the decision of the users.
The ranking technique directly ranks items by training a model to predict the
ranking of one item over another item. In the training model, it is possible to
have items, ranking one over the other by having a "score" for each item.
Higher ranked items have higher scores and lower ranked items have lower
scores. Using these scores, a model is built to predict which item ranks higher
than the other.
Ranking Methods
Oracle Machine Learning supports pairwise and listwise ranking methods
through XGBoost.
For a training data set, in a number of sets, each set consists of objects and
labels representing their ranking. A ranking function is constructed by
minimizing a certain loss function on the training data. Using test data, the
ranking function is applied to get a ranked list of objects. Ranking is enabled
for XGBoost using the regression function. OML4SQL supports pairwise and
listwise ranking methods through XGBoost.
Pairwise ranking: This approach regards a pair of objects as the learning
instance. The pairs and lists are defined by supplying the same case_id value.
Given a pair of objects, this approach gives an optimal ordering for that pair.
Pairwise losses are defined by the order of the two objects. In OML4SQL, the
algorithm uses LambdaMART to perform pairwise ranking with the goal of
minimizing the average number of inversions in ranking.
Listwise ranking: This approach takes multiple lists of ranked objects as
learning instance. The items in a list must have the same case_id. The
algorithm uses LambdaMART to perform list-wise ranking.
See Also:
"Ranking Measures and Loss Functions in Learning to Rank" a research
paper presentation at https://www.researchgate.net/
Oracle Database PL/SQL Packages and Types Reference for a listing
and explanation of the available model settings for XGBoost.
Note:
The term hyperparameter is also interchangeably used for model setting.
Related Topics
XGBoost
DBMS_DATA_MINING — Algorithm Settings: XGBoost
Ranking Algorithms
Ranking falls under the Regression function.
OML4SQL supports XGBoost algorithm for ranking.
Related Topics
XGBoost
Structured outputs
Structured prediction or structured (output) learning is an umbrella
term for supervised machine learning techniques that
involves predicting structured[1]objects, rather than
scalar discrete or real values.
Similar to commonly used supervised learning techniques, structured
prediction models are typically trained by means of observed data in which
the true prediction value is used to adjust model parameters. Due to the
complexity of the model and the interrelations of predicted variables the
process of prediction using a trained model and of training itself is often
computationally infeasible and approximate inference and learning methods
are used.
Applications
For example, the problem of translating a natural language sentence into a
syntactic representation such as a parse tree can be seen as a structured
prediction problem in which the structured output domain is the set of all
[2]
possible parse trees. Structured prediction is also used in a wide variety of
application domains including bioinformatics, natural language
processing, speech recognition, and computer vision.
Example: sequence tagging
Sequence tagging is a class of problems prevalent in natural language
processing, where input data are often sequences (e.g. sentences of text).
The sequence tagging problem appears in several guises, e.g. part-of-speech
tagging and named entity recognition. In POS tagging, for example, each
word in a sequence must receive a "tag" (class label) that expresses its "type"
of word:
This DT
is VBZ
a DT
tagged JJ
sentenceNN
DIFFERENCE
Bagging Boosting
Various training data subsets are randomly Each new subset contains the
drawn with replacement from the whole components that were misclassified by
training dataset. previous models.
Bagging attempts to tackle the over-fitting Boosting tries to reduce bias.
issue.
If the classifier is unstable (high variance), If the classifier is steady and
then we need to apply bagging. straightforward (high bias), then we
need to apply boosting.
Every model receives an equal weight. Models are weighted by their
performance.
Objective to decrease variance, not bias. Objective to decrease bias, not variance.
It is the easiest way of connecting It is a way of connecting predictions that
predictions that belong to the same type. belong to the different types.
Every model is constructed independently. New models are affected by the
performance of the previously
developed model.
3. Stacking
Stacking is one of the popular ensemble modeling techniques in machine learning.
Various weak learners are ensembled in a parallel manner in such a way that by
combining them with Meta learners, we can predict better predictions for the
future.
In stacking, an algorithm takes the outputs of sub-models as input and attempts to
learn how to best combine the input predictions to make a better output prediction.
Stacking is also known as a stacked generalization and is an extended form of the
Model Averaging Ensemble technique in which all sub-models equally participate as
per their performance weights and build a new model with better predictions. This
new model is stacked up on top of the others; this is the reason why it is named
stacking.
Architecture of Stacking
The architecture of the stacking model is designed in such as way that it consists of
two or more base/learner's models and a meta-model that combines the predictions
of the base models. These base models are called level 0 models, and the meta-
model is known as the level 1 model. So, the Stacking ensemble method
includes original (training) data, primary level models, primary level prediction,
secondary level model, and final prediction. The basic architecture of stacking can
be represented as shown below the image.
o Original data: This data is divided into n-folds and is also considered test data
or training data.
o Base models: These models are also referred to as level-0 models. These
models use training data and provide compiled predictions (level-0) as an
output.
o Level-0 Predictions: Each base model is triggered on some training data and
provides different predictions, which are known as level-0 predictions.
o Meta Model: The architecture of the stacking model consists of one meta-
model, which helps to best combine the predictions of the base models. The
meta-model is also known as the level-1 model.
o Level-1 Prediction: The meta-model learns how to best combine the
predictions of the base models and is trained on different predictions made
by individual base models, i.e., data not used to train the base models are fed
to the meta-model, predictions are made, and these predictions, along with
the expected outputs, provide the input and output pairs of the training
dataset used to fit the meta-model.
Steps to implement Stacking models:
There are some important steps to implementing stacking models in machine
learning.
These are as follows:
o Split training data sets into n-folds using the RepeatedStratifiedKFold as this
is the most common approach to preparing training datasets for meta-
models.
o Now the base model is fitted with the first fold, which is n-1, and it will make
predictions for the nth folds.
o The prediction made in the above step is added to the x1_train list.
o Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train array of size n,
o Now, the model is trained on all the n parts, which will make predictions for
the sample data.
o Add this prediction to the y1_test list.
o In the same way, we can find x2_train, y2_test, x3_train, and y3_test by using
Model 2 and 3 for training, respectively, to get Level 2 predictions.
o Now train the Meta model on level 1 prediction, where these predictions will
be used as features for the model.
o Finally, Meta learners can now be used to make a prediction on test data in
the stacking model.
pasting
Bagging is to use the same training for every predictor, but to train them on
different random subsets of the training set. When sampling is performed
with replacement, this method is called bagging (short for bootstrap
aggregating). When sampling is performed without replacement, it is called
pasting.
The first thing that we’ll understand is what is the decision boundary (the danger red
line above!). Consider these lines as being at any distance, say ‘a’, from the
hyperplane. So, these are the lines that we draw at distance ‘+a’ and ‘-a’ from the
hyperplane. This ‘a’ in the text is basically referred to as epsilon.
.Our main aim here is to decide a decision boundary at ‘a’ distance from the original
hyperplane such that data points closest to the hyperplane or the support vectors
are within that boundary line.
Hence, we are going to take only those points that are within the decision boundary
and have the least error rate, or are within the Margin of Tolerance. This gives us a
better fitting model.
A real-world dataset contains features that vary in magnitudes, units, and range. I
would suggest performing normalization when the scale of a feature is irrelevant or
misleading.
Feature Scaling basically helps to normalize the data within a particular range.
Normally several common class types contain the feature scaling function so that
they make feature scaling automatically. However, the SVR class is not a commonly
used class type so we should perform feature scaling using Python.
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
view rawsvr3.py hosted with by GitHub
Kernel is the most important feature. There are many types of kernels – linear,
Gaussian, etc. Each is used depending on the dataset. To learn more about this, read
this: Support Vector Machine (SVM) in Python and R
Step 6. Visualizing the SVR results (for higher resolution and smoother curve)
X_grid = np.arange(min(X), max(X), 0.01) #this step required because data is
feature scaled.
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
This is what we get as output- the best fit line that has a maximum number of points.
Quite accurate!
Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can
be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the fruit
is identified on the bases of color, shape, and taste, then red, spherical, and
sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem
.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated, so it
cannot learn the relationship between features.
Applications of Naïve Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal
distribution. This means if predictors take continuous values instead of
discrete, then the model assumes that these values are sampled from the
Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as
Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier,
but the predictor variables are the independent Booleans variables. Such as if
a particular word is present or not in a document. This model is also famous
for document classification tasks.
Python Implementation of the Naïve Bayes algorithm:
Now we will implement a Naive Bayes Algorithm using Python. So for this, we will
use the "user_data" dataset, which we have used in our other classification model.
Therefore we can easily compare the Naive Bayes model with the other models.
Steps to implement:
o Data Pre-processing step
o Fitting Naive Bayes to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result
1) Data Pre-processing step:
In this step, we will pre-process/prepare the data so that we can use it efficiently in
our code. It is similar as we did in data-pre-processing
2) Fitting Naive Bayes to the Training Set:
After the pre-processing step, now we will fit the Naive Bayes model to the Training
set. Below is the code for it:
3) Prediction of the test set result:
Now we will predict the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict function to make the predictions
4) Creating Confusion Matrix:
Now we will check the accuracy of the Naive Bayes classifier using the Confusion
matrix. Below is the code for it:
5) Visualizing the training set result:
Next we will visualize the training set result using Naïve Bayes Classifier. Below is
the code for it:
Unit-4
Unsupervised learning techniques:
Clustering :
Clustering or cluster analysis is a machine learning technique, which groups the
unlabelleddataset. It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The objects with the
possible similarities remain in a group that has less or no similarities with
another group."
It does it by finding some similar patterns in the unlabelled dataset such as
shape, size, color, behavior, etc., and divides them as per the presence and
absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the
algorithm,and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a
cluster- ID. ML system can use this id to simplify the processing of large and
complex datasets.
The clustering technique is commonly used for statistical data analysis.
Note: Clustering is somewhere similar to the classification algorithm, but the
difference is thetype of dataset that we are using. In classification, we work with the
labeled data set, whereasin clustering, we work with the unlabelled dataset.
Example: Let's understand the clustering technique with the real-world example
of Mall:When we visit any shopping mall, we can observe that the things with
similar usage are grouped together. Such as the t-shirts are grouped in one
section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., aregrouped in separate sections, so that we
can easily find out the things. The clustering technique also works in the same
way. Other examples of clustering are grouping documents according to the
topic.
The clustering technique can be widely used in various tasks. Some most
common uses of this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of products.
Netflix also uses thistechnique to recommend the movies and web-series to its
users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see
the different fruits are divided into several groups with similar properties.
K-Means ClusteringAlgorithm
K-Means Clustering is an unsupervised learning algorithm that is used to solve
the clustering problems in machine learning or data science. In this topic, we will
learn what is K-means clustering algorithm, how the algorithm works, along
with the Python implementation of k-means clustering.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if
there is a point o such that both i and j are considered as density reachable from
o with respect toEps and MinPts.
Gaussian DiscriminantAnalysis
There are two types of Supervised Learning algorithms are used in Machine
Learning forclassification.
1. Discriminative Learning Algorithms
2. Generative Learning Algorithms
Logistic Regression, Perceptron, and other Discriminative Learning Algorithms
are examples of discriminative learning algorithms. These algorithms attempt to
determine aboundary between classes in the learning process. A Discriminative
Learning Algorithm might be used to solve a classification problem that will
determine if a patient has malaria.The boundary is then checked to see if the
new example falls on the boundary, P(y|X), i.e., Given a feature set X, what is its
probability of belonging to the class "y".
Generative Learning Algorithms, on the other hand, take a different approach.
They try to capture each class distribution separately rather than finding a
boundary between classes.A Generative Learning Algorithm, as mentioned, will
examine the distribution of infectedand healthy patients separately. It will then
attempt to learn each distribution's features individually. When a new example
is presented, it will be compared to both distributions,and the class that it most
closely resembles will be assigned, P(X|y) for a given P(y) here,P(y) is known as a
class prior.
These Bayes Theory predictions are used to predict generative learning algorithms
By analysing only, the numbers of P(X|y) as well as P(y) in the specific class, we
can determine P(y), i.e., considering the characteristics of a sample, how likely is
it that it belongs to class "y".
Gaussian Discriminant Analysis is a Generative Learning Algorithm that aims to
determine the distribution of every class. It attempts to create the Gaussian
distribution to each category of data in a separate way. The likelihood of an
outcome in the case using an algorithm known as the Generative learning
algorithm is very high if it is close to the centre of the contour, which
corresponds to its class. It diminishes when we move away from the middle of
the contour. Below are images that illustrate the differences between
Discriminative as well as Generative Learning Algorithms.
should be maximum.
It involves the following steps:
Construct the covariance matrix of the data.
Compute the eigenvectors of this matrix.
Eigenvectors corresponding to the largest eigen values are used to
reconstruct a large fraction of variance ofthe original data.
Hence, we are left with a lesser number of eigenvectors, and there might have
been some data loss in the process.But, the most important variances should be
retained by the remaining eigenvectors.
Advantages of Dimensionality Reduction:
It helps in data compression, and hence reduced storage space.
It reduces computation time.
It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction:
It may lead to some amount of data loss.
PCA tends to find linear correlations between variables, which is sometimes
undesirable.
PCA fails in cases where mean and covariance are not enough to define
datasets.
Principal Component Analysis
Principal Component Analysis is an unsupervised learning algorithm that
is used for the dimensionality reduction in machine learning. It is a
statistical process that converts the observations of correlated features
into a set of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory
data analysis and predictive modeling. It is a technique to draw strong
patterns from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high
-dimensional data.
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces
the dimensionality. Some real-world applications of PCA are image
processing, movie recommendation system, optimizing the power
allocation in various communication channels. It is a feature extraction
technique, so it contains the important variables and drops the least
important variable.
The PCA algorithm is based on some mathematical concepts such as:
In the kernel space the two classes are linearly separable. Kernel PCA uses a
kernel function to project the dataset into a higher-dimensional space, where it is
linearly separable.
Randomized PCA algorithm
Both SVD and NIPALS are not very efficient when number of rows in dataset is
very large (e.g. hundreds of thousandsvalues or even more). Such datasets
can be easily obtained in case of for example hyperspectral images. Direct use
of the traditional algorithms with such datasets often leads to a lack of
memory and long computational time.
One of the solution here is to use probabilistic algorithms, which allow to
reduce the number of values needed forestimation of principal components.
Starting from 0.9.0 one of the probabilistic approach is also implemented
in mdatools. The original idea can be found in this paper and some examples on
using the approach for PCA analysis ofhyperspectral
A Randomized Algorithm for PCA 2
• Form Y = AΩ. •
QR decompose Y and discard R.
The main theoretical result is:
E||A − QQ∗A|| ≤ 1 + 4 ∗ √ k + p p − 1 p min(m, n) σk+1(A).
Proof Sketch.
Apply the triangle inequality many times in order to split the error into a part that
involves optimizing over a space of dimension k and a separate high dimensional
part.
Let Ω ∈ R n×(k+12) , W ∈ R (k+12)×n and Z ∈ R k×(k+12)
||A − QQ∗A|| ≤ 2||A − AΩW|| + 2||AΩ − QZ||||W||.
we want to choose W and Z to show
||A − QQ∗A|| ≤ Cσk+1(A).
The algorithm forms Q’s columns from singular vectors corresponding to the k +
p greatest singular values of AΩ. This lets us choose Z such that
||A − QQ∗A|| ≤ σk+1(AΩ) ≤ ||Ω||σk+1(A)
where we understand the second inequality by recalling that we are working
with the spectral norm in this note.
The existence of a (k +p)×n matrix W such that ||A−AΩW|| ≤ Cσk+1(A) is tedious
and shown in the appendix of using results from [1]. A few notes about the
result:
• A few iterations of the power method in our computation of Y can improve the
accuracy of our method.
• We expect the bound in to involve a factor of σk+1(A) as σk+1(A) is the
theoretical best bound we can find.
• Notice that increasing p greatly improves accuracy.
Unit V:
Neural Networks and Deep Learning: Introduction to Artificial Neural
Networks with Keras, Implementing MLPs with Keras, Installing TensorFlow 2,
Loading and Preprocessing Data with TensorFlow
After that,
Step 3: We will be brought to another page, where we will need to select
either the x86-64 or amd64 installer to install Python.
Now, Python is installing successfully.
Step 4: For this tutorial, I'll be choosing to Add Python 3.5 to PATH.
Step 5: Now, we will be able to see the message "Set-up was successful." A
way to confirm that it hs installed successfully is to open your Command
Prompt and check the version.
What is pip?
pip is known as a package management system which is used to install and
manage the software package, which is written in Python or any other
languages. pip is used to download, search, install, uninstall, and manage the
3rd party python package. (pip3 is the latest version of it which comes with
new Python 3.5.x version that we had just downloaded)
Installing our TensorFlow
Once we have downloaded the latest version of Python, we can now put our
finishing touches by installing our TensorFlow.
Step 1: To install TensorFlow, start the terminal. Make sure that we run the
cmd as an administrator.
If we do not know how to run your cmd as an administrator
Here's how we can run in our cmd as an administrator.
Open the Start menu, search for cmd, and then right-click on it and Run as an
administrator.
Open the Start menu, search for cmd, and then right-click on it and Run as an
administrator.
Step 2: Once we are done with that, then we have to write the command
in command prompt for finish installing Tensorflow in our Windows.
Enter this command:
Choose the vc_redist.x64.exe on the page and click on "Next" after that it will
be downloaded.
At last, it will successfully installed in our system.
We will read conda install TensorFlow in our next tutorial.
How is data loaded with TensorFlow?
In memory data
For any small CSV dataset the simplest way to train a TensorFlow model on it
is to load it into memory as a pandas Dataframe or a NumPy array. A
relatively simple example is the abalone dataset. The dataset is small. All the
input features are all limited-range floating point values.