Ad8552 ML Unit I
Ad8552 ML Unit I
3 0 0 3
AD8552 MACHINE LEARNING
(MODEL BUILDING/PROTOTYPE)
S NO TOPICS
The term machine learning was first coined in the 1950s when Artificial Intelligence
pioneer Arthur Samuel built the first self-learning system for playing checkers. He
noticed that the more the system played, the better it performed.
Fueled by advances in statistics and computer science, as well as better datasets and
the growth of neural networks, machine learning has truly taken off in recent years.
Today, whether you realize it or not, machine learning is everywhere ‒ automated
translation, image recognition, voice search technology, self-driving cars, and beyond.
1.1 What Is Machine Learning?
• Machine learning (ML) is a branch of artificial intelligence (AI) that enables computers to
“self-learn” from training data and improve over time, without being explicitly
programmed. Machine learning algorithms are able to detect patterns in data and learn
from them, in order to make their own predictions. In short, machine learning
algorithms and models learn through experience.
1.2 History Machine Learning (ML)
2. Essential concepts of Machine Learning
There are various concepts that are spread all over the place in terms of categorizing them
with techniques or applications or implementations. However, they are at the heart of the
entire machine learning theory and its applications and need special attention.
The machine learning algorithms are broadly classified into three types:
2.2.1 Supervised learning
In Supervised learning, an AI system is presented with data which is labeled, which
means that each data tagged with the correct label.
The goal is to approximate the mapping function so well that when you have new input
data (x) that you can predict the output variables (Y) for that data.
As shown in the above example, we have initially taken some data and marked them
as ‘Spam’ or ‘Not Spam’. This labelled data is used by the training supervised model,
this data is used to train the model.Once it is trained we can test our model by testing
it with some test new mails and checking of the model is able to predict the right
output.
Advantages:
•Since supervised learning work with the labelled dataset so we can have an exact idea about
the classes of objects.
•These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages:
•These algorithms are not able to solve complex tasks.
•It may predict the wrong output if the test data is different from the training data.
•It requires lots of computational time to train the algorithm.
In the above example, we have given some characters to our model which are ‘Ducks’ and
‘Not Ducks’. In our training data, we don’t provide any label to the corresponding data.
The unsupervised model is able to separate both the characters by looking at the type of
data and models the underlying structure or distribution in the data in order to learn more
about it.
Advantages:
•These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
•Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is
easier as compared to the labelled dataset.
Disadvantages:
•The output of an unsupervised algorithm can be less accurate as the dataset is not labeled,
and algorithms are not trained with the exact output in prior.
•Working with Unsupervised learning is more difficult as it works with the unlabeled dataset
that does not map with the output.
In the above example, we can see that the agent is given 2 options i.e. a path with water
or a path with fire. A reinforcement algorithm works on rewarding a system i.e. if the
agent uses the fire path then the rewards are subtracted and the agent tries to learn that
it should avoid the fire path. If it had chosen the water path or the safe path then some
points would have been added to the reward points, the agent then would try to learn
what path is safe and what path isn’t. It is basically leveraging the rewards obtained, the
agent improves its environment knowledge to select the next action.
An example of an algorithm used in RL is Markov Decision Process.
2.3 Machine Learning Methods Based on Time
Another way to slice the machine learning methods is to classify them based on
the type of data that they deal with. The systems that take in static labelled data are called
static learning methods. The systems that deal with data that is continuously changing with
time are called dynamic methods. Each type of method can be supervised, or unsupervised,
however, reinforcement learning methods are always.
dynamic.
2.4 Dimensionality
• The concept of linearity and nonlinearity is applicable to both the data and the model that
built on top of it. Data is called as linear if the relationship between the input and output
is linear. when the value of input increases, the value of output also increases and vice
versa.
• All the models that use linear equations to model the relationship between input and
output are called as linear models. However, sometimes, by preconditioning the input or
output a nonlinear relationship between the data can be converted into linear relationship
and then the linear model can be applied on it.
For example if input and output are related with exponential relationship as y = 2𝑒 𝑥 The
data is clearly nonlinear. However, instead of building the model on original data, we can
build a model after applying a log operation. This operation transforms the original nonlinear
relationship into linear one as log y = 2𝑒 𝑥 . Then we build the linear model to predict log y
instead of y, which can then be converted to y by taking exponent.
There can also be cases where a problem can be broken down into multiple parts and linear
model can be applied to each part to ultimately solve a nonlinear problem. Below figures
show examples of converted linear and piecewise linear relationships,
respectively.
Linear models are the simplest to understand, build, and interpret. Our brain is highly tuned for
linear models, as most of our experiences tend to have linear trends. All the models in the
theory of machine learning can handle linear data. Examples of purely linear models are linear
regression, support vector machines without nonlinear kernels, etc. Nonlinear models inherently
use some nonlinear functions to approximate the nonlinear characteristics of the data. Examples
of nonlinear models include neural networks, decision trees, probabilistic models based on
nonlinear distributions, etc.
2.6 Early Trends in Machine Learning
Before the machine learning started off commercially in true sense, there were few other
systems that were already pushing the boundary if routine computation. One such
notable application was Expert Systems.
2.6.1 Expert Systems
The definition by Alan Turin marks the beginning of the era where machine intelligence
was recognized and with that field of AI was born. However, in the early days (all the
way till 1980s), the field of Machine Intelligence or Machine Learning was limited to what
were called as Expert Systems or Knowledge based Systems. One of the leading
experts in the field of expert systems, Dr. Edward Feigenbaum,once defined as expert
system as,
Definition : Expert Systems An intelligent computer program that uses knowledge and
inference procedures to solve problems that are difficult enough to require significant
human expertise for the solution
Such systems were capable of replacing experts in certain areas. These
machines were programmed to perform complex heuristic tasks based on elaborate
logical operations. In spite of being able to replace the humans who are experts in the
specific areas, these systems were not “intelligent” in the true sense, if we compare
them with human intelligence.
The reason being the system were “hard-coded” to solve only a specific
type of problem and if there is need to solve a simpler but completely different problem,
these system would quickly become completely useless.
Nonetheless, these systems were quite popular and successful specifically
in areas where repeated but highly accurate performance was needed, e.g.,
diagnosis,inspection, monitoring, control
3. Data Understanding, Representation, and Visualization
Each entity can contain multiple attributes. The raw data for each application
can contain multiple such entities (Table 3.1).In case of Iris data, we have only one such
entity in the form of dimensions of sepals and petals of the flowers. However, if one is
trying to solve this classification problem and finds that the data about sepals and petals
alone is not sufficient, then he/she can add more information in the form of additional
entities. For example more information about the flowers in the form of their colors, or
smells or longevity of the trees that produce them, etc. can be added to improve the
classification performance.
Table 3.1 Sample from Iris data set containing 3 classes and 4 attributes
3.1.2 Understanding Attributes
Each attribute can be thought of as a column in the file or table. In case of Iris
data,the attributes from the single given entity are sepal length in cm, sepal width in cm,
petal length in cm, petal width in cm. If we had added additional entities like color,smell, etc.,
each of those entities would have their own attributes. It is important to note that in the
current data, all the columns are all features, and there is no ID column.
The last step in preprocessing the data showing how the data is distributed and
how it related to the output or class label. We live in 3-dimensional world, so any data that is
up to 3-dimensional, we can plot it and visualize it.
However, when there are more than 3-dimensions, it gets tricky. The Iris data,
for example, also has 4 dimensions. There is no way we can plot the full information in each
sample in a single plot that we can visualize.
based scales perfectly fine for higher dimensions. Principal components are orthogonal
The principal components are vectors, but they are not chosen at random. The first
principal component is computed so that it explains the greatest amount of variance in
the original features. The second component is orthogonal to the first, and it explains
the greatest amount of variance left after the first principal component.
The algorithm can be used on its own, or it can serve as a data cleaning or data
preprocessing technique used before another machine learning algorithm.
2. De-noise the data. Because PCA is computed by finding the components which
explain the greatest amount of variance, it captures the signal in the data and omits
the noise.
Application of PCA
Overlapping
To overcome the overlapping issue in the classification process, we must increase the
number of features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data points in a
2-dimensional plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can separate these
data points efficiently but using linear Discriminant analysis; we can dimensionally reduce
the 2-D plane into the 1-D plane. Using this technique, we can also maximize the
separability between multiple classes.
Applications of LDA
Face Recognition
Medical
Customer Identification
Predictions