ML Unit-1
ML Unit-1
Machine Learning:
Machine Learning is the science (and art) of programming computers so they can learn
from data. Arthur Samuel in the year 1959 defined Machine Learning as the field of study that
gives computers the ability to learn without being explicitly programmed.
Learning:
A computer program is said to learn from experience E with respect to some task T and
some performance measure P, if its performance on T, as measured by P, improves with experience
E.
For Example, A spam filter is a Machine Learning program that, given examples of spam
emails (e.g., flagged by users) and examples of regular (nonspam, also called “ham”) emails, can
learn to flag spam. The examples that the system uses to learn are called the training set. Each
training example is called a training instance (or sample).
In this case, the task T is to flag spam for new emails, the experience E is the training data,
and the performance measure P needs to be defined; for example, the ratio of correctly classified
emails can be used as a performance measure. This particular performance measure is called
accuracy, and it is often used in classification tasks.
Examples:
i) Handwriting recognition learning problem
• Task T: Recognizing and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training Experience E: A dataset of handwritten words with given classifications
ii) A robot driving learning problem
• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• Training Experience E: A sequence of images and steering commands recorded while
observing a human driver
iii) A chess learning problem
• Task T: Playing chess
• Performance measure P: Percent of games won against opponents
• Training experience E: Playing practice games against itself
Definition
A computer program which learns from experience is called a machine learning program
or simply a learning program. Such a program is sometimes also referred to as a learner.
Components of Learning
Basic components of learning process
K.Ramya Laxmi, Assistant Professor, CSE
The learning process, whether by a human or a machine, can be divided into four
components, namely, data storage, abstraction, generalization and evaluation. Figure 1.1 illustrates
the various components and the steps involved in the learning process.
1. Data storage :
Facilities for storing and retrieving huge amounts of data are an important component of the
learning process. Humans and computers alike utilize data storage as a foundation for advanced
reasoning. In a human being, the data is stored in the brain and data is retrieved using
electrochemical signals. Computers use hard disk drives, flash memory, random access memory
and similar devices to store data and use cables and other technology to retrieve data.
2. Abstraction :
The second component of the learning process is known as abstraction. Abstraction is the
process of extracting knowledge about stored data. This involves creating general concepts about
the data as a whole. The creation of knowledge involves application of known models and creation
of new models. The process of fitting a model to a dataset is known as training. When the model
has been trained, the data is transformed into an abstract form that summarizes the original
information.
3. Generalization :
The third component of the learning process is known as generalization. The term
generalization describes the process of turning the knowledge about stored data into a form that
can be utilized for future action. These actions are to be carried out on tasks that are similar, but
not identical, to those what have been seen before. In generalization, the goal is to discover those
properties of the data that will be most relevant to future tasks.
4. Evaluation :
Evaluation is the last component of the learning process. It is the process of giving feedback
to the user to measure the utility of the learned knowledge. This feedback is then utilized to effect
improvements in the whole learning process
Traditional Approaches:
K.Ramya Laxmi, Assistant Professor, CSE
Consider the steps involved in writing a spam filter using traditional programming techniques:
1. Firstly, analyze what spam typically looks like. Notice that some words or phrases (such
as “4U,” “credit card,” “free,” and “amazing”) tend to come up a lot in the subject line, few
other patterns in the sender’s name, the email’s body, and other parts of the email.
2. Write a detection algorithm for each of the patterns noticed, and the program would flag
emails as spam if a number of these patterns were detected.
3. Test the program and repeat steps 1 and 2 until it is good enough to launch.
Since the problem is difficult, the program will likely become a long list of complex rules—pretty
hard to maintain.
In contrast, a spam filter based on Machine Learning techniques automatically learns which
words and phrases are good predictors of spam by detecting unusually frequent patterns of words
in the spam examples compared to the ham examples. The program is much shorter, easier to
maintain, and most likely more accurate.
In contrast, a spam filter based on Machine Learning techniques automatically notices that “For
U” has become unusually frequent in spam flagged by users, and it starts flagging them without
K.Ramya Laxmi, Assistant Professor, CSE
your intervention (Figure 1-3).
Machine Learning can help humans learn (Figure 1-4). ML algorithms can be inspected
to see what they have learned (although for some algorithms this can be tricky). For instance,
once a spam filter has been trained on enough spam, it can easily be inspected to reveal the list of
words and combinations of words that it believes are the best predictors of spam. Sometimes this
will reveal unsuspected correlations or new trends, and thereby lead to a better understanding of
the problem. Applying ML techniques to dig into large amounts of data can help discover
patterns that were not immediately apparent. This is called data mining.
These criteria can be combined in any way depending on the application. For example, a
state-of-the-art spam filter may learn on the fly using a deep neural network model trained
using examples of spam and ham; this makes it an online, model-based, supervised learning
system.
Supervised/Unsupervised Learning
Machine Learning systems can be classified according to the amount and type of
supervision they get during training.
There are four major categories:
1. Supervised learning
2. Unsupervised learning
3. Semi supervised learning, and
4. Reinforcement Learning
Supervised Learning:
In supervised learning, the training set fed to the algorithm includes the desired
solutions, called labels.
It has two typical tasks to be performed:
1. Classification
2. Regression
For Example, The spam filter is a good example of classification: it is trained with many
example emails along with their class (spam or ham), and it must learn how to classify new
Another typical task is to predict a target numeric value, such as the price of a car, given a
set of features (mileage, age, brand, etc.) called predictors. This sort of task is called
regression. To train the system, many examples of cars, including both their predictors and
their labels (i.e., their prices) must be given to it.
In Machine Learning, an attribute is a data type (e.g., “mileage”), while a feature generally
means an attribute plus its value (e.g., “mileage = 15,000).
Some regression algorithms can be used for classification as well, and vice versa. For
example, Logistic Regression is commonly used for classification, as it can output a value
that corresponds to the probability of belonging to a given class (e.g., 20% chance of being
spam).
Clustering:
For example, say there is a lot of data about a blog’s visitors. A clustering algorithm
can be used to detect groups of similar visitors. The algorithm is not given any information
about which group a visitor belongs to, it, itself finds those connections. For example, it
might notice that 40% of your visitors are males who love comic books and generally read
your blog in the evening, while 20% are young sci-fi lovers who visit during the weekends.
If a hierarchical clustering algorithm is used, it may also subdivide each group into smaller
groups. This may help target the posts for each group.
Dimensionality Reduction:
A related task is dimensionality reduction, in which the goal is to simplify the data
without losing too much information. One way to do this is to merge several correlated
features into one.
For example, a car’s mileage may be strongly correlated with its age, so the dimensionality
reduction algorithm will merge them into one feature that represents the car’s wear and
tear. This is called feature extraction.
It is often good to reduce the dimension of the training data using a dimensionality
reduction algorithm before feeding it to another Machine Learning algorithm such as a
Anomaly Detection:
Another important unsupervised task is anomaly detection—for example, detecting
unusual credit card transactions to prevent fraud, catching manufacturing defects, or
automatically removing outliers from a dataset before feeding it to another learning
algorithm. The system is shown mostly normal instances during training, so it learns to
recognize them; then, when it sees a new instance, it can tell whether it looks like a normal
one or whether it is likely an anomaly.
A very similar task is novelty detection: it aims to detect new instances that look
different from all instances in the training set. This requires having a very “clean” training
set, devoid of any instance that you would like the algorithm to detect. For example, if you
have thousands of pictures of dogs, and 1% of these pictures represent Chihuahuas, then a
novelty detection algorithm should not treat new pictures of Chihuahuas as novelties. On
the other hand, anomaly detection algorithms may consider these dogs as so rare and so
different from other dogs that they would likely classify them as anomalies.
K.Ramya Laxmi, Assistant Professor, CSE
Association rule learning:
Finally, another common unsupervised task is association rule learning, in which
the goal is to dig into large amounts of data and discover interesting relations between
attributes. For example, consider a supermarket. Running an association rule on its sales
logs may reveal that people who purchase barbecue sauce and potato chips also tend to
buy steak. Thus, these items are placed close to one another.
Semi-supervised Learning:
Since labeling data is usually time-consuming and costly, often there are plenty of
unlabeled instances, and few labeled instances. Some algorithms can deal with data that’s
partially labeled. This is called Semi-supervised learning.
Some photo-hosting services, such as Google Photos, are good examples of this.
Once upload all the family photos are uploaded to the service, it automatically recognizes
that the same person A shows up in photos 1, 5, and 11, while another person B shows up
in photos 2, 5, and 7. This is the unsupervised part of the algorithm (clustering). Now all
the system needs is tell it who these people are. Just add one label per person and it is able
to name everyone in every photo, which is useful for searching photos.
Most semi-supervised learning algorithms are combinations of unsupervised and
supervised algorithms. For example, deep belief networks (DBNs) are based on
unsupervised components called restricted Boltzmann machines (RBMs) stacked on top of
one another. RBMs are trained sequentially in an unsupervised manner, and then the whole
system is fine-tuned using supervised learning techniques.
Batch learning:
In batch learning, the system is incapable of learning incrementally: it must be
trained using all the available data. This will generally take a lot of time and computing
resources, so it is typically done offline. First the system is trained, and then it is launched
into production and runs without learning anymore; it just applies what it has learned. This
is called offline learning.
For a batch learning system to know about new data (such as a new type of spam),
it needs to be trained on a new version of the system from scratch on the full dataset (not
just the new data, but also the old data), then stop the old system and replace it with the
new one.
The whole process of training, evaluating, and launching a Machine Learning
system can be automated fairly easily, so even a batch learning system can adapt to change.
Simply update the data and train a new version of the system from scratch as often as
needed.
This solution is simple and often works fine, but training using the full set of data
can take many hours, so typically train a new system only every 24 hours or even just
weekly. If the system needs to adapt to rapidly changing data (e.g., to predict stock prices),
then a more reactive solution is needed.
Also, training on the full set of data requires a lot of computing resources like CPU,
memory space, disk space, disk I/O, network I/O, etc. If there is a lot of data and there is a
need to automate the system to train from scratch every day, it will end up costing a lot of
money. If the amount of data is huge, it may even be impossible to use a batch learning
algorithm.
Finally, if the system needs to be able to learn autonomously and it has limited
resources (e.g., a smartphone application or a rover on Mars), then carrying around large
amounts of training data and taking up a lot of resources to train for hours every day is a
showstopper. A better option in all these cases is to use algorithms that are capable of
learning incrementally.
Online learning:
In online learning the system is trained incrementally by feeding it data instances
sequentially, either individually or in small groups called mini-batches. Each learning step
is fast and cheap, so the system can learn about new data on the fly, as it arrives
A big challenge with online learning is that if bad data is fed to the system, the
system’s performance will gradually decline. If it’s a live system, the clients will notice.
For example, bad data could come from a malfunctioning sensor on a robot, or from
someone spamming a search engine to try to rank high in search results. To reduce this
risk, the system must be monitored closely and promptly switch learning of and possibly
revert to a previously working state if a drop in performance is detected. The input data can
also be monitored and react to abnormal data using an anomaly detection algorithm.
Instance-based learning
The most trivial form of learning is simply to learn by heart. For example, consider
creating a spam filter this way, it would just flag all emails that are identical to emails that
have already been flagged by users. It can be programmed to also flag emails that are very
similar to known spam emails. This requires a measure of similarity between two emails.
A similarity measure between two emails could be to count the number of words they have
in common. The system would flag an email as spam if it has many words in common with
a known spam email. This is called instance-based learning: the system learns the examples
by heart, then generalizes to new cases by using a similarity measure to compare them to
the learned examples (or a subset of them).
For example, in Figure 1-15 the new instance would be classified as a triangle
because the majority of the most similar instances belong to that class.
Model-based learning
Another way to generalize from a set of examples is to build a model of these
examples and then use that model to make predictions. This is called model-based learning
(Figure 1-16).
Before, using the model, the parameter values θ0 and θ1 must be defined. Specify
a performance measure, it can be either a utility function (or fitness function) that measures
how good the model is, or it can be a cost function that measures how bad it is.
For Linear Regression problems, a cost function is mostly used. It measures the
distance between the linear model’s predictions and the training examples; the objective is
to minimize this distance.
Linear Regression algorithm feeds it the training examples, and it finds the
parameters that make the linear model fit best to the data. This is called training the model.
In this case, the algorithm finds that the optimal parameter values are θ0 = 4.85 and θ1 =
4.91 × 10–5.
Here, the word “model” can refer to a type of model e.g., Linear Regression, to a
fully specified model architecture e.g., Linear Regression with one input and one output,
or to the final trained model ready to be used for predictions e.g., Linear Regression with
one input and one output, using Ɵ0 = 4.85 and Ɵ1 = 4.91 ×10–5.
Model selection consists in choosing the type of model and fully specifying its
architecture. Training a model means running an algorithm to find the model parameters
that will make it best fit the training data (and hopefully make good predictions on new
data).
Finally the model is now ready to run and make predictions. For example, say if we
If an instance-based learning algorithm is to be used instead, closest GDP per capita to that
of Cyprus that is Slovenia ($20,732) is identified , and since the OECD data tells us that
Slovenians’ life satisfaction is 5.7, a life satisfaction of 5.7 for Cyprus would be predicted. Two
next-closest countries, can also be considered; Portugal and Spain with life satisfactions of 5.1 and
6.5, respectively. Averaging these three values, 5.77, which is pretty close to the model-based
prediction is predicted. This simple algorithm is called k-Nearest Neighbors regression (in this
example, k = 3).
Replacing the Linear Regression model with k-Nearest Neighbors regression in the
previous code is as simple as replacing these two lines:
import sklearn.linear_model
model = sklearn.linear_model.LinearRegression()
import sklearn.neighbors
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)
For example, the set of countries used earlier for training the linear model was not
perfectly representative; a few countries were missing. Figure 1-21 shows what the data
looks like when the missing countries are added. If a linear model is trained on this data,
we get the solid line, while the old model is represented by the dotted line. Observe that
not only does adding a few missing countries significantly alter the model, but it makes it
K.Ramya Laxmi, Assistant Professor, CSE
clear that such a simple linear model is probably never going to work well.
It seems that very rich countries are not happier than moderately rich countries (in
fact, they seem unhappier), and conversely some poor countries seem happier than many
rich countries. It is crucial to use a training set that is representative of the cases that are to
be generalized. If the sample is too small, it will have sampling noise i.e., non-
representative data as a result of chance. Very large samples can be non-representative if
the sampling method is flawed. This is called sampling bias.
3. Poor-Quality Data:
If the training data is full of errors, outliers, and noise e.g., due to poor quality
measurements, it will make it harder for the system to detect the underlying patterns, so
the system is less likely to perform well. It is often well worth the effort to spend time
cleaning up the training data. The following are a couple of examples of when the training
data needs to be cleaned up:
• If some instances are clearly outliers, it may help to simply discard them or try to fix the
errors manually.
• If some instances are missing a few features e.g., 5% of the customers did not specify
their age, it must be decided whether to ignore this attribute altogether, ignore these
instances, fill in the missing values e.g., with the median age, or train one model with the
feature and one model without it.
4. Irrelevant Features:
A critical part of the success of a Machine Learning project is coming up with a good
set of features to train on. This process, called feature engineering, involves the following
steps:
• Feature selection: Selecting the most useful features to train on among existing features
• Feature extraction: Combining existing features to produce a more useful one—
dimensionality reduction algorithms can help
• Creating new features by gathering new data
5. Overfitting the Training Data:
Overfitting means that the model performs well on the training data, but it does not
generalize well.
Constraining a model to make it simpler and reduce the risk of overfitting is called
regularization. For example, the linear model defined earlier has two parameters, θ0 and
θ1. This gives the learning algorithm two degrees of freedom to adapt the model to the
training data: it can tweak both the height (θ0) and the slope (θ1) of the line. If θ1 = 0, the
algorithm would have only one degree of freedom and would have a much harder time
fitting the data properly: all it could do is move the line up or down to get as close as
possible to the training instances, so it would end up around the mean.
A very simple model indeed! If the algorithm can modify θ1 but force it to keep it
small, then the learning algorithm will effectively have somewhere in between one and two
K.Ramya Laxmi, Assistant Professor, CSE
degrees of freedom. It will produce a model that’s simpler than one with two degrees of
freedom, but more complex than one with just one. The right balance between fitting the
training data perfectly and keeping the model simple enough to ensure that it will
generalize well is to be found out.
Figure 1-23 shows three models. The dotted line represents the original model that
was trained on the countries represented as circles without the countries represented as
squares, the dashed line is the second model trained with all countries circles and squares,
and the solid line is a model trained with the same data as the first model but with a
regularization constraint. Observe that regularization forced the model to have a smaller
slope: this model does not fit the training data (circles) as well as the first model, but it
actually generalizes better to new examples that it did not see during training (squares).
To decide between two models: One option is to train both and compare how well they
generalize using the test set. Now suppose that the linear model generalizes better, but you
want to apply some regularization to avoid overfitting.
To choose the value of the regularization hyper-parameter: One option is to train 100
different models using 100 different values for this hyper-parameter. Suppose the best
hyper-parameter value that produces a model with the lowest generalization error—say,
just 5% error. If this model is launched into production, unfortunately it does not perform
as well as expected and produces 15% errors.
The problem is that the generalization error multiple times on the test set, and adapted
the model and hyper-parameters to produce the best model for that particular set. This
means that the model is unlikely to perform as well on new data.
A common solution to this problem is called holdout validation. Simply hold out part
of the training set to evaluate several candidate models and select the best one. The new
held-out set is called the validation set or sometimes the development set, or dev set.
Train multiple models with various hyperparameters on the reduced training set i.e.,
the full training set minus the validation set, and select the model that performs best on the
validation set. After this holdout validation process, train the best model on the full training
K.Ramya Laxmi, Assistant Professor, CSE
set including the validation set, and this gives the final model. Lastly, evaluate this final
model on the test set to get an estimate of the generalization error.
If the validation set is too small, then model evaluations will be imprecise: we may
end up selecting a suboptimal model by mistake. Conversely, if the validation set is too
large, then the remaining training set will be much smaller than the full trainng set. One
way to solve this problem is to perform repeated cross-validation, using many small
validation sets. Each model is evaluated once per validation set after it is trained on the rest
of the data. By averaging out all the evaluations of a model, a much more accurate measure
of its performance is obtained. There is a drawback, however: the training time is
multiplied by the number of validation sets.
9. Data Mismatch:
It’s easy to get a large amount of data for training, but this data probably won’t be
perfectly representative of the data that will be used in production. For example, suppose
we want to create a mobile app to take pictures of flowers and automatically determine
their species. We can easily download millions of pictures of flowers on the web, but they
won’t be perfectly representative of the pictures that will actually be taken using the app
on a mobile device. Perhaps we only have 10,000 representative pictures (i.e., actually
taken with the app). In this case, the most important rule to remember is that the validation
set and the test set must be as representative as possible of the data you expect to use in
production, so they should be composed exclusively of representative pictures: we can
shuffle them and put half in the validation set and half in the test set.
Hold out some of the training pictures (from the web) in yet another set called the
train-dev set. After the model is trained on the training set, evaluate it on the train-dev set.
If it performs well, then the model is not overfitting the training set. If it performs poorly
on the validation set, the problem must be coming from the data mismatch.
Try to tackle this problem by preprocessing the web images to make them look
more like the pictures that will be taken by the mobile app, and then retraining the model.
Conversely, if the model performs poorly on the train-dev set, then it must have overfit the
training set, so try to simplify or regularize the model, get more training data, and clean up
the training data.