0% found this document useful (0 votes)
6 views30 pages

ML Unit-1

Machine Learning is the discipline of programming computers to learn from data, defined by Arthur Samuel in 1959. It involves a learning process with components such as data storage, abstraction, generalization, and evaluation, and can be applied to various tasks like spam filtering, image classification, and anomaly detection. Machine Learning systems can be categorized into supervised, unsupervised, semi-supervised, and reinforcement learning based on their training methods and supervision levels.

Uploaded by

Nihaal Varma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views30 pages

ML Unit-1

Machine Learning is the discipline of programming computers to learn from data, defined by Arthur Samuel in 1959. It involves a learning process with components such as data storage, abstraction, generalization, and evaluation, and can be applied to various tasks like spam filtering, image classification, and anomaly detection. Machine Learning systems can be categorized into supervised, unsupervised, semi-supervised, and reinforcement learning based on their training methods and supervision levels.

Uploaded by

Nihaal Varma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT 1

Machine Learning:

Machine Learning is the science (and art) of programming computers so they can learn
from data. Arthur Samuel in the year 1959 defined Machine Learning as the field of study that
gives computers the ability to learn without being explicitly programmed.
Learning:
A computer program is said to learn from experience E with respect to some task T and
some performance measure P, if its performance on T, as measured by P, improves with experience
E.
For Example, A spam filter is a Machine Learning program that, given examples of spam
emails (e.g., flagged by users) and examples of regular (nonspam, also called “ham”) emails, can
learn to flag spam. The examples that the system uses to learn are called the training set. Each
training example is called a training instance (or sample).
In this case, the task T is to flag spam for new emails, the experience E is the training data,
and the performance measure P needs to be defined; for example, the ratio of correctly classified
emails can be used as a performance measure. This particular performance measure is called
accuracy, and it is often used in classification tasks.
Examples:
i) Handwriting recognition learning problem
• Task T: Recognizing and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training Experience E: A dataset of handwritten words with given classifications
ii) A robot driving learning problem
• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• Training Experience E: A sequence of images and steering commands recorded while
observing a human driver
iii) A chess learning problem
• Task T: Playing chess
• Performance measure P: Percent of games won against opponents
• Training experience E: Playing practice games against itself
Definition
A computer program which learns from experience is called a machine learning program
or simply a learning program. Such a program is sometimes also referred to as a learner.
Components of Learning
Basic components of learning process
K.Ramya Laxmi, Assistant Professor, CSE
The learning process, whether by a human or a machine, can be divided into four
components, namely, data storage, abstraction, generalization and evaluation. Figure 1.1 illustrates
the various components and the steps involved in the learning process.

1. Data storage :
Facilities for storing and retrieving huge amounts of data are an important component of the
learning process. Humans and computers alike utilize data storage as a foundation for advanced
reasoning. In a human being, the data is stored in the brain and data is retrieved using
electrochemical signals. Computers use hard disk drives, flash memory, random access memory
and similar devices to store data and use cables and other technology to retrieve data.
2. Abstraction :
The second component of the learning process is known as abstraction. Abstraction is the
process of extracting knowledge about stored data. This involves creating general concepts about
the data as a whole. The creation of knowledge involves application of known models and creation
of new models. The process of fitting a model to a dataset is known as training. When the model
has been trained, the data is transformed into an abstract form that summarizes the original
information.
3. Generalization :
The third component of the learning process is known as generalization. The term
generalization describes the process of turning the knowledge about stored data into a form that
can be utilized for future action. These actions are to be carried out on tasks that are similar, but
not identical, to those what have been seen before. In generalization, the goal is to discover those
properties of the data that will be most relevant to future tasks.
4. Evaluation :
Evaluation is the last component of the learning process. It is the process of giving feedback
to the user to measure the utility of the learned knowledge. This feedback is then utilized to effect
improvements in the whole learning process

Traditional Approaches:
K.Ramya Laxmi, Assistant Professor, CSE
Consider the steps involved in writing a spam filter using traditional programming techniques:
1. Firstly, analyze what spam typically looks like. Notice that some words or phrases (such
as “4U,” “credit card,” “free,” and “amazing”) tend to come up a lot in the subject line, few
other patterns in the sender’s name, the email’s body, and other parts of the email.
2. Write a detection algorithm for each of the patterns noticed, and the program would flag
emails as spam if a number of these patterns were detected.
3. Test the program and repeat steps 1 and 2 until it is good enough to launch.

Since the problem is difficult, the program will likely become a long list of complex rules—pretty
hard to maintain.
In contrast, a spam filter based on Machine Learning techniques automatically learns which
words and phrases are good predictors of spam by detecting unusually frequent patterns of words
in the spam examples compared to the ham examples. The program is much shorter, easier to
maintain, and most likely more accurate.

In contrast, a spam filter based on Machine Learning techniques automatically notices that “For
U” has become unusually frequent in spam flagged by users, and it starts flagging them without
K.Ramya Laxmi, Assistant Professor, CSE
your intervention (Figure 1-3).

Machine Learning can help humans learn (Figure 1-4). ML algorithms can be inspected
to see what they have learned (although for some algorithms this can be tricky). For instance,
once a spam filter has been trained on enough spam, it can easily be inspected to reveal the list of
words and combinations of words that it believes are the best predictors of spam. Sometimes this
will reveal unsuspected correlations or new trends, and thereby lead to a better understanding of
the problem. Applying ML techniques to dig into large amounts of data can help discover
patterns that were not immediately apparent. This is called data mining.

To summarize, Machine Learning is great for:


• Problems for which existing solutions require a lot of fine-tuning or long lists of rules: one
Machine Learning algorithm can often simplify code and perform better than the traditional
approach.
• Complex problems for which using a traditional approach yields no good solution: the best
Machine Learning techniques can perhaps find a solution.
• Fluctuating environments: a Machine Learning system can adapt to new data.
• Getting insights about complex problems and large amounts of data.
K.Ramya Laxmi, Assistant Professor, CSE
Examples of Applications
Examples of Machine Learning tasks, along with the techniques that can tackle them:
1. Analyzing images of products on a production line to automatically classify them:
This is image classification and is typically performed using convolutional neural
networks
2. Detecting tumors in brain scans:
This is semantic segmentation, where each pixel in the image is classified to
determine the exact location and shape of tumors, typically using CNNs as well.
3. Automatically classifying news articles:
This is natural language processing (NLP), and more specifically text classification,
which can be tackled using recurrent neural networks (RNNs), CNNs, or Transformers
4. Automatically flagging offensive comments on discussion forums
This is also text classification, using the same NLP tools. Summarizing long
documents automatically. This is a branch of NLP called text summarization, again using
the same tools.
5. Creating a chatbot or a personal assistant:
This involves many NLP components, including natural language understanding
(NLU) and question-answering modules.
6. Forecasting the company’s revenue next year, based on many performance metrics:
This is a regression task (i.e., predicting values) that may be tackled using any
regression model, such as a Linear Regression or Polynomial Regression model, a
regression SVM, a regression Random Forest, or an artificial neural network.
7. Making your app react to voice commands:
This is speech recognition, which requires processing audio samples: since they are
long and complex sequences, they are typically processed using RNNs, CNNs, or
Transformers
8. Detecting credit card fraud:
This is anomaly detection. Segmenting clients based on their purchases so that you
can design a different marketing strategy for each segment. This is clustering.
9. Representing a complex, high-dimensional dataset in a clear and insightful diagram:
This is data visualization, often involving dimensionality reduction techniques.
10. Recommending a product that a client may be interested in, based on past purchases
This is a recommender system. One approach is to feed past purchases and other
information about the client to an artificial neural network typically be trained on past
sequences of purchases across all clients.
11. Building an intelligent bot for a game:
This is often tackled using Reinforcement Learning, which is a branch of Machine

K.Ramya Laxmi, Assistant Professor, CSE


Learning that trains agents (such as bots) to pick the actions that will maximize their
rewards over time (e.g., a bot may get a reward every time the player loses some life
points), within a given environment (such as the game).
The famous AlphaGo program that beat the world champion at the game of Go was built
using RL.
Types of Machine Learning Systems:
There are so many different types of Machine Learning systems that it is useful to
classify them in broad categories, based on the following criteria:
 Whether or not they are trained with human supervision (supervised, unsupervised,
semi supervised, and Reinforcement Learning)
 Whether or not they can learn incrementally on the fly (online versus batch
learning)
 Whether they work by simply comparing new data points to known data points, or
instead by detecting patterns in the training data and building a predictive model,
much like scientists do (instance-based versus model-based learning)

These criteria can be combined in any way depending on the application. For example, a
state-of-the-art spam filter may learn on the fly using a deep neural network model trained
using examples of spam and ham; this makes it an online, model-based, supervised learning
system.
 Supervised/Unsupervised Learning
Machine Learning systems can be classified according to the amount and type of
supervision they get during training.
There are four major categories:
1. Supervised learning
2. Unsupervised learning
3. Semi supervised learning, and
4. Reinforcement Learning

Supervised Learning:
In supervised learning, the training set fed to the algorithm includes the desired
solutions, called labels.
It has two typical tasks to be performed:
1. Classification
2. Regression
For Example, The spam filter is a good example of classification: it is trained with many
example emails along with their class (spam or ham), and it must learn how to classify new

K.Ramya Laxmi, Assistant Professor, CSE


emails.

Another typical task is to predict a target numeric value, such as the price of a car, given a
set of features (mileage, age, brand, etc.) called predictors. This sort of task is called
regression. To train the system, many examples of cars, including both their predictors and
their labels (i.e., their prices) must be given to it.
In Machine Learning, an attribute is a data type (e.g., “mileage”), while a feature generally
means an attribute plus its value (e.g., “mileage = 15,000).
Some regression algorithms can be used for classification as well, and vice versa. For
example, Logistic Regression is commonly used for classification, as it can output a value
that corresponds to the probability of belonging to a given class (e.g., 20% chance of being
spam).

Some of the most important supervised learning algorithms:


K.Ramya Laxmi, Assistant Professor, CSE
•k-Nearest Neighbors
• Linear Regression
• Logistic Regression
• Support Vector Machines (SVMs)
• Decision Trees and Random Forests
• Neural networks2
Unsupervised learning:
In unsupervised learning, the training data is unlabeled. The system tries to learn
without a teacher.
Some of the most important unsupervised learning algorithms are:
• Clustering
—K-Means
—DBSCAN
—Hierarchical Cluster Analysis (HCA)
• Anomaly detection and novelty detection
—One-class SVM
—Isolation Forest
• Visualization and dimensionality reduction
—Principal Component Analysis (PCA)
—Kernel PCA
—Locally Linear Embedding (LLE)
—t-Distributed Stochastic Neighbor Embedding (t-SNE)
• Association rule learning
—Apriori
—Eclat

Clustering:
For example, say there is a lot of data about a blog’s visitors. A clustering algorithm
can be used to detect groups of similar visitors. The algorithm is not given any information
about which group a visitor belongs to, it, itself finds those connections. For example, it
might notice that 40% of your visitors are males who love comic books and generally read
your blog in the evening, while 20% are young sci-fi lovers who visit during the weekends.
If a hierarchical clustering algorithm is used, it may also subdivide each group into smaller
groups. This may help target the posts for each group.

K.Ramya Laxmi, Assistant Professor, CSE


Visualization:
Visualization algorithms are also good examples of unsupervised learning
algorithms. A lot of complex and unlabeled data is fed to them, and they output a 2D or 3D
representation of the data that can easily be plotted. These algorithms try to preserve as
much structure as they can (e.g., trying to keep separate clusters in the input space from
overlapping in the visualization) to understand how the data is organized and perhaps
identify unsuspected patterns.

Dimensionality Reduction:
A related task is dimensionality reduction, in which the goal is to simplify the data
without losing too much information. One way to do this is to merge several correlated
features into one.
For example, a car’s mileage may be strongly correlated with its age, so the dimensionality
reduction algorithm will merge them into one feature that represents the car’s wear and
tear. This is called feature extraction.
It is often good to reduce the dimension of the training data using a dimensionality
reduction algorithm before feeding it to another Machine Learning algorithm such as a

K.Ramya Laxmi, Assistant Professor, CSE


supervised learning algorithm. It will run much faster, the data will take up less disk and
memory space, and in some cases it may also perform better.

Anomaly Detection:
Another important unsupervised task is anomaly detection—for example, detecting
unusual credit card transactions to prevent fraud, catching manufacturing defects, or
automatically removing outliers from a dataset before feeding it to another learning
algorithm. The system is shown mostly normal instances during training, so it learns to
recognize them; then, when it sees a new instance, it can tell whether it looks like a normal
one or whether it is likely an anomaly.

A very similar task is novelty detection: it aims to detect new instances that look
different from all instances in the training set. This requires having a very “clean” training
set, devoid of any instance that you would like the algorithm to detect. For example, if you
have thousands of pictures of dogs, and 1% of these pictures represent Chihuahuas, then a
novelty detection algorithm should not treat new pictures of Chihuahuas as novelties. On
the other hand, anomaly detection algorithms may consider these dogs as so rare and so
different from other dogs that they would likely classify them as anomalies.
K.Ramya Laxmi, Assistant Professor, CSE
Association rule learning:
Finally, another common unsupervised task is association rule learning, in which
the goal is to dig into large amounts of data and discover interesting relations between
attributes. For example, consider a supermarket. Running an association rule on its sales
logs may reveal that people who purchase barbecue sauce and potato chips also tend to
buy steak. Thus, these items are placed close to one another.

Semi-supervised Learning:
Since labeling data is usually time-consuming and costly, often there are plenty of
unlabeled instances, and few labeled instances. Some algorithms can deal with data that’s
partially labeled. This is called Semi-supervised learning.

Some photo-hosting services, such as Google Photos, are good examples of this.
Once upload all the family photos are uploaded to the service, it automatically recognizes
that the same person A shows up in photos 1, 5, and 11, while another person B shows up
in photos 2, 5, and 7. This is the unsupervised part of the algorithm (clustering). Now all
the system needs is tell it who these people are. Just add one label per person and it is able
to name everyone in every photo, which is useful for searching photos.
Most semi-supervised learning algorithms are combinations of unsupervised and
supervised algorithms. For example, deep belief networks (DBNs) are based on
unsupervised components called restricted Boltzmann machines (RBMs) stacked on top of
one another. RBMs are trained sequentially in an unsupervised manner, and then the whole
system is fine-tuned using supervised learning techniques.

K.Ramya Laxmi, Assistant Professor, CSE


Reinforcement Learning:
Reinforcement Learning is very different from the above types. The learning
system, called an agent in this context, can observe the environment, select and perform
actions, and get rewards in return (or penalties in the form of negative rewards). It must
then learn by itself, what is the best strategy, called a policy, to get the most reward over
time. A policy defines what action the agent should choose when it is in a given situation.
For example, many robots implement Reinforcement Learning algorithms to learn how to
walk. Deep Mind’s AlphaGo program is also a good example of Reinforcement Learning:
it made the headlines in May 2017 when it beat the world champion Ke Jie at the game of
Go. It learned its winning policy by analyzing millions of games, and then playing many
games against itself. Note that learning was turned off during the games against the
champion; AlphaGo was just applying the policy it had learned.

Batch and Online Learning:


Another criterion used to classify Machine Learning systems is whether or not the

K.Ramya Laxmi, Assistant Professor, CSE


system can learn incrementally from a stream of incoming data.

Batch learning:
In batch learning, the system is incapable of learning incrementally: it must be
trained using all the available data. This will generally take a lot of time and computing
resources, so it is typically done offline. First the system is trained, and then it is launched
into production and runs without learning anymore; it just applies what it has learned. This
is called offline learning.
For a batch learning system to know about new data (such as a new type of spam),
it needs to be trained on a new version of the system from scratch on the full dataset (not
just the new data, but also the old data), then stop the old system and replace it with the
new one.
The whole process of training, evaluating, and launching a Machine Learning
system can be automated fairly easily, so even a batch learning system can adapt to change.
Simply update the data and train a new version of the system from scratch as often as
needed.
This solution is simple and often works fine, but training using the full set of data
can take many hours, so typically train a new system only every 24 hours or even just
weekly. If the system needs to adapt to rapidly changing data (e.g., to predict stock prices),
then a more reactive solution is needed.
Also, training on the full set of data requires a lot of computing resources like CPU,
memory space, disk space, disk I/O, network I/O, etc. If there is a lot of data and there is a
need to automate the system to train from scratch every day, it will end up costing a lot of
money. If the amount of data is huge, it may even be impossible to use a batch learning
algorithm.
Finally, if the system needs to be able to learn autonomously and it has limited
resources (e.g., a smartphone application or a rover on Mars), then carrying around large
amounts of training data and taking up a lot of resources to train for hours every day is a
showstopper. A better option in all these cases is to use algorithms that are capable of
learning incrementally.

K.Ramya Laxmi, Assistant Professor, CSE


Output

Online learning:
In online learning the system is trained incrementally by feeding it data instances
sequentially, either individually or in small groups called mini-batches. Each learning step
is fast and cheap, so the system can learn about new data on the fly, as it arrives

K.Ramya Laxmi, Assistant Professor, CSE


K.Ramya Laxmi, Assistant Professor, CSE
Online learning is great for systems that receive data as a continuous flow (e.g.,
stock prices) and need to adapt to change rapidly or autonomously. It is also a good option
if the computing resources are limited: once an online learning system has learned about
new data instances, it does not need them anymore, so they can be discarded. This can save
a huge amount of space.
Online learning algorithms can also be used to train systems on huge datasets that
cannot fit in one machine’s main memory (this is called out-of-core learning). The
algorithm loads part of the data, runs a training step on that data, and repeats the process
until it has run on all of the data (see Figure 1-14).
One important parameter of online learning systems is how fast they should adapt
to changing data: this is called the learning rate. If learning rate is high, then the system
will rapidly adapt to new data, but it will also tend to quickly forget the old data conversely,
if a low learning rate is set, the system will have more inertia; that is, it will learn more
slowly, but it will also be less sensitive to noise in the new data or to sequences of non-
representative data points (outliers).

A big challenge with online learning is that if bad data is fed to the system, the
system’s performance will gradually decline. If it’s a live system, the clients will notice.
For example, bad data could come from a malfunctioning sensor on a robot, or from
someone spamming a search engine to try to rank high in search results. To reduce this
risk, the system must be monitored closely and promptly switch learning of and possibly
revert to a previously working state if a drop in performance is detected. The input data can
also be monitored and react to abnormal data using an anomaly detection algorithm.

K.Ramya Laxmi, Assistant Professor, CSE


Instance-Based Versus Model-Based Learning
One more way to categorize Machine Learning systems is by how they generalize.
Most Machine Learning tasks are about making predictions. This means that given a
number of training examples, the system needs to be able to make good predictions for

K.Ramya Laxmi, Assistant Professor, CSE


(generalize to) examples it has never seen before. Having a good performance measure on
the training data is good, but insufficient; the true goal is to perform well on new instances.
There are two main approaches to generalization:
1. Instance-based learning and
2. Model-based learning.

Instance-based learning
The most trivial form of learning is simply to learn by heart. For example, consider
creating a spam filter this way, it would just flag all emails that are identical to emails that
have already been flagged by users. It can be programmed to also flag emails that are very
similar to known spam emails. This requires a measure of similarity between two emails.
A similarity measure between two emails could be to count the number of words they have
in common. The system would flag an email as spam if it has many words in common with
a known spam email. This is called instance-based learning: the system learns the examples
by heart, then generalizes to new cases by using a similarity measure to compare them to
the learned examples (or a subset of them).
For example, in Figure 1-15 the new instance would be classified as a triangle
because the majority of the most similar instances belong to that class.

Model-based learning
Another way to generalize from a set of examples is to build a model of these
examples and then use that model to make predictions. This is called model-based learning
(Figure 1-16).

K.Ramya Laxmi, Assistant Professor, CSE


Example:
For example, consider an example to know if money makes people happy,
download the Better Life Index data from the OECD’s website and stats about gross
domestic product (GDP) per capita from the IMF’s website. Then join the tables and sort
by GDP per capita.
Table 1-1 shows an excerpt of this.

Plot the data for these countries (Figure 1-17).

K.Ramya Laxmi, Assistant Professor, CSE


K.Ramya Laxmi, Assistant Professor, CSE
Although the data is noisy (i.e., partly random), It looks like life satisfaction goes
up more or less linearly as the country’s GDP per capita increases. So life satisfaction can
be modeled as a linear function of GDP per capita. This step is called model selection: A
linear model of life satisfaction with just one attribute, GDP per capita is selected.
A simple linear model
life_satisfaction = θ0 + θ1 × GDP_per_capita
This model has two model parameters, θ0 and θ1. By tweaking these parameters, the model
represent any linear function,

Before, using the model, the parameter values θ0 and θ1 must be defined. Specify
a performance measure, it can be either a utility function (or fitness function) that measures
how good the model is, or it can be a cost function that measures how bad it is.
For Linear Regression problems, a cost function is mostly used. It measures the
distance between the linear model’s predictions and the training examples; the objective is
to minimize this distance.
Linear Regression algorithm feeds it the training examples, and it finds the
parameters that make the linear model fit best to the data. This is called training the model.
In this case, the algorithm finds that the optimal parameter values are θ0 = 4.85 and θ1 =
4.91 × 10–5.
Here, the word “model” can refer to a type of model e.g., Linear Regression, to a
fully specified model architecture e.g., Linear Regression with one input and one output,
or to the final trained model ready to be used for predictions e.g., Linear Regression with
one input and one output, using Ɵ0 = 4.85 and Ɵ1 = 4.91 ×10–5.
Model selection consists in choosing the type of model and fully specifying its
architecture. Training a model means running an algorithm to find the model parameters
that will make it best fit the training data (and hopefully make good predictions on new
data).
Finally the model is now ready to run and make predictions. For example, say if we

K.Ramya Laxmi, Assistant Professor, CSE


want to know how happy Cypriots are, and the Organization for Economic Co-operation
and Development (OECD) data does not have the answer. The model can be used to make
a good prediction: Cyprus’s GDP per capita, $22,587, is applied to the the model and find
that life satisfaction is likely to be somewhere around 4.85 + 22,587 × 4.91 × 10-5 = 5.96.

Example 1-1. Training and running a linear model using Scikit-Learn

import matplotlib.pyplot as plt


import numpy as np
import pandas as pd
import sklearn.linear_model
# Load the data
oecd_bli = pd.read_csv("oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv("gdp_per_capita.csv",thousands=',',delimiter='\t',
encoding='latin1', na_values="n/a")
# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]
# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()
K.Ramya Laxmi, Assistant Professor, CSE
# Select a linear model
model = sklearn.linear_model.LinearRegression()
# Train the model
model.fit(X, y)
# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus's GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]

If an instance-based learning algorithm is to be used instead, closest GDP per capita to that
of Cyprus that is Slovenia ($20,732) is identified , and since the OECD data tells us that
Slovenians’ life satisfaction is 5.7, a life satisfaction of 5.7 for Cyprus would be predicted. Two
next-closest countries, can also be considered; Portugal and Spain with life satisfactions of 5.1 and
6.5, respectively. Averaging these three values, 5.77, which is pretty close to the model-based
prediction is predicted. This simple algorithm is called k-Nearest Neighbors regression (in this
example, k = 3).
Replacing the Linear Regression model with k-Nearest Neighbors regression in the
previous code is as simple as replacing these two lines:

import sklearn.linear_model
model = sklearn.linear_model.LinearRegression()

with these two:

import sklearn.neighbors
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)

Main Challenges of Machine Learning:

1. Insufficient Quantity of Training Data:


It takes a lot of data for most Machine Learning algorithms to work properly. Even
very simple problems typically need thousands of examples, and for complex problems
such as image or speech recognition need millions of examples.

K.Ramya Laxmi, Assistant Professor, CSE


2. Non representative Training Data:
In order to generalize well, it is crucial that the training data be representative of the
new cases that are to be generalized. This is true for both instance-based learning and
model-based learning.

For example, the set of countries used earlier for training the linear model was not
perfectly representative; a few countries were missing. Figure 1-21 shows what the data
looks like when the missing countries are added. If a linear model is trained on this data,
we get the solid line, while the old model is represented by the dotted line. Observe that
not only does adding a few missing countries significantly alter the model, but it makes it
K.Ramya Laxmi, Assistant Professor, CSE
clear that such a simple linear model is probably never going to work well.
It seems that very rich countries are not happier than moderately rich countries (in
fact, they seem unhappier), and conversely some poor countries seem happier than many
rich countries. It is crucial to use a training set that is representative of the cases that are to
be generalized. If the sample is too small, it will have sampling noise i.e., non-
representative data as a result of chance. Very large samples can be non-representative if
the sampling method is flawed. This is called sampling bias.
3. Poor-Quality Data:
If the training data is full of errors, outliers, and noise e.g., due to poor quality
measurements, it will make it harder for the system to detect the underlying patterns, so
the system is less likely to perform well. It is often well worth the effort to spend time
cleaning up the training data. The following are a couple of examples of when the training
data needs to be cleaned up:
• If some instances are clearly outliers, it may help to simply discard them or try to fix the
errors manually.
• If some instances are missing a few features e.g., 5% of the customers did not specify
their age, it must be decided whether to ignore this attribute altogether, ignore these
instances, fill in the missing values e.g., with the median age, or train one model with the
feature and one model without it.

4. Irrelevant Features:
A critical part of the success of a Machine Learning project is coming up with a good
set of features to train on. This process, called feature engineering, involves the following
steps:
• Feature selection: Selecting the most useful features to train on among existing features
• Feature extraction: Combining existing features to produce a more useful one—
dimensionality reduction algorithms can help
• Creating new features by gathering new data
5. Overfitting the Training Data:
Overfitting means that the model performs well on the training data, but it does not
generalize well.

K.Ramya Laxmi, Assistant Professor, CSE


Figure 1-22 shows an example of a high-degree polynomial life satisfaction model
that strongly overfits the training data. Even though it performs much better on the training
data than the simple linear model. Are the predictions correct? For example, feed the life
satisfaction model many more attributes, including uninformative ones such as the
country’s name. In that case, a complex model may detect patterns like the fact that all
countries in the training data with a ‘w’ in their name have a life satisfaction greater than
7: New Zealand (7.3), Norway (7.4), Sweden (7.2), and Switzerland (7.5).
The w-satisfaction rule generalizes to Rwanda or Zimbabwe. Obviously this pattern
occurred in the training data by pure chance, but the model has no way to tell whether a
pattern is real or simply the result of noise in the data.
Overfitting happens when the model is too complex relative to the amount and
noisiness of the training data. Here are possible solutions:
• Simplify the model by selecting one with fewer parameters e.g., a linear model rather
than a high-degree polynomial model, by reducing the number of attributes in the training
data, or by constraining the model.
• Gather more training data.
• Reduce the noise in the training data e.g., fix data errors and remove outliers.

Constraining a model to make it simpler and reduce the risk of overfitting is called
regularization. For example, the linear model defined earlier has two parameters, θ0 and
θ1. This gives the learning algorithm two degrees of freedom to adapt the model to the
training data: it can tweak both the height (θ0) and the slope (θ1) of the line. If θ1 = 0, the
algorithm would have only one degree of freedom and would have a much harder time
fitting the data properly: all it could do is move the line up or down to get as close as
possible to the training instances, so it would end up around the mean.
A very simple model indeed! If the algorithm can modify θ1 but force it to keep it
small, then the learning algorithm will effectively have somewhere in between one and two
K.Ramya Laxmi, Assistant Professor, CSE
degrees of freedom. It will produce a model that’s simpler than one with two degrees of
freedom, but more complex than one with just one. The right balance between fitting the
training data perfectly and keeping the model simple enough to ensure that it will
generalize well is to be found out.

Figure 1-23 shows three models. The dotted line represents the original model that
was trained on the countries represented as circles without the countries represented as
squares, the dashed line is the second model trained with all countries circles and squares,
and the solid line is a model trained with the same data as the first model but with a
regularization constraint. Observe that regularization forced the model to have a smaller
slope: this model does not fit the training data (circles) as well as the first model, but it
actually generalizes better to new examples that it did not see during training (squares).

The amount of regularization to apply during learning can be controlled by a


hyperparameter. A hyperparameter is a parameter of a learning algorithm. As such, it is
not affected by the learning algorithm itself; it must be set prior to training and remains
constant during training. If the regularization hyperparameter is set to a very large value,
an almost flat model (a slope close to zero) is obtained; the learning algorithm will almost
certainly not overfit the training data, but it will be less likely to find a good solution.
Tuning hyperparameters is an important part of building a Machine Learning system

6. Under-fitting the Training Data:


Underfitting occurs when the model is too simple to learn the underlying structure of
the data. For example, a linear model of life satisfaction is prone to underfit; reality is just

K.Ramya Laxmi, Assistant Professor, CSE


more complex than the model, so its predictions are bound to be inaccurate, even on the
training examples. Here are the main options for fixing this problem:

o Select a more powerful model, with more parameters.


o Feed better features to the learning algorithm (feature engineering).
o Reduce the constraints on the model (e.g., reduce the regularization hyper-
parameter).
7. Testing and Validating:
The only way to know how well a model will generalize to new cases is to actually
try it out on new cases. A better option is to split the data into two sets: the training set and
the test set. Train the model using the training set, and test it using the test set. The error
rate on new cases is called the generalization error (or out-of-sample error), and by
evaluating the model on the test set, an estimate of this error is found. This value tells how
well the model will perform on instances it has never seen before. If the training error is
low but the generalization error is high, it means that your model is overfitting the training
data.

8. Hyper-parameter Tuning and Model Selection:


The simplest way of evaluating a model is to use a test set.

To decide between two models: One option is to train both and compare how well they
generalize using the test set. Now suppose that the linear model generalizes better, but you
want to apply some regularization to avoid overfitting.

To choose the value of the regularization hyper-parameter: One option is to train 100
different models using 100 different values for this hyper-parameter. Suppose the best
hyper-parameter value that produces a model with the lowest generalization error—say,
just 5% error. If this model is launched into production, unfortunately it does not perform
as well as expected and produces 15% errors.

The problem is that the generalization error multiple times on the test set, and adapted
the model and hyper-parameters to produce the best model for that particular set. This
means that the model is unlikely to perform as well on new data.

A common solution to this problem is called holdout validation. Simply hold out part
of the training set to evaluate several candidate models and select the best one. The new
held-out set is called the validation set or sometimes the development set, or dev set.

Train multiple models with various hyperparameters on the reduced training set i.e.,
the full training set minus the validation set, and select the model that performs best on the
validation set. After this holdout validation process, train the best model on the full training
K.Ramya Laxmi, Assistant Professor, CSE
set including the validation set, and this gives the final model. Lastly, evaluate this final
model on the test set to get an estimate of the generalization error.
If the validation set is too small, then model evaluations will be imprecise: we may
end up selecting a suboptimal model by mistake. Conversely, if the validation set is too
large, then the remaining training set will be much smaller than the full trainng set. One
way to solve this problem is to perform repeated cross-validation, using many small
validation sets. Each model is evaluated once per validation set after it is trained on the rest
of the data. By averaging out all the evaluations of a model, a much more accurate measure
of its performance is obtained. There is a drawback, however: the training time is
multiplied by the number of validation sets.

9. Data Mismatch:
It’s easy to get a large amount of data for training, but this data probably won’t be
perfectly representative of the data that will be used in production. For example, suppose
we want to create a mobile app to take pictures of flowers and automatically determine
their species. We can easily download millions of pictures of flowers on the web, but they
won’t be perfectly representative of the pictures that will actually be taken using the app
on a mobile device. Perhaps we only have 10,000 representative pictures (i.e., actually
taken with the app). In this case, the most important rule to remember is that the validation
set and the test set must be as representative as possible of the data you expect to use in
production, so they should be composed exclusively of representative pictures: we can
shuffle them and put half in the validation set and half in the test set.

Hold out some of the training pictures (from the web) in yet another set called the
train-dev set. After the model is trained on the training set, evaluate it on the train-dev set.
If it performs well, then the model is not overfitting the training set. If it performs poorly
on the validation set, the problem must be coming from the data mismatch.
Try to tackle this problem by preprocessing the web images to make them look
more like the pictures that will be taken by the mobile app, and then retraining the model.
Conversely, if the model performs poorly on the train-dev set, then it must have overfit the
training set, so try to simplify or regularize the model, get more training data, and clean up
the training data.

Differences between Supervised and Unsupervised Learning:

K.Ramya Laxmi, Assistant Professor, CSE


K.Ramya Laxmi, Assistant Professor, CSE

You might also like