0% found this document useful (0 votes)
17 views6 pages

Chapter 1

Uploaded by

ebru.sara123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views6 pages

Chapter 1

Uploaded by

ebru.sara123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter 1.

The Machine Learning Analyzing images of products on a production line to


automatically classify them.
Landscape
 This is image classification, typically performed using
What Is Machine Learning?
convolutional neural networks.
Machine Learning is the science (and art) of programming
computers so they can learn from data. Detecting tumors in brain scans
Why Use Machine Learning?  This is semantic segmentation, where each pixel in the
image is classified (as we want to determine the exact
Consider how you would write a spam filter using traditional
location and shape of tumors), typically using CNNs as
programming techniques (Figure 1-1):
well.
1. First you would consider what spam typically looks
Automatically classifying news articles
like. You might notice that some words or phrases
(such as “4U,” “credit card,” “free,” and “amazing”)  This is natural language processing (NLP), and more
tend to come up a lot in the subject line. Perhaps you specifically text classification, which can be tackled using
would also notice a few other patterns in the sender’s recurrent neural networks (RNNs), CNNs, or Transformers.
name, the email’s body, and other parts of the email.
Automatically flagging offensive comments on discussion
2. You would write a detection algorithm for each of the
forums
patterns that you noticed, and your program would
flag emails as spam if a number of these patterns  This is also text classification, using the same NLP tools.
were detected.
3. You would test your program and repeat steps 1 and Summarizing long documents automatically
2 until it was good enough to launch.  This is a branch of NLP called text summarization, again
using the same tools.

Creating a chatbot or a personal assistant

 This involves many NLP components, including natural


language understanding (NLU) and question-answering
modules.

Forecasting your company’s revenue next year, based on


many performance metrics.
Figure 1-1. The traditional approach
 This is a regression task (i.e., predicting values) that may
Since the problem is difficult, your program will likely become be tackled using any regression model, such as a Linear
a long list of complex rules—hard to maintain. Regression or Polynomial Regression model, a regression
SVM, a regression Random Forest, or an artificial neural
In contrast, a spam filter based on Machine Learning
network. If you want to take into account sequences of
techniques automatically learns which words and phrases are
past performance metrics, you may want to use RNNs,
good predictors of spam by detecting unusually frequent
CNNs, or Transformers.
patterns of words in the spam examples compared to the ham
examples (Figure 1-2). The program is much shorter, easier to Making your app react to voice commands.
maintain, and most likely more accurate.
 This is speech recognition, which requires processing
audio samples: since they are long and complex
sequences, they are typically processed using RNNs,
CNNs, or Transformers.

Detecting credit card fraud

 This is anomaly detection.

Segmenting clients based on their purchases so that you can


design a different marketing strategy for each segment
Figure 1-2. The Machine Learning approach
 This is clustering.
Examples of Applications
Let’s look at some concrete examples of Machine Learning Representing a complex, high-dimensional dataset in a clear
tasks, along with the techniques that can tackle them: and insightful diagram
 This is data visualization, often involving dimensionality example emails along with their class (spam or ham), and it
reduction techniques. must learn how to classify new emails.

Recommending a product that a client may be interested in, Another typical task is to predict a target numeric value, such
based on past purchases. as the price of a car, given a set of features (mileage, age,
brand, etc.) called predictors. This sort of task is called
 This is a recommender system. One approach is to feed
regression (Figure 1-6). To train the system, you need to give
past purchases (and other information about the client) to
it many examples of cars, including both their predictors and
an artificial neural network and get it to output the most
their labels (i.e., their prices).
likely next purchase. This neural net would typically be
trained on past sequences of purchases across all clients. Unsupervised learning
In unsupervised learning, as you might guess, the training data
Building an intelligent bot for a game
is unlabeled (Figure 1-7). The system tries to learn without a
 This is often tackled using Reinforcement Learning, which teacher.
is a branch of Machine Learning that trains agents (such
as bots) to pick the actions that will maximize their
rewards over time (e.g., a bot may get a reward every
time the player loses some life points), within a given
environment (such as the game). The famous AlphaGo
program that beat the world champion at the game of Go
was built using RL.

This list could go on and on, but hopefully it gives you a sense
of the incredible breadth and complexity of the tasks that Figure 1-7. An unlabeled training set for unsupervised learning
Machine Learning can tackle, and the types of techniques that Semisupervised learning
you would use for each task. Since labeling data is usually time-consuming and costly, you
Types of Machine Learning Systems will often have plenty of unlabeled instances, and few labeled
instances. Some algorithms can deal with data that’s partially
There are so many different types of Machine Learning
labeled. This is called semisupervised learning (Figure 1-11).
systems that it is useful to classify them in broad categories,
based on the following criteria:

 Whether or not they are trained with human supervision


(supervised, unsupervised, semisupervised, and
Reinforcement Learning)

Supervised/Unsupervised Learning
Machine Learning systems can be classified according to the
amount and type of supervision they get during training.
There are four major categories: supervised learning, Figure 1-11. Semisupervised learning with two classes
unsupervised learning, semisupervised learning, and (triangles and squares): the unlabeled examples (circles) help
Reinforcement Learning. classify a new instance (the cross) into the triangle class
rather than the square class, even though it is closer to the
Supervised learning labeled squares
In supervised learning, the training set you feed to the
algorithm includes the desired solutions, called labels (Figure Instance-Based Versus Model-Based
1-5). Learning
One more way to categorize Machine Learning systems is by
how they generalize. Most Machine Learning tasks are about
making predictions. This means that given a few training
examples, the system needs to be able to make good
predictions for (generalize to) examples it has never seen
before. Having a good performance measure on the training
data is good, but insufficient; the true goal is to perform well
Figure 1-5. A labeled training set for spam classification (an
on new instances.
example of supervised learning)

A typical supervised learning task is classification. The spam


filter is a good example of this: it is trained with many
Model-based learning This model has two model parameters, θ0 and θ1. By tweaking
Another way to generalize from a set of examples is to build a these parameters, you can make your model represent any
model of these examples and then use that model to make linear function, as shown in Figure 1-18.
predictions. This is called model-based learning (Figure 1-16).

Figure 1-16. Model-based learning

For example, suppose you want to know if money makes Figure 1-18. A few possible linear models
people happy, so you download the Better Life Index data
Before you can use your model, you need to define the
from the OECD’s website and stats about gross domestic
parameter values θ0 and θ1. How can you know which values
product (GDP) per capita from the IMF’s website. Then you
will make your model perform best? To answer this question,
join the tables and sort by GDP per capita. Table 1-1 shows an
you need to specify a performance measure. You can either
excerpt of what you get.
define a utility function (or fitness function) that measures
Table 1-1. Does money make people happier? how good your model is, or you can define a cost function
that measures how bad it is. For Linear Regression problems,
people typically use a cost function that measures the
distance between the linear model’s predictions and the
training examples; the objective is to minimize this distance.
This is where the Linear Regression algorithm comes in: you
feed it your training examples, and it finds the parameters
that make the linear model fit best to your data. This is called
training the model. In our case, the algorithm finds that the
optimal parameter values are θ0 = 4.85 and θ1 = 4.91 × 10-5.

Let’s plot the data for these countries (Figure 1-17). WARNING

Model selection consists in choosing the type of model and


fully specifying its architecture. Training a model means
running an algorithm to find the model parameters that will
make it best fit the training data (and hopefully make good
predictions on new data)

Now the model fits the training data as closely as possible (for
a linear model), as you can see in Figure 1-19.

Figure 1-17

There does seem to be a trend here! Although the data is


noisy (i.e., partly random), it looks like life satisfaction goes up
more or less linearly as the country’s GDP per capita
increases. So you decide to model life satisfaction as a linear
function of GDP per capita. This step is called model selection:
you selected a linear model of life satisfaction with just one
attribute, GDP per capita (Equation 1-1).
Figure 1-19. The linear model that fits the training data best
Equation 1-1. A simple linear model
You are finally ready to run the model to make predictions.
life_satisfaction = θ0 + θ1 × GDP_per_capita For example, say you want to know how happy Cypriots are,
and the OECD data does not have the answer. Fortunately,
you can use your model to make a good prediction: you look
up Cyprus’s GDP per capita, find $22,587, and then apply your
model and find that life satisfaction is likely to be somewhere
around 4.85 + 22,587 × 4.91 × 10-5 = 5.96.

Figure 1-21. A more representative training sample

If you train a linear model on this data, you get the solid line,
while the old model is represented by the dotted line. As you
can see, not only does adding a few missing countries
significantly alter the model, but it makes it clear that such a
simple linear model is probably never going to work well. It
seems that very rich countries are not happier than
moderately rich countries (in fact, they seem unhappier), and
conversely some poor countries seem happier than many rich
countries.

By using a nonrepresentative training set, we trained a model


that is unlikely to make accurate predictions, especially for
very poor and very rich countries.

It is crucial to use a training set that is representative of the


If all went well, your model will make good predictions. If not, cases you want to generalize to. This is often harder than it
you may need to use more attributes (employment rate, sounds: if the sample is too small, you will have sampling
health, air pollution, etc.), get more or better-quality training noise (i.e., nonrepresentative data because of chance), but
data, or perhaps select a more powerful model (e.g., a even very large samples can be nonrepresentative if the
Polynomial Regression model). sampling method is flawed. This is called sampling bias.

In summary: Poor-Quality Data


Obviously, if your training data is full of errors, outliers, and
 You studied the data.
noise (e.g., due to poor-quality measurements), it will make it
 You selected a model.
harder for the system to detect the underlying patterns, so
 You trained it on the training data (i.e., the learning
your system is less likely to perform well. It is often well worth
algorithm searched for the model parameter values that
the effort to spend time cleaning up your training data. The
minimize a cost function).
truth is, most data scientists spend a significant part of their
 Finally, you applied the model to make predictions on
time doing just that. The following are a couple of examples of
new cases (this is called inference), hoping that this model
when you’d want to clean up training data:
will generalize well.
 If some instances are clearly outliers, it may help to
Main Challenges of Machine Learning simply discard them or try to fix the errors manually.
In short, since your main task is to select a learning algorithm  If some instances are missing a few features (e.g., 5%
and train it on some data, the two things that can go wrong of your customers did not specify their age), you must
are “bad algorithm” and “bad data.” Let’s start with examples decide whether you want to ignore this attribute
of bad data. altogether, ignore these instances, fill in the missing
Nonrepresentative Training Data values (e.g., with the median age), or train one model
In order to generalize well, it is crucial that your training data with the feature and one model without it.
be representative of the new cases you want to generalize to. Overfitting the Training Data
This is true whether you use instance-based learning or Say you are visiting a foreign country and the taxi driver rips
model-based learning. you off. You might be tempted to say that all taxi drivers in
For example, the set of countries we used earlier for training that country are thieves.
the linear model was not perfectly representative; a few In Machine Learning this is called overfitting: it means that the
countries were missing. Figure 1-21 shows what the data model performs well on the training data, but it does not
looks like when you add the missing countries. generalize well.
Figure 1-22 shows an example of a high-degree polynomial complex than the model, so its predictions are bound to be
life satisfaction model that strongly overfits the training data. inaccurate, even on the training examples.
Even though it performs much better on the training data
Here are the main options for fixing this problem:
than the simple linear model, would you really trust its
predictions?  Select a more powerful model, with more parameters.
 Feed better features to the learning algorithm (feature
engineering).
 Reduce the constraints on the model (e.g., reduce the
regularization hyperparameter).

Testing and Validating


The only way to know how well a model will generalize to new
Figure 1-22. Overfitting the training data cases is to actually try it out on new cases. One way to do that
is to put your model in production and monitor how well it
Complex models such as deep neural networks can detect
performs. This works well, but if your model is horribly bad,
subtle patterns in the data, but if the training set is noisy, or if
your users will complain—not the best idea.
it is too small (which introduces sampling noise), then the
model is likely to detect patterns in the noise itself. A better option is to split your data into two sets: the training
set and the test set. As these names imply, you train your
WARNING model using the training set, and you test it using the test set.
Overfitting happens when the model is too complex relative The error rate on new cases is called the generalization error
to the amount and noisiness of the training data. Here are (or out-of-sample error), and by evaluating your model on the
possible solutions: test set, you get an estimate of this error. This value tells you
how well your model will perform on instances it has never
 Simplify the model by selecting one with fewer seen before.
parameters (e.g., a linear model rather than a high-degree
polynomial model), by reducing the number of attributes If the training error is low (i.e., your model makes few
in the training data, or by constraining the model. mistakes on the training set) but the generalization error is
 Gather more training data. high, it means that your model is overfitting the training data.
 Reduce the noise in the training data (e.g., fix data errors
TIP
and remove outliers)
It is common to use 80% of the data for training and hold out
Constraining a model to make it simpler and reduce the risk of 20% for testing. However, this depends on the size of the
overfitting is called regularization. For example, the linear dataset: if it contains 10 million instances, then holding out
model we defined earlier has two parameters, θ and θ . This 1% means your test set will contain 100,000 instances,
gives the learning algorithm two degrees of freedom to adapt probably more than enough to get a good estimate of the
the model to the training data: it can tweak both the height (θ generalization error.
) and the slope (θ ) of the line. If we forced θ = 0, the
algorithm would have only one degree of freedom and would
have a much harder time fitting the data properly: all it could
do is move the line up or down to get as close as possible to
the training instances, so it would end up around the mean. A
very simple model indeed! If we allow the algorithm to modify
θ but we force it to keep it small, then the learning algorithm
will effectively have somewhere in between one and two
degrees of freedom. It will produce a model that’s simpler
than one with two degrees of freedom, but more complex
than one with just one. You want to find the right balance
between fitting the training data perfectly and keeping the
model simple enough to ensure that it will generalize well

Underfitting the Training Data


As you might guess, underfitting is the opposite of overfitting:
it occurs when your model is too simple to learn the
underlying structure of the data. For example, a linear model
of life satisfaction is prone to underfit; reality is just more

You might also like