0% found this document useful (0 votes)
51 views19 pages

Unit III

Uploaded by

mohamudsk007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views19 pages

Unit III

Uploaded by

mohamudsk007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT-III MACHINE LEARNING

What is machine learning and why should you care about it?
“Machine learning is a field of study that gives computers the ability to learn without being
explicitly programmed.”
—Arthur Samuel, 19591
The definition of machine learning coined by Arthur Samuel is often quoted and is genius in its
broadness, but it leaves you with the question of how the computer learns. To achieve machine learning,
experts develop general-purpose algorithms that can be used on large classes of learning problems. When
you want to solve a specific task you only need to feed the algorithm more specific data. In a way, you’re
programming by example. In most cases a computer will use data as its source of information and
compare its output to a desired output and then correct for it. The more data or “experience” the computer
gets, the better it becomes at its designated job, like a human does.

When machine learning is seen as a process, the following definition is insightful:


“Machine learning is the process by which a computer can work more accurately as it collects and learns
from the data it is given.”
—Mike Roberts2
For example, as a user writes more text messages on a phone, the phone learns more about the
messages’ common vocabulary and can predict (autocomplete) their words faster and more accurately.
In the broader field of science, machine learning is a subfield of artificial intelligence and is
closely related to applied mathematics and statistics. All this might sound a bit abstract, but machine
learning has many applications in everyday life.
Applications for machine learning in data science
Regression and classification are of primary importance to a data scientist. To achieve these goals, one of
the main tools a data scientist uses is machine learning. The uses for regression and automatic
classification are wide ranging, such as the following:
■ Finding oil fields, gold mines, or archeological sites based on existing sites (classification
and regression)
■ Finding place names or persons in text (classification)
■ Identifying people based on pictures or voice recordings (classification)
■ Recognizing birds based on their whistle (classification)
■ Identifying profitable customers (regression and classification)
■ Proactively identifying car parts that are likely to fail (regression)
■ Identifying tumors and diseases (classification)
■ Predicting the amount of money a person will spend on product X (regression)
■ Predicting the number of eruptions of a volcano in a period (regression)
■ Predicting your company’s yearly revenue (regression)
■ Predicting which team will win the Champions League in soccer (classification)
Occasionally data scientists build a model (an abstraction of reality) that provides insight to the
underlying processes of a phenomenon. When the goal of a model isn’t prediction but interpretation, it’s
called root cause analysis. Here are a few examples:
■ Understanding and optimizing a business process, such as determining which products addvalue
to a product line
■ Discovering what causes diabetes
■ Determining the causes of traffic jams
This list of machine learning applications can only be seen as an appetizer because it’s ubiquitous
within data science. Regression and classification are two important techniques, but the repertoire and the
applications don’t end, with clustering as one other example of a valuable technique. Machine learning
techniques can be used throughout the data science process, as we’ll discuss in the next section.
The modeling process
The modeling phase consists of four steps:
1 Feature engineering and model selection
2 Training the model
3 Model validation and selection
4 Applying the trained model to unseen data
Before you find a good model, you’ll probably iterate among the first three steps.
The last step isn’t always present because sometimes the goal isn’t prediction but explanation
(root cause analysis). For instance, you might want to find out the causes of species’ extinctions but not
necessarily predict which one is next in line to leave our planet.
It’s possible to chain or combine multiple techniques. When you chain multiple models, the output
of the first model becomes an input for the second model. When you combine multiple models, you train
them independently and combine their results. This last technique is also known as ensemble learning.
A model consists of constructs of information called features or predictors and a target or response
variable. Your model’s goal is to predict the target variable, for example, tomorrow’s high temperature.
Engineering features and selecting a model
With engineering features, you must come up with and create possible predictors for the model.
This is one of the most important steps in the process because a model recombines these features to
achieve its predictions. Often you may need to consult an expert or the appropriate literature to come up
with meaningful features.

Training your model


With the right predictors in place and a modeling technique in mind, you can progress to model
training. In this phase you present to your model data from which it can learn.
The most common modeling techniques have industry-ready implementations in almost every
programming language, including Python. These enable you to train your models by executing a few lines
of code. For more state-of-the art data science techniques, you’ll probably end up doing heavy
mathematical calculations and implementing them with modern computer science techniques. Once a
model is trained, it’s time to test whether it can be extrapolated to reality: model validation

Validating a model
Data science has many modeling techniques, and the question is which one is the right one to use.
A good model has two properties: it has good predictive power and it generalizes well to data it hasn’t
seen. To achieve this you define an error measure (how wrong the model is) and a validation strategy.
Two common error measures in machine learning are the classification error rate for classification
problems and the mean squared error for regression problems. The classification error rate is the
percentage of observations in the test data set that your model mislabeled; lower is better.
Many validation strategies exist, including the following common ones:
■ Dividing your data into a training set with X% of the observations and keeping the rest as a
holdout data set (a data set that’s never used for model creation)—This is the most common
technique.
■ K-folds cross validation—This strategy divides the data set into k parts and uses each part one
time as a test data set while using the others as a training data set. This has the advantage that you
use all the data available in the data set.
■ Leave-1 out—This approach is the same as k-folds but with k=1. You always leave one
observation out and train on the rest of the data. This is used only on small data sets, so it’s more
valuable to people evaluating laboratory experiments than to big data analysts.
Another popular term in machine learning is regularization. When applying regularization, you
incur a penalty for every extra variable used to construct the model. With L1 regularization you ask for a
model with as few predictors as possible. This is important for the model’s robustness: simple solutions
tend to hold true in more situations. L2 regularization aims to keep the variance between the coefficients
of the predictors as small as possible. Overlapping variance between predictors makes it hard to make
outthe actual impact of each predictor. Keeping their variance from overlapping will increase
interpretability. To keep it simple: regularization is mainly used to stop a model from using too many
features and thus prevent over-fitting.
Validation is extremely important because it determines whether your model works in real-life
conditions. To put it bluntly, it’s whether your model is worth a dime. Even so, every now and then
people send in papers to respected scientific journals (and sometimes even succeed at publishing them)
with faulty validation. The result of this is they get rejected or need to retract the paper because
everything is wrong. Situations like this are bad for your mental health so always keep this in mind: test
your models on data the constructed model has never seen and make sure this data is a true representation
of what it would encounter when applied on fresh observations by other people.
For classification models, instruments like the confusion matrix (introduced in chapter 2 but thoroughly
explained later in this chapter) are golden; embrace them. Once you’ve constructed a good model, you
can (optionally) use it to predict the future.
Predicting new observations
If you’ve implemented the first three steps successfully, you now have a performant model that
generalizes to unseen data. The process of applying your model to new data is called model scoring. In
fact, model scoring is something you implicitly did during validation, only now you don’t know the
correct outcome. By now you should trust your model enough to use it for real.
Model scoring involves two steps. First, you prepare a data set that has features exactly as
defined by your model. This boils down to repeating the data preparation you did in step one of the
modeling process but for a new data set. Then you apply the model on this new data set, and this results in
a prediction.
Now let’s look at the different types of machine learning techniques: adifferent problem requires a
different approach.
Types of machine learning
Broadly speaking, we can divide the different approaches to machine learning by the amount of
human effort that’s required to coordinate them and how they use labeled data—data with a category or a
real-value number assigned to it that represents the outcome of previous observations.
■ Supervised learning techniques attempt to discern results and learn by trying to find patterns in
a labeled data set. Human interaction is required to label the data.
■ Unsupervised learning techniques don’t rely on labeled data and attempt to find patterns in a
data set without human interaction.
■ Semi-supervised learning techniques need labeled data, and therefore human interaction, to find
patterns in the data set, but they can still progress toward a result and learn even if passed
unlabeled data as well.
Supervised learning
As stated before, supervised learning is a learning technique that can only be applied on labeled
data. An example implementation of this would be discerning digits from images. Let’s dive into a case
study on number recognition.

CASE STUDY: DISCERNING DIGITS FROM IMAGES


One of the many common approaches on the web to stopping computers from hacking into user
accounts is the Captcha check—a picture of text and numbers that the human user must decipher and
enter into a form field before sending the form back to the web server. Something like figure 3.3 should
look familiar.
Step 2 of the data science process: fetching the digital image data:-

Working with images isn’t much different from working with other data sets. In the case of a gray
image, you put a value in every matrix entry that depicts the gray value to be shown. The following code
demonstrates this process and is step four of the data science process: data exploration.
Step 4 of the data science process: using Scikit-learn:-

Figure 3.4 shows how a blurry “0” image translates into a data matrix.
Figure 3.4 shows the actual code output, but perhaps figure 3.5 can clarify this slightly, because it
shows how each element in the vector is a piece of the image. Easy so far, isn’t it? There is, naturally,
a little more work to do. The Naïve Bayes classifier is expecting a list of values, but pl.matshow()
returns a two-dimensional array (a matrix) reflecting the shape of the image. To flatten it into a list,
we need to call reshape() on digits.images. The net result will be a one-dimensional array that
looks something like this:
array([[ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10., 15., 5., 0.,
0., 3., 15., 2., 0., 11., 8., 0., 0., 4., 12., 0., 0., 8., 8., 0.,
0., 5., 8., 0., 0., 9., 8., 0., 0., 4., 11., 0., 1., 12., 7., 0.,
0., 2., 14., 5., 10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.]])
Now that we have a way to pass the contents of an image into the classifier, we need to pass it a
training data set so it can start learning how to predict the numbers in the images.
Image data classification problem on images of digits:-
The end result of this code is called a confusion matrix, such as the one shown in figure 3.6. Returned as
a two-dimensional array, it shows how often the number predicted was the correct number on the main
diagonal and also in the matrix entry (i,j), where j was predicted but the image showed i. Looking at
figure
3.6 we can see that the model predicted the number 2 correctly 17 times (at coordinates 3,3), but also that
the model predicted the number 8 15 times when it was actually the number 2 in the image (at 9,3).

Confusion matrices:-
A confusion matrix is a matrix showing how wrongly (or correctly) a model predicted, how
much it got “confused.” In its simplest form it will be a 2x2 table for models that try to classify
observations as being A or B. Let’s say we have a classification model that predicts whether somebody
will buy our newest product: deep-fried cherry pudding.
We can either predict: “Yes, this person will buy” or “No, this customer won’t buy.” Once we
make our prediction for 100 people we can compare this to their actual behavior, showing us how many
times we got it right. An example is shown in table 3.1.
The model was correct in (35+40) 75 cases and incorrect in (15+10) 25 cases, resulting in a (75 correct/100
total observations) 75% accuracy.
All the correctly classified observations are added up on the diagonal (35+40) while everything else
(15+10) is incorrectly classified. When the model only predicts two classes (binary), our correct guesses are
two groups: true positives (predicted to buy and did so) and true negatives (predicted they wouldn’t buy and
they didn’t). Our incorrect guesses are divided into two groups: false positives (predicted they would buy
but they didn’t) and false negatives (predicted not to buy but they did). The matrix is useful to see where the
model is having the most problems. In this case we tend to be overconfident in our product and classify
customers as future buyers too easily (false positive).
From the confusion matrix, we can deduce that for most images the predictions are quite accurate. In
a good model you’d expect the sum of the numbers on the main diagonal of the matrix (also known as the
matrix trace) to be very high compared to the sum of all matrix entries, indicating that the predictions were
correct for the most part.
Let’s assume we want to show off our results in a more easily understandable way or we want to
inspect several of the images and the predictions our program has made: we can use the following code to
display one next to the other. Then we can see where the program has gone wrong and needs a little more
training. If we’re satisfied with the results, the model building ends here and we arrive at step six: presenting
the results.
Inspecting predictions vs actual numbers:-

Figure 3.7 shows how all predictions seem to be correct except for the digit number 2, which it labels as
8. We should forgive this mistake as this 2 does share visual similarities with 8. The bottom left number
is ambiguous, even to humans; is it a 5 or a 3? It’s debatable, but the algorithm thinks it’s a 3.
By discerning which images were misinterpreted, we can train the model further by labeling them
with the correct number they display and feeding them back into the model as a new training set (step 5 of
the data science process). This will make the model more accurate, so the cycle of learn, predict, correct
continues and the predictions become more accurate. This is a controlled data set we’re using for the
example.
All the examples are the same size and they are all in 16 shades of gray. Expand that up to the
variable size images of variable length strings of variable shades of alphanumeric characters shown in the
Captcha control, and you can appreciate why a model accurate enough to predict any Captcha image doesn’t
exist yet.
In this supervised learning example, it’s apparent that without the labels associated with each image
telling the program what number that image shows, a model cannot be built and predictions cannot be made.
By contrast, an unsupervised learning approach doesn’t need its data to be labeled and can be used to give
structure to an unstructured data set.

Supervised Machine Learning

Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data means
some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that teaches
the machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to the machine
learning model. The aim of a supervised learning algorithm is to find a mapping function to map the input
variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud
Detection, spam filtering, etc.
How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model learns about each
type of data. Once the training process is completed, the model is tested on the basis of test data (a subset of
the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and diagram:

Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.

If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.

If the given shape has three sides, then it will be labelled as a triangle.

If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to identify the
shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the
shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:

First Determine the type of training dataset

Collect/Gather the labelled training data.

Split the training dataset into training dataset, test dataset, and validation dataset.
Determine the input features of the training dataset, which should have enough knowledge so that
the model can accurately predict the output.

Determine the suitable algorithm for the model, such as support vector machine, decision tree, etc.

Execute the algorithm on the training dataset. Sometimes we need validation sets as the control
parameters, which are the subset of training datasets.

Evaluate the accuracy of the model by providing the test set. If the model predicts the correct output,
which means our model is accurate.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market Trends,
etc. Below are some popular Regression algorithms which come under supervised learning:

Linear Regression

Regression Trees

Non-Linear Regression

Bayesian Linear Regression

Polynomial Regression

2. Classification
Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

Random Forest

Decision Trees

Logistic Regression

Support vector Machines

Advantages of Supervised learning:

With the help of supervised learning, the model can predict the output on the basis of prior
experiences.

In supervised learning, we can have an exact idea about the classes of objects.

Supervised learning model helps us to solve various real-world problems such as fraud detection,
spam filtering, etc.

Disadvantages of supervised learning:

Supervised learning models are not suitable for handling the complex tasks.

Supervised learning cannot predict the correct output if the test data is different from the training
dataset.

Training required lots of computation times.

In supervised learning, we need enough knowledge about the classes of object.

Unsupervised Machine Learning

In the previous topic, we learned supervised machine learning in which models are trained using
labeled data under the supervision of training data. But there may be many cases in which we do not have
labeled data and need to find the hidden patterns from the given dataset. So, to solve such types of cases in
machine learning, we need unsupervised learning techniques.

What is Unsupervised Learning?

As the name suggests, unsupervised learning is a machine learning technique in which models are
not supervised using training dataset. Instead, models itself find the hidden patterns and insights from the
given data. It can be compared to learning which takes place in the human brain while learning new things.
It can be defined as:

Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification problem because


unlike supervised learning, we have the input data but no corresponding output data. The goal of
unsupervised learning is to find the underlying structure of dataset, group that data according to similarities,
and represent that dataset in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset containing images
of different types of cats and dogs. The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset. The task of the unsupervised learning algorithm is
to identify the image features on their own. Unsupervised learning algorithm will perform this task by
clustering the image dataset into the groups according to similarities between images.

Why use Unsupervised Learning?

Unsupervised learning is helpful for finding useful insights from the data.

Unsupervised learning is much similar as a human learns to think by their own experiences, which
makes it closer to the real AI.

Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.

In real-world, we do not always have input data with the corresponding output so to solve such
cases, we need unsupervised learning.

Working of Unsupervised Learning

Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized and corresponding
outputs are also not given. Now, this unlabeled input data is fed to the machine learning model in order to
train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and then will apply
suitable algorithms such as k-means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to
the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of problems:

Clustering: Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group. Cluster
analysis finds the commonalities between the data objects and categorizes them as per the presence and
absence of those commonalities.

Association: An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs together in
the dataset. Association rule makes marketing strategy more effective. Such as people who buy X item
(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical example of Association rule is
Market Basket Analysis.

Advantages of Unsupervised Learning

Unsupervised learning is used for more complex tasks as compared to supervised learning because,
in unsupervised learning, we don't have labeled input data.

Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled data.

Disadvantages of Unsupervised Learning

Unsupervised learning is intrinsically more difficult than supervised learning as it does not have
corresponding output.
The result of the unsupervised learning algorithm might be less accurate as input data is not labeled,
and algorithms do not know the exact output in advance.

Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained Unsupervised learning algorithms are


using labeled data. trained using unlabeled data.

Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting correct output or take any feedback.
not.

Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.

In supervised learning, input data is provided In unsupervised learning, only input data
to the model along with the output. is provided to the model.

The goal of supervised learning is to train The goal of unsupervised learning is to


the model so that it can predict the output when it is find the hidden patterns and useful insights from
given new data. the unknown dataset.

Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problems. in Clustering and Associations problems.

Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input data and no
corresponding outputs. corresponding output data.

Supervised learning model produces an Unsupervised learning model may give


accurate result. less accurate result as compared to supervised
learning.
Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train the true Artificial Intelligence as it learns similarly as
model for each data, and then only it can predict the a child learns daily routine things by his
correct output. experiences.

You might also like