0% found this document useful (0 votes)
11 views

Machine Learnimg Notes

Uploaded by

baffoebenaiah
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Machine Learnimg Notes

Uploaded by

baffoebenaiah
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Thanks to machine learning, exciting changes are happening across all areas of society, including:

 Recent advancements in industries such as autonomous vehicles.

 Accurate and rapid translation of the text into hundreds of languages.

 AI assistants you might find in your home.

 Worker safety improvements.

 Quicker pharmaceutical design and development.

Machine learning is a complex subject area. Our goal in this lesson is to introduce you to some of the most common
terms and ideas used in machine learning. We will then walk you through the different steps involved in machine
learning and finish with a series of examples that use machine learning to solve real-world situations.

Let's look at the outline for this lesson.

Machine learning is part of the broader field of artificial intelligence. This field is concerned with the capability of
machines to perform activities using human-like intelligence. Within machine learning there are several different
kinds of tasks or techniques:

 In supervised learning, every training sample from the dataset has a corresponding label or output value
associated with it. As a result, the algorithm learns to predict labels or output values. We will explore this in-
depth in this lesson.

 In unsupervised learning, there are no labels for the training data. A machine learning algorithm tries to learn
the underlying patterns or distributions that govern the data. We will explore this in-depth in this lesson.

 In reinforcement learning, the algorithm figures out which actions to take in a situation to maximize a reward
(in the form of a number) on the way to reaching a specific goal. This is a completely different approach than
supervised and unsupervised learning. We will dive deep into this in the next lesson.

In traditional problem-solving with software, a person analyzes a problem and engineers a solution in code to solve
that problem. For many real-world problems, this process can be laborious (or even impossible) because a correct
solution would need to take a vast number of edge cases into consideration.
Imagine, for example, the challenging task of writing a program that can detect if a cat is present in an image. Solving
this in the traditional way would require careful attention to details like varying lighting conditions, different types of
cats, and various poses a cat might be in.

In machine learning, the problem solver abstracts away part of their solution as a flexible component called a model,
and uses a special program called a model training algorithm to adjust that model to real-world data. The result is a
trained model which can be used to predict outcomes that are not part of the dataset used to train it.

In a way, machine learning automates some of the statistical reasoning and pattern-matching the problem solver
would traditionally do.

The overall goal is to use a model created by a model-training algorithm to generate predictions or find patterns in
data that can be used to solve a problem.
Machine learning is a new field created at the intersection of statistics, applied math, and computer science. Because
of the rapid and recent growth of machine learning, each of these fields might use slightly different formal definitions
of the same terms.

Terminology

Machine learning, or ML, is a modern software development technique that enables computers to solve problems by
using examples of real-world data.

In supervised learning, every training sample from the dataset has a corresponding label or output value associated
with it. As a result, the algorithm learns to predict labels or output values.

In reinforcement learning, the algorithm figures out which actions to take in a situation to maximize a reward (in the
form of a number) on the way to reaching a specific goal.

In unsupervised learning, there are no labels for the training data. A machine learning algorithm tries to learn the
underlying patterns or distributions that govern the data.

tep 1: Define the problem

Is it possible to find clusters of similar books based on the presence of common words in the book descriptions?

You do editorial work for a book recommendation company, and you want to write an article on the largest book
trends of the year. You believe that a trend called "micro-genres" exists, and you have confidence that you can use
the book description text to identify these micro-genres.

By using an unsupervised machine learning technique called clustering, you can test your hypothesis that the book
description text can be used to identify these "hidden" micro-genres.

Identify the machine learning task you could use

By using an unsupervised machine learning technique called clustering, you can test your hypothesis that the book
description text can be used to identify these "hidden" micro-genres.

Earlier in this lesson, you were introduced to the idea of unsupervised learning. This machine learning task is
especially useful when your data is not labeled.
Step 2: Build your dataset

To test the hypothesis, you gather book description text for 800 romance books published in the current year. You
plan to use this text as your dataset.

Data exploration, cleaning, and preprocessing

In the lesson about building your dataset, you learned about how sometimes it is necessary to change the format of
the data that you want to use. In this case study, we need use a process called vectorization. Vectorization is a
process whereby words are converted into numbers.

Data cleaning and exploration


For this project, you believe capitalization and verb tense will not matter, and therefore you remove capitals and
convert all verbs to the same tense using a Python library built for processing human language. You also remove
punctuation and words you don’t think have useful meaning, like 'a' and 'the'. The machine learning community
refers to these words as stop words.

Data preprocessing
Before you can train the model, you need to do a type of data preprocessing called data vectorization, which is used
to convert text into numbers.
As shown in the following image, you transform this book description text into what is called a bag of words
representation, so that it is understandable by machine learning models.

How the bag of words representation works is beyond the scope of this lesson. If you are interested in learning more,
see the What's next section at the end of this chapter.
Step 3: Train the model

Now you are ready to train your model.

You pick a common cluster-finding model called k-means. In this model, you can change a model parameter, k, to be
equal to how many clusters the model will try to find in your dataset.

Your data is unlabeled and you don't how many micro-genres might exist. So, you train your model multiple times
using different values for k each time.

What does this even mean? In the following graphs, you can see examples of when k=2 and when k=3.

During the model evaluation phase, you plan on using a metric to find which value for k is the most appropriate.

Step 4: Model evaluation

In machine learning, numerous statistical metrics or methods are available to evaluate a model. In this use case,
the silhouette coefficient is a good choice. This metric describes how well your data was clustered by the model. To
find the optimal number of clusters, you plot the silhouette coefficient as shown in the following image below. You
find the optimal value is when k=19.
Often, machine learning practitioners do a manual evaluation of the model's findings.

You find one cluster that contains a large collection of books that you can categorize as "paranormal teen romance."
This trend is known in your industry, and therefore you feel somewhat confident in your machine learning approach.
You don’t know if every cluster is going to be as cohesive as this, but you decide to use this model to see if you can
find anything interesting about which to write an article.

Step 5: Model inference

As you inspect the different clusters found when k=19, you find a surprisingly large cluster of books. Here's an
example from fictionalized cluster #7.

As you inspect the preceding table, you can see that most of these text snippets indicate that the characters are in
some kind of long-distance relationship. You see a few other self-consistent clusters and feel you now have enough
useful data to begin writing an article on unexpected modern romance micro-genres.

Wrap-up

In this example, you saw how you can use machine learning to help find micro-genres in books by using the text
found on the back of the book. Here is summary of key moments from the lesson you just finished.

One
For some applications of machine learning, you need to not only clean and preprocess the data but also convert the
data into a format that is machine readable. In this example, the words were converted into numbers through a
process called data vectorization.
Two
Solving problems in machine learning requires iteration. In this example you saw how it was necessary to train the
model multiples times for different values of k. After training your model over multiple iterations you saw how the
silhouette coefficient could be use to determine the optimal value for k.

Three
During model inference you continued to inspect the clusters for accuracy to ensure that your model was generative
useful predictions.

Terminology

 Bag of words: A technique used to extract features from text. It counts how many times a word appears in a
document (corpus), and then transforms that information into a dataset.

 Data vectorization: A process that converts non-numeric data into a numerical format so that it can be used
by a machine learning model.

 Silhouette coefficients: A score from -1 to 1 describing the clusters found during modeling. A score near zero
indicates overlapping clusters, and scores less than zero indicate data points assigned to incorrect clusters. A
score approaching 1 indicates successful identification of discrete non-overlapping clusters.

 Stop words: A list of words removed by natural language processing tools when building your dataset. There
is no single universal list of stop words used by all-natural language processing tools.

Step 1: Defining the problem


Imagine you run a company that offers specialized on-site janitorial services.
One client - an industrial chemical plant - requires a fast response for spills
and other health hazards. You realize if you could automatically detect spills
using the plant's surveillance system, you could mobilize your janitorial team
faster.

Machine learning could be a valuable tool to solve this problem.


Choosing a model
As shown in the image above, your goal will be to predict if each image
belongs to one of the following classes:
 Contains spill
 Does not contain spill

Step 2: Building a dataset


Collecting
 Using historical data, as well as safely staged spills, quickly build a
collection of images that contain both spills and non-spills in multiple
lighting conditions and environments.
Exploring and cleaning
 Go through all of the photos to ensure that the spill is clearly in the shot.
There are Python tools and other techniques available to improve image
quality, which you can use later if you determine that you need to
iterate.
Data vectorization (converting to numbers)
 Many models require numerical data, so you must transform all of your
image data needs to be transformed into a numerical format. Python
tools can help you do this automatically.
 In the following image, you can see how each pixel in the image
immediately below can be represented in the image beneath it using a
number between 0 and 1, with 0 being completely black and 1 being
completely white.

Split the data


 Split your image data into a training dataset and a test dataset.

 Step 3: Model Training


 Traditionally, solving this problem would require hand-engineering features on top of the underlying
pixels (for example, locations of prominent edges and corners in the image), and then training a
model on these features.
Today, deep neural networks are the most common tool used for solving this kind of problem. Many
deep neural network models are structured to learn the features on top of the underlying pixels so
you don’t have to learn them. You’ll have a chance to take a deeper look at this in the next lesson, so
we’ll keep things high-level for now.
 CNN (convolutional neural network)
 Neural networks are beyond the scope of this lesson, but you can think of them as a collection of
very simple models connected together. These simple models are called neurons, and the
connections between these models are trainable model parameters called weights.
Convolutional neural networks are a special type of neural network that is particularly good at
processing images.

Step 4: Model evaluation


As you saw in the last example, there are many different statistical metrics
that you can use to evaluate your model. As you gain more experience in
machine learning, you will learn how to research which metrics can help you
evaluate your model most effectively. Here's a list of common metrics:
 Accuracy
 Confusion matrix
 F1 score
 False positive rate
 False negative rate
 Log loss
 Negative predictive value
 Precession
 Recall
 ROC Curve
 Specificity
In cases such as this, accuracy might not be the best evaluation mechanism.

Why not? The model will see the does not contain spill' class almost all the
time, so any model that just predicts no spill most of the time will seem pretty
accurate.
What you really care about is an evaluation tool that rarely misses a real spill.
After doing some internet sleuthing, you realize this is a common problem and
that precision and recall will be effective. Think of precision as answering the
question, "Of all predictions of a spill, how many were right?" and recall as
answering the question, "Of all actual spills, how many did we detect?"
Manual evaluation plays an important role. If you are unsure if your staged
spills are sufficiently realistic compared to actual spills, you can get a better
sense how well your model performs with actual spills by finding additional
examples from historical records. This allows you to confirm that your model
is performing satisfactorily.
Step 5: Model inference
The model can be deployed on a system that enables you to run machine
learning workloads such as AWS Panorama.

Thankfully, most of the time, the results will be from the class does not
contain spill.

But, when the class contains spill' is detected, a simple paging system could
alert the team to respond.
Wrap-up
In this example, you saw how you can use machine learning to help detect
spills in a work environment. This example also used a modern machine
learning technique called a convolutional neural network (CNN).

Here is summary of key moments from the lesson that you just finished.
One
For some applications of machine learning, you need to use more complicated
techniques to solve the problem. While modern neural networks are a
powerful tool, don’t forget their cost in terms of being easily explained.

Two
High quality data once again was very important to the success of this
application, to the point where even staging some fake data was required.
Once again, the process of data vectorization was required so it was
important to convert the images into numbers so that they could be used by
the neural network.

Three
During model inference you continued to inspect the predictions for accuracy.
It is especially important in this case because you created some fake data to
use when training your model.
Terminology
Neural networks: a collection of very simple models connected together.
 These simple models are called neurons.
 The connections between these models are trainable model parameters
called weights.
Convolutional neural networks(CNN): a special type of neural network
particularly good at processing images.

You might also like