Aws ML PDF
Aws ML PDF
These notes follow LinuxAcademy structure which can be found here: https://
linuxacademy.com/course/aws-certified-machine-learning-specialty/. I would
recommend viewing the course to gain full and detailed explanations.
I have created these notes as part of my personal learning and hope to be able to
help and inspire others.
As i am also learning, there may well be mistakes, please do reach out and let me
know, so i can correct them.
• Advancements with compute power has brought a new wave of artificial intelligence
• Machine Learning provides the ability to learn without being explicitly programmed.
• It focuses on the development of programs that can access data and use it to learn for them
selves
• Machine learning is when you load lots of data into a computer program and choose a model to
“fit” the data, which allows the computer (without your help) to come up with predictions.
• The way the computer makes the model is through algorithms, which can range from a simple
equation (like the equation of a line) to a very complex system of logic/math that gets the
computer to the best predictions.
What is Machine Learning?
HEIGHT
v
This would then become our training data for our 1
inferences for height and weight given that we have
one of the values
HEIGHT
TESTING DATA
r
1 I This blue line may appear to be a better fit for
I predicting weights as opposed to the straight line.
r
3
HEIGHT
We can use our test data to test the line and see how well
it fits.
DATA
I RAIN MODEL PREDICTION
ALGORITHM
r
LINEN2 REGRESSION
This is an very simplified view of what we are doing with machine learning.
We have only looked at two dimensions which is easy to visualise, however when we
get beyond 3 it becomes more difficult. Having lots of dimensions is more closer to
reality and considering we cannot draw a 200 dimension graph, Machine Learning
can help towards solving these problems.
What is Deep Learning?
Deep learning is based on the principles of an organic brain with the aim to get machines to
learn in a similar way.
Neurons are chained together as a Neural Network with inputs and outputs.
IMPROVE TEST
C
PREDICTIONS
L INFER c DEPLOY
Process Data
FEATURE
LABEL
Feature reduction
We want as much data as Encoding
possible when training our In the previous image, the
model however, we dont want to star signs are numeric
pass data that is not related. values, therefore the string
This can be diffi cult as you may has been encoded. We
be looking for relationships in could look up the data in a
the data, that you are not aware separate table.
of.
Formatting
The fi le format that we will use
for providing the data to the ML
algorithm.
The Algorithm
◦Can see and is directly influenced by the training data
Train ◦Uses but is indirectly influenced from validation data
◦Does not see testing data during training
29 3 YES YES
1
This is another example of supervised data
where we can infer the weight based on the
height.
HEIGHT
DOGS CATS
Unsupervised learning involves fi nding relationships
where we did not know there was one.
M
It is best used when we are trying to analyse data with
lots of dimensions in order to fi nd relationships
En between the data points, where we would not
normally fi nd using conventional methods.
SCORE A
REWARD
ACTION
Summary
Unsupervised learning
involves looking for patterns
Supervised when it is not initially evident
there are ones. It is best used
with hundreds of dimensions
where it is not possible to be
able to plot on graphs
Reinforcement learning
involves providing a reward
Reinforcement when it does something
Learning correct and taking away the
reward when it is incorrect. It
involves a lot of trial and
error to get it right
Optimisation
HEIGHT
HEIGHT
V
The job of the machine learning algorithm is to find
E the lowest point of the parabola.
E
The bottom of this curve would show the line with the
best fit as it has the least amount of differences.
O
It is easy for us to see the bottom of the slope but
E
V
the computer needs to be able to calculate this.
E
O u
n n
BEST FIT SLOPE FOR MODEL
V
You can then tell if you are heading towards reduction
in slope or increase in order to understand the
gradient.
E
It is then possible to keep stepping until you get to
the bottom of the graph
V
size. If it is too large then we might miss the bottom
of the graph or too small and it would be inefficient.
• Graph
◦If we do sum of the squares vs the slope of the model line, we will end up with a parabola.
The algorithm needs to fi nd the lowest point.
◦The bottom of the curve is where the slope is 0 and this is the best fi t.
• Gradient Descent
◦In order to discover the gradient, the model will pick a point and fi nd the gradient and
move in the direction where its less steep
◦This technique is called gradient descent
• Learning Rate
◦The step size sets the learning rate
◦If the step size is too large it might miss the bottom of the graph and too small is not
effi cient.
• Important
◦The other thing to bare in mind is that there might be multiple dips in the line
Regularisation
Your sample data may fit well, but real world data
generally does not fit so well straight away.
• Technique when we dont see our dataset fit real world data that well.
• Looking at the graph we can see that a small differences can have a larger effect overall.
Hyperparameters
Hyperparameters
◦Learning rate
◦Epochs
◦Batch size.
Learning Rate
◦Determine the size of the step taken during gradient descent optimisation.
◦It is set between 0 and 1
Batch Size
◦Batch size is the number of samples used to train at any one time.
◦It could be all of the data, some of the data or a single sample.
◦Another way to put it is batch, stochastic or mini-batch. It is often 32, 64 or 128.
◦It is possible to calculate based on infrastructure.
◦It is also based on the amount of data you have.
◦If you span over multiple servers then you might use a batch size that splits over that
infrastructure
Epochs
◦The number of times the algorithm will process the entire data set multiple times.
◦Each time it passes through the data, the intention is to improve the accuracy of the
algorithm.
◦Common values of these are high numbers - the number of times the algorithm will
sample the data set.
Cross Validation
IMPROVE
TEST
L.tt
As a result we use all data for training and
for validation which is called k-fold
validation.
VALIDATION
This technique can also be used to compare
different algorithms and validating different
data sets.
Feature Selection and Engineering
This example dataset can be used to understand if people like coffee or not.
The first thing to do is remove anything in the data set which does not have anything to do
with the inference we are making, however this does require specific domain knowledge in
order to establish if we are taking away the correct features or not.
In this data set the name is not relevant and therefore can be removed and also helps
towards making the algorithm more efficient as it won't try and make a relationship between
someones name and if they like coffee. We need to be careful we do not remove a feature
that would of been useful.
The result will be a faster trained model and also one that is more accurate.
COUNTRY AGE HEIGHT LIKECOFFEE The other way to establish the
UK 33 170 YES relevant is by checking if there is
133 NO any correlation between the
BRAZIL 23
label and the feature.
USA NO
5 2
INDIA 39 175 YES This also needs domain level
5 35 YES knowledge and trial and error.
AUSTRALIA
LOOKS SUSPICIOUS
Another strategy is to engineer new
features.
SCORE 1
PCA looks for aspects of the data which influence
it the most by finding the central point of the data
set.
Eg
Once we find that central point all data set is
score3
moved such that it is centered around the origin.
SCORE 1
We do that by finding the mean value on score 1,
score 2 and score 3
PCA generally does this for us but once the data set
is captured, we would need to draw around it.
SCORE 1
The next longest is principle 2 followed by 3 which
gives us the spread of data which most influences
the data set
PCI Pez
r We can then leave out the 3rd dataset.
• Try and source more data because the thing you are looking for is not represented as well as you
would like.
• If it is not possible to get the data, another option is to over sample the data but then faults will
likely to look like whatever you have in your training data
• We can synthesise the data to understand what can vary and affect the data set. That way the ML
algorithm can approximate the data
• Finally we can try a different algorithm - often people use the same algorithm frequently since we
know that algorithm and understand it.
Label and One Hot Encoding
UK 0 O O 1
BRAZIL I 0 O O
USA O O I 0
AUSTRALIA 0 I 0 O
In this scenario one hot encoding comes into play, whereby new features are introduced into
the data set and therefore each country would become a feature and a table with 0's and 1's .
In this case it is important not to have a numerical relationship between the countries and no
implied hierarchy between the countries.
Logistical Regression
Supervised ML algorithm
RESTINGHEARTRATE TO 88 65 89 78 61 69 98 82
LIKES CATS Y N Y N Y Y N N N
NO
YES
65707580 85 90 95 100 105 110
RESTINGHEARTRATE TO 88 65 89 78 61 69 98 82
LIKES CATS Y N Y N Y Y N N N
A way to do this would be to draw a line using linear regression to find the
best fit.
A problem with this is that there may be outliers which can skew the data
set and therefore make the wrong inferences
SIGMOIDFUNCTION
YES
Instead we could fit a sigmoid function which does not skew the line like linear
regression but instead looks for the cut off point between the yes and no
There are methods to fine tune this to understand what is most important.
Linear Regression
Supervised model
LATITUDE COFFEE
CONSUMED
4 6 60
7 2 50
20 O 40
28 O 30
38 24 20
45 35 to
59 18 10 20 30 40 50 60 70 80 90 100
70 49
76 24
The example data set might be latitude for where you live on the planet and then the
amount of coffee consumed.
Supervised model
40
30 SUPPORT VECTOR
20
10
10 20 30 40 so 60 70 80 90 100
How do we best identify where we
should draw lines by identifying the
boundaries of our data sets.
30
20
10
10 20 30 40 so 60 70 80 90 100
Decision Trees
Supervised algorithm
I NODE
11200
LIKEWALKING
Decision trees are essentially flow
INTERNAL diagrams which has root nodes, internal
NODE Y N nodes and leaf nodes
RUNNING
How do we start our root? We would need to understand which feature assigns most
closely to the question we are asking.
In this example when analysing the data set, we see that it is, 'likes running' for who is a
dog person vs cat person.
We would then fi lter the data based on that data and identify what is the next most
important feature which in turn would make up the next node and go through other
branches.
You may fi nd some of the features was not selected as it had no correlation between the
question.
We wont see the actual decision tree when it is created but we can give it new data and
categorise the data to see its behaviour
Random Forest
Random forests are supervised algorithms
RUNNING
s
WALKING Kms 4
e s s
Kms 2 WALKING
c s
e s
When you create a decision tree you need to know what question you will place in the root node. The
random forest will check 2 different features and follow down the branch - it is chosen randomly.
We build the decision tree in this way and continue until we have a decision tree with a random
variance.
Supervised ML algorithm
so
so
so
so
go 40
so so
so so
lo lo
lo w so 40 so so o so go coo to zo so 40 so so yo so go too
If we want to find 3 classes of data, the algorithm makes some random guesses and places 3 points
across the dataset.
It then goes through each data point and checks which centre point it is closest to. The next step is
to figure out all the closest data points. At this point the data classification will be wrong so it will
move the centre point to the middle of its classes.
The algorithm will then go through the cycle again including moving the central point until the
distributions make sense. We need to find equilibrium where moving to the central point does not
effect the classification
60
so
40
so
Few
7
er
I 2 3 4 5 6 7 8
CLUSTERS K
To find out how many times we cycle through, we can graph the number of clusters and
reduction in variation.
The first cluster variation will be at 0 and as we increase the number of clusters we will
eventually see a elbow plot where the variation does not change much.
Supervised algorithm
Unsupervised Algorithm
Document
WORD
WORD
WORD
Document
word
Topic word
Topic word
Topic word
Document word
word
word
word
There is data analysis steps which are done before any processing is done which involves removing
particular words like 'stop words' and words such as 'and'. These words do not help towards
understanding the content.
We then apply stemming to words such as, learned, learning, and learn are all condensed into a
single word i.e learn. Once this is complete, we can then tokenise the words into an array.
Finally we choose the number of topics we would want LDA to find and this is K.
So we take all the words in our array, if we select 3 topics to find, the algorithm will randomly assign a
topic number to all the words.
WORD TOPIC TOPIC2 Topless
Topic
WORD 1 MACHINELEARNING 22 33 43
WORD 2 FUNRUN 32 34 23
WORD 3 DEEP LEARNING 44 23 34
WORD 1 We then calculate each word and
LAMBDA 51 43 23
WORD 2 how often they appear in each
WORD 3 STORAGE 33 64 54 topic
WORD 2 ARTIFICIAL
INTELLIGENCE 45 33 23
WORD 3 Once that is complete we can then
check each document and how
WORD 1
often each topic appears there.
WORD 2
DOCUMENT TOPIC7 TOPICZTOPICS
WORD 2
We take the number of times a
WORD 3 STORAGE 123 23 34 word appears in a topic and how
WORD 1 MACHINELEARNING 43 143 45 many times it appeared for a
particular document and multiply
LAMBDA 24 35 132
them together.
WEIGHTS
ACTIVATION FUNCTIONS
x RELU
2.5
25
y
SIGMOID
if 2s
x
NH
There are 3 types of activation functions which are
• ReLU
◦Does not consider any negative values
• Sigmoid
◦Generally places values between 0 and 1
• Tanh
◦Is similar to Sigmoid but also trends to negative 1 on the y axis.
If we plot the x value on the function, the y value is the activation function which is provided.
We do not tend to use Sigmoid or TanH generally, ReLU is most commonly used.
b
The bias is there to prevent our neuron
w from being deactivated. If the result was 0
b w
then it would not infl uence anything -the
w
more neurons you have turned off, the less
w
effective the network is.
s
w b
At this point the output will be wrong
because everything will be random and this is
b called forward propagation.
HOW CORRECT AM I
FORWARD PROPAGATION
N
b w
w LOSS FUNCTION
s w
E H w b s
EEEL EEE
BACK PROPAGATION
Once we get to the end we do a loss function which is an evaluation of the calculations that was
made.
This is also known as back propagation and it uses gradient descent and learning rates to reduce
the loss that takes place. It looks at a way to update weights and biases.
The iteration of doing forward and back propagation is epochs and this is how it learns.
Convolutional Neural Networks (CNN)
Supervised Algorithm
Supervised Algorithm
6
Let's say we do these repeated activities at
7 various times during the day. There is a linear
8 relationship here of activities.
MEMORY
The main thing is we take the output and feed it back into the model, it has a
memory to know previous predictions to influence future predictions.
I
SVM DECISIONTREES LOGISTIC REGRESSION
We can use different algorithms to our data but the question would be, which algorithm is
best suited to our desired inference?
KNOWNTRUTHS
LIKESDOGS LIKESCATS
GoesnotukeDoes
We can split our data to training and testing data and use Logistical Regression, SVM or
Decision Trees.
As we have labelled data, we can push the testing data through the models and get a result
but we want to establish which is best suited for our scenario.
One of the tools to be able to do this, is called a confusion matrix. This matrix maps on one
side the model prediction vs known truths such that we can see the accuracy.
So simply put the model predicted they do like animals when they didn't or the model
predicted they dont like animals when they did.
KNOWNTRUTHS
LIKESDOGS LIKES
CATS
iuxeooa.si
cooesno We would do this confusion
matrix across the different
LIKESDOGS 120 98
E algorithms to be able to see
E
LIKESCATS which algorithm performs
SVM E
E iuxeooa.si
cooesno 109 200
better.
KNOWNTRUTHS
Its not always clear which one
is better unless we
LIKESDOGS LIKES
CATS
iukeooa.si
cooesno understand our question in
more detail, we would then
E
LIKESDOGS 240 40 choose based on our
E particular use case.
EE LIKESCATS
LOGISTIC REGRESSION E iukeooa.si
cooesno 45 202
Sensitivity and Specificity
TP
SENSITIVITY
Tp FN
KNOWNTRUTHS
Sensitivity = True Positives / ( True
Positives + False Negatives)
YES NO
The closer the sensitivity value is
E
to 1 then the most accurate it is.
Ee YES TruePositives FalsePositives
TN
SPECIFICITY
TN FP
Banks are more interested in the sensitivity score since they are looking for fraudulent activities.
It is more important to catch fraud then falsely identifying - this can fixed it or account can be
unblocked if it was not fraud for example. Therefore the ML model will have higher sensitivity.
This could be similar to medical scenarios too, if it turns out to be false identification, the doctor
can use additional methods to verify.
Specificity is used for example when we have a child watching videos on YouTube. False positives
are not acceptable, we can put up with videos that would of been suitable but was not shown but
displaying unsuitable content will cause issues.
Accuracy and precision
Accuracy is the proportion of all the predictions that was correctly identified
We need to be careful how we frame the question when it comes to identifying and in a technical
manner.
Accuracy = TP + TN / total
Accuracy with 100% means it is likely overfit and needs to be more generalised.
We can calculate the accuracy and precision for Logistic Regression against decision trees for example
then we can see the difference between each
ROC/AUC
LIKESCOFFEE
If we consider a logistical regression
graph for a Binary situation i.e. likes
PROBABILITY
coffee vs does not, we can then model OFLIKING
COFFEE
this behaviour. It must however, be
binary and also we need to identify
a
where that cut off is actually located. go.gg
to 20 30 40 so 60 70 80 90 100
INCREAJESPECIFICITY
LIKESCOFFEE If we move the line up, we are
L
increasing specificity, which means you do
PROBABILITY
not want any of the classifi cations incorrect.
OFLIKING
COFFEE
If we move it down then we are
a increasing sensitivity, but we dont mind if
go.gg some people are captured who was a false
10 20 30 40 so 60 70 80 90 100 positive but at least they are captured but
we can address this with further checks and
balances later.
The question is, where do we draw this line and it depends on what we want to show.
The other consideration is where is the best balance balance between sensitivity and specificity.
One extreme to another is not going to be useful as it will always return the same result.
In this example there is some test data LIKESCOFFEE TRUE POSITIVE FALSEPOSITIVE
that has been labelled as likes to drink
coffee vs do not. a
1
LIKESCOFFEE FALSENEGATIVE TRUENEGATIVE
Everything on the right of the vertical line
will be classified as liking coffee and
everything on the left as not liking.
Here we can see we correctly identified
LikesCOFFEE all 5 as liking coffee and we got 3 for
known
h aunts true negatives.
probability i.ee
ukescaieeukesca
ofukinaco.ee
to 20 so 40 so 60 70 so go 100
aJ
truths
known
probability i.ee
ukescaieeukesca
What is the best point for our cut off point with all our data?
ROC
BESTMODEL
TPR WITHMAX
SENSITIVITY
The line at the top is the ROC which is
Receiver Operating Characteristics and
FPR
the point where we go from the upper
0 I
slope to the line - this is the cut off point
for max sensitivity the start of the slop
is the best model for max specificity.
0 FPR i
0 FPR I
In decision trees, the algorithm goes through the data looking for the data that represents
the biggest split. This can be calculated in various ways and Gini impurity is one of them.
What splits the data best? We need to look each of the features.
PROBABILITYOFDOG2
GINI IMPURITY I PROBABILITY OFCAT
v
WALKING RUNNING COLOR TYPE
NO
LIKES WALKING
YES GREEN DOG
NO NO BLUE CAT
120 Y N 98
YES YES RED DOG S
YES NO GREEN CAT TYPE TYPE
YES NO GREEN DOG DOG CAT DOG CAT
YES YES BLUE DOG 97 23 68
30
NO NO RED CAT
LIKESWALKING
LIKESWALKING
2 120 Y N 98 2 2
I 120 Y N 98
1 ft e
TYPE
s
TYPE
Zog
Type
s
TYPE
0310 DOG CAT DOG CAT DOG CAT DOG CAT
97 23 30 68 0.425 97 23 so 68
Likes walking has the lowest weighted Gini impurity, so it best separates people who like dogs
over cats. We will use likes walking as our root node.
F1 Score
2
I
PRECISION
RECALL
RECALL X PRECISION X2
RECALL 1 PRECISION
• Whenever you see F1, it is discussed as recall rather than sensitivity that's why its mentioned in that
manner.
• If you have an uneven class distribution then this is proved to be a better way to analyse
AWS Services, ML and DL Frameworks
• Algorithm such as CNN along with a framework such as MXNet, these two put together make up
the model which is then trained to create inferences
• TensorFlow has been developed by google and powers suggested videos, spam filtering etc..
• AWS have done considerable work with MXNet and SageMaker. MXNet is very good at scaling
across cloud infrastructure.
• PyTorch is runner up to TensorFlow and established machine learning and SciKit learn is a easier
framework to use and natively has support for many algorithms.
Athena
• It provides an SQL interface into S3
• Source data from multiple S3 locations s O c s.ES
neetEMAkER
SQaL f
• Athena looks at the schema of the data
which comes from glue
• We can do feature engineering from the SCHEMA GLUE SCHEMA
n n
original dataset to then use for analysis v v v
or train our algorithm S3
S3 S3
EMR
Image moderation
Facial analysis
Celebrity recognition
Face comparison
Text in image
Use Cases
Create a filter to prevent inappropriate images being sent via a messaging platform. This can
include nudity or offensive text.
Enhance metadata catalog of an image library to include the number of people in each image
LAMBDA
a
REKOGNITION c
GET LAMBDA s
S3
L
SNS SQS
The lambda function uses the Rekognition which would go to S3 bucket and get the data
Rekognition will go through the data and will send a message to SNS Topic on completion
which will be written to an SQS Queue.
Another lambda function will see the message in the queue and go to Recognition to get the
completed job.
Use Cases
Detect people of interest in a live video stream for a public safety application.
You can enter some plain text and it will transform into speech
Custom lexicons which is the ability to create your own specific words and pronunciations.
SSML (Speech Synthesis Markup Language) allows you to add syntax to change the way
something is spoken i.e. you could put an effect like 'whispered' which would say it in a whispered
tone.
There is a variety of languages like French, German, Hindi, Italian, Romanian etc..
Use Cases
Create an automated voice response (AVR) solution for a telephony system (including Connect)
Amazon Transcribe
You can either speak directly into the mic or pass it files which would be written to text
Use Cases
Create a call centre monitoring solution that integrates with other services to analyse caller
sentiment
Provides a variety of metrics, such as the ability to see successful request count, throttled request
count and character count along with others
Use Cases
Use Cases
Creates a chatbot that triages customer support requests directly on the product page of a website
AWS Step Functions lets you coordinate multiple AWS services into serverless workflows.
It allows you to stitch together services such as Transcribe, Comprehend along with others lambda
functions and services
LAMBDA LAMBDA s
J s LAMBDA LAMBDA s s
v
AMAZON
AMAZON COMPREHEND
SPEECH S3 TRANSCRIBE
We could then use call a lambda function based on an event which would fall into amazon step
function which will then orchestrate the desired behaviour between the different services.
We are triggering another lambda function which in turn will go and speak with Transcribe and kick
off a job against the s3 bucket.
We could then use another function after a period of time to check if the job has completed or not
and based on the response we can decide what we want to do next.
Once we have the desired response we can then use another lambda function to speak with Amazon
Comprehend which allows you to extract key phases, entitles, sentiment, language amongst other
things.
It covers the entire machine learning workflow to label and prepare your data, choose an
algorithm, train the model, tune and optimise it for deployment, make predictions and take action.
PARAMETER CHANNEL
There is also managed spot training and you can keep checkpoints of the model state in S3.
Once you have the above, you will end up with a model which can then be used to make inferences.
SageMaker - Batch / Realtime
• Real Time
◦It is possible to do real-time inferences by allowing the application to invoke the SageMaker
Endpoint which would then call on the model.
INVOKEENDPOINT
SAGEMAKERENDPOINT
S3 MODELL ECR
• Batch
◦Batch Transform jobs
◦They will put in data that we want to get inference from
◦We could then push that into our classifi cation model for example to understand if we have
a high value customer
S3 s S3
BATCHTRANSFORMATION
S3 SAGEMAKER DOCKER
SageMaker - Deploy
At this point you could run a command and give it a new file and use the model to create an
inference.