ML Cheatsheet
ML Cheatsheet
ML Cheatsheet
Team
1 Linear Regression 3
2 Gradient Descent 21
3 Logistic Regression 25
4 Glossary 39
5 Calculus 45
6 Linear Algebra 57
7 Probability 67
8 Statistics 69
9 Notation 71
10 Concepts 75
11 Forwardpropagation 81
12 Backpropagation 91
13 Activation Functions 97
14 Layers 105
16 Optimizers 115
17 Regularization 121
18 Architectures 131
i
21 Regression Algorithms 153
23 Datasets 157
24 Libraries 173
25 Papers 203
27 Contribute 215
ii
ML Cheatsheet Documentation
Brief visual explanations of machine learning concepts with diagrams, code examples and links to resources for
learning more.
Warning: This document is under early stage development. If you find errors, please raise an issue or contribute
a better definition!
Basics 1
ML Cheatsheet Documentation
2 Basics
CHAPTER 1
Linear Regression
• Introduction
• Simple regression
– Making predictions
– Cost function
– Gradient descent
– Training
– Model evaluation
– Summary
• Multivariable regression
– Growing complexity
– Normalization
– Making predictions
– Initialize weights
– Cost function
– Gradient descent
– Simplifying with matrices
– Bias term
– Model evaluation
3
ML Cheatsheet Documentation
1.1 Introduction
Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a
constant slope. It’s used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify
them into categories (e.g. cat, dog). There are two main types:
Simple regression
Simple linear regression uses traditional slope-intercept form, where 𝑚 and 𝑏 are the variables our algorithm will try
to “learn” to produce the most accurate predictions. 𝑥 represents our input data and 𝑦 represents our prediction.
𝑦 = 𝑚𝑥 + 𝑏
Multivariable regression
A more complex, multi-variable linear equation might look like this, where 𝑤 represents the coefficients, or weights,
our model will try to learn.
𝑓 (𝑥, 𝑦, 𝑧) = 𝑤1 𝑥 + 𝑤2 𝑦 + 𝑤3 𝑧
The variables 𝑥, 𝑦, 𝑧 represent the attributes, or distinct pieces of information, we have about each observation. For
sales predictions, these attributes might include a company’s advertising spend on radio, TV, and newspapers.
Let’s say we are given a dataset with the following columns (features): how much a company spends on Radio
advertising each year and its annual Sales in terms of units sold. We are trying to develop an equation that will let us
to predict units sold based on how much a company spends on radio advertising. The rows (observations) represent
companies.
Our prediction function outputs an estimate of sales given a company’s radio advertising spend and our current values
for Weight and Bias.
Weight the coefficient for the Radio independent variable. In machine learning we call coefficients weights.
Radio the independent variable. In machine learning we call these variables features.
Bias the intercept where our line intercepts the y-axis. In machine learning we can call intercepts bias. Bias offsets
all predictions that we make.
Our algorithm will try to learn the correct values for Weight and Bias. By the end of our training, our equation will
approximate the line of best fit.
Code
The prediction function is nice, but for our purposes we don’t really need it. What we need is a cost function so we
can start optimizing our weights.
Let’s use MSE (L2) as our cost function. MSE measures the average squared difference between an observation’s
actual and predicted values. The output is a single number representing the cost, or score, associated with our current
set of weights. Our goal is to minimize MSE to improve the accuracy of our model.
Math
Note:
• 𝑁 is the total number of observations (data points)
∑︀𝑛
• 𝑁1 𝑖=1 is the mean
• 𝑦𝑖 is the actual value of an observation and 𝑚𝑥𝑖 + 𝑏 is our prediction
Code
To minimize MSE we use Gradient Descent to calculate the gradient of our cost function. Gradient descent consists of
looking at the error that our weight currently gives us, using the derivative of the cost function to find the gradient (The
slope of the cost function using our current weight), and then changing our weight to move in the direction opposite of
the gradient. We need to move in the opposite direction of the gradient since the gradient points up the slope instead
of down it, so we move in the opposite direction to try to decrease our error.
Math
There are two parameters (coefficients) in our cost function we can control: weight 𝑚 and bias 𝑏. Since we need to
consider the impact each one has on the final prediction, we use partial derivatives. To find the partial derivatives, we
use the Chain rule. We need the chain rule because (𝑦 − (𝑚𝑥 + 𝑏))2 is really 2 nested functions: the inner function
𝑦 − (𝑚𝑥 + 𝑏) and the outer function 𝑥2 .
Returning to our cost function:
𝑛
1 ∑︁
𝑓 (𝑚, 𝑏) = (𝑦𝑖 − (𝑚𝑥𝑖 + 𝑏))2
𝑁 𝑖=1
Code
To solve for the gradient, we iterate through our data points using our new weight and bias values and take the average
of the partial derivatives. The resulting gradient tells us the slope of our cost function at our current position (i.e.
weight and bias) and the direction we should update to reduce our cost function (we move in the direction opposite the
gradient). The size of our update is controlled by the learning rate.
def update_weights(radio, sales, weight, bias, learning_rate):
weight_deriv = 0
bias_deriv = 0
companies = len(radio)
for i in range(companies):
# Calculate partial derivatives
# -2x(y - (mx + b))
weight_deriv += -2*radio[i] * (sales[i] - (weight*radio[i] + bias))
1.2.4 Training
Training a model is the process of iteratively improving your prediction equation by looping through the dataset
multiple times, each time updating the weight and bias values in the direction indicated by the slope of the cost
function (gradient). Training is complete when we reach an acceptable error threshold, or when subsequent training
iterations fail to reduce our cost.
Before training we need to initialize our weights (set default values), set our hyperparameters (learning rate and
number of iterations), and prepare to log our progress over each iteration.
Code
for i in range(iters):
weight,bias = update_weights(radio, sales, weight, bias, learning_rate)
# Log Progress
if i % 10 == 0:
print "iter={:d} weight={:.2f} bias={:.4f} cost={:.2}".format(i,
˓→weight, bias, cost)
If our model is working, we should see our cost decrease after every iteration.
Logging
Visualizing
Cost history
1.2.6 Summary
By learning the best values for weight (.46) and bias (.25), we now have an equation that predicts future sales based
on radio advertising investment.
How would our model perform in the real world? I’ll let you think about it :)
Let’s say we are given data on TV, radio, and newspaper advertising spend for a list of companies, and our goal is to
predict sales in terms of units sold.
As the number of features grows, the complexity of our model increases and it becomes increasingly difficult to
visualize, or even comprehend, our data.
One solution is to break the data apart and compare 1-2 features at a time. In this example we explore how Radio and
TV investment impacts Sales.
1.3.2 Normalization
As the number of features grows, calculating gradient takes longer to compute. We can speed this up by “normalizing”
our input data to ensure all values are within the same range. This is especially important for datasets with high
standard deviations or differences in the ranges of the attributes. Our goal now will be to normalize our features so
they are all in the range -1 to 1.
Code
Our input is a 200 x 3 matrix containing TV, Radio, and Newspaper data. Our output is a normalized matrix of the
same shape with all values between -1 and 1.
def normalize(features):
**
features - (200, 3)
features.T - (3, 200)
#Vector Subtraction
feature -= fmean
#Vector Division
feature /= frange
return features
Note: Matrix math. Before we continue, it’s important to understand basic Linear Algebra concepts as well as
numpy functions like numpy.dot().
Our predict function outputs an estimate of sales given our current weights (coefficients) and a company’s TV, radio,
and newspaper spend. Our model will try to identify weight values that most reduce our cost function.
W1 = 0.0
W2 = 0.0
W3 = 0.0
weights = np.array([
[W1],
[W2],
[W3]
])
Now we need a cost function to audit how our model is performing. The math is the same, except we swap the 𝑚𝑥 + 𝑏
expression for 𝑊1 𝑥1 + 𝑊2 𝑥2 + 𝑊3 𝑥3 . We also divide the expression by 2 to make derivative calculations simpler.
𝑛
1 ∑︁
𝑀 𝑆𝐸 = (𝑦𝑖 − (𝑊1 𝑥1 + 𝑊2 𝑥2 + 𝑊3 𝑥3 ))2
2𝑁 𝑖=1
Again using the Chain rule we can compute the gradient–a vector of partial derivatives describing the slope of the cost
function for each weight.
return weights
The gradient descent code above has a lot of duplication. Can we improve it somehow? One way to refactor would be
to loop through our features and weights–allowing our function to handle any number of features. However there is
another even better technique: vectorized gradient descent.
Math
We use the same formula as above, but instead of operating on a single feature at a time, we use matrix multiplication
to operative on all features and weights simultaneously. We replace the 𝑥𝑖 terms with a single feature matrix 𝑋.
Code
X = [
[x1, x2, x3]
[x1, x2, x3]
.
.
.
[x1, x2, x3]
]
targets = [
[1],
[2],
[3]
]
#1 - Get Predictions
predictions = predict(X, weights)
(continues on next page)
#2 - Calculate error/loss
error = targets - predictions
return weights
Our train function is the same as for simple linear regression, however we’re going to make one final tweak before
running: add a bias term to our feature matrix.
In our example, it’s very unlikely that sales would be zero if companies stopped advertising. Possible reasons for this
might include past advertising, existing customer relationships, retail locations, and salespeople. A bias term will help
us capture this base case.
Code
Below we add a constant 1 to our features matrix. By setting this value to 1, it turns our bias term into a constant.
bias = np.ones(shape=(len(features),1))
features = np.append(bias, features, axis=1)
After training our model through 1000 iterations with a learning rate of .0005, we finally arrive at a set of weights we
can use to make predictions:
References
Gradient Descent
Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction
of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update
the parameters of our model. Parameters refer to coefficients in Linear Regression and weights in neural networks.
2.1 Introduction
Consider the 3-dimensional graph below in the context of a cost function. Our goal is to move from the mountain in
the top right corner (high cost) to the dark blue sea in the bottom left (low cost). The arrows represent the direction of
steepest descent (negative gradient) from any given point–the direction that decreases the cost function as quickly as
possible. Source
Starting at the top of the mountain, we take our first step downhill in the direction specified by the negative gradient.
Next we recalculate the negative gradient (passing in the coordinates of our new point) and take another step in the
direction it specifies. We continue this process iteratively until we get to the bottom of our graph, or to a point where
we can no longer move downhill–a local minimum. image source.
21
ML Cheatsheet Documentation
The size of these steps is called the learning rate. With a high learning rate we can cover more ground each step, but
we risk overshooting the lowest point since the slope of the hill is constantly changing. With a very low learning rate,
we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low
learning rate is more precise, but calculating the gradient is time-consuming, so it will take us a very long time to get
to the bottom.
A Loss Functions tells us “how good” our model is at making predictions for a given set of parameters. The cost
function has its own curve and its own gradients. The slope of this curve tells us how to update our parameters to make
the model more accurate.
2.4 Step-by-step
Now let’s run gradient descent using our new cost function. There are two parameters in our cost function we can
control: 𝑚 (weight) and 𝑏 (bias). Since we need to consider the impact each one has on the final prediction, we need
to use partial derivatives. We calculate the partial derivatives of the cost function with respect to each parameter and
store the results in a gradient.
Math
To solve for the gradient, we iterate through our data points using our new 𝑚 and 𝑏 values and compute the partial
derivatives. This new gradient tells us the slope of our cost function at our current position (current parameter values)
and the direction we should move to update our parameters. The size of our update is controlled by the learning rate.
Code
return m, b
References
2.4. Step-by-step 23
ML Cheatsheet Documentation
Logistic Regression
• Introduction
– Comparison to linear regression
– Types of logistic regression
• Binary logistic regression
– Sigmoid activation
– Decision boundary
– Making predictions
– Cost function
– Gradient descent
– Mapping probabilities to classes
– Training
– Model evaluation
• Multiclass logistic regression
– Procedure
– Softmax activation
– Scikit-Learn example
3.1 Introduction
Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear
regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid
25
ML Cheatsheet Documentation
function to return a probability value which can then be mapped to two or more discrete classes.
Given data on time spent studying and exam scores. Linear Regression and logistic regression can predict different
things:
• Linear Regression could help us predict the student’s test score on a scale of 0 - 100. Linear regression
predictions are continuous (numbers in a range).
• Logistic Regression could help use predict whether the student passed or failed. Logistic regression predictions
are discrete (only specific values or categories are allowed). We can also view probability scores underlying the
model’s classifications.
• Binary (Pass/Fail)
• Multi (Cats, Dogs, Sheep)
• Ordinal (Low, Medium, High)
Say we’re given data on student exam results and our goal is to predict whether a student will pass or fail based on
number of hours slept and hours spent studying. We have two features (hours slept, hours studied) and two classes:
passed (1) and failed (0).
In order to map predicted values to probabilities, we use the sigmoid function. The function maps any real value into
another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.
Math
1
𝑆(𝑧) =
1 + 𝑒−𝑧
Note:
• 𝑠(𝑧) = output between 0 and 1 (probability estimate)
• 𝑧 = input to the function (your algorithm’s prediction e.g. mx + b)
• 𝑒 = base of natural log
Graph
Code
def sigmoid(z):
return 1.0 / (1 + np.exp(-z))
Our current prediction function returns a probability score between 0 and 1. In order to map this to a discrete class
(true/false, cat/dog), we select a threshold value or tipping point above which we will classify values into class 1 and
below which we classify values into class 2.
𝑝 ≥ 0.5, 𝑐𝑙𝑎𝑠𝑠 = 1
𝑝 < 0.5, 𝑐𝑙𝑎𝑠𝑠 = 0
For example, if our threshold was .5 and our prediction function returned .7, we would classify this observation as
positive. If our prediction was .2 we would classify the observation as negative. For logistic regression with multiple
classes we could select the class with the highest predicted probability.
Using our knowledge of sigmoid functions and decision boundaries, we can now write a prediction function. A
prediction function in logistic regression returns the probability of our observation being positive, True, or “Yes”. We
call this class 1 and its notation is 𝑃 (𝑐𝑙𝑎𝑠𝑠 = 1). As the probability gets closer to 1, our model is more confident that
the observation is in class 1.
Math
Let’s use the same multiple linear regression equation from our linear regression tutorial.
𝑧 = 𝑊0 + 𝑊1 𝑆𝑡𝑢𝑑𝑖𝑒𝑑 + 𝑊2 𝑆𝑙𝑒𝑝𝑡
This time however we will transform the output using the sigmoid function to return a probability value between 0 and
1.
1
𝑃 (𝑐𝑙𝑎𝑠𝑠 = 1) =
1 + 𝑒−𝑧
If the model returns .4 it believes there is only a 40% chance of passing. If our decision boundary was .5, we would
categorize this observation as “Fail.””
Code
We wrap the sigmoid function over the same prediction function we used in multiple linear regression
Unfortunately we can’t (or at least shouldn’t) use the same cost function MSE (L2) as we did for linear regression.
Why? There is a great math explanation in chapter 3 of Michael Neilson’s deep learning book5 , but for now I’ll simply
say it’s because our prediction function is non-linear (due to sigmoid transform). Squaring this prediction as we do
in MSE results in a non-convex function with many local minimums. If our cost function has many local minimums,
gradient descent may not find the optimal global minimum.
Math
Instead of Mean Squared Error, we use a cost function called Cross-Entropy, also known as Log Loss. Cross-entropy
loss can be divided into two separate cost functions: one for 𝑦 = 1 and one for 𝑦 = 0.
The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0.
These smooth monotonic functions7 (always increasing or always decreasing) make it easy to calculate the gradient
and minimize cost. Image from Andrew Ng’s slides on logistic regression1 .
5 http://neuralnetworksanddeeplearning.com/chap3.html
7 https://en.wikipedia.org/wiki/Monotoniconotonic_function
1 http://www.holehouse.org/mlclass/06_Logistic_Regression.html
The key thing to note is the cost function penalizes confident and wrong predictions more than it rewards confident
and right predictions! The corollary is increasing prediction accuracy (closer to 0 or 1) has diminishing returns on
reducing cost due to the logistic nature of our cost function.
Multiplying by 𝑦 and (1 − 𝑦) in the above equation is a sneaky trick that let’s us use the same equation to solve for
both y=1 and y=0 cases. If y=0, the first side cancels out. If y=1, the second side cancels out. In both cases we only
perform the operation we need to perform.
Code
Features:(100,3)
Labels: (100,1)
Weights:(3,1)
(continues on next page)
return cost
To minimize our cost, we use Gradient Descent just like before in Linear Regression. There are other more sophisti-
cated optimization algorithms out there such as conjugate gradient like BFGS, but you don’t have to worry about these.
Machine learning libraries like Scikit-learn hide their implementations so you can focus on more interesting things!
Math
One of the neat properties of the sigmoid function is its derivative is easy to calculate. If you’re curious, there is a
good walk-through derivation on stack overflow6 . Michael Neilson also covers the topic in chapter 3 of his book.
𝐶 ′ = 𝑥(𝑠(𝑧) − 𝑦)
Note:
• 𝐶 ′ is the derivative of cost with respect to weights
• 𝑦 is the actual class label (0 or 1)
• 𝑠(𝑧) is your model’s prediction
• 𝑥 is your feature or feature vector.
Notice how this gradient is the same as the MSE (L2) gradient, the only difference is the hypothesis function.
6 http://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x
Pseudocode
Repeat {
Code
Features:(200, 3)
Labels: (200, 1)
Weights:(3, 1)
'''
N = len(features)
#1 - Get Predictions
predictions = predict(features, weights)
return weights
Decision boundary
def decision_boundary(prob):
return 1 if prob >= .5 else 0
def classify(predictions):
'''
input - N element array of predictions between 0 and 1
output - N element array of 0s (False) and 1s (True)
'''
decision_boundary = np.vectorize(decision_boundary)
return decision_boundary(predictions).flatten()
Example output
3.2.7 Training
for i in range(iters):
weights = update_weights(features, labels, weights, lr)
# Log Progress
if i % 1000 == 0:
print "iter: "+str(i) + " cost: "+str(cost)
If our model is working, we should see our cost decrease after every iteration.
Cost history
Accuracy
Accuracy measures how correct our predictions were. In this case we simply compare predicted labels to true labels
and divide by the total.
Decision boundary
Another helpful technique is to plot the decision boundary on top of our predictions to see how our labels compare to
the actual labels. This involves plotting our predicted probabilities and coloring them with their true labels.
plt.legend(loc='upper right');
ax.set_title("Decision Boundary")
ax.set_xlabel('N/2')
ax.set_ylabel('Predicted Probability')
plt.axhline(.5, color='black')
plt.show()
Instead of 𝑦 = 0, 1 we will expand our definition so that 𝑦 = 0, 1...𝑛. Basically we re-run binary classification multiple
times, once for each class.
3.3.1 Procedure
1. Divide the problem into n+1 binary classification problems (+1 because the index starts at 0?).
2. For each class. . .
3. Predict the probability the observations are in that single class.
4. prediction = <math>max(probability of the classes)
For each sub-problem, we select one class (YES) and lump all the others into a second class (NO). Then we take the
class with the highest predicted value.
The softmax function (softargmax or normalized exponential function) is a function that takes as input a vector of
K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the
exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or
greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval [ 0 , 1
] , and the components will add up to 1, so that they can be interpreted as probabilities. The standard (unit) softmax
function is defined by the formula
𝑒𝑧(𝑖)
(𝑧𝑖 ) = ∑︀𝐾 𝑓 𝑜𝑟 𝑖 = 1, ., ., ., 𝐾 𝑎𝑛𝑑 𝑧 = 𝑧1 , ., ., ., 𝑧𝐾 (3.2)
𝑧(𝑗)
𝑗=1 𝑒
In words: we apply the standard exponential function to each element 𝑧𝑖 of the input vector 𝑧 and normalize these
values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of
the output vector (𝑧) is 1.9
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
# Create Test/Train
features_train,features_test,labels_train,labels_test = train_test_split(features,
˓→labels,test_size=0.4)
References
Glossary
39
ML Cheatsheet Documentation
40 Chapter 4. Glossary
ML Cheatsheet Documentation
Learning Rate The size of the update steps to take during optimization loops like Gradient Descent. With a high
learning rate we can cover more ground each step, but we risk overshooting the lowest point since the slope of
the hill is constantly changing. With a very low learning rate, we can confidently move in the direction of the
negative gradient since we are recalculating it so frequently. A low learning rate is more precise, but calculating
the gradient is time-consuming, so it will take us a very long time to get to the bottom.
Loss Loss = true_value(from data-set)- predicted value(from ML-model) The lower the loss, the better a model (un-
less the model has over-fitted to the training data). The loss is calculated on training and validation and its
interpretation is how well the model is doing for these two sets. Unlike accuracy, loss is not a percentage. It is a
summation of the errors made for each example in training or validation sets.
Machine Learning Mitchell (1997) provides a succinct definition: “A computer program is said to learn from expe-
rience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.” In simple language machine learning is a field in which human
made algorithms have an ability learn by itself or predict future for unseen data.
Model A data structure that stores a representation of a dataset (weights and biases). Models are created/learned when
you train an algorithm on a dataset.
Neural Networks Neural Networks are mathematical algorithms modeled after the brain’s architecture, designed to
recognize patterns and relationships in data.
Normalization Restriction of the values of weights in regression to avoid overfitting and improving computation
speed.
Noise Any irrelevant information or randomness in a dataset which obscures the underlying pattern.
Null Accuracy Baseline accuracy that can be achieved by always predicting the most frequent class (“B has the
highest frequency, so lets guess B every time”).
Observation A data point, row, or sample in a dataset. Another term for instance.
Outlier An observation that deviates significantly from other observations in the dataset.
Overfitting Overfitting occurs when your model learns the training data too well and incorporates details and noise
specific to your dataset. You can tell a model is overfitting when it performs great on your training/validation
set, but poorly on your test set (or new real-world data).
Parameters Parameters are properties of training data learned by training a machine learning model or classifier.
They are adjusted using optimization algorithms and unique to each experiment.
Examples of parameters include:
• weights in an artificial neural network
• support vectors in a support vector machine
• coefficients in a linear or logistic regression
Precision In the context of binary classification (Yes/No), precision measures the model’s performance at classifying
positive observations (i.e. “Yes”). In other words, when a positive value is predicted, how often is the prediction
correct? We could game this metric by only returning positive for the single observation we are most confident
in.
𝑇 𝑟𝑢𝑒𝑃 𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃 =
𝑇 𝑟𝑢𝑒𝑃 𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹 𝑎𝑙𝑠𝑒𝑃 𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
Recall Also called sensitivity. In the context of binary classification (Yes/No), recall measures how “sensitive” the
classifier is at detecting positive instances. In other words, for all the true observations in our sample, how many
did we “catch.” We could game this metric by always classifying observations as positive.
𝑇 𝑟𝑢𝑒𝑃 𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑅=
𝑇 𝑟𝑢𝑒𝑃 𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹 𝑎𝑙𝑠𝑒𝑁 𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
41
ML Cheatsheet Documentation
Recall vs Precision Say we are analyzing Brain scans and trying to predict whether a person has a tumor (True) or
not (False). We feed it into our model and our model starts guessing.
• Precision is the % of True guesses that were actually correct! If we guess 1 image is True out of 100
images and that image is actually True, then our precision is 100%! Our results aren’t helpful however
because we missed 10 brain tumors! We were super precise when we tried, but we didn’t try hard enough.
• Recall, or Sensitivity, provides another lens which with to view how good our model is. Again let’s say
there are 100 images, 10 with brain tumors, and we correctly guessed 1 had a brain tumor. Precision is
100%, but recall is 10%. Perfect recall requires that we catch all 10 tumors!
Regression Predicting a continuous output (e.g. price, sales).
Regularization Regularization is a technique utilized to combat the overfitting problem. This is achieved by adding
a complexity term to the loss function that gives a bigger loss for more complex models
Reinforcement Learning Training a model to maximize a reward via iterative trial and error.
ROC (Receiver Operating Characteristic) Curve A plot of the true positive rate against the false positive rate at
all classification thresholds. This is used to evaluate the performance of a classification model at different
classification thresholds. The area under the ROC curve can be interpreted as the probability that the model
correctly distinguishes between a randomly chosen positive observation (e.g. “spam”) and a randomly chosen
negative observation (e.g. “not spam”).
Segmentation It is the process of partitioning a data set into multiple distinct sets. This separation is done such that
the members of the same set are similar to each otherand different from the members of other sets.
Specificity In the context of binary classification (Yes/No), specificity measures the model’s performance at classi-
fying negative observations (i.e. “No”). In other words, when the correct label is negative, how often is the
prediction correct? We could game this metric if we predict everything as negative.
𝑇 𝑟𝑢𝑒𝑁 𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝑆=
𝑇 𝑟𝑢𝑒𝑁 𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 + 𝐹 𝑎𝑙𝑠𝑒𝑃 𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
42 Chapter 4. Glossary
ML Cheatsheet Documentation
Universal Approximation Theorem A neural network with one hidden layer can approximate any continuous func-
tion but only for inputs in a specific range. If you train a network on inputs between -2 and 2, then it will work
well for inputs in the same range, but you can’t expect it to generalize to other inputs without retraining the
model or adding more hidden neurons.
Unsupervised Learning Training a model to find patterns in an unlabeled dataset (e.g. clustering).
Validation Set A set of observations used during model training to provide feedback on how well the current param-
eters generalize beyond the training set. If training error decreases but validation error increases, your model is
likely overfitting and you should pause training.
Variance How tightly packed are your predictions for a particular observation relative to each other?
• Low variance suggests your model is internally consistent, with predictions varying little from each other
after every iteration.
• High variance (with low bias) suggests your model may be overfitting and reading too deeply into the
noise found in every training set.
References
43
ML Cheatsheet Documentation
44 Chapter 4. Glossary
CHAPTER 5
Calculus
• Introduction
• Derivatives
– Geometric definition
– Taking the derivative
– Step-by-step
– Machine learning use cases
• Chain rule
– How It Works
– Step-by-step
– Multiple functions
• Gradients
– Partial derivatives
– Step-by-step
– Directional derivatives
– Useful properties
• Integrals
– Computing integrals
– Applications of integration
* Computing probabilities
* Expected value
45
ML Cheatsheet Documentation
* Variance
5.1 Introduction
You need to know some basic calculus in order to understand how functions change over time (derivatives), and to
calculate the total amount of a quantity that accumulates over a time period (integrals). The language of calculus will
allow you to speak precisely about the properties of functions and better understand their behaviour.
Normally taking a calculus course involves doing lots of tedious calculations by hand, but having the power of com-
puters on your side can make the process much more fun. This section describes the key ideas of calculus which you’ll
need to know to understand machine learning concepts.
5.2 Derivatives
In geometry slope represents the steepness of a line. It answers the question: how much does 𝑦 or 𝑓 (𝑥) change given
a specific change in 𝑥?
Using this definition we can easily calculate the slope between two points. But what if I asked you, instead of the slope
between two points, what is the slope at a single point on the line? In this case there isn’t any obvious “rise-over-run”
to calculate. Derivatives help us answer this question.
A derivative outputs an expression we can use to calculate the instantaneous rate of change, or slope, at a single point
on a line. After solving for the derivative you can use it to calculate the slope at every other point on the line.
46 Chapter 5. Calculus
ML Cheatsheet Documentation
5.2.3 Step-by-step
Calculating the derivative is the same as calculating normal slope, however in this case we calculate the slope between
our point and a point infinitesimally close to it. We use the variable ℎ to represent this infinitesimally distance. Here
are the steps:
1. Given the function:
𝑓 (𝑥) = 𝑥2
2. Increment 𝑥 by a very small value ℎ(ℎ = 𝑥)
𝑓 (𝑥 + ℎ) = (𝑥 + ℎ)2
3. Apply the slope formula
5.2. Derivatives 47
ML Cheatsheet Documentation
𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)
ℎ
4. Simplify the equation
𝑥2 + 2𝑥ℎ + ℎ2 − 𝑥2
ℎ
2𝑥ℎ + ℎ2
= 2𝑥 + ℎ
ℎ
5. Set ℎ to 0 (the limit as ℎ heads toward 0)
2𝑥 + 0 = 2𝑥
So what does this mean? It means for the function 𝑓 (𝑥) = 𝑥2 , the slope at any point equals 2𝑥. The formula is defined
as:
𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)
lim
ℎ→0 ℎ
Code
Let’s write code to calculate the derivative of any function 𝑓 (𝑥). We test our function works as expected on the input
𝑓 (𝑥) = 𝑥2 producing a value close to the actual derivative 2𝑥.
In general it’s preferable to use the math to obtain exact derivative formulas, but keep in mind you can always compute
derivatives numerically by computing the rise-over-run for a “small step” ℎ.
Machine learning uses derivatives in optimization problems. Optimization algorithms like gradient descent use deriva-
tives to decide whether to increase or decrease weights in order to maximize or minimize some objective (e.g. a
model’s accuracy or error functions). Derivatives also help us approximate nonlinear functions as linear functions
(tangent lines), which have constant slopes. With a constant slope we can decide whether to move up or down the
slope (increase or decrease our weights) to get closer to the target value (class label).
The chain rule is a formula for calculating the derivatives of composite functions. Composite functions are functions
composed of functions inside other function(s).
48 Chapter 5. Calculus
ML Cheatsheet Documentation
Given a composite function 𝑓 (𝑥) = 𝐴(𝐵(𝑥)), the derivative of 𝑓 (𝑥) equals the product of the derivative of 𝐴 with
respect to 𝐵(𝑥) and the derivative of 𝐵 with respect to 𝑥.
𝑓 (𝑥) = ℎ(𝑔(𝑥))
𝑑𝑓 𝑑ℎ 𝑑𝑔
= ·
𝑑𝑥 𝑑𝑔 𝑑𝑥
5.3.2 Step-by-step
Say 𝑓 (𝑥) is composed of two functions ℎ(𝑥) = 𝑥3 and 𝑔(𝑥) = 𝑥2 . And that:
𝑑𝑓 𝑑ℎ 𝑑𝑔
= (5.4)
𝑑𝑥 𝑑𝑔 𝑑𝑥
𝑑ℎ 𝑑𝑔
= (5.5)
𝑑(𝑥2 ) 𝑑𝑥
Steps
In the above example we assumed a composite function containing a single inner function. But the chain rule can also
be applied to higher-order functions like:
𝑓 (𝑥) = 𝐴(𝐵(𝐶(𝑥)))
The chain rule tells us that the derivative of this function equals:
𝑑𝑓 𝑑𝐴 𝑑𝐵 𝑑𝐶
=
𝑑𝑥 𝑑𝐵 𝑑𝐶 𝑑𝑥
We can also write this derivative equation 𝑓 ′ notation:
Steps
5.4 Gradients
A gradient is a vector that stores the partial derivatives of multivariable functions. It helps us calculate the slope at a
specific point on a curve for functions with multiple independent variables. In order to calculate this more complex
slope, we need to isolate each variable to determine how it impacts the output on its own. To do this we iterate
through each of the variables and calculate the derivative of the function after holding all other variables constant.
Each iteration produces a partial derivative which we store in the gradient.
In functions with 2 or more variables, the partial derivative is the derivative of one variable with respect to the others.
If we change 𝑥, but hold all other variables constant, how does 𝑓 (𝑥, 𝑧) change? That’s one partial derivative. The next
variable is 𝑧. If we change 𝑧 but hold 𝑥 constant, how does 𝑓 (𝑥, 𝑧) change? We store partial derivatives in a gradient,
which represents the full derivative of the multivariable function.
50 Chapter 5. Calculus
ML Cheatsheet Documentation
5.4.2 Step-by-step
Here are the steps to calculate the gradient for a multivariable function:
1. Given a multivariable function
𝑓 (𝑥, 𝑧) = 2𝑧 3 𝑥2
2. Calculate the derivative with respect to 𝑥
𝑑𝑓
(𝑥, 𝑧)
𝑑𝑥
3. Swap 2𝑧 3 with a constant value 𝑏
𝑓 (𝑥, 𝑧) = 𝑏𝑥2
4. Calculate the derivative with 𝑏 constant
𝑑𝑓 𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)
= lim (5.15)
𝑑𝑥 ℎ→0 ℎ
𝑏(𝑥 + ℎ)2 − 𝑏(𝑥2 )
= lim (5.16)
ℎ→0 ℎ
𝑏((𝑥 + ℎ)(𝑥 + ℎ)) − 𝑏𝑥2
= lim (5.17)
ℎ→0 ℎ
𝑏((𝑥2 + 𝑥ℎ + ℎ𝑥 + ℎ2 )) − 𝑏𝑥2
= lim (5.18)
ℎ→0 ℎ
𝑏𝑥2 + 2𝑏𝑥ℎ + 𝑏ℎ2 − 𝑏𝑥2
= lim (5.19)
ℎ→0 ℎ
2𝑏𝑥ℎ + 𝑏ℎ2
= lim (5.20)
ℎ→0 ℎ
= lim 2𝑏𝑥(5.21)
+ 𝑏ℎ
ℎ→0
(5.22)
As ℎ¯ > 0. . .
2bx + 0
5. Swap 2𝑧 3 back into the equation, to find the derivative with respect to 𝑥.
𝑑𝑓
(𝑥, 𝑧) = 2(2𝑧 3 )𝑥 (5.23)
𝑑𝑥
4𝑧 3 𝑥
=(5.24)
6. Repeat the above steps to calculate the derivative with respect to 𝑧
𝑑𝑓
(𝑥, 𝑧) = 6𝑥2 𝑧 2
𝑑𝑧
7. Store the partial derivatives in a gradient
[︂ 𝑑𝑓 ]︂
4𝑧 3 𝑥
[︂ ]︂
∇𝑓 (𝑥, 𝑧) = 𝑑𝑥 =
𝑑𝑓
𝑑𝑧
6𝑥2 𝑧 2
Another important concept is directional derivatives. When calculating the partial derivatives of multivariable func-
tions we use our old technique of analyzing the impact of infinitesimally small increases to each of our independent
variables. By increasing each variable we alter the function output in the direction of the slope.
5.4. Gradients 51
ML Cheatsheet Documentation
But what if we want to change directions? For example, imagine we’re traveling north through mountainous terrain
on a 3-dimensional plane. The gradient we calculated above tells us we’re traveling north at our current location. But
what if we wanted to travel southwest? How can we determine the steepness of the hills in the southwest direction?
Directional derivatives help us find the slope if we move in a direction different from the one specified by the gradient.
Math
The directional derivative is computed by taking the dot product11 of the gradient of 𝑓 and a unit vector ⃗𝑣 of “tiny
nudges” representing the direction. The unit vector describes the proportions we want to move in each direction. The
output of this calculation is a scalar number representing how much 𝑓 will change if the current input moves with
vector ⃗𝑣 .
Let’s say you have the function 𝑓 (𝑥, 𝑦, 𝑧) and you want to compute its directional derivative along the following
vector2 :
⎡ ⎤
2
⃗𝑣 = ⎣ 3 ⎦
−1
As described above, we take the dot product of the gradient and the directional vector:
⎡ ⎤ ⎡ ⎤
𝑑𝑓
2
⎢ 𝑑𝑥
𝑑𝑓 ⎥
⎣ 𝑑𝑦 ⎦ · ⎣ 3 ⎦
𝑑𝑓 −1
𝑑𝑧
∇⃗𝑣𝑓 =2 𝑑𝑓 +3 𝑑𝑓 −1 𝑑𝑓
𝑑𝑥 𝑑𝑦 𝑑𝑧
This should make sense because a tiny nudge along ⃗𝑣 can be broken down into two tiny nudges in the x-direction,
three tiny nudges in the y-direction, and a tiny nudge backwards, by 1 in the z-direction.
There are two additional properties of gradients that are especially useful in deep learning. The gradient of a function:
1. Always points in the direction of greatest increase of a function (explained here)
2. Is zero at a local maximum or local minimum
5.5 Integrals
The integral of 𝑓 (𝑥) corresponds to the computation of the area under the graph of 𝑓 (𝑥). The area under 𝑓 (𝑥) between
the points 𝑥 = 𝑎 and 𝑥 = 𝑏 is denoted as follows:
∫︁ 𝑏
𝐴(𝑎, 𝑏) = 𝑓 (𝑥) 𝑑𝑥.
𝑎
11 https://en.wikipedia.org/wiki/Dot_product
2 https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/
directional-derivative-introduction
52 Chapter 5. Calculus
ML Cheatsheet Documentation
The area 𝐴(𝑎, 𝑏) is bounded by the function 𝑓 (𝑥) from above, by the 𝑥-axis from below,∫︀ and by two vertical lines at
𝑥 = 𝑎 and 𝑥 = 𝑏. The points 𝑥 = 𝑎 and 𝑥 = 𝑏 are called the limits of integration. The sign comes from the Latin
word summa. The integral is the “sum” of the values of 𝑓 (𝑥) between the two limits of integration.
The integral function 𝐹 (𝑐) corresponds to the area calculation as a function of the upper limit of integration:
∫︁ 𝑐
𝐹 (𝑐) ≡ 𝑓 (𝑥) 𝑑𝑥 .
0
There are two variables and one constant in this formula. The input variable 𝑐 describes the upper limit of integration.
The integration variable 𝑥 performs a sweep from 𝑥 = 0 until 𝑥 = 𝑐. The constant 0 describes the lower limit of
integration. Note that choosing 𝑥 = 0 for the starting point of the integral function was an arbitrary choice.
The integral function 𝐹 (𝑐) contains the “precomputed” information about the area under the graph of 𝑓 (𝑥). The
derivative function 𝑓 ′ (𝑥) tells us the “slope of the graph” property of the function 𝑓 (𝑥) for all values of 𝑥. Similarly,
the integral function 𝐹 (𝑐) tells us the “area under the graph” property of the function 𝑓 (𝑥) for all possible limits of
integration.
The area under 𝑓 (𝑥) between 𝑥 = 𝑎 and 𝑥 = 𝑏 is obtained by calculating the change in the integral function as follows:
∫︁ 𝑏
𝐴(𝑎, 𝑏) = 𝑓 (𝑥) 𝑑𝑥 = 𝐹 (𝑏) − 𝐹 (𝑎).
𝑎
5.5. Integrals 53
ML Cheatsheet Documentation
We can approximate the total area under the function 𝑓 (𝑥) between 𝑥 = 𝑎 and 𝑥 = 𝑏 by splitting the region into tiny
vertical strips of width ℎ, then adding up the areas of the rectangular strips. The figure below shows how to compute
the area under 𝑓 (𝑥) = 𝑥2 between 𝑥 = 1 and 𝑥 = 3 by approximating it as four rectangular strips of width ℎ = 0.5.
Usually we want to choose ℎ to be a small number so that the approximation is accurate. Here is some sample code
that performs integration.
You can find integral functions using the derivative formulas and some reverse engineering. To find an integral function
of the function 𝑓 (𝑥), we must find a function 𝐹 (𝑥) such that 𝐹 ′ (𝑥) = 𝑓 (𝑥). Suppose you’re given a function 𝑓 (𝑥)
and asked to find its integral function 𝐹 (𝑥):
∫︁
𝐹 (𝑥) = 𝑓 (𝑥) 𝑑𝑥.
54 Chapter 5. Calculus
ML Cheatsheet Documentation
𝐹 ′ (𝑥) = 𝑓 (𝑥).
For example, suppose you want to find the indefinite integral 𝑥2 𝑑𝑥. We can rephrase this problem as the search for
∫︀
𝐹 ′ (𝑥) = 𝑥2 .
Remembering the derivative formulas we saw above, you guess that 𝐹 (𝑥) must contain an 𝑥3 term. Taking the
derivative of a cubic term results in a quadratic term. Therefore, the function you are looking for has the form
𝐹 (𝑥) = 𝑐𝑥3 , for some constant 𝑐. Pick the constant 𝑐 that makes this equation true:
𝐹 ′ (𝑥) = 3𝑐𝑥2 = 𝑥2 .
1
Solving 3𝑐 = 1, we find 𝑐 = 3and so the integral function is
∫︁
1
𝐹 (𝑥) = 𝑥2 𝑑𝑥 = 𝑥3 + 𝐶.
3
𝑑 1 3
[︀ ]︀ 2
You can verify that 𝑑𝑥 3𝑥 + 𝐶 = 𝑥 .
You can also verify Integrals using maths. Here is a set of formulas for your reference
Integral calculations have widespread applications to more areas of science than are practical to list here. Let’s explore
a few examples related to probabilities.
Computing probabilities
A continuous random variable 𝑋 is described by its probability density function 𝑝(𝑥). A probability density function
𝑝(𝑥) is a positive function for which the total area under the curve is 1:
∫︁ ∞
𝑝(𝑥) ≥ 0, ∀𝑥 and 𝑝(𝑥) 𝑑𝑥 = 1.
−∞
Thus, the notion of integration is central to probability theory with continuous random variables.
We also use integration to compute certain characteristic properties of the random variable. The expected value and
the variance are two properties of any random variable 𝑋 that capture important aspects of its behaviour.
Expected value
The expected value of the random variable 𝑋 is computed using the formula
∫︁ ∞
𝜇= 𝑥 𝑝(𝑥).
−∞
The expected value is a single number that tells us what value of 𝑋 we can expect to obtain on average from the
random variable 𝑋. The expected value is also called the average or the mean of the random variable 𝑋.
5.5. Integrals 55
ML Cheatsheet Documentation
Variance
The variance formula computes the expectation of the squared distance of the random variable 𝑋 from its expected
value. The variance 𝜎 2 , also denoted var(𝑋), gives us an indication of how clustered or spread the values of 𝑋 are.
A small variance indicates the outcomes of 𝑋 are tightly clustered near the expected value 𝜇, while a large variance
indicates the outcomes of 𝑋 are widely spread. The square root of the variance is called the standard deviation and is
usually denoted 𝜎.
The expected value 𝜇 and the variance 𝜎 2 are two central concepts in probability theory and statistics because they
allow us to characterize any random variable. The expected value is a measure of the central tendency of the random
variable, while the variance 𝜎 2 measures its dispersion. Readers familiar with concepts from physics can think of the
expected value as the centre of mass of the distribution, and the variance as the moment of inertia of the distribution.
References
56 Chapter 5. Calculus
CHAPTER 6
Linear Algebra
• Vectors
– Notation
– Vectors in geometry
– Scalar operations
– Elementwise operations
– Dot product
– Hadamard product
– Vector fields
• Matrices
– Dimensions
– Scalar operations
– Elementwise operations
– Hadamard product
– Matrix transpose
– Matrix multiplication
– Test yourself
• Numpy
– Dot product
– Broadcasting
Linear algebra is a mathematical toolbox that offers helpful techniques for manipulating groups of numbers simulta-
57
ML Cheatsheet Documentation
neously. It provides structures like vectors and matrices (spreadsheets) to hold these numbers and new rules for how
to add, subtract, multiply, and divide them. Here is a brief overview of basic linear algebra concepts taken from my
linear algebra post on Medium.
6.1 Vectors
Vectors are 1-dimensional arrays of numbers or terms. In geometry, vectors store the magnitude and direction of a
potential change to a point. The vector [3, -2] says go right 3 and down 2. A vector with more than one dimension is
called a matrix.
6.1.1 Notation
There are a variety of ways to represent vectors. Here are a few you might come across in your reading.
⎡ ⎤ ⎛ ⎞
1 1 [︀ ]︀
𝑣 = ⎣2⎦ = ⎝2⎠ = 1 2 3
3 3
Vectors typically represent movement from a point. They store both the magnitude and direction of potential changes
to a point. The vector [-2,5] says move left 2 units and up 5 units1 .
A vector can be applied to any point in space. The vector’s direction equals the slope of the hypotenuse created moving
up 5 and left 2. Its magnitude equals the length of the hypotenuse.
1 http://mathinsight.org/vector_introduction
Scalar operations involve a vector and a number. You modify the vector in-place by adding, subtracting, or multiplying
the number from all the values in the vector.
⎡ ⎤ ⎡ ⎤
2 3
⎣2⎦ + 1 = ⎣3⎦
2 3
In elementwise operations like addition, subtraction, and division, values that correspond positionally are combined to
produce a new vector. The 1st value in vector A is paired with the 1st value in vector B. The 2nd value is paired with
the 2nd, and so on. This means the vectors must have equal dimensions to complete the operation.*
[︂ ]︂ [︂ ]︂ [︂ ]︂
𝑎1 𝑏 𝑎 + 𝑏1
+ 1 = 1
𝑎2 𝑏2 𝑎2 + 𝑏2
y = np.array([1,2,3])
x = np.array([2,3,4])
y + x = [3, 5, 7]
y - x = [-1, -1, -1]
y / x = [.5, .67, .75]
The dot product of two vectors is a scalar. Dot product of vectors and matrices (matrix multiplication) is one of the
most important operations in deep learning.
[︂ ]︂ [︂ ]︂
𝑎1 𝑏
· 1 = 𝑎1 𝑏1 + 𝑎2 𝑏2
𝑎2 𝑏2
y = np.array([1,2,3])
x = np.array([2,3,4])
np.dot(y,x) = 20
y = np.array([1,2,3])
x = np.array([2,3,4])
y * x = [2, 6, 12]
6.1. Vectors 59
ML Cheatsheet Documentation
A vector field shows how far the point (x,y) would hypothetically move if we applied a vector function to it like
addition or multiplication. Given a point in space, a vector field shows the power and direction of our proposed change
at a variety of points in a graph2 .
This vector field is an interesting one since it moves in different directions depending the starting point. The reason is
that the vector behind this field stores terms like 2𝑥 or 𝑥2 instead of scalar values like -2 and 5. For each point on the
graph, we plug the x-coordinate into 2𝑥 or 𝑥2 and draw an arrow from the starting point to the new location. Vector
fields are extremely useful for visualizing machine learning techniques like Gradient Descent.
6.2 Matrices
A matrix is a rectangular grid of numbers or terms (like an Excel spreadsheet) with special rules for addition, subtrac-
tion, and multiplication.
6.2.1 Dimensions
a = np.array([
[1,2,3],
[4,5,6]
(continues on next page)
2 https://en.wikipedia.org/wiki/Vector_field
Scalar operations with matrices work the same way as they do for vectors. Simply apply the scalar to every element
in the matrix—add, subtract, divide, multiply, etc.
⎡ ⎤ ⎡ ⎤
2 3 3 4
⎣2 3⎦ + 1 = ⎣3 4⎦
2 3 3 4
# Addition
a = np.array(
[[1,2],
[3,4]])
a + 1
[[2,3],
[4,5]]
In order to add, subtract, or divide two matrices they must have equal dimensions. We combine corresponding values
in an elementwise fashion to produce a new matrix.
[︂ ]︂ [︂ ]︂ [︂ ]︂
𝑎 𝑏 1 2 𝑎+1 𝑏+2
+ =
𝑐 𝑑 3 4 𝑐+3 𝑑+4
a = np.array([
[1,2],
[3,4]])
b = np.array([
[1,2],
[3,4]])
a + b
[[2, 4],
[6, 8]]
a -- b
[[0, 0],
[0, 0]]
6.2. Matrices 61
ML Cheatsheet Documentation
Hadamard product of matrices is an elementwise operation. Values that correspond positionally are multiplied to
produce a new matrix.
[︂ ]︂ [︂ ]︂ [︂ ]︂
𝑎1 𝑎2 𝑏1 𝑏2 𝑎1 · 𝑏1 𝑎2 · 𝑏2
⊙ =
𝑎3 𝑎4 𝑏3 𝑏4 𝑎3 · 𝑏3 𝑎4 · 𝑏4
a = np.array(
[[2,3],
[2,3]])
b = np.array(
[[3,4],
[5,6]])
In numpy you can take the Hadamard product of a matrix and vector as long as their dimensions meet the requirements
of broadcasting.
[︂ ]︂ [︂ ]︂ [︂ ]︂
𝑎1 𝑏1 𝑏2 𝑎1 · 𝑏1 𝑎1 · 𝑏2
⊙ =
𝑎2 𝑏3 𝑏4 𝑎2 · 𝑏3 𝑎2 · 𝑏4
Neural networks frequently process weights and inputs of different sizes where the dimensions do not meet the re-
quirements of matrix multiplication. Matrix transposition (often denoted by a superscript ‘T’ e.g. M^T) provides a
way to “rotate” one of the matrices so that the operation complies with multiplication requirements and can continue.
There are two steps to transpose a matrix:
1. Rotate the matrix right 90°
2. Reverse the order of elements in each row (e.g. [a b c] becomes [c b a])
As an example, transpose matrix M into T:
⎡ ⎤
𝑎 𝑏 [︂ ]︂
𝑎 𝑐 𝑒
⎣𝑐 𝑑⎦ ⇒
𝑏 𝑑 𝑓
𝑒 𝑓
a = np.array([
[1, 2],
[3, 4]])
a.T
[[1, 3],
[2, 4]]
Matrix multiplication specifies a set of rules for multiplying matrices together to produce a new matrix.
Rules
Not all matrices are eligible for multiplication. In addition, there is a requirement on the dimensions of the resulting
matrix output. Source.
1. The number of columns of the 1st matrix must equal the number of rows of the 2nd
2. The product of an M x N matrix and an N x K matrix is an M x K matrix. The new matrix takes the rows of the
1st and columns of the 2nd
Steps
Matrix multiplication relies on dot product to multiply various combinations of rows and columns. In the image
below, taken from Khan Academy’s excellent linear algebra course, each entry in Matrix C is the dot product of a row
in matrix A and a column in matrix B3 .
The operation a1 · b1 means we take the dot product of the 1st row in matrix A (1, 7) and the 1st column in matrix B
(3, 5).
[︂ ]︂ [︂ ]︂
1 3
𝑎1 · 𝑏1 = · = (1 · 3) + (7 · 5) = 38
7 5
6.2. Matrices 63
ML Cheatsheet Documentation
6.3 Numpy
Numpy uses the function np.dot(A,B) for both vector and matrix multiplication. It has some other interesting features
and gotchas so I encourage you to read the documentation here before use.
a = np.array([
[1, 2]
])
a.shape == (1,2)
b = np.array([
[3, 4],
[5, 6]
])
b.shape == (2,2)
# Multiply
mm = np.dot(a,b)
mm == [13, 16]
mm.shape == (1,2)
6.3.2 Broadcasting
In numpy the dimension requirements for elementwise operations are relaxed via a mechanism called broadcasting.
Two matrices are compatible if the corresponding dimensions in each matrix (rows vs rows, columns vs columns)
meet the following requirements:
1. The dimensions are equal, or
2. One dimension is of size 1
a = np.array([
[1],
[2]
])
b = np.array([
[3,4],
[5,6]
])
(continues on next page)
Tutorials
References
6.3. Numpy 65
ML Cheatsheet Documentation
Probability
• Links
• Screenshots
• License
7.1 Links
7.2 Screenshots
7.3 License
67
ML Cheatsheet Documentation
References
68 Chapter 7. Probability
CHAPTER 8
Statistics
References
69
ML Cheatsheet Documentation
70 Chapter 8. Statistics
CHAPTER 9
Notation
• Algebra
• Calculus
• Linear algebra
• Probability
• Set theory
• Statistics
Note: Use the table generator to quickly add new symbols. Import current tables into tablesgenerator from
figures/*.tgn. Export and save your changes. Also see helpful multiline editing in Sublime.
9.1 Algebra
71
ML Cheatsheet Documentation
9.2 Calculus
9.4 Probability
9.6 Statistics
72 Chapter 9. Notation
ML Cheatsheet Documentation
References
9.6. Statistics 73
ML Cheatsheet Documentation
74 Chapter 9. Notation
CHAPTER 10
Concepts
• Neural Network
• Neuron
• Synapse
• Weights
• Bias
• Layers
• Weighted Input
• Activation Functions
• Loss Functions
• Optimization Algorithms
• Gradient Accumulation
Neural networks are a class of machine learning algorithms used to model complex patterns in datasets using multiple
hidden layers and non-linear activation functions. A neural network takes an input, passes it through multiple layers of
hidden neurons (mini-functions with unique coefficients that must be learned), and outputs a prediction representing
the combined input of all the neurons.
75
ML Cheatsheet Documentation
Neural networks are trained iteratively using optimization techniques like gradient descent. After each cycle of train-
ing, an error metric is calculated based on the difference between prediction and target. The derivatives of this error
metric are calculated and propagated back through the network using a technique called backpropagation. Each neu-
ron’s coefficients (weights) are then adjusted relative to how much they contributed to the total error. This process is
repeated iteratively until the network error drops below an acceptable threshold.
10.2 Neuron
A neuron takes a group of weighted inputs, applies an activation function, and returns an output.
Inputs to a neuron can either be features from a training set or outputs from a previous layer’s neurons. Weights are
applied to the inputs as they travel along synapses to reach the neuron. The neuron then applies an activation function
to the “sum of weighted inputs” from each incoming synapse and passes the result on to all the neurons in the next
layer.
10.3 Synapse
Synapses are like roads in a neural network. They connect inputs to neurons, neurons to neurons, and neurons to
outputs. In order to get from one neuron to another, you have to travel along the synapse paying the “toll” (weight)
along the way. Each connection between two neurons has a unique synapse with a unique weight attached to it. When
we talk about updating weights in a network, we’re really talking about adjusting the weights on these synapses.
10.4 Weights
Weights are values that control the strength of the connection between two neurons. That is, inputs are typically
multiplied by weights, and that defines how much influence the input will have on the output. In other words: when
the inputs are transmitted between neurons, the weights are applied to the inputs along with an additional value (the
bias)
10.5 Bias
Bias terms are additional constants attached to neurons and added to the weighted input before the activation function
is applied. Bias terms help models represent patterns that do not necessarily pass through the origin. For example,
if all your features were 0, would your output also be zero? Is it possible there is some base value upon which your
features have an effect? Bias terms typically accompany weights and must also be learned by your model.
10.6 Layers
Input Layer
Holds the data your model will train on. Each neuron in the input layer represents a unique attribute in your dataset
(e.g. height, hair color, etc.).
Hidden Layer
Sits between the input and output layers and applies an activation function before passing on the results. There are often
multiple hidden layers in a network. In traditional networks, hidden layers are typically fully-connected layers—each
neuron receives input from all the previous layer’s neurons and sends its output to every neuron in the next layer. This
contrasts with how convolutional layers work where the neurons send their output to only some of the neurons in the
next layer.
10.3. Synapse 77
ML Cheatsheet Documentation
Output Layer
The final layer in a network. It receives input from the previous hidden layer, optionally applies an activation function,
and returns an output representing your model’s prediction.
A neuron’s input equals the sum of weighted outputs from all neurons in the previous layer. Each input is multiplied
by the weight associated with the synapse connecting the input to the current neuron. If there are 3 inputs or neurons
in the previous layer, each neuron in the current layer will have 3 distinct weights—one for each each synapse.
Single Input
𝑍 = 𝐼𝑛𝑝𝑢𝑡 · 𝑊 𝑒𝑖𝑔ℎ𝑡
= 𝑋𝑊
Multiple Inputs
𝑛
∑︁
𝑍= 𝑥𝑖 𝑤𝑖
𝑖=1
= 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑥3 𝑤3
Notice, it’s exactly the same equation we use with linear regression! In fact, a neural network with a single neuron
is the same as linear regression! The only difference is the neural network post-processes the weighted input with an
activation function.
Activation functions live inside neural network layers and modify the data they receive before passing it to the next
layer. Activation functions give neural networks their power—allowing them to model complex non-linear relation-
ships. By modifying inputs with non-linear functions neural networks can model highly complex relationships between
features. Popular activation functions include relu and sigmoid.
Activation functions typically have the following properties:
• Non-linear - In linear regression we’re limited to a prediction equation that looks like a straight line. This is
nice for simple datasets with a one-to-one relationship between inputs and outputs, but what if the patterns in
our dataset were non-linear? (e.g. 𝑥2 , sin, log). To model these relationships we need a non-linear prediction
equation.1 Activation functions provide this non-linearity.
• Continuously differentiable—To improve our model with gradient descent, we need our output to have a nice
slope so we can compute error derivatives with respect to weights. If our neuron instead outputted 0 or 1
(perceptron), we wouldn’t know in which direction to update our weights to reduce our error.
• Fixed Range—Activation functions typically squash the input data into a narrow range that makes training the
model more stable and efficient.
A loss function, or cost function, is a wrapper around our model’s predict function that tells us “how good” the model
is at making predictions for a given set of parameters. The loss function has its own curve and its own derivatives. The
slope of this curve tells us how to change our parameters to make the model more accurate! We use the model to make
predictions. We use the cost function to update our parameters. Our cost function can take a variety of forms as there
are many different cost functions available. Popular loss functions include: MSE (L2) and Cross-entropy Loss.
Gradient accumulation is a mechanism to split the batch of samples—used for training a neural network—into several
mini-batches of samples that will be run sequentially.
This is used to enable using large batch sizes that require more GPU memory than available. Gradient accumulation
helps in doing so by using mini-batches that require an amount of GPU memory that can be satisfied.
Gradient accumulation means running all mini-batches sequentially (generally on the same GPU) while accumulating
their calculated gradients and not updating the model variables - the weights and biases of the model. The model
variables must not be updated during the accumulation in order to ensure all mini-batches use the same model variable
values to calculate their gradients. Only after accumulating the gradients of all those mini-batches will we generate
and apply the updates for the model variables.
This results in the same updates for the model parameters as if we were to use the global batch.
More details, a technical and algorithmical deep-dive, how-to tutorials, and examples can be found at [2].
References
Forwardpropagation
• Simple Network
– Steps
– Code
• Larger Network
– Architecture
– Weight Initialization
– Bias Terms
– Working with Matrices
– Dynamic Resizing
– Refactoring Our Code
– Final Result
81
ML Cheatsheet Documentation
Forward propagation is how neural networks make predictions. Input data is “forward propagated” through the net-
work layer by layer to the final layer which outputs a prediction. For the toy neural network above, a single pass of
forward propagation translates mathematically to:
Where 𝐴 is an activation function like ReLU, 𝑋 is the input and 𝑊ℎ and 𝑊𝑜 are weights.
11.1.1 Steps
1. Calculate the weighted input to the hidden layer by multiplying 𝑋 by the hidden weight 𝑊ℎ
2. Apply the activation function and pass the result to the final layer
3. Repeat step 2 except this time 𝑋 is replaced by the hidden layer’s output, 𝐻
11.1.2 Code
Let’s write a method feed_forward() to propagate input data through our simple network of 1 hidden layer. The output
of this method represents our model’s prediction.
def relu(z):
return max(0,z)
# Output layer
Zo = H * Wo
output = relu(Zo)
return output
x is the input to the network, Zo and Zh are the weighted inputs and Wo and Wh are the weights.
The simple network above is helpful for learning purposes, but in reality neural networks are much larger and more
complex. Modern neural networks have many more hidden layers, more neurons per layer, more variables per input,
more inputs per training set, and more output variables to predict. Here is a slightly larger network that will introduce
us to matrices and the matrix operations used to train arbitrarily large neural networks.
11.2.1 Architecture
To accomodate arbitrarily large inputs or outputs, we need to make our code more extensible by adding a few parame-
ters to our network’s __init__ method: inputLayerSize, hiddenLayerSize, outputLayerSize. We’ll still limit ourselves
to using one hidden layer, but now we can create layers of different sizes to respond to the different inputs or outputs.
INPUT_LAYER_SIZE = 1
HIDDEN_LAYER_SIZE = 2
OUTPUT_LAYER_SIZE = 2
Unlike last time where Wh and Wo were scalar numbers, our new weight variables will be numpy arrays. Each array
will hold all the weights for its own layer—one weight for each synapse. Below we initialize each array with the
numpy’s np.random.randn(rows, cols) method, which returns a matrix of random numbers drawn from a
normal distribution with mean 0 and variance 1.
def init_weights():
Wh = np.random.randn(INPUT_LAYER_SIZE, HIDDEN_LAYER_SIZE) * \
np.sqrt(2.0/INPUT_LAYER_SIZE)
Wo = np.random.randn(HIDDEN_LAYER_SIZE, OUTPUT_LAYER_SIZE) * \
np.sqrt(2.0/HIDDEN_LAYER_SIZE)
arr = np.random.randn(1, 2)
print(arr)
>> [[-0.36094661 -1.30447338]]
(continues on next page)
print(arr.shape)
>> (1,2)
As you’ll soon see, there are strict requirements on the dimensions of these weight matrices. The number of rows must
equal the number of neurons in the previous layer. The number of columns must match the number of neurons in the
next layer.
A good explanation of random weight initalization can be found in the Stanford CS231 course notes1 chapter on neural
networks.
Bias terms allow us to shift our neuron’s activation outputs left and right. This helps us model datasets that do not
necessarily pass through the origin.
Using the numpy method np.full() below, we create two 1-dimensional bias arrays filled with the default value
0.2. The first argument to np.full is a tuple of array dimensions. The second is the default value for cells in the
array.
def init_bias():
Bh = np.full((1, HIDDEN_LAYER_SIZE), 0.1)
Bo = np.full((1, OUTPUT_LAYER_SIZE), 0.1)
return Bh, Bo
To take advantage of fast linear algebra techniques and GPUs, we need to store our inputs, weights, and biases in
matrices. Here is our neural network diagram again with its underlying matrix representation.
1 http://cs231n.github.io/neural-networks-2/#init
What’s happening here? To better understand, let’s walk through each of the matrices in the diagram with an emphasis
on their dimensions and why the dimensions are what they are. The matrix dimensions above flow naturally from the
architecture of our network and the number of samples in our training set.
Matrix dimensions
Before we continue I want to point out how the matrix dimensions change with changes to the network architecture or
size of the training set. For example, let’s build a network with 2 input neurons, 3 hidden neurons, 2 output neurons,
and 4 observations in our training set.
Now let’s use same number of layers and neurons but reduce the number of observations in our dataset to 1 instance:
As you can see, the number of columns in all matrices remains the same. The only thing that changes is the number
of rows the layer matrices, which fluctuate with the size of the training set. The dimensions of the weight matrices
remain unchanged. This shows us we can use the same network, the same lines of code, to process any number of
observations.
Here is our new feed forward code which accepts matrices instead of scalar inputs.
def feed_forward(X):
'''
X - input matrix
Zh - hidden layer weighted input
Zo - output layer weighted input
H - hidden layer activation
(continues on next page)
# Hidden layer
Zh = np.dot(X, Wh) + Bh
H = relu(Zh)
# Output layer
Zo = np.dot(H, Wo) + Bo
yHat = relu(Zo)
return yHat
Weighted input
The first change is to update our weighted input calculation to handle matrices. Using dot product, we multiply the
input matrix by the weights connecting them to the neurons in the next layer. Next we add the bias vector using matrix
addition.
Zh = np.dot(X, Wh) + Bh
The first column in Bh is added to all the rows in the first column of resulting dot product of X and Wh. The second
value in Bh is added to all the elements in the second column. The result is a new matrix, Zh which has a column for
every neuron in the hidden layer and a row for every observation in our dataset. Given all the layers in our network are
fully-connected, there is one weight for every neuron-to-neuron connection between the layers.
The same process is repeated for the output layer, except the input is now the hidden layer activation H and the weights
Wo.
ReLU activation
The second change is to refactor ReLU to use elementwise multiplication on matrices. It’s only a small change, but its
necessary if we want to work with matrices. np.maximum() is actually extensible and can handle both scalar and
array inputs.
def relu(Z):
return np.maximum(0, Z)
In the hidden layer activation step, we apply the ReLU activation function np.maximum(0,Z) to every cell in the
new matrix. The result is a matrix where all negative values have been replaced by 0. The same process is repeated
for the output layer, except the input is Zo.
Putting it all together we have the following code for forward propagation with matrices.
INPUT_LAYER_SIZE = 1
HIDDEN_LAYER_SIZE = 2
OUTPUT_LAYER_SIZE = 2
def init_weights():
Wh = np.random.randn(INPUT_LAYER_SIZE, HIDDEN_LAYER_SIZE) * \
np.sqrt(2.0/INPUT_LAYER_SIZE)
Wo = np.random.randn(HIDDEN_LAYER_SIZE, OUTPUT_LAYER_SIZE) * \
np.sqrt(2.0/HIDDEN_LAYER_SIZE)
def init_bias():
Bh = np.full((1, HIDDEN_LAYER_SIZE), 0.1)
Bo = np.full((1, OUTPUT_LAYER_SIZE), 0.1)
return Bh, Bo
def relu(Z):
return np.maximum(0, Z)
def relu_prime(Z):
'''
Z - weighted input matrix
def feed_forward(X):
'''
X - input matrix
Zh - hidden layer weighted input
Zo - output layer weighted input
H - hidden layer activation
y - output layer
(continues on next page)
# Hidden layer
Zh = np.dot(X, Wh) + Bh
H = relu(Zh)
# Output layer
Zo = np.dot(H, Wo) + Bo
yHat = relu(Zo)
References
Backpropagation
The goals of backpropagation are straightforward: adjust each weight in the network in proportion to how much it
contributes to overall error. If we iteratively reduce each weight’s error, eventually we’ll have a series of weights that
produce good predictions.
As seen above, foward propagation can be viewed as a long series of nested equations. If you think of feed forward
this way, then backpropagation is merely an application of Chain rule to find the Derivatives of cost with respect to
any variable in the nested equation. Given a forward propagation function:
𝑓 (𝑥) = 𝐴(𝐵(𝐶(𝑥)))
A, B, and C are activation functions at different layers. Using the chain rule we easily calculate the derivative of 𝑓 (𝑥)
with respect to 𝑥:
How about the derivative with respect to B? To find the derivative with respect to B you can pretend 𝐵(𝐶(𝑥)) is a
constant, replace it with a placeholder variable B, and proceed to find the derivative normally with respect to B.
This simple technique extends to any variable within a function and allows us to precisely pinpoint the exact impact
each variable has on the total output.
91
ML Cheatsheet Documentation
Let’s use the chain rule to calculate the derivative of cost with respect to any weight in the network. The chain rule
will help us identify how much each weight contributes to our overall error and the direction to update each weight to
reduce our error. Here are the equations we need to make a prediction and calculate total error, or cost:
Given a network consisting of a single neuron, total cost could be calculated as:
Using the chain rule we can easily find the derivative of Cost with respect to weight W.
𝐶 ′ (𝑊 ) = 𝐶 ′ (𝑅) · 𝑅′ (𝑍) · 𝑍 ′ (𝑊 )
𝑦 − 𝑦) · 𝑅′ (𝑍) · 𝑋
= (ˆ
Now that we have an equation to calculate the derivative of cost with respect to any weight, let’s go back to our toy
neural network example above
And just for fun, what if our network had 10 hidden layers. What is the derivative of cost for the first weight 𝑤1 ?
𝑑𝐶 𝑑ˆ
𝑦 𝑑𝑍11
𝐶 ′ (𝑤1 ) =
· · ·
𝑑ˆ
𝑦 𝑑𝑍11 𝑑𝐻10
𝑑𝐻10 𝑑𝑍10 𝑑𝐻9 𝑑𝑍9 𝑑𝐻8 𝑑𝑍8 𝑑𝐻7
· · · · · · ·
𝑑𝑍10 𝑑𝐻9 𝑑𝑍9 𝑑𝐻8 𝑑𝑍8 𝑑𝐻7 𝑑𝑍7
𝑑𝑍7 𝑑𝐻6 𝑑𝑍6 𝑑𝐻5 𝑑𝑍5 𝑑𝐻4 𝑑𝑍4
· · · · · · ·
𝑑𝐻6 𝑑𝑍6 𝑑𝐻5 𝑑𝑍5 𝑑𝐻4 𝑑𝑍4 𝑑𝐻3
𝑑𝐻3 𝑑𝑍3 𝑑𝐻2 𝑑𝑍2 𝑑𝐻1 𝑑𝑍1
· · · · ·
𝑑𝑍3 𝑑𝐻2 𝑑𝑍2 𝑑𝐻1 𝑑𝑍1 𝑑𝑊1
See the pattern? The number of calculations required to compute cost derivatives increases as our network grows
deeper. Notice also the redundancy in our derivative calculations. Each layer’s cost derivative appends two new terms
to the terms that have already been calculated by the layers above it. What if there was a way to save our work
somehow and avoid these duplicate calculations?
Memoization is a computer science term which simply means: don’t recompute the same thing over and over. In
memoization we store previously computed results to avoid recalculating the same function. It’s handy for speeding
up recursive functions of which backpropagation is one. Notice the pattern in the derivative equations below.
Each of these layers is recomputing the same derivatives! Instead of writing out long derivative equations for every
weight, we can use memoization to save our work as we backprop error through the network. To do this, we define 3
equations (below), which together encapsulate all the calculations needed for backpropagation. The math is the same,
but the equations provide a nice shorthand we can use to track which calculations we’ve already performed and save
our work as we move backwards through the network.
We first calculate the output layer error and pass the result to the hidden layer before it. After calculating the hidden
layer error, we pass its error value back to the previous hidden layer before it. And so on and so forth. As we move
back through the network we apply the 3rd formula at every layer to calculate the derivative of cost with respect that
layer’s weights. This resulting derivative tells us in which direction to adjust our weights to reduce overall cost.
Note: The term layer error refers to the derivative of cost with respect to a layer’s input. It answers the question: how
does the cost function output change when the input to that layer changes?
To calculate output layer error we need to find the derivative of cost with respect to the output layer input, 𝑍𝑜 . It
answers the question—how are the final layer’s weights impacting overall error in the network? The derivative is then:
𝐶 ′ (𝑍𝑜 ) = (ˆ
𝑦 − 𝑦) · 𝑅′ (𝑍𝑜 )
𝑦 − 𝑦) · 𝑅′ (𝑍𝑜 )
𝐸𝑜 = (ˆ
To calculate hidden layer error we need to find the derivative of cost with respect to the hidden layer input, Zh.
𝐶 ′ (𝑍ℎ ) = (ˆ
𝑦 − 𝑦) · 𝑅′ (𝑍𝑜 ) · 𝑊𝑜 · 𝑅′ (𝑍ℎ )
Next we can swap in the 𝐸𝑜 term above to avoid duplication and create a new simplified equation for Hidden layer
error:
𝐸ℎ = 𝐸𝑜 · 𝑊𝑜 · 𝑅′ (𝑍ℎ )
This formula is at the core of backpropagation. We calculate the current layer’s error, and pass the weighted error back
to the previous layer, continuing the process until we arrive at our first hidden layer. Along the way we update the
weights using the derivative of cost with respect to each weight.
Let’s return to our formula for the derivative of cost with respect to the output layer weight 𝑊𝑜 .
𝐶 ′ (𝑊𝑂 ) = (ˆ
𝑦 − 𝑦) · 𝑅′ (𝑍𝑂 ) · 𝐻
We know we can replace the first part with our equation for output layer error 𝐸𝑜 . H represents the hidden layer
activation.
𝐶 ′ (𝑊𝑜 ) = 𝐸𝑜 · 𝐻
So to find the derivative of cost with respect to any weight in our network, we simply multiply the corresponding
layer’s error times its input (the previous layer’s output).
Note: Input refers to the activation from the previous layer, not the weighted input, Z.
Summary
Here are the final 3 equations that together form the foundation of backpropagation.
Here is the process visualized using our toy neural network example above.
def relu_prime(z):
if z > 0:
return 1
return 0
# Layer Error
Eo = (yHat - y) * relu_prime(Zo)
Eh = Eo * Wo * relu_prime(Zh)
# Update weights
(continues on next page)
References
Activation Functions
• Linear
• ELU
• ReLU
• LeakyReLU
• Sigmoid
• Tanh
• Softmax
13.1 Linear
A straight line function where activation is proportional to input ( which is the weighted sum from neuron ).
97
ML Cheatsheet Documentation
Function Derivative
𝑅′ (𝑧, 𝑚) = 𝑚
{︀ }︀ {︀ }︀
𝑅(𝑧, 𝑚) = 𝑧 * 𝑚
Pros
Cons
• For this function, derivative is a constant. That means, the gradient has no relationship with X.
• It is a constant gradient and the descent is going to be on constant gradient.
• If there is an error in prediction, the changes made by back propagation is constant and not depending on the
change in input delta(x) !
13.2 ELU
Exponential Linear Unit or its widely known name ELU is a function that tend to converge cost to zero faster and
produce more accurate results. Different to other activation functions, ELU has a extra alpha constant which should
be positive number.
ELU is very similiar to RELU except negative inputs. They are both in identity function form for non-negative inputs.
On the other hand, ELU becomes smooth slowly until its output equal to -𝛼 whereas RELU sharply smoothes.
Function Derivative
z>0
.( e𝑧 ˘1) z <= 0
{︂ }︂
′ 1 𝑧>0
𝑅 (𝑧) =
.𝑒𝑧 𝑧<0
def elu(z,alpha):
return z if z >= 0 else alpha*(e^z -1)
def elu_prime(z,alpha):
return 1 if z > 0 else alpha*np.exp(z)
Pros
• ELU becomes smooth slowly until its output equal to -𝛼 whereas RELU sharply smoothes.
• ELU is a strong alternative to ReLU.
• Unlike to ReLU, ELU can produce negative outputs.
Cons
• For x > 0, it can blow up the activation with the output range of [0, inf].
13.3 ReLU
A recent invention which stands for Rectified Linear Units. The formula is deceptively simple: 𝑚𝑎𝑥(0, 𝑧). Despite its
name and appearance, it’s not linear and provides the same benefits as Sigmoid but with better performance.
13.3. ReLU 99
ML Cheatsheet Documentation
Function Derivative
z>0
0 z <= 0
{︂ }︂
′ 1 𝑧>0
𝑅 (𝑧) =
0 𝑧<0
def relu(z):
return max(0, z)
def relu_prime(z):
return 1 if z > 0 else 0
Pros
Cons
• One of its limitation is that it should only be used within Hidden layers of a Neural Network Model.
• Some gradients can be fragile during training and can die. It can cause a weight update which will makes it
never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.
• In another words, For activations in the region (x<0) of ReLu, gradient will be 0 because of which the weights
will not get adjusted during descent. That means, those neurons which go into that state will stop responding to
variations in error/ input ( simply because gradient is 0, nothing changes ). This is called dying ReLu problem.
• The range of ReLu is [0, inf). This means it can blow up the activation.
Further reading
13.4 LeakyReLU
LeakyRelu is a variant of ReLU. Instead of being 0 when 𝑧 < 0, a leaky ReLU allows a small, non-zero, constant
gradient 𝛼 (Normally, 𝛼 = 0.01). However, the consistency of the benefit across tasks is presently unclear.1
Function Derivative
z>0
𝛼𝑧 z <= 0
{︂ }︂
1 𝑧>0
𝑅′ (𝑧) =
𝛼 𝑧<0
Pros
• Leaky ReLUs are one attempt to fix the “dying ReLU” problem by having a small negative slope (of 0.01, or
so).
1 http://cs231n.github.io/neural-networks-1/
Cons
• As it possess linearity, it can’t be used for the complex Classification. It lags behind the Sigmoid and Tanh for
some of the use cases.
Further reading
• Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, Kaiming He
et al. (2015)
13.5 Sigmoid
Sigmoid takes a real value as input and outputs another value between 0 and 1. It’s easy to work with and has all the
nice properties of activation functions: it’s non-linear, continuously differentiable, monotonic, and has a fixed output
range.
Function Derivative
1
𝑆(𝑧) = 𝑆 ′ (𝑧) = 𝑆(𝑧) · (1 − 𝑆(𝑧))
1 + 𝑒−𝑧
Pros
• The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear
function. So we have our activations bound in a range. Nice, it won’t blow up the activations then.
Cons
• Towards either end of the sigmoid function, the Y values tend to respond very less to changes in X.
• It gives rise to a problem of “vanishing gradients”.
• Its output isn’t zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1,
and it makes optimization harder.
• Sigmoids saturate and kill gradients.
• The network refuses to learn further or is drastically slow ( depending on use case and until gradient /computa-
tion gets hit by floating point value limits ).
Further reading
13.6 Tanh
Tanh squashes a real-valued number to the range [-1, 1]. It’s non-linear. But unlike Sigmoid, its output is zero-centered.
Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.1
Function Derivative
Pros
• The gradient is stronger for tanh than sigmoid ( derivatives are steeper).
Cons
13.7 Softmax
Softmax function calculates the probabilities distribution of the event over ‘n’ different events. In general way of
saying, this function will calculate the probabilities of each target class over all possible target classes. Later the
calculated probabilities will be helpful for determining the target class for the given inputs.
References
Layers
• BatchNorm
• Convolution
• Dropout
• Linear
• LSTM
• Pooling
• RNN
14.1 BatchNorm
BatchNorm accelerates convergence by reducing internal covariate shift inside each batch. If the individual observa-
tions in the batch are widely different, the gradient updates will be choppy and take longer to converge.
The batch norm layer normalizes the incoming activations and outputs a new batch where the mean equals 0 and
standard deviation equals 1. It subtracts the mean and divides by the standard deviation of the batch.
Code
105
ML Cheatsheet Documentation
N, D = X.shape
X_mu = X - mu
std_inv = 1. / np.sqrt(var + 1e-8)
Further reading
• Original Paper
• Implementing BatchNorm in Neural Net
• Understanding the backward pass through Batch Norm
14.2 Convolution
In CNN, a convolution is a linear operation that involves multiplication of weight (kernel/filter) with the input and it
does most of the heavy lifting job.
Convolution layer consists of 2 major component 1. Kernel(Filter) 2. Stride
1. Kernel (Filter): A convolution layer can have more than one filter. The size of the filter should be smaller than
the size of input dimension. It is intentional as it allows filter to be applied multiple times at difference point
(position) on the input.Filters are helpful in understanding and identifying important features from given input.
By applying different filters (more than one filter) on the same input helps in extracting different features from
given input. Output from multiplying filter with the input gives Two dimensional array. As such, the output
array from this operation is called “Feature Map”.
2. Stride: This property controls the movement of filter over input. when the value is set to 1, then filter moves 1
column at a time over input. When the value is set to 2 then the filer jump 2 columns at a time as filter moves
over the input.
Code
def conv_filter_ouput(input_img_section,filter_value):
# this method perfromas the multiplication of input and filter
# returns singular value
value = 0
for i in range(len(filter_value)):
for j in range(len(filter_value[0])):
value = value + (input_img_section[i][j]*filter_value[i][j])
return value
filter = [[1,0],
[0,1]]
filterX,filterY = len(filter),len(filter[0])
filtered_result = []
for i in range(0,len(img_mx)-filterX+1):
clm = []
for j in range(0,len(img_mx[0])-filterY+1):
clm.append(conv_filter_ouput(img_mx[i:i+filterX,j:j+filterY],filter))
filtered_result.append(clm)
print(filtered_result)
Further reading
• cs231n reference
14.3 Dropout
A dropout layer takes the output of the previous layer’s activations and randomly sets a certain fraction (dropout rate)
of the activatons to 0, cancelling or ‘dropping’ them out.
It is a common regularization technique used to prevent overfitting in Neural Networks.
The dropout rate is the tunable hyperparameter that is adjusted to measure performance with different values. It is
typically set between 0.2 and 0.5 (but may be arbitrarily set).
Dropout is only used during training; At test time, no activations are dropped, but scaled down by a factor of dropout
rate. This is to account for more units being active during test time than training time.
For example:
• A layer in a neural net outputs a tensor (matrix) A of shape (batch_size, num_features).
• The dropout rate of the layer is set to 0.5 (50%).
• A random 50% of the values in A will be set to 0.
• These will then be multiplied with the weight matrix to form the inputs to the next layer.
The premise behind dropout is to introduce noise into a layer in order to disrupt any interdependent learning or
coincidental patterns that may occur between units in the layer, that aren’t significant.
Code
layer_output /= 0.5
# OR
layer_output *= 0.5 # Scaling down during TEST time.
All reference, images and code examples, unless mentioned otherwise, are from section 4.4.3 of Deep Learning for
Python by François Chollet.
14.4 Linear
14.5 LSTM
14.6 Pooling
Pooling layers often take convolution layers as input. A complicated dataset with many object will require a large
number of filters, each responsible finding pattern in an image so the dimensionally of convolutional layer can get
large. It will cause an increase of parameters, which can lead to over-fitting. Pooling layers are methods for reducing
this high dimensionally. Just like the convolution layer, there is kernel size and stride. The size of the kernel is smaller
than the feature map. For most of the cases the size of the kernel will be 2X2 and the stride of 2. There are mainly two
types of pooling layers.
The first type is max pooling layer. Max pooling layer will take a stack of feature maps (convolution layer) as input.
The value of the node in the max pooling layer is calculated by just the maximum of the pixels contained in the
window.
The other type of pooling layer is the Average Pooling layer. Average pooling layer calculates the average of pixels
contained in the window. Its not used often but you may see this used in applications for which smoothing an image
is preferable.
Code
pool_out = numpy.zeros(pool_shape)
for layer in range(feature_map.shape[-1]):
#for each layer
row = 0
for r in numpy.arange(0,feature_map.shape[0], stride):
col = 0
for c in numpy.arange(0, feature_map.shape[1], stride):
pool_out[row, col, layer] = numpy.max([feature_map[c:c+size,
˓→r:r+size, layer]])
col = col + 1
row = row +1
return pool_out
14.7 RNN
References
Loss Functions
• Cross-Entropy
• Hinge
• Huber
• Kullback-Leibler
• MAE (L1)
• MSE (L2)
15.1 Cross-Entropy
Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value
between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting
a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect
model would have a log loss of 0.
111
ML Cheatsheet Documentation
The graph above shows the range of possible loss values given a true observation (isDog = 1). As the predicted
probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss
increases rapidly. Log loss penalizes both types of errors, but especially those predictions that are confident and
wrong!
Cross-entropy and log loss are slightly different depending on context, but in machine learning when calculating error
rates between 0 and 1 they resolve to the same thing.
Code
Math
In binary classification, where the number of classes 𝑀 equals 2, cross-entropy can be calculated as:
If 𝑀 > 2 (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the
result.
𝑀
∑︁
− 𝑦𝑜,𝑐 log(𝑝𝑜,𝑐 )
𝑐=1
Note:
• M - number of classes (dog, cat, fish)
• log - the natural log
• y - binary indicator (0 or 1) if class label 𝑐 is the correct classification for observation 𝑜
• p - predicted probability observation 𝑜 is of class 𝑐
15.2 Hinge
Code
15.3 Huber
Typically used for regression. It’s less sensitive to outliers than the MSE as it treats error as square only inside an
interval.
{︂ 1
ˆ)2
2 (𝑦 − 𝑦 𝑖𝑓 |(𝑦 − 𝑦ˆ)| < 𝛿
𝐿𝛿 =
𝛿((𝑦 − 𝑦ˆ) − 12 𝛿) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Code
15.4 Kullback-Leibler
Code
Mean Absolute Error, or L1 loss. Excellent overview below [6] and [10].
Code
Mean Squared Error, or L2 loss. Excellent overview below [6] and [10].
References
Optimizers
What is Optimizer ?
It is very important to tweak the weights of the model during the training process, to make our predictions as correct
and optimized as possible. But how exactly do you do that? How do you change the parameters of your model, by
how much, and when?
Best answer to all above question is optimizers. They tie together the loss function and model parameters by updating
the model in response to the output of the loss function. In simpler terms, optimizers shape and mold your model into
its most accurate possible form by futzing with the weights. The loss function is the guide to the terrain, telling the
optimizer when it’s moving in the right or wrong direction.
Below are list of example optimizers
• Adagrad
• Adadelta
• Adam
• Conjugate Gradients
• BFGS
• Momentum
• Nesterov Momentum
• Newton’s Method
• RMSProp
• SGD
115
ML Cheatsheet Documentation
16.1 Adagrad
Adagrad (short for adaptive gradient) adaptively sets the learning rate according to a parameter.
• Parameters that have higher gradients or frequent updates should have slower learning rate so that we do not
overshoot the minimum value.
• Parameters that have low gradients or infrequent updates should faster learning rate so that they get trained
quickly.
• It divides the learning rate by the sum of squares of all previous gradients of the parameter.
• When the sum of the squared past gradients has a high value, it basically divides the learning rate by a high
value, so the learning rate will become less.
• Similarly, if the sum of the squared past gradients has a low value, it divides the learning rate by a lower value,
so the learning rate value will become high.
• This implies that the learning rate is inversely proportional to the sum of the squares of all the previous gradients
of the parameter.
𝜕𝒥 (𝑤𝑡𝑖 )
𝑔𝑡𝑖 =
𝜕𝑊
𝑖
𝜕𝒥 (𝑤𝑡 )
𝑊 = 𝑊 − 𝛼 √︁∑︀
𝑡 2
𝑟=1 (𝑔𝑟𝑖 ) + 𝜀
Note:
• 𝑔𝑡𝑖 - the gradient of a parameter, :math: ‘Theta ‘ at an iteration t.
• 𝛼 - the learning rate
• 𝜖 - very small value to avoid dividing by zero
def Adagrad(data):
gradient_sums = np.zeros(theta.shape[0])
for t in range(num_iterations):
gradients = compute_gradients(data, weights)
gradient_sums += gradients ** 2
gradient_update = gradients / (np.sqrt(gradient_sums + epsilon))
weights = weights - lr * gradient_update
return weights
16.2 Adadelta
AdaDelta belongs to the family of stochastic gradient descent algorithms, that provide adaptive techniques for hyper-
parameter tuning. Adadelta is probably short for ‘adaptive delta’, where delta here refers to the difference between the
current weight and the newly updated weight.
The main disadvantage in Adagrad is its accumulation of the squared gradients. During the training process, the
accumulated sum keeps growing. From the above formala we can see that, As the accumulated sum increases learning
rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire
additional knowledge.
Adadelta is a more robust extension of Adagrad that adapts learning rates based on a moving window of gradient
updates, instead of accumulating all past gradients. This way, Adadelta continues learning even when many updates
have been done.
With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule.
Implementation is something like this,
16.3 Adam
Adaptive Moment Estimation (Adam) combines ideas from both RMSProp and Momentum. It computes adaptive
learning rates for each parameter and works as follows.
• First, it computes the exponentially weighted average of past gradients (𝑣𝑑𝑊 ).
• Second, it computes the exponentially weighted average of the squares of past gradients (𝑠𝑑𝑊 ).
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
• Third, these averages have a bias towards zero and to counteract this a bias correction is applied (𝑣𝑑𝑊 ,
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑠𝑑𝑊 ).
• Lastly, the parameters are updated using the information from the calculated averages.
𝜕𝒥
𝑣𝑑𝑊 = 𝛽1 𝑣𝑑𝑊 + (1 − 𝛽1 )
𝜕𝑊
𝜕𝒥 2
𝑠𝑑𝑊 = 𝛽2 𝑠𝑑𝑊 + (1 − 𝛽2 )( )
𝜕𝑊
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑣𝑑𝑊
𝑣𝑑𝑊 =
1 − (𝛽1 )𝑡
𝑠𝑑𝑊
𝑠𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑑𝑊 =
1 − (𝛽1 )𝑡
𝑣 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑊 = 𝑊 − 𝛼 √︁ 𝑑𝑊
𝑠𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑑𝑊 +𝜀
Note:
• 𝑣𝑑𝑊 - the exponentially weighted average of past gradients
• 𝑠𝑑𝑊 - the exponentially weighted average of past squares of gradients
• 𝛽1 - hyperparameter to be tuned
• 𝛽2 - hyperparameter to be tuned
𝜕𝒥
• 𝜕𝑊 - cost gradient with respect to current layer
• 𝑊 - the weight matrix (parameter to be updated)
• 𝛼 - the learning rate
• 𝜖 - very small value to avoid dividing by zero
16.5 BFGS
16.6 Momentum
Used in conjunction Stochastic Gradient Descent (sgd) or Mini-Batch Gradient Descent, Momentum takes into account
past gradients to smooth out the update. This is seen in variable 𝑣 which is an exponentially weighted average of the
gradient on previous steps. This results in minimizing oscillations and faster convergence.
𝜕𝒥
𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + (1 − 𝛽)
𝜕𝑊
𝑊 = 𝑊 − 𝛼𝑣𝑑𝑊
Note:
• 𝑣 - the exponentially weighted average of past gradients
𝜕𝒥
• 𝜕𝑊 - cost gradient with respect to current layer weight tensor
• 𝑊 - weight tensor
• 𝛽 - hyperparameter to be tuned
• 𝛼 - the learning rate
16.9 RMSProp
Another adaptive learning rate optimization algorithm, Root Mean Square Prop (RMSProp) works by keeping an
exponentially weighted average of the squares of past gradients. RMSProp then divides the learning rate by this
average to speed up convergence.
𝜕𝒥 2
𝑠𝑑𝑊 = 𝛽𝑠𝑑𝑊 + (1 − 𝛽)( )
𝜕𝑊
𝜕𝒥
𝜕𝑊
𝑊 = 𝑊 − 𝛼 √︁
𝑠𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑑𝑊 +𝜀
Note:
• 𝑠 - the exponentially weighted average of past squares of gradients
𝜕𝒥
• 𝜕𝑊 - cost gradient with respect to current layer weight tensor
• 𝑊 - weight tensor
• 𝛽 - hyperparameter to be tuned
• 𝛼 - the learning rate
• 𝜖 - very small value to avoid dividing by zero
16.10 SGD
SGD stands for Stochastic Gradient Descent.In Stochastic Gradient Descent, a few samples are selected randomly
instead of the whole data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes the
total number of samples from a dataset that is used for calculating the gradient for each iteration. In typical Gradient
Descent optimization, like Batch Gradient Descent, the batch is taken to be the whole dataset. Although, using the
whole dataset is really useful for getting to the minima in a less noisy or less random manner, but the problem arises
when our datasets get really huge.
This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single sample to perform each iteration.
The sample is randomly shuffled and selected for performing the iteration.
Since only one sample from the dataset is chosen at random for each iteration, the path taken by the algorithm to reach
the minima is usually noisier than your typical Gradient Descent algorithm. But that doesn’t matter all that much
because the path taken by the algorithm does not matter, as long as we reach the minima and with significantly shorter
training time.
References
Regularization
• Data Augmentation
• Dropout
• Early Stopping
• Ensembling
• Injecting Noise
• L1 Regularization
• L2 Regularization
What is overfitting?
What is Regularization?
Having more data (dataset / samples) is a best way to get better consistent estimators (ML model). In the real world
getting a large volume of useful data for training a model is cumbersome and labelling is an extremely tedious task.
121
ML Cheatsheet Documentation
Either labelling requires more manual annotation, example - For creating a better image classifier we can use Mturk
and involve more man power to generate dataset or doing survey in social media and asking people to participate and
generate dataset. Above process can yield good dataset however those are difficult to carry and expensive. Having
small dataset will lead to the well know Over fitting problem.
Data Augmentation is one of the interesting regularization technique to resolve the above problem. The concept is
very simple, this technique generates new training data from given original dataset. Dataset Augmentation provides a
cheap and easy way to increase the amount of your training data.
This technique can be used for both NLP and CV.
In CV we can use the techniques like Jitter, PCA and Flipping. Similarly in NLP we can use the techniques like
Synonym Replacement,Random Insertion, Random Deletion and Word Embeddings.
It is worth knowing that Keras’ provided ImageDataGenerator for generating Data Augmentation.
Sample code for random deletion
#if you end up deleting all words, just return a random word
if len(new_words) == 0:
rand_int = random.randint(0, len(words)-1)
return [words[rand_int]]
return new_words
Furthermore, when comparing two machine learning algorithms train both with either augmented or non-augmented
dataset. Otherwise, no subjective decision can be made on which algorithm performed better
Further reading
17.2 Dropout
What is Dropout?
Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations
on training data
Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” ran-
domly. This means that their contribution to the activation of downstream neurons is temporally removed on the
forward pass and any weight updates are not applied to the neuron on the backward pass.
Simply put, It is the process of ignoring some of the neurons in particular forward or backward pass.
Dropout can be easily implemented by randomly selecting nodes to be dropped-out with a given probability (e.g. .1%)
each weight update cycle.
Most importantly Dropout is only used during the training of a model and is not used when evaluating the model.
import numpy as np
A = np.arange(20).reshape((5,4))
Given input:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
After Dropout:
[[ 0. 2. 0. 0.]
[ 8. 0. 0. 14.]
[16. 18. 0. 22.]
[24. 0. 0. 0.]
[32. 34. 36. 0.]]
Further reading
• Dropout https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
One of the biggest problem in training neural network is how long to train the model.
Training too little will lead to underfit in train and test sets. Traning too much will have the overfit in training set and
poor result in test sets.
Here the challenge is to train the network long enough that it is capable of learning the mapping from inputs to outputs,
but not training the model so long that it overfits the training data.
One possible solution to solve this problem is to treat the number of training epochs as a hyperparameter and train the
model multiple times with different values, then select the number of epochs that result in the best accuracy on the
train or a holdout test dataset, But the problem is it requires multiple models to be trained and discarded.
Clearly, after ‘t’ epochs, the model starts overfitting. This is clear by the increasing gap between the train and the
validation error in the above plot.
One alternative technique to prevent overfitting is use validation error to decide when to stop. This approach is called
Early Stopping.
While building the model, it is evaluated on the holdout validation dataset after each epoch. If the accuracy of the
model on the validation dataset starts to degrade (e.g. loss begins to increase or accuracy begins to decrease), then the
training process is stopped. This process is called Early stopping.
Parameters:
n: int; Number of steps between evaluations.
p: int; "patience", the number of evaluations to observe worsening validataion
˓→set.
Returns:
theta_prime: Network object; The output network.
i_prime: int; The number of iterations for the output network.
v: float; The validation error for the output network.
"""
# Initialize variables
theta = theta0.clone() # The active network
i = 0 # The number of training steps taken
j = 0 # The number of evaluations steps since last update of
˓→theta_prime
while j < p:
# Update theta by running the training algorithm for n steps
for _ in range(n):
theta.train(x_train, y_train)
# Update Values
i += n
v_new = theta.error(x_valid, y_valid)
# If better validation error, then reset waiting time, save the network, and
˓→ update the best error value
if v_new < v:
j = 0
theta_prime = theta.clone()
i_prime = i
v = v_new
Further reading
• Regularization
17.4 Ensembling
Ensemble methods combine several machine learning techniques into one predictive model. There are a few different
methods for ensembling, but the two most common are:
Bagging
• Bagging stands for bootstrap aggregation. One way to reduce the variance of an estimate is to average together
multiple estimates.
• It trains a large number of “strong” learners in parallel.
• A strong learner is a model that’s relatively unconstrained.
• Bagging then combines all the strong learners together in order to “smooth out” their predictions.
Boosting
• Boosting refers to a family of algorithms that are able to convert weak learners to strong learners.
• Each one in the sequence focuses on learning from the mistakes of the one before it.
• Boosting then combines all the weak learners into a single strong learner.
Bagging uses complex base models and tries to “smooth out” their predictions, while boosting uses simple base models
and tries to “boost” their aggregate complexity.
Noise is often introduced to the inputs as a dataset augmentation strategy. When we have a small dataset the network
may effectively memorize the training dataset. Instead of learning a general mapping from inputs to outputs, the model
may learn the specific input examples and their associated outputs. One approach for improving generalization error
and improving the structure of the mapping problem is to add random noise.
Adding noise means that the network is less able to memorize training samples because they are changing all of the
time, resulting in smaller network weights and a more robust network that has lower generalization error.
Noise is only added during training. No noise is added during the evaluation of the model or when the model is used
to make predictions on new data.
Random noise can be added to other parts of the network during training. Some examples include:
• In the real world dataset, We can expect some amount of mistakes in the output labels. One way to remedy this
is to explicitly model the noise on labels.
Further reading
• Regularization
17.6 L1 Regularization
Let’s define a model to see how L1 Regularization works. For simplicity, We define a simple linear regression model
Y with one independent variable.
In this model, W represent Weight, b represent Bias.
𝑊 = 𝑤1 , 𝑤2 ...𝑤𝑛
𝑋 = 𝑥1 , 𝑥2 ...𝑥𝑛
𝑌̂︀ = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ...𝑤𝑛 𝑥𝑛 + 𝑏
Following formula calculates the error without Regularization function
Note: Here, If the value of lambda is Zero then above Loss function becomes Ordinary Least Square whereas very
large value makes the coefficients (weights) zero hence it under-fits.
One thing to note is that |𝑤| is differentiable when w!=0 as shown below,
{︃
d|𝑤| 1 𝑤>0
=
d𝑤 −1 𝑤 < 0
To understand the Note above,
Let’s substitute the formula in finding new weights using Gradient Descent optimizer.
𝜕𝐿1
𝑤𝑛𝑒𝑤 = 𝑤 − 𝜂
𝜕𝑤
When we apply the L1 in above formula it becomes,
d|𝑤|
𝑤𝑛𝑒𝑤 = 𝑤 − 𝜂.(𝐸𝑟𝑟𝑜𝑟(𝑌, 𝑌̂︀ ) + 𝜆 )
{︃ d𝑤
𝑤 − 𝜂.(𝐸𝑟𝑟𝑜𝑟(𝑌, 𝑌̂︀ ) + 𝜆) 𝑤 > 0
=
𝑤 − 𝜂.(𝐸𝑟𝑟𝑜𝑟(𝑌, 𝑌̂︀ ) − 𝜆) 𝑤 < 0
From the above formula,
• If w is positive, the regularization parameter 𝜆 > 0 will push w to be less positive, by subtracting 𝜆 from w.
• If w is negative, the regularization parameter 𝜆 < 0 will push w to be less negative, by adding 𝜆 to w. hence this
has the effect of pushing w towards 0.
Simple python implementation
return weights
Use Case
L1 Regularization (or varient of this concept) is a model of choice when the number of features are high, Since it
provides sparse solutions. We can get computational advantage as the features with zero coefficients can simply be
ignored.
Further reading
• Linear Regression
17.7 L2 Regularization
A regression model that uses L2 regularization technique is called Ridge Regression. Main difference between L1 and
L2 regularization is, L2 regularization uses “squared magnitude” of coefficient as penalty term to the loss function.
Let’s define a model to see how L2 Regularization works. For simplicity, We define a simple linear regression model
Y with one independent variable.
In this model, W represent Weight, b represent Bias.
𝑊 = 𝑤1 , 𝑤2 ...𝑤𝑛
𝑋 = 𝑥1 , 𝑥2 ...𝑥𝑛
𝑌̂︀ = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ...𝑤𝑛 𝑥𝑛 + 𝑏
Note: Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will
add too much weight and it leads to under-fitting.
return weights
Use Case
L2 regularization can address the multicollinearity problem by constraining the coefficient norm and keeping all the
variables. L2 regression can be used to estimate the predictor importance and penalize predictors that are not important.
One issue with co-linearity is that the variance of the parameter estimate is huge. In cases where the number of features
are greater than the number of observations, the matrix used in the OLS may not be invertible but Ridge Regression
enables this matrix to be inverted.
Further reading
• Ridge Regression
References
Architectures
• Autoencoder
• CNN
• GAN
• MLP
• RNN
• VAE
18.1 Autoencoder
TODO: Description of Autoencoder use case and basic architecture. Figure from [1].
131
ML Cheatsheet Documentation
Model
Training
output = net(inputs)
loss = loss_func(output, inputs)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Further reading
• Convolutional Autoencoders
• Deep Learning Book
18.2 CNN
TODO: Description of CNN use case and basic architecture. Figure from [2].
Model
Training
Further reading
18.3 GAN
A Generative Adversarial Network (GAN) is a type of network which creates novel tensors (often images, voices,
etc.). The generative portion of the architecture competes with the discriminator part of the architecture in a zero-sum
game. The goal of the generative network is to create novel tensors which the adversarial network attempts to classify
as real or fake. The goal of the generative network is generate tensors where the discriminator network determines that
the tensor has a 50% chance of being fake and a 50% chance of being real.
Figure from [3].
Model
Generator
class Generator(nn.Module):
def __init__(self):
super()
self.net = nn.Sequential(
nn.ConvTranspose2d( 200, 32 * 8, 4, 1, 0, bias=False),
nn.BatchNorm2d(32 * 8),
nn.ReLU(),
nn.ConvTranspose2d(32 * 8, 32 * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(32 * 4),
nn.ReLU(),
nn.ConvTranspose2d( 32 * 4, 32 * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(32 * 2),
nn.ReLU(),
nn.ConvTranspose2d( 32 * 2, 32, 4, 2, 1, bias=False),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.ConvTranspose2d( 32, 1, 4, 2, 1, bias=False),
nn.Tanh()
)
def forward(self, tens):
return self.net(tens)
Discriminator
class Discriminator(nn.Module):
def __init__(self):
super()
self.net = nn.Sequential(
nn.Conv2d(1, 32, 4, 2, 1, bias=False),
nn.LeakyReLU(0.2),
nn.Conv2d(32, 32 * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(32 * 2),
nn.LeakyReLU(0.2),
nn.Conv2d(32 * 2, 32 * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(32 * 4),
nn.LeakyReLU(0.2),
# state size. (32*4) x 8 x 8
nn.Conv2d(32 * 4, 32 * 8, 4, 2, 1, bias=False),
nn.BatchNorm2d(32 * 8),
nn.LeakyReLU(0.2),
# state size. (32*8) x 4 x 4
nn.Conv2d(32 * 8, 1, 4, 1, 0, bias=False),
nn.Sigmoid()
)
Training
output = netD(realtens)
errD_real = loss_func(output, label)
errD_real.backward() # backprop discriminator fake and real based on label
noise = torch.randn(b_size, 200, 1, 1, device=device)
fake = netG(noise)
label.fill_(0)
output = netD(fake.detach()).view(-1)
errD_fake = loss_func(output, label)
errD_fake.backward() # backprop discriminator fake and real based on label
errD = errD_real + errD_fake # discriminator error
optimizerD.step()
netG.zero_grad()
label.fill_(1)
output = netD(fake).view(-1)
errG = loss_func(output, label) # generator error
errG.backward()
optimizerG.step()
Further reading
18.4 MLP
A Multi Layer Perceptron (MLP) is a neural network with only fully connected layers. Figure from [5].
Model
class MLP(nn.Module):
def __init__(self):
super(MLP,self).__init__()
# define layers
self.fc1 = nn.Linear(in_features=28*28, out_features=500)
self.fc2 = nn.Linear(in_features=500, out_features=200)
self.fc3 = nn.Linear(in_features=200, out_features=100)
self.out = nn.Linear(in_features=100, out_features=10)
(continues on next page)
Training
output = net(inputs)
loss = loss_func(output, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# print statistics
running_loss = loss.item()
print('Training loss: %.3f' %( running_loss))
Evaluating
def main():
train_set = torchvision.datasets.FashionMNIST(
root = './FMNIST',
train = True,
download = False,
transform = transforms.Compose([
transforms.ToTensor()
])
)
mlp = MLP()
loader = torch.utils.data.DataLoader(train_set, batch_size = 1000)
optimizer = optim.Adam(mlp.parameters(), lr=0.01)
loss_func=nn.CrossEntropyLoss()
for i in range(0,15):
train(mlp,loader,loss_func,optimizer)
(continues on next page)
Further reading
TODO
18.5 RNN
Model
class RNN(nn.Module):
def __init__(self, n_classes):
super().__init__()
self.hid_fc = nn.Linear(185, 128)
self.out_fc = nn.Linear(185, n_classes)
self.softmax = nn.LogSoftmax()
Training
In this example, our input is a list of last names, where each name is a variable length array of one-hot encoded
characters. Our target is is a list of indices representing the class (language) of the name.
for p in model.parameters():
p.data.add_(-.001, p.grad.data)
Further reading
• Jupyter notebook
• Deep Learning Book
18.6 VAE
Autoencoders can encode an input image to a latent vector and decode it, but they can’t generate novel images.
Variational Autoencoders (VAE) solve this problem by adding a constraint: the latent vector representation should
model a unit gaussian distribution. The Encoder returns the mean and variance of the learned gaussian. To generate a
new image, we pass a new mean and variance to the Decoder. In other words, we “sample a latent vector” from the
gaussian and pass it to the Decoder. It also improves network generalization and avoids memorization. Figure from
[4].
Loss Function
The VAE loss function combines reconstruction loss (e.g. Cross Entropy, MSE) with KL divergence.
Model
class VAE(nn.Module):
def __init__(self, in_shape, n_latent):
super().__init__()
self.in_shape = in_shape
self.n_latent = n_latent
c,h,w = in_shape
self.z_dim = h//2**2 # receptive field downsampled 2 times
self.encoder = nn.Sequential(
nn.BatchNorm2d(c),
nn.Conv2d(c, 32, kernel_size=4, stride=2, padding=1), # 32, 16, 16
nn.BatchNorm2d(32),
nn.LeakyReLU(),
nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=1), # 32, 8, 8
nn.BatchNorm2d(64),
nn.LeakyReLU(),
)
self.z_mean = nn.Linear(64 * self.z_dim**2, n_latent)
self.z_var = nn.Linear(64 * self.z_dim**2, n_latent)
self.z_develop = nn.Linear(n_latent, 64 * self.z_dim**2)
self.decoder = nn.Sequential(
(continues on next page)
Training
optimizer.zero_grad()
loss.backward()
optimizer.step()
Further reading
• Original Paper
• VAE Explained
• Deep Learning Book
References
Classification Algorithms
Classification problems is when our output Y is always in categories like positive vs negative in terms of sentiment
analysis, dog vs cat in terms of image classification and disease vs no disease in terms of medical diagnosis.
19.1 Bayesian
Overlaps..
Intuitions
Decision tree works by successively splitting the dataset into small segments until the target variable are the same or
until the dataset can no longer be split. It’s a greedy algorithm which make the best decision at the given time without
concern for the global optimality2 .
The concept behind decision tree is straightforward. The following flowchart show a simple email classification system
based on decision tree. If the address is “myEmployer.com”, it will classify it to “Email to read when bored”. Then
if the email contains the word “hockey”, this email will be classified as “Email from friends”. Otherwise, it will be
identified as “Spam: don’t read”. Image source2 .
2 Machine Learning in Action by Peter Harrington
145
ML Cheatsheet Documentation
Algorithm Explained
There are various kinds of decision tree algorithms such as ID3 (Iterative Dichotomiser 3), C4.5 and CART (Classifi-
cation and Regression Trees). The constructions of decision tree are similar5 :
1. Assign all training instances to the root of the tree. Set current node to root node.
2. Find the split feature and split value based on the split criterion such as information gain, information gain ratio
or gini coefficient.
3. Partition all data instances at the node based on the split feature and threshold value.
4. Denote each partition as a child node of the current node.
5. For each child node:
1. If the child node is “pure” (has instances from only one class), tag it as a leaf and return.
2. Else, set the child node as the current node and recurse to step 2.
ID3 creates a multiway tree. For each node, it trys to find the categorical feature that will yield the largest information
gain for the target variable.
C4.5 is the successor of ID3 and remove the restriction that the feature must be categorical by dynamically define a
discrete attribute that partitions the continuous attribute in the discrete set of intervals.
CART is similar to C4.5. But it differs in that it constructs binary tree and support regression problem3 .
The main differences are shown in the following table:
Code Implementation
We used object-oriented patterns to create the code for ID3, C4.5 and CART. We will first introduce the base class for
these three algorithms, then we explain the code of CART in details.
First, we create the base class TreeNode class and DecisionTree
class TreeNode:
def __init__(self, data_idx, depth, child_lst=[]):
self.data_idx = data_idx
self.depth = depth
self.child = child_lst
self.label = None
self.split_col = None
self.child_cate_order = None
class DecisionTree()
def fit(self, X, y):
"""
X: train data, dimensition [num_sample, num_feature]
y: label, dimension [num_sample, ]
"""
self.data = X
self.labels = y
num_sample, num_feature = X.shape
self.feature_num = num_feature
data_idx = list(range(num_sample))
# Set the root of the tree
self.root = TreeNode(data_idx=data_idx, depth=0, child_lst=[])
queue = [self.root]
while queue:
node = queue.pop(0)
# Check if the terminate criterion has been met
if node.depth>self.max_depth or len(node.data_idx)==1:
# Set the label for the leaf node
self.set_label(node)
else:
# Split the node
child_nodes = self.split_node(node)
if not child_nodes:
self.set_label(node)
else:
queue.extend(child_nodes)
The CART algorithm, when constructing the binary tree, will try searching for the feature and threshold that will
yield the largest gain or the least impurity. The split criterion is a combination of the child nodes’ impurity. For the
child nodes’ impurity, gini coefficient or information gain are adopted in classification. For regression problem, mean-
square-error or mean-absolute-error are used. Example codes are showed below. For more details about the formulas,
please refer to Mathematical formulation for decision tree in scikit-learn documentation
class CART(DecisionTree):
Introduction
K-Nearest Neighbor is a supervised learning algorithm both for classification and regression. The principle is to find
the predefined number of training samples closest to the new point, and predict the label from these training samples1 .
For example, when a new point comes, the algorithm will follow these steps:
1. Calculate the Euclidean distance between the new point and all training data
2. Pick the top-K closest training data
3. For regression problem, take the average of the labels as the result; for classification problem, take the most
common label of these labels as the result.
Code
Below is the Numpy implementation of K-Nearest Neighbor function. Refer to code example for details.
1 https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification
neighbors= []
for index, data in enumerate(training_data):
# distance between the target data and the current example from the data.
distance = euclidean_distance(data[:-1], target)
neighbors.append((distance, index))
# Step three: For regression problem, take the average of the labels as the
˓→result;
# for classification problem, take the most common label of these
˓→labels as the result.
19.6 Boosting
References
Clustering Algorithms
20.1 Centroid
20.2 Density
20.3 Distribution
20.4 Hierarchical
20.5 K-Means
151
ML Cheatsheet Documentation
References
Regression Algorithms
OLS is the method with which linear regression is performed. The square of the difference from the mean is taken for
every data point, and the summed loss function is to be minimized.
𝑛
∑︁
𝑙= (𝑦𝑖 − 𝑦¯)2
𝑖=1
21.2 Polynomial
Polynomial regression is a modification of linear regression where the existing features are mapped to a polynomial
form. The problem is still a linear regression problem, but the input vector is now mapped to a higher dimensional
vector which serves as a pseudo-input vector of sorts.
153
ML Cheatsheet Documentation
21.3 Lasso
Lasso Regression tries to reduce the ordinary least squares error similar to vanilla regression, but adds an extra term.
The sum of the 𝐿1 norm for every data point multiplied by a hyperparameter 𝛼 is used. This reduces model complexity
and prevents overfitting.
𝑛
∑︁ 𝑝
∑︁
𝑙= (𝑦𝑖 − 𝑦˜)2 + 𝛼 |𝑤𝑗 |
𝑖=1 𝑗=1
21.4 Ridge
Ridge regression is similar to lasso regression, but the regularization term uses the 𝐿2 norm instead.
𝑛
∑︁ 𝑝
∑︁
𝑙= (𝑦𝑖 − 𝑦˜)2 + 𝛼 𝑤𝑗2
𝑖=1 𝑗=1
21.5 Stepwise
Stepwise regression or spline regression helps us fit a piece wise function to the data. It is usually used with linear
models, but it can be generalized to higher degrees as well. The regression equation takes the form of
𝑦 = 𝑎𝑥 + 𝑏(𝑥 − 𝑥
¯)𝐻𝛼 + 𝑐
References
Reinforcement Learning
155
ML Cheatsheet Documentation
Datasets
Public datasets in vision, nlp and more forked from caesar0301’s awesome datasets wiki.
• Agriculture
• Art
• Biology
• Chemistry/Materials Science
• Climate/Weather
• Complex Networks
• Computer Networks
• Data Challenges
• Earth Science
• Economics
• Education
• Energy
• Finance
• GIS
• Government
• Healthcare
• Image Processing
• Machine Learning
• Museums
157
ML Cheatsheet Documentation
• Music
• Natural Language
• Neuroscience
• Physics
• Psychology/Cognition
• Public Domains
• Search Engines
• Social Networks
• Social Sciences
• Software
• Sports
• Time Series
• Transportation
23.1 Agriculture
23.2 Art
23.3 Biology
• 1000 Genomes
• American Gut (Microbiome Project)
• Broad Bioimage Benchmark Collection (BBBC)
• Broad Cancer Cell Line Encyclopedia (CCLE)
• Cell Image Library
• Complete Genomics Public Data
• EBI ArrayExpress
• EBI Protein Data Bank in Europe
• Electron Microscopy Pilot Image Archive (EMPIAR)
• ENCODE project
• Ensembl Genomes
23.5 Climate/Weather
• Bruteforce Database
• Challenges in Machine Learning
• CrowdANALYTIX dataX
• D4D Challenge of Orange
• DrivenData Competitions for Social Good
• ICWSM Data Challenge (since 2009)
• Kaggle Competition Data
23.10 Economics
23.11 Education
23.12 Energy
• AMPds
• BLUEd
• COMBED
• Dataport
• DRED
• ECO
• EIA
• HES - Household Electricity Study, UK
• HFED
• iAWE
• PLAID - the Plug Load Appliance Identification Dataset
• REDD
• Tracebase
• UK-DALE - UK Domestic Appliance-Level Electricity
• WHITED
23.13 Finance
23.14 GIS
23.15 Government
23.16 Healthcare
23.19 Museums
23.20 Music
23.22 Neuroscience
• FCP-INDI
• Human Connectome Project
• NDAR
• NeuroData
• Neuroelectro
• NIMH Data Archive
• OASIS
• OpenfMRI
• Study Forrest
23.23 Physics
23.24 Psychology/Cognition
• Amazon
• Archive-it from Internet Archive
• Archive.org Datasets
• CMU JASA data archive
• CMU StatLab collections
• Data.World
• Data360
• Datamob.org
• Google
• Infochimps
• KDNuggets Data Collections
• Microsoft Azure Data Market Free DataSets
• Microsoft Data Science for Research
• Numbray
• Open Library Data Dumps
• Reddit Datasets
• RevolutionAnalytics Collection
• Sample R data sets
• Stats4Stem R data sets
• StatSci.org
• The Washington Post List
• UCLA SOCR data collection
• UFO Reports
• Wikileaks 911 pager intercepts
• Yahoo Webscope
23.29 Software
• FLOSSmole data about free, libre, and open source software development
23.30 Sports
23.32 Transportation
Libraries
Machine learning libraries and frameworks forked from josephmisti’s awesome machine learning.
• APL
• C
• C++
• Common Lisp
• Clojure
• Elixir
• Erlang
• Go
• Haskell
• Java
• Javascript
• Julia
• Lua
• Matlab
• .NET
• Objective C
• OCaml
• PHP
• Python
173
ML Cheatsheet Documentation
• Ruby
• Rust
• R
• SAS
• Scala
• Swift
24.1 APL
24.2 C
• Darknet - Darknet is an open source neural network framework written in C and CUDA. It is fast, easy to install,
and supports CPU and GPU computation.
• Recommender - A C library for product recommendations/suggestions using collaborative filtering (CF).
• Hybrid Recommender System - A hybrid recomender system based upon scikit-learn algorithms.
Computer Vision
Speech Recognition
• HTK -The Hidden Markov Model Toolkit. HTK is a portable toolkit for building and manipulating hidden
Markov models.
24.3 C++
Computer Vision
• DLib - DLib has C++ and Python interfaces for face detection and training general object detectors.
• EBLearn - Eblearn is an object-oriented C++ library that implements various machine learning models
• OpenCV - OpenCV has C++, C, Python, Java and MATLAB interfaces and supports Windows, Linux, Android
and Mac OS.
• VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning library for volumes of
arbitrary dimensionality with Python bindings.
• BLLIP Parser
• colibri-core - C++ library, command line tools, and Python binding for extracting and working with basic lin-
guistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.
• CRF++ for segmenting/labeling sequential data & other Natural Language Processing tasks.
• CRFsuite for labeling sequential data.
• frog - Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser, dependency parser, NER, shallow
parser, morphological analyzer.
• libfolia](https://github.com/LanguageMachines/libfolia) - C++ library for the [FoLiA format
• MeTA](https://github.com/meta-toolkit/meta) - [MeTA : ModErn Text Analysis is a C++ Data Sciences Toolkit
that facilitates mining big text data.
• MIT Information Extraction Toolkit - C, C++, and Python tools for named entity recognition and relation ex-
traction
• ucto - Unicode-aware regular-expression based tokenizer for various languages. Tool and C++ library. Supports
FoLiA format.
Speech Recognition
• Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v2.0.
Kaldi is intended for use by speech recognition researchers.
Sequence Analysis
• ToPS - This is an objected-oriented framework that facilitates the integration of probabilistic models for se-
quences over a user defined alphabet.
Gesture Detection
• grt - The Gesture Recognition Toolkit. GRT is a cross-platform, open-source, C++ machine learning library
designed for real-time gesture recognition.
24.5 Clojure
• Incanter - Incanter is a Clojure-based, R-like platform for statistical computing and graphics.
• PigPen - Map-Reduce for Clojure.
• Envision - Clojure Data Visualisation library, based on Statistiker and D3
24.6 Elixir
24.7 Erlang
24.8 Go
24.9 Haskell
24.10 Java
• Apache cTAKES is an open-source natural language processing system for information extraction from elec-
tronic medical record clinical free-text.
• ClearNLP - The ClearNLP project provides software and resources for natural language processing. The project
started at the Center for Computational Language and EducAtion Research, and is currently developed by the
Center for Language and Information Research at Emory University. This project is under the Apache 2 license.
• CogcompNLP developed in the University of Illinois’ Cognitive Computation Group, for example illinois-core-
utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that support
writing NLP applications, running experiments, etc, illinois-edison a library for feature extraction from illinois-
core-utilities data structures and many other packages.
• aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly.
• Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applica-
tions
• ELKI
• Encog - An advanced neural network and machine learning framework. Encog contains classes to create a wide
variety of networks, as well as support classes to normalize and process data for these neural networks. Encog
trains using multithreaded resilient propagation. Encog can also make use of a GPU to further speed processing
time. A GUI based workbench is also provided to help model and train neural networks.
• FlinkML in Apache Flink - Distributed machine learning library in Flink
• H2O - ML engine that supports distributed learning on Hadoop, Spark or your laptop via APIs in R, Python,
Scala, REST/JSON.
• htm.java - General Machine Learning library using Numenta’s Cortical Learning Algorithm
• java-deeplearning - Distributed Deep Learning Platform for Java, Clojure,Scala
• Mahout - Distributed machine learning
• Meka.
• MLlib in Apache Spark - Distributed machine learning library in Spark
• Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime, batch
or reactive web services.
• Neuroph - Neuroph is lightweight Java neural network framework
• ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization for
real-time large-scale machine learning.
• Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface to
plug-in different stream processing platforms.
• RankLib - RankLib is a library of learning to rank algorithms
• rapaio - statistics, data mining and machine learning toolbox in Java
• RapidMiner - RapidMiner integration into Java code
• Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one of
k classes.
• SmileMiner - Statistical Machine Intelligence & Learning Engine
• SystemML language.
Speech Recognition
• CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library.
• Flink - Open source platform for distributed stream and batch data processing.
• Hadoop - Hadoop/HDFS
• Spark - Spark is a fast and general engine for large-scale data processing.
• Storm - Storm is a distributed realtime computation system.
• Impala - Real-time Query for Hadoop
• DataMelt - Mathematics software for numeric computation, statistics, symbolic calculations, data analysis and
data visualization.
• Dr. Michael Thomas Flanagan’s Java Scientific Library
Deep Learning
24.11 Javascript
• D3.js
• High Charts
• NVD3.js
• dc.js
• chartjs
• dimple
• amCharts
• D3xter - Straight forward plotting built on D3
• statkit - Statistics kit for JavaScript
• datakit - A lightweight framework for data analysis in JavaScript
• science.js - Scientific and statistical computing in JavaScript.
• Z3d - Easily make interactive 3d plots built on Three.js
• Sigma.js - JavaScript library dedicated to graph drawing.
• C3.js- customizable library based on D3.js for easy chart drawing.
• Datamaps- Customizable SVG map/geo visualizations using D3.js.
• ZingChart- library written on Vanilla JS for big data visualization.
• cheminfo - Platform for data visualization and analysis, using the visualizer project.
• Convnet.js - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]
• Clusterfck - Agglomerative hierarchical clustering implemented in Javascript for Node.js and the browser
• Clustering.js - Clustering algorithms implemented in Javascript for Node.js and the browser
• Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm
• DN2A - Digital Neural Networks Architecture
• figue - K-means, fuzzy c-means and agglomerative clustering
• Node-fann bindings for Node.js
• Kmeans.js - Simple Javascript implementation of the k-means algorithm, for node.js and the browser
• LDA.js - LDA topic modeling for node.js
• Learning.js - Javascript implementation of logistic regression/c4.5 decision tree
• Machine Learning - Machine learning library for Node.js
• machineJS - Automated machine learning, data formatting, ensembling, and hyperparameter optimization for
competitions and exploration- just give it a .csv file!
• mil-tokyo - List of several machine learning libraries
• Node-SVM - Support Vector Machine for nodejs
• Brain - Neural networks in JavaScript [Deprecated]
Misc
24.12 Julia
• DSP.
• JuliaCon Presentations - Presentations for JuliaCon
• SignalProcessing - Signal Processing tools for Julia
• Images - An image library for Julia
24.13 Lua
• Torch7
• cephes - Cephes mathematical functions library, wrapped for Torch. Provides and wraps the 180+ special
mathematical functions from the Cephes mathematical library, developed by Stephen L. Moshier. It is used,
among many other places, at the heart of SciPy.
• autograd - Autograd automatically differentiates native Torch code. Inspired by the original Python version.
• graph - Graph package for Torch
• randomkit - Numpy’s randomkit, wrapped for Torch
• signal - A signal processing toolbox for Torch-7. FFT, DCT, Hilbert, cepstrums, stft
• nn - Neural Network package for Torch
• torchnet - framework for torch which provides a set of abstractions aiming at encouraging code re-use as well
as encouraging modular programming
• nngraph - This package provides graphical computation for nn library in Torch7.
• nnx - A completely unstable and experimental package that extends Torch’s builtin nn library
• rnn - A Recurrent Neural Network library that extends Torch’s nn. RNNs, LSTMs, GRUs, BRNNs, BLSTMs,
etc.
• dpnn - Many useful features that aren’t part of the main nn package.
• dp - A deep learning library designed for streamlining research and development using the Torch7 distribution.
It emphasizes flexibility through the elegant use of object-oriented design patterns.
• optim - An optimization library for Torch. SGD, Adagrad, Conjugate-Gradient, LBFGS, RProp and more.
• unsup.
• manifold - A package to manipulate manifolds
• svm - Torch-SVM library
• lbfgs - FFI Wrapper for liblbfgs
• vowpalwabbit - An old vowpalwabbit interface to torch.
• OpenGM - OpenGM is a C++ library for graphical modeling, and inference. The Lua bindings provide a simple
way of describing graphs, from Lua, and then optimizing them with OpenGM.
• Core torch7 demos repository. * linear-regression, logistic-regression * face detector (training and detection
as separate demos) * mst-based-segmenter * train-a-digit-classifier * train-autoencoder * optical flow demo *
train-on-housenumbers * train-on-cifar * tracking with deep nets * kinect demo * filter-bank visualization *
saliency-networks
• Training a Convnet for the Galaxy-Zoo Kaggle challenge(CUDA demo)
• Music Tagging - Music Tagging scripts for torch7
• torch-datasets - Scripts to load several popular datasets including: * BSR 500 * CIFAR-10 * COIL * Street
View House Numbers * MNIST * NORB
• Atari2600 - Scripts to generate a dataset with static frames from the Arcade Learning Environment
24.14 Matlab
Computer Vision
• Contourlets - MATLAB source code that implements the contourlet transform and its utility functions.
• Shearlets - MATLAB code for shearlet transform
• Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed to
represent images at different scales and different angles.
• Bandlets - MATLAB code for bandlet transform
• mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV library
24.15 .NET
Computer Vision
• OpenCVDotNet - A wrapper for the OpenCV project to be used with .NET applications.
• Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run on Windows, Linus,
Mac OS X, iOS, and Android.
• AForge.NET - Open source C# framework for developers and researchers in the fields of Computer Vision and
Artificial Intelligence. Development has now shifted to GitHub.
• Accord.NET - Together with AForge.NET, this library can provide image processing and computer vision al-
gorithms to Windows, Windows RT and Windows Phone. Some components are also available for Java and
Android.
• Stanford.NLP for .NET - A full port of Stanford NLP packages to .NET and also available precompiled as a
NuGet package.
• Accord-Framework -The Accord.NET Framework is a complete framework for building machine learning, com-
puter vision, computer audition, signal processing and statistical applications.
• Accord.MachineLearning - Support Vector Machines, Decision Trees, Naive Bayesian models, K-means, Gaus-
sian Mixture models and general algorithms such as Ransac, Cross-validation and Grid-Search for machine-
learning applications. This package is part of the Accord.NET Framework.
• DiffSharp for machine learning and optimization applications. Operations can be nested to any level, meaning
that you can compute exact higher-order derivatives and differentiate functions that are internally making use of
differentiation, for applications such as hyperparameter optimization.
• Vulpes - Deep belief and deep learning implementation written in F# and leverages CUDA GPU execution with
Alea.cuBase.
• Encog - An advanced neural network and machine learning framework. Encog contains classes to create a wide
variety of networks, as well as support classes to normalize and process data for these neural networks. Encog
trains using multithreaded resilient propagation. Encog can also make use of a GPU to further speed processing
time. A GUI based workbench is also provided to help model and train neural networks.
• Neural Network Designer - DBMS management system and designer for neural networks. The designer appli-
cation is developed using WPF, and is a user interface which allows you to design your neural network, query
the network, create and configure chat bots that are capable of asking questions and learning from your feed
back. The chat bots can even scrape the internet for information to return in their output as well as to use for
learning.
• Infer.NET - Infer.NET is a framework for running Bayesian inference in graphical models. One can use In-
fer.NET to solve many different kinds of machine learning problems, from standard problems like classifica-
tion, recommendation or clustering through to customised solutions to domain-specific problems. Infer.NET has
been used in a wide variety of domains including information retrieval, bioinformatics, epidemiology, vision,
and many others.
• numl - numl is a machine learning library intended to ease the use of using standard modeling techniques for
both prediction and clustering.
• Math.NET Numerics - Numerical foundation of the Math.NET project, aiming to provide methods and algo-
rithms for numerical computations in science, engineering and every day use. Supports .Net 4.0, .Net 3.5 and
Mono on Windows, Linux and Mac; Silverlight 5, WindowsPhone/SL 8, WindowsPhone 8.1 and Windows 8
with PCL Portable Profiles 47 and 344; Android/iOS with Xamarin.
• Sho to enable fast and flexible prototyping. The environment includes powerful and efficient libraries for lin-
ear algebra as well as data visualization that can be used from any .NET language, as well as a feature-rich
interactive shell for rapid development.
24.16 Objective C
• YCML.
• MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X. MLPNeuralNet
predicts new examples by trained neural network. It is built on top of the Apple’s Accelerate Framework, using
vectorized operations and hardware acceleration if available.
• MAChineLearning - An Objective-C multilayer perceptron library, with full support for training through back-
propagation. Implemented using vDSP and vecLib, it’s 20 times faster than its Java equivalent. Includes sample
code for use from Swift.
• BPN-NeuralNetwork. This network can be used in products recommendation, user behavior analysis, data
mining and data analysis.
• Multi-Perceptron-NeuralNetwork and designed unlimited-hidden-layers.
• KRHebbian-Algorithm in neural network of Machine Learning.
• KRKmeans-Algorithm - It implemented K-Means the clustering and classification algorithm. It could be used
in data mining and image compression.
• KRFuzzyCMeans-Algorithm the fuzzy clustering / classification algorithm on Machine Learning. It could be
used in data mining and image compression.
24.17 OCaml
24.18 PHP
• PHP-ML - Machine Learning library for PHP. Algorithms, Cross Validation, Neural Network, Preprocessing,
Feature Extraction and much more in one library.
• PredictionBuilder - A library for machine learning that builds predictions using a linear regression.
• Rubix ML - A high-level machine learning and deep learning library for the PHP language.
24.19 Python
Computer Vision
• NLTK - A leading platform for building Python programs to work with human language data.
• Pattern - A web mining module for the Python programming language. It has tools for natural language pro-
cessing, machine learning, among others.
• Quepy - A python framework to transform natural language questions to queries in a database query language
• TextBlob tasks. Stands on the giant shoulders of NLTK and Pattern, and plays nicely with both.
• YAlign - A sentence aligner, a friendly tool for extracting parallel sentences from comparable corpora.
• jieba - Chinese Words Segmentation Utilities.
• SnowNLP - A library for processing Chinese text.
• spammy - A library for email Spam filtering built on top of nltk
• loso - Another Chinese segmentation library.
• genius - A Chinese segment base on Conditional Random Field.
• KoNLPy - A Python package for Korean natural language processing.
• nut - Natural language Understanding Toolkit
• Rosetta
• BLLIP Parser
• PyNLPl](https://github.com/proycon/pynlpl) - Python Natural Language Processing Library. General purpose
NLP library for Python. Also contains some specific modules for parsing common NLP formats, most notably
for [FoLiA, but also ARPA language models, Moses phrasetables, GIZA++ alignments.
• python-ucto
• python-frog
• python-zpar](https://github.com/EducationalTestingService/python-zpar) - Python bindings for [ZPar, a statisti-
cal part-of-speech-tagger, constiuency parser, and dependency parser for English.
• colibri-core - Python binding to C++ library for extracting and working with with basic linguistic constructions
such as n-grams and skipgrams in a quick and memory-efficient way.
• spaCy - Industrial strength NLP with Python and Cython.
• PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies.
• Distance - Levenshtein and Hamming distance computation
• auto_ml - Automated machine learning for production and analytics. Lets you focus on the fun parts of ML,
while outputting production-ready code, and detailed analytics of your dataset and results. Includes support for
NLP, XGBoost, LightGBM, and soon, deep learning.
• machine learning](https://github.com/jeff1evesque/machine-learning) - automated build consisting of a
[web-interface](https://github.com/jeff1evesque/machine-learning#web-interface), and set of [programmatic-
interface, are stored into a NoSQL datastore.
• XGBoost Library
• Bayesian Methods for Hackers - Book/iPython notebooks on Probabilistic Programming in Python
• Featureforge A set of tools for creating and testing machine learning features, with a scikit-learn compatible
API
• MLlib in Apache Spark - Distributed machine learning library in Spark
• Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning models as realtime, batch
or reactive web services.
• scikit-learn - A Python module for machine learning built on top of SciPy.
• metric-learn - A Python module for metric learning.
• SimpleAI Python implementation of many of the artificial intelligence algorithms described on the book “Ar-
tificial Intelligence, a Modern Approach”. It focuses on providing an easy to use, well documented and tested
library.
• astroML - Machine Learning and Data Mining for Astronomy.
• graphlab-create implemented on top of a disk-backed DataFrame.
• BigML - A library that contacts external servers.
• pattern - Web mining module for Python.
• NuPIC - Numenta Platform for Intelligent Computing.
• Pylearn2](https://github.com/lisa-lab/pylearn2) - A Machine Learning library based on [Theano.
• keras](https://github.com/fchollet/keras) - Modular neural network library based on [Theano.
• Lasagne - Lightweight library to build and train neural networks in Theano.
• hebel - GPU-Accelerated Deep Learning Library in Python.
• Chainer - Flexible neural network framework
• prohpet - Fast and automated time series forecasting framework by Facebook.
• gensim - Topic Modelling for Humans.
• topik - Topic modelling toolkit
• Neural Networks and Deep Learning - Code samples for my book “Neural Networks and Deep Learning” [DEEP
LEARNING]
• Annoy - Approximate nearest neighbours implementation
• skflow - Simplified interface for TensorFlow, mimicking Scikit Learn.
• TPOT - Tool that automatically creates and optimizes machine learning pipelines using genetic programming.
Consider it your personal data science assistant, automating a tedious part of machine learning.
• pgmpy A python library for working with Probabilistic Graphical Models.
• DIGITS is a web application for training deep learning models.
• Orange - Open source data visualization and data analysis for novices and experts.
• MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware
Dataflow Dep Scheduler; for Python, R, Julia, Go, Javascript and more.
• milk - Machine learning toolkit focused on supervised classification.
• TFLearn - Deep learning library featuring a higher-level API for TensorFlow.
• REP - an IPython-based environment for conducting data-driven research in a consistent and reproducible way.
REP is not trying to substitute scikit-learn, but extends it and provides better user experience.
• rgf_python Library.
• gym - OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms.
• skbayes - Python package for Bayesian Machine Learning with scikit-learn API
• fuku-ml - Simple machine learning library, including Perceptron, Regression, Support Vector Machine, Decision
Tree and more, it’s easy to use and easy to learn for beginners.
• SciPy - A Python-based ecosystem of open-source software for mathematics, science, and engineering.
• NumPy - A fundamental package for scientific computing with Python.
• Numba complier to LLVM aimed at scientific Python by the developers of Cython and NumPy.
• NetworkX - A high-productivity software for complex networks.
• igraph - binding to igraph library - General purpose graph library
• Pandas - A library providing high-performance, easy-to-use data structures and data analysis tools.
• Open Mining
• PyMC - Markov Chain Monte Carlo sampling toolkit.
• zipline - A Pythonic algorithmic trading library.
• PyDy - Short for Python Dynamics, used to assist with workflow in the modeling of dynamic motion based
around NumPy, SciPy, IPython, and matplotlib.
• SymPy - A Python library for symbolic mathematics.
• statsmodels - Statistical modeling and econometrics in Python.
• astropy - A community Python library for Astronomy.
• matplotlib - A Python 2D plotting library.
• bokeh - Interactive Web Plotting for Python.
Neural networks
• Neural networks - NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networks
that describe images with sentences.
• Neuron neural networks learned with Gradient descent or LeLevenberg–Marquardt algorithm.
• Data Driven Code - Very simple implementation of neural networks for dummies in python without using any
libraries, with detailed comments.
24.20 Ruby
• Treat - Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit I’ve encountered so far
for Ruby
• Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language. It
includes a generic language-independent front end, a module for mapping language codes into language names,
and a module which contains various English-language utilities.
• Stemmer - Expose libstemmer_c to Ruby
• Ruby Wordnet - This library is a Ruby interface to WordNet
• Raspel - raspell is an interface binding for ruby
• UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing
• Twitter-text-rb - A library that does auto linking and extraction of usernames, lists and hashtags in tweets
Misc
24.21 Rust
• deeplearn-rs - deeplearn-rs provides simple networks that use matrix multiplication, addition, and ReLU under
the MIT license.
• rustlearn - a machine learning framework featuring logistic regression, support vector machines, decision trees
and random forests.
• rusty-machine - a pure-rust machine learning library.
• leaf](https://github.com/autumnai/leaf) - open source framework for machine intelligence, sharing concepts
from TensorFlow and Caffe. Available under the MIT license. [**[Deprecated]**
• RustNN - RustNN is a feedforward neural network library.
24.22 R
• e1071, TU Wien
• earth - earth: Multivariate Adaptive Regression Spline Models
• elasticnet - elasticnet: Elastic-Net for Sparse Estimation and Sparse PCA
• ElemStatLearn - ElemStatLearn: Data sets, functions and examples from the book: “The Elements of Statistical
Learning, Data Mining, Inference, and Prediction” by Trevor Hastie, Robert Tibshirani and Jerome Friedman
Prediction” by Trevor Hastie, Robert Tibshirani and Jerome Friedman
• evtree - evtree: Evolutionary Learning of Globally Optimal Trees
• forecast - forecast: Timeseries forecasting using ARIMA, ETS, STLM, TBATS, and neural network models
• forecastHybrid - forecastHybrid: Automatic ensemble and cross validation of ARIMA, ETS, STLM, TBATS,
and neural network models from the “forecast” package
• fpc - fpc: Flexible procedures for clustering
• frbs - frbs: Fuzzy Rule-based Systems for Classification and Regression Tasks
• GAMBoost - GAMBoost: Generalized linear and additive models by likelihood based boosting
• gamboostLSS - gamboostLSS: Boosting Methods for GAMLSS
• gbm - gbm: Generalized Boosted Regression Models
• glmnet - glmnet: Lasso and elastic-net regularized generalized linear models
• glmpath - glmpath: L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model
• GMMBoost - GMMBoost: Likelihood-based Boosting for Generalized mixed models
• grplasso - grplasso: Fitting user specified models with Group Lasso penalty
• grpreg - grpreg: Regularization paths for regression models with grouped covariates
• h2o - A framework for fast, parallel, and distributed machine learning algorithms at scale – Deeplearning,
Random forests, GBM, KMeans, PCA, GLM
• hda - hda: Heteroscedastic Discriminant Analysis
• Introduction to Statistical Learning
• ipred - ipred: Improved Predictors
• kernlab - kernlab: Kernel-based Machine Learning Lab
• klaR - klaR: Classification and visualization
• lars - lars: Least Angle Regression, Lasso and Forward Stagewise
• lasso2 - lasso2: L1 constrained estimation aka ‘lasso’
• LiblineaR - LiblineaR: Linear Predictive Models Based On The Liblinear C/C++ Library
• LogicReg - LogicReg: Logic Regression
• Machine Learning For Hackers
• maptree - maptree: Mapping, pruning, and graphing tree models
• mboost - mboost: Model-Based Boosting
• medley - medley: Blending regression models, using a greedy stepwise approach
• mlr - mlr: Machine Learning in R
• mvpart - mvpart: Multivariate partitioning
24.22. R 197
ML Cheatsheet Documentation
• ncvreg - ncvreg: Regularization paths for SCAD- and MCP-penalized regression models
• nnet - nnet: Feed-forward Neural Networks and Multinomial Log-Linear Models
• oblique.tree - oblique.tree: Oblique Trees for Classification Data
• pamr - pamr: Pam: prediction analysis for microarrays
• party - party: A Laboratory for Recursive Partytioning
• partykit - partykit: A Toolkit for Recursive Partytioning
• penalized penalized estimation in GLMs and in the Cox model
• penalizedLDA - penalizedLDA: Penalized classification using Fisher’s linear discriminant
• penalizedSVM - penalizedSVM: Feature Selection SVM using penalty functions
• quantregForest - quantregForest: Quantile Regression Forests
• randomForest - randomForest: Breiman and Cutler’s random forests for classification and regression
• randomForestSRC
• rattle - rattle: Graphical user interface for data mining in R
• rda - rda: Shrunken Centroids Regularized Discriminant Analysis
• rdetools in Feature Spaces
• REEMtree Data
• relaxo - relaxo: Relaxed Lasso
• rgenoud - rgenoud: R version of GENetic Optimization Using Derivatives
• rgp - rgp: R genetic programming framework
• Rmalschains in R
• rminer in classification and regression
• ROCR - ROCR: Visualizing the performance of scoring classifiers
• RoughSets - RoughSets: Data Analysis Using Rough Set and Fuzzy Rough Set Theories
• rpart - rpart: Recursive Partitioning and Regression Trees
• RPMM - RPMM: Recursively Partitioned Mixture Model
• RSNNS
• RWeka - RWeka: R/Weka interface
• RXshrink - RXshrink: Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression
• sda - sda: Shrinkage Discriminant Analysis and CAT Score Variable Selection
• SDDA - SDDA: Stepwise Diagonal Discriminant Analysis
• SuperLearner](https://github.com/ecpolley/SuperLearner) and [subsemble - Multi-algorithm ensemble learning
packages.
• svmpath - svmpath: svmpath: the SVM Path algorithm
• tgp - tgp: Bayesian treed Gaussian process models
• tree - tree: Classification and regression trees
• varSelRF - varSelRF: Variable selection using random forests
• XGBoost.R Library
• Optunity - A library dedicated to automated hyperparameter optimization with a simple, lightweight API to
facilitate drop-in replacement of grid search. Optunity is written in Python but interfaces seamlessly to R.
• igraph - binding to igraph library - General purpose graph library
• MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware
Dataflow Dep Scheduler; for Python, R, Julia, Go, Javascript and more.
• TDSP-Utilities.
24.23 SAS
• Enterprise Miner - Data mining and machine learning that creates deployable models using a GUI or code.
• Factory Miner - Automatically creates deployable machine learning models across numerous market or customer
segments using a GUI.
• High Performance Data Mining - Data mining and machine learning that creates deployable models using a GUI
or code in an MPP environment, including Hadoop.
• High Performance Text Mining - Text mining using a GUI or code in an MPP environment, including Hadoop.
• enlighten-integration - Example code and materials that illustrate techniques for integrating SAS with other
analytics technologies in Java, PMML, Python and R.
• enlighten-deep - Example code and materials that illustrate using neural networks with several hidden layers in
SAS.
• dm-flow - Library of SAS Enterprise Miner process flow diagrams to help you learn by example about specific
data mining topics.
24.24 Scala
24.25 Swift
• Swift AI - Highly optimized artificial intelligence and machine learning library written in Swift.
• BrainCore - The iOS and OS X neural network framework
• swix - A bare bones library that includes a general matrix language and wraps some OpenCV for iOS develop-
ment.
• DeepLearningKit an Open Source Deep Learning Framework for Apple’s iOS, OS X and tvOS. It currently
allows using deep convolutional neural network models trained in Caffe on Apple operating systems.
• AIToolbox - A toolbox framework of AI modules written in Swift: Graphs/Trees, Linear Regression, Support
Vector Machines, Neural Networks, PCA, KMeans, Genetic Algorithms, MDP, Mixture of Gaussians.
• MLKit - A simple Machine Learning Framework written in Swift. Currently features Simple Linear Regression,
Polynomial Regression, and Ridge Regression.
• Swift Brain - The first neural network / machine learning library written in Swift. This is a project for AI
algorithms in Swift for iOS and OS X development. This project includes algorithms focused on Bayes theorem,
neural networks, SVMs, Matrices, etc..
Papers
• Machine Learning
• Deep Learning
– Understanding
– Optimization / Training Techniques
– Unsupervised / Generative Models
– Image Segmentation / Object Detection
– Image / Video
– Natural Language Processing
– Speech / Other
– Reinforcement Learning
– New papers
– Classic Papers
203
ML Cheatsheet Documentation
25.2.1 Understanding
• Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015), S. Loffe
and C. Szegedy [pdf]
• Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (2015), K. He et
al. [pdf]
• Dropout: A simple way to prevent neural networks from overfitting (2014), N. Srivastava et al. [pdf]
• Adam: A method for stochastic optimization (2014), D. Kingma and J. Ba [pdf]
• Improving neural networks by preventing co-adaptation of feature detectors (2012), G. Hinton et al. [pdf]
• Random search for hyper-parameter optimization (2012) J. Bergstra and Y. Bengio [pdf]
• You only look once: Unified, real-time object detection (2016), J. Redmon et al. [pdf]
• Fully convolutional networks for semantic segmentation (2015), J. Long et al. [pdf]
• Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015), S. Ren et al. [pdf]
• Fast R-CNN (2015), R. Girshick [pdf]
• Rich feature hierarchies for accurate object detection and semantic segmentation (2014), R. Girshick et al. [pdf]
• Semantic image segmentation with deep convolutional nets and fully connected CRFs, L. Chen et al. [pdf]
• Learning hierarchical features for scene labeling (2013), C. Farabet et al. [pdf]
• Image Super-Resolution Using Deep Convolutional Networks (2016), C. Dong et al. [pdf]
• A neural algorithm of artistic style (2015), L. Gatys et al. [pdf]
• Deep visual-semantic alignments for generating image descriptions (2015), A. Karpathy and L. Fei-Fei [pdf]
• Show, attend and tell: Neural image caption generation with visual attention (2015), K. Xu et al. [pdf]
• Show and tell: A neural image caption generator (2015), O. Vinyals et al. [pdf]
• Long-term recurrent convolutional networks for visual recognition and description (2015), J. Donahue et al.
[pdf]
• VQA: Visual question answering (2015), S. Antol et al. [pdf]
• DeepFace: Closing the gap to human-level performance in face verification (2014), Y. Taigman et al. [pdf]:
• Large-scale video classification with convolutional neural networks (2014), A. Karpathy et al. [pdf]
• DeepPose: Human pose estimation via deep neural networks (2014), A. Toshev and C. Szegedy [pdf]
• Two-stream convolutional networks for action recognition in videos (2014), K. Simonyan et al. [pdf]
• 3D convolutional neural networks for human action recognition (2013), S. Ji et al. [pdf]
• Neural Architectures for Named Entity Recognition (2016), G. Lample et al. [pdf]
• Exploring the limits of language modeling (2016), R. Jozefowicz et al. [pdf]
• Teaching machines to read and comprehend (2015), K. Hermann et al. [pdf]
• Effective approaches to attention-based neural machine translation (2015), M. Luong et al. [pdf]
• Conditional random fields as recurrent neural networks (2015), S. Zheng and S. Jayasumana. [pdf]
• Memory networks (2014), J. Weston et al. [pdf]
• Neural turing machines (2014), A. Graves et al. [pdf]
• Neural machine translation by jointly learning to align and translate (2014), D. Bahdanau et al. [pdf]
• Sequence to sequence learning with neural networks (2014), I. Sutskever et al. [pdf]
• Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014), K. Cho
et al. [pdf]
• A convolutional neural network for modeling sentences (2014), N. Kalchbrenner et al. [pdf]
• Convolutional neural networks for sentence classification (2014), Y. Kim [pdf]
• Glove: Global vectors for word representation (2014), J. Pennington et al. [pdf]
• Distributed representations of sentences and documents (2014), Q. Le and T. Mikolov [pdf]
• Distributed representations of words and phrases and their compositionality (2013), T. Mikolov et al. [pdf]
• Efficient estimation of word representations in vector space (2013), T. Mikolov et al. [pdf]
• Recursive deep models for semantic compositionality over a sentiment treebank (2013), R. Socher et al. [pdf]
• Generating sequences with recurrent neural networks (2013), A. Graves. [pdf]
• End-to-end attention-based large vocabulary speech recognition (2016), D. Bahdanau et al. [pdf]
• Deep speech 2: End-to-end speech recognition in English and Mandarin (2015), D. Amodei et al. [pdf]
• Speech recognition with deep recurrent neural networks (2013), A. Graves [pdf]
• Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups
(2012), G. Hinton et al. [pdf]
• Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition (2012) G. Dahl et
al. [pdf]
• Acoustic modeling using deep belief networks (2012), A. Mohamed et al. [pdf]
• An analysis of single-layer networks in unsupervised feature learning (2011), A. Coates et al. [pdf]
• Deep sparse rectifier neural networks (2011), X. Glorot et al. [pdf]
• Natural language processing (almost) from scratch (2011), R. Collobert et al. [pdf]
• Recurrent neural network based language model (2010), T. Mikolov et al. [pdf]
• Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising
criterion (2010), P. Vincent et al. [pdf]
• Learning mid-level features for recognition (2010), Y. Boureau [pdf]
• A practical guide to training restricted boltzmann machines (2010), G. Hinton [pdf]
• Understanding the difficulty of training deep feedforward neural networks (2010), X. Glorot and Y. Bengio [pdf]
• Why does unsupervised pre-training help deep learning (2010), D. Erhan et al. [pdf]
• Learning deep architectures for AI (2009), Y. Bengio. [pdf]
• Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations (2009),
H. Lee et al. [pdf]
• Greedy layer-wise training of deep networks (2007), Y. Bengio et al. [pdf]
• A fast learning algorithm for deep belief nets (2006), G. Hinton et al. [pdf]
• Gradient-based learning applied to document recognition (1998), Y. LeCun et al. [pdf]
• Long short-term memory (1997), S. Hochreiter and J. Schmidhuber. [pdf]
Other Content
Books, blogs, courses and more forked from josephmisiti’s awesome machine learning
• Blogs
– Data Science
– Machine learning
– Math
• Books
– Machine learning
– Deep learning
– Probability & Statistics
– Linear Algebra
• Courses
• Podcasts
• Tutorials
26.1 Blogs
• https://jeremykun.com/
• http://iamtrask.github.io/
• http://blog.explainmydata.com/
209
ML Cheatsheet Documentation
• http://andrewgelman.com/
• http://simplystatistics.org/
• http://www.evanmiller.org/
• http://jakevdp.github.io/
• http://blog.yhat.com/
• http://wesmckinney.com
• http://www.overkillanalytics.net/
• http://newton.cx/~peter/
• http://mbakker7.github.io/exploratory_computing_with_python/
• https://sebastianraschka.com/blog/index.html
• http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/
• http://colah.github.io/
• http://www.thomasdimson.com/
• http://blog.smellthedata.com/
• https://sebastianraschka.com/
• http://dogdogfish.com/
• http://www.johnmyleswhite.com/
• http://drewconway.com/zia/
• http://bugra.github.io/
• http://opendata.cern.ch/
• https://alexanderetz.com/
• http://www.sumsar.net/
• https://www.countbayesie.com
• http://blog.kaggle.com/
• http://www.danvk.org/
• http://hunch.net/
• http://www.randalolson.com/blog/
• https://www.johndcook.com/blog/r_language_for_programmers/
• http://www.dataschool.io/
• OpenAI
• Distill
• Andrej Karpathy Blog
• Colah’s Blog
• WildML
• FastML
• TheMorningPaper
26.1.3 Math
• http://www.sumsar.net/
• http://allendowney.blogspot.ca/
• https://healthyalgorithms.com/
• https://petewarden.com/
• http://mrtz.org/blog/
26.2 Books
26.3 Courses
26.4 Podcasts
26.5 Tutorials
Contribute
215