Data Science Nigeria Machine and Deep Learning Study Guide
Data Science Nigeria Machine and Deep Learning Study Guide
Data Science Nigeria is a non-profit organization run and managed by the Data Scientists
to galvanize a data science knowledge revolution in Nigeria and beyond. Our Approach
Nigeria and in the diaspora train and mentor Nigerians through face-to-face virtual
coaching classes, offer project-based support, and holiday boot camps funded by
individuals and corporate organizations. Data Science Nigeria is also leading the
PhD4Innovation Hub project that is proactively driving the application of Big Data in
solving problems in areas such as financial iInclusion, agriculture, health and social well-
being.
Our Successes
• Four Nigerian-centric data science products are being supported in the ecosystem
• High impact learning bootcamps, academic engagement and direct job placements
(full time, freelance and internships) for young Nigerian data scientists
• Strategic partnerships with leading firms including Microsoft, KPMG, the Nigerian
Contacts
Phone/WhatsApp: +2348140000853
Website: www.datasciencenigeria.org
Email: [email protected]
Twitter: Datasciencenig
Instagram: Datasciencenigeria
Facebook: facebook.com/datasciencenig
YouTube: https://goo.gl/Vcjjyp
LinkedIn: https://www.linkedin.com/company/datasciencenigeria/
Course Outline
Week 1 - Introduction to Machine Learning, Part 1
What is Machine Learning?
What Does It Actually Do, or Why Do We Need ML and AI?
Types of Machine Learning Algorithms
“A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P if its performance at tasks in T, as measured by P,
improves with experience E.”
Spam detection: Machine learning is expanding its reach to various domains, such as
spam detection. When Google Mail detects spam email automatically, it is the result of
machine learning techniques. Some other uses of machine learning are listed below.
Credit card fraud: Identifying 'unusual' activities on a credit card is often a machine
learning problem involving anomaly detection.
Facial Recognition: when Facebook automatically recognizes the faces of your friends in
a photo, a machine learning process is running in the background.
Customer segmentation: using the customer’s data gathered during a trial period of a
product helps to identify the customers who are most likely to subscribe to the paid
version of the product. This a learning problem.
So, that was a quick introduction to machine learning and different types of machine
learning algorithms, and a few applications as well. In this course, we will be exploring
machine learning, deep learning and artificial intelligence. Python and its open source
technologies will be used as the primary programming language within this course.
Python is the world fastest growing programming language. Before 2007, there was no
built-in machine learning package in Python. David Cournapeau (who also
developed NumPy and SciPy) developed Scikit-learn as a part of a Google Summer of
Code project. The project now has 30 contributors and many sponsors including Google
and the Python Software Foundation. Scikit-learn provides a range of supervised and
unsupervised learning algorithms via a Python interface. Similarly, TensorFlow, also built
by Google, provides an open source machine learning framework with the power to
handle deep learning.. We will be exploring all these packages and libraries one by one
in this course from the grassroots level. First we will learn how to represent data in scikit-
learn, and then we will build highly optimized and efficient machine learning models.
4. Machine Learning Mastery is a very carefully laid out step-by-step guide to some
particular algorithms.
5. Andrew Ng's Course on Coursera, one of the best courses for machine learning
and deep learning in the world, explores all the math and theories behind these
fields.
This week we will be exploring the various open source libraries for data loading, data
manipulation, and exploratory data visualization.
The best all-in-one IDE for learning ML and Data Science with Python is Jupyter
Notebook from Anaconda Distribution. Jupyter Notebook provides a highly interactive
notebook type interface where inputs as well as the outputs are displayed in the same
cell.
Jupyter Notebook
Here is a video link explaining the use and importance of Jupyter Notebook. To install
Jupyter Notebook, go to this link and download the version of Anaconda that matches
your operating system (OS) requirements (Windows/Mac/Linux). We will be using the
Python 3.6 version. Jupyter Notebooks has a lot of advantages:
Machine learning and Maths and programming concepts: to master machine learning,
one has to be good at both Mathematics and programming.
Mathematics: to understand the machine learning algorithms and to choose the right
model, one needs to understand the math behind them. You do not need to understand
all the math, only some sub-branches:
linear algebra
probability theory
optimization
calculus
information theory and decision theory
Programming: programming skills are needed for the following tasks:
1. Using ML models
2. Building new models
3. Getting data from various sources
4. Cleaning the data
5. Choosing the right features and validating the data
Some programming languages are more preferred than others for ML because there are
a larger number of libraries with most of the ML models already implemented in those
languages. These languages include:
Python
R (good but it has a slow run time)
MATLAB (good but costly and slow)
JULIA (Future best! Very fast, good, but limited libraries as it is new)
As mentioned, we will be using Python throughout this course. Some good books about
ML and deep learning are:
Coursera Courses:
Machine Learning
Deep Learning
Practice:
Kaggle
Analytics Vidhya
Driven Data
Papers:
arXiv.org e-Print archive
CVPR, NAACL
NIPS, ICLR, ICML etc
If you read and implement a lot of good papers (say 100) you will become an expert in
ML/DL. After that point you will be able to create your own algorithms and start
publishing your work.
Regression Analysis
This is the first lesson in which we will be specifically focusing on machine learning
algorithms and their practical implementation in Python using real world datasets. We
will start with Regression Analysis and Predictive Modelling.
We assume that you have basic knowledge of Python programming and have used the
basic libraries such as NumPy, SciPy, Pandas, Matplotlib and scikit-learn,as mentioned
in the last week’s notes. If you haven't, here are some links for you.
1. NumPy
2. SciPy
3. Pandas
4. Matplotlib
5. scikit-learn
First, we will have a quick introduction to building models in Python, and what better
way to start than one of the very basic models, linear regression? Linear regression will
be the first algorithm used and you will also learn more complex regression models.
Y = mX + b
In this equation,
Y is the dependent variable, or the variable we are trying to predict or estimate;
X is the independent variable, the variable we are using to make predictions;
m is the slope of the regression line, and it represents the effect X has on Y.
Simple Linear Regression (SLR) models also include the errors in the data (also known as
residuals). We won’t go into it too much now, but residuals are basically the differences
between the true value of Y and the predicted/estimated value of Y. It is important to
note that in a linear regression, we are trying to predict a continuous variable. In a
regression model, we are trying to minimize these errors by finding the “line of best
fit”—the regression line from the errors would be minimal. We are trying to
minimize the distance of the red dots from the blue line as close to zero as possible. It
is related to (or equivalent to) minimizing the mean squared error (MSE) or the sum
of squares of error (SSE), also called the ’residual sum of squares.’ (RSS).
In most cases, we will have more than one independent variable, or multiple variables;
they can be as few as two independent variables and up to hundreds (or theoretically
even thousands) of variables. In those cases we will use a Multiple Linear Regression
model (MLR). The regression equation is pretty much the same as the simple regression
equation but with more variables:
Y = b0 + b1X1 + b2X2
Now we will perform a simple regression analysis on the Boston housing data by
exploring simple types of linear regression models. We will use the Boston Housing
dataset; this dataset contains information about the housing values in the suburbs of
Boston. This dataset was originally taken from the StatLib library maintained at Carnegie
Mellon University; it is now available on the UCI Machine Learning Repository. The UCI
machine learning repository contains many interesting datasets, and we encourage you
to explore it. We will use scikit-learn to import the Boston data as it contains a bunch
of useful datasets to practise with and we will also import a linear regression model
from scikit-learn. You may also code your own linear regression model as a function or
class in Python. It’s easy!
Prediction:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
expected = y_test
plt.figure(figsize=(15, 6))
plt.scatter(expected, predicted)
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')
plt.tight_layout()
As we can see from the results above, the linear regression model is able to predict the
values in a good manner, fitting the linear line in best manner.
Next, we will also understand the various assumptions in regression models. Moreover,
we will also discuss if these assumptions get violated, how do you build your linear
regression model then?
Regression is not just fitting a line to the predicted values or defining it as an equation
(y = m*x + b) like we just did. There is much more to it. Regression is considered to be
the simplest algorithm in machine learning. When we start playing with ML, most of us
start with regressions, but it is not always understood well by new learners. It is
important, and your understanding of regressions will describe how well you understand
the math behind ML. Please note that machine learning is not just loading classes from
scikit-learn and fitting data and predicting targets. It is more than that.
So, how would you check whether your dataset fulfills all the regression assumptions?
To understand all the regression assumptions, here is a link. All the assumptions are well
explained. More Links for Learning:
8 ways to perform simple linear regression and measure their speed using Python
Exercises
This week, you have to test yourself in the below- mentioned Kaggle Competition. Have
fun!
Last week we discussed regression analysis, and we learnt about the assumptions or so-
called limitations of linear regressions. Linear regression is assumed to be the simplest
machine learning algorithm the world has ever seen, and yes! it is. We also discussed
how your model can give you poor predictions in real time if you don't obey the
assumptions of linear regression. Whatever you are going to predict, whether it is a stock
value, sales or some revenue, linear regression must be handled with care if you want
to get best predictions from it. Linear regression says that if the data is linear in nature,
there must be a linear relationship. But wait! Real-world data is always non-linear. So,
what we should we do? Should we try to bring non-linearity into the regression
model, or check out the residuals and fitted values, keep applying transformations, and
work harder and harder to get the best predictive model using linear regression? This
would be a pain and it would take too much time.
Now, the question is, should it be considered as the solution? Or is there any other way
to deal with this, so that we can get a better predictive model without getting into these
assumptions of linear regression?
Some of us know about random forest and decision trees. They are very common. In
case of classification or regression, they often perform far better than other models.
This week we will learn tree-based models such as decision trees and assemble tree-
based models including the Random forest (RF), Gradient Boosted Tree (GBT), AdaBoost
Tree, and extreme boosted tree for regression analysis. Tree-based models have proven
themselves to be both reliable and effective and are now part of any modern predictive
modeler’s toolkit.
But, there are some cases when a linear regression is based on more accurate
assumptions than tree-based models, such as in the following cases:
2. When there are a very large number of features, especially with a very low signal
to noise ratio. Tree-based models have a little trouble modelling linear
combinations of a large number of features.
The point is, there are probably only a few cases in which linear models like SLR are
better than tree-based models or other non-linear models as these fit the data better
from the beginning without needing to use transformations.
Tree-based models are more forgiving in almost every way. We don’t need to scale the
data, and we don’t need to do any monotonic transformations (log, square root, etc).
We often don’t even need to remove outliers. We can throw in features, and it’ll
automatically partition the data if it aids the fit. We don’t have to spend any time
generating interaction terms as in case of linear models. And perhaps most importantly,
in most cases, tree-based models will probably be notably more accurate.
The bottom line is, we can spend hours playing with the data, generating features and
interaction variables and get a 77% R-squared; or, we can use “from sklearn.ensemble
import RandomForestRegressor” and in a few minutes get an 82% R-squared.
Check out this link. Let me explain it to you using some examples for clear intuition with
an example. Linear regression is a linear model, which means it works really nicely when
the data has a linear shape. But, when the data has a non-linear shape a linear model
cannot capture the non-linear features. So in this case, we can use the decision trees,
which do a better job of capturing the non-linearity in the data by dividing the space into
smaller sub-spaces depending on the questions asked.
Now, the question is, when do you use linear regression vs. decision trees? I suspect that
the Quora answer would do a better job than me of explaining the difference between
them and their applications. Let me quote that for you. Let’s suppose you are trying to
predict income. The predictor variables that are available are education, age, and city.
In a linear regression model, we have an equation with these three attributes. Fine.
You’d expect higher degrees of education, higher age and larger cities to be associated
with higher income. But what about a PhD who is 40 years old and living in Scranton,
Pennsylvania? Is that person likely to earn more than a person with a Bachelor of Science
degree who is 35 and living in the Upper West Side of New York City? Maybe not. Maybe
education totally loses its predictive power in a city like Scranton? Maybe age is a very
ineffective, weak variable in a city like New York? This is a case where decision trees are
handy. The tree can split by city and we get to use a different set of variables for each
city. Maybe age will be a strong second-level split variable in Scranton, but it might not
feature at all in the New York branch of the tree. Education may be a stronger variable
in New York. Decision trees, whether Random Forest (RF) or Gradient Boosting Model
(GBM), handle messier data and messier relationships better than regression models;
and there is seldom a dataset in the real world where relationships are not messy. No
wonder we will seldom see a linear regression model outperforming an RF or GBM. So,
this is the main idea behind tree (decision tree regression) and ensemble-based models
(forest regression/gradient boosting regression/extreme boosting regression).
In the GitHub link, you might have seen a number of other models and their comparisons
as well. These are all the available regressive models present in scikit-learn. As you can
see, GBM/RF perform the best.
Below are the links to almost all the regression techniques in scikit-learn:
There is a lot more we need to explore in regression analysis. Here are some links:
Last week we discussed regression modelling and the different types of regression
models. This week, we will compare regression analysis with classification analysis and
learn a few classification algorithms and when to implement them.
Let me explain this with the help of an example. In the last few lessons, we saw how
we can use regression to predict the price of a house depending on the 'size' (square
feet or whatever unit) and, for example, the 'location' of the house, which can be some
numerical value that gives us the house’s price - This relates to regression.
Usually the difference is in what's called the loss function. In regression and
classification, the goal of the optimization algorithm (linear, support vector, decision
tree, etc.) is to optimize the output of the loss function.
To understand it in a much better way, here are a few links we must dive deeply into:
Difference Between Classification and Regression in Machine Learning
Classification and Regression
Have you ever thought about how your email service is able to separate out something
as spam, or ham (not spam)?
The process behind it is to teach a model to identify the incoming email by training it
with millions of emails that have already been determined to be spam. To classify email
as spam, the following aspects are taken into consideration:
1. Whether the email contains spam related words like ‘lottery’, 'free' etc.
3. How often email with these words is received in this email account.
So in this case, after the system has been trained to identify emails that contain spam
or don’t contain spam, when new emails arrive in your inbox, each email will
automatically be classified as spam or not spam. Classification problems require items to
be divided into different categories based on past data. In a way, we’re solving a yes/no
problem. Classification can be multi-label as well, as discussed above in case of example
of house prices.
Regression: With regression problems, the system attempts to predict a value for an
input based on past data. Unlike classification, we are predicting a value based on past
data, rather than classifying the data into different categories.
Let’s say that we wanted to predict whether it would rain, and if it does rain how much
rain we would get, maybe in centimeters, based on atmospheric variables such as
humidity, temperature, pressure, wind speed and wind direction.
Let's see some more common examples of using classification algorithms, which we will
be working with in our Python implementation.
Given a set of input features, predict whether a breast cancer is benign or malignant.
1. Binary classification: this is used when there are only two classes to predict,
usually 1 or 0 values.
2. Multi-Class Classification: this is used when there are more than two class
labels to predict. We call this a multi-classification task. For example, if we are predicting
three types of iris flower species using image classification with thousands of classes
(cat, dog, fish, car).
Further Reading
This section provides more resources on the topic if you wish to go deeper.
How to Build a Machine Learning Classifier in Python with scikit-learn
Solving Multi-Label Classification Problems (case studies included)
Machine Learning Classification Strategy In Python
Python Machine Learning: Scikit-Learn Tutorial
Spot-Check Classification Machine Learning Algorithms in Python with scikit-learn
Exercises
For your practice for this week, test yourself on the below- mentioned Kaggle
Competition.
Titanic: Machine Learning from Disaster
Machine Learning Curriculum - Week 6
Supervised Learning
Machine learning categories are divided into supervised and un-supervised machine
learning. Some of the well-known supervised machine learning algorithms are Support
Vector Machine (SVM), Linear Regression, Neural Network (NN) and Naive Bayes (NB).
In supervised learning, the training data is labelled, meaning that we already know the
target variable we are going to predict when we test the model.
Unsupervised Classification
In unsupervised learning the training data is not labelled and the system tries to learn
without a trainer. Some of the most important unsupervised algorithms are clustering,
k-means, and association rule learning.
What Is Clustering?
Cluster analysis or clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar in some way to each other
than to those in other groups (clusters). Clustering is a main task of exploratory data
mining, and a common technique used for statistical data analysis. Clustering is used in
many fields including machine learning, pattern recognition, image analysis, information
retrieval, bioinformatics, data compression, and computer graphics.
This another scatter plot is created from several "blobs" of different sizes and shapes
showing the clusters that exists in the data
We will discuss the K-means and hierarchical clustering algorithms.
K-means clustering
You might be wondering, ‘how do I decide the value of K in the first step’?
One of the methods is called the Elbow method and can be used to decide an optimal
number of clusters. To use it we would run K-mean clustering on a range of K values and
plot the percentage of variance explained on the Y-axis and “K” on X-axis as shown in
the figure below. If we add more than 3 clusters it doesn't affect the variance explained.
Unlike K-mean clustering, hierarchical clustering starts by assigning all data points as
their own cluster and thus building the hierarchy; and, it combines the two nearest data
point and merges them together into one cluster as shown in the Dendrogram below.
Mean-Shift Clustering
Expectation–Maximization (EM) Clustering using Gaussian Mixture
Models (GMM)
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Anomaly detection algorithms detect observations that are significantly different from
most of what we've seen before. One classic example is detecting credit card fraud. How
can we automatically detect purchases that a legitimate credit card owner is very
unlikely to have made? Another example is in computer systems security. How can we
detect activity on a network that's unlikely to be caused be a legitimate user?
Anomaly detection is often done by building a probabilistic model of the data. This
means that you can see what the probability of observing every possible event is
according to your model. When you observe an event that has a sufficiently low
probability, the model will label it as anomalous.
Anomaly detection is utilized in a wide array of fields, such as the following use cases:
Types of Anomalies
Anomalies can be classified into four categories:
4. Change Points: This is unique to time series data and refers to points in time
where the typical pattern changes or evolves. Change points are not always considered
to be anomalies.
However, it is not easy to come up with a definition of normality that accounts for every
variation of normal patterns. Defining anomalies is harder still. Anomalies are rare
events, and it is not possible to have a prior knowledge of every type of anomaly.
Moreover, the definition of anomalies varies across application,. although it is
commonly assumed that anomalies and normal points are generated from different
processes.
Another major obstacle in building and evaluating anomaly detection systems is the lack
of labeled datasets. Though anomaly detection has been a widely studied problem there
is still a lack of commonly agreed upon benchmark datasets. In many real-world
applications anomalies represent critical failures that are too costly and difficult to
obtain. In some domains it is sufficient to have tolerance levels, and any value outside
the tolerance intervals can be marked as an anomaly. In many cases labelling anomalies
is a time-consuming process and human experts with knowledge of the underlying
physical process are required to annotate anomalies. Anomaly detection for time series
presents its own unique challenges. This is mainly due to the issues inherent in time
series analysis, which is considered to be one of the ten most challenging problems in
data mining research.
For this week’s practice, build a recommendation system using the Kaggle dataset
below.
Sequential rule mining is a data mining technique that consists of discovering rules in
sequences. Sequential rule mining has many applications; for example, it is used for
analysing the behaviour of customers in supermarkets, users on a website, or
passengers at an airport.
Confidence is a measure of the reliability of the rule. A confidence of 0.5 in the above
example would mean that in 50% of the cases where Diaper and Gum were purchased,
the purchase also included Beer and Chips. For product recommendations, a 50%
confidence may be perfectly acceptable; but in a medical situation, this level may not be
high enough.
Lift is the ratio of the observed support to the level expected if the two rules were
independent. The basic rule of thumb is that a lift value close to 1 means the rules were
completely independent. Lift values > 1 are generally more “interesting” and could be
indicative of a useful rule pattern.
The Apriori algorithm is based on conditional probabilities and helps us determine the
likelihood of items being bought together based on a- priori data. There are three
important parameters: support, confidence and lift. Suppose there is a set of
transactions with item1 --> item 2. So, support for item 1 will be defined by n (item1)/n
(total transactions). Confidence, on the other hand, is defined as, n (item1 &
item2)/n(item1). So, confidence tells us the strength of the association and support tells
us the relevance of the rule because we don’t want to include rules about items that are
seldom bought or, in other words, have low support. Lift is confidence/support. The
higher the lift, more the significance there is of applying the Apriori algorithm to
determine the rule.
More Resources to explore:
Exercise:
For this week’s practice, enjoy doing the Association Mining Algorithms in the Kaggle
Competition below.
Collaborative Filtering
Under user-based collaborative filtering (UBCF), this memory-based method works with
the assumption that users with similar item tastes will also rate items similarly.
Therefore, the missing ratings from a user can be predicted by finding other similar users
(a neighbourhood). Within the neighbourhood we can aggregate the neighbours’ ratings
of items unknown to the user, and use that as basis for a prediction.
An inverted approach to nearest neighbours-based recommendations is item-based
collaborative filtering. Instead of finding the most similar users to each individual, an
algorithm assesses the similarities between the items that are correlated in their ratings
or the purchase profile amongst all users.
Some additional starter articles to learning more about collaborative filtering can be
found here and here (http://recommender-systems.org/collaborative-filtering/).
Weaknesses: these algorithms do not too work well with very sparse ratings
matrices. Additionally, they are computationally expensive as the entire user
database needs to be processed as the basis of forming recommendations. These
algorithms will not work from a cold start since a new user has no historic data
profile or ratings for the algorithm to start from.
Data Requirements: this include a user ratings profile containing items the users
has already rated/clicked/purchased. A "rating" can be defined however it fits
the business use case.
This is used to compute the unit-normalized TF-IDF vector for each item in the data set.
The model contains a mapping of item IDs to TF-IDF vectors, normalized to unit vectors,
for each item. The heart of the recommendation process is the scoring method of the
item scorer, which is TFIDF Item Scorer, scoring each item by using cosine similarity. The
score for an item is the cosine between that item's tag vector and the user's profile
vector.
In this variant, rather than just summing the vectors for all positively-rated items a
weighted sum of the item vectors is computed for all rated items, with weights being
based on the user's rating.
Exercises
For your practice for this week, build a recommendation system using the following
Kaggle datasets:
This week we will be getting into deep learning and we will start working with
TensorFlow, learning the basic workflow of using TensorFlow with a simple linear model.
After loading the so-called MNIST dataset with images of hand-written digits, we will
define and optimize a simple mathematical model in TensorFlow. The results will be
plotted and discussed as well. We expect that you are familiar with basic linear algebra;
otherwise here are some links for you:
Firstly, we will load the MNIST dataset, which is about 12 MB and will be downloaded
automatically using the following command:
Now the MNIST dataset has now been loaded and consists of about 70.000
images and associated labels (i.e. classifications of the images). The dataset is split into
3 mutually exclusive sub-sets. We will only use the training and testsets in this tutorial.
Size of:
- Training-set: 55000
- Test-set: 10000
- Validation-set: 5000
The labels are one hot encoded, which means the labels have been converted
from a single number to a vector whose length equals the number of possible classes.
All elements of the vector are zero except for the ith element. Read about one hot
encoding in this link.
Data dimensions
MNIST images are 28 pixels in each dimension.
img_size = 28
Images are stored in one-dimensional arrays of this length.
img_size_flat = img_size * img_size
Tuple with height and width of images used to reshape arrays.
img_shape = (img_size, img_size)
Number of classes, one class for each of 10 digits.
num_classes = 10
Plot a few images to see if data is correct
The entire purpose of TensorFlow is to have a so-called computational graph that can
be executed much more efficiently than if the same calculations were to be performed
directly in Python. TensorFlow can be more efficient than NumPy because TensorFlow
knows the entire computation graph that must be executed, while NumPy only knows
the computation of a single mathematical operation at a time. A TensorFlow graph
consists of the following parts as listed below:
Placeholder variables
Placeholder variables serve as the inputs to the graph, which we may change each time
we execute the graph. First, we define the placeholder variable for the input images.
This allows us to change the images that are input to the TensorFlow graph. This is a so-
called tensor, which just means that it is a multi-dimensional vector or matrix. The data-
type is set to float32 and the shape is set to [None, img_size_flat], where None means
that the tensor may hold an arbitrary number of images with each image being a vector
of length img_size_flat.
Next, we have the placeholder variable for the true labels associated with the images
that were input in the placeholder variable x. The shape of this placeholder variable
is [None, num_classes]; this means it may hold an arbitrary number of labels and each
label is a vector of length num_classes, which is 10 in this case.
Finally, we have the placeholder variable for the true class of each image in the
placeholder variable x. These are integers and the dimensionality of this placeholder
variable is set to [None], which means the placeholder variable is a one-dimensional
vector of arbitrary length.
Variables to be optimized
Apart from the placeholder variables defined above and which serve as feeding
input data into the model, there are also some model variables that must be changed
by TensorFlow so as to make the model perform better using the training data.
The first variable that must be optimized is called weights, and is defined here as
a TensorFlow variable that must be initialized with zeros and whose shape
is [img_size_flat, num_classes], so it is a 2-dimensional tensor (or matrix)
with img_size_flat rows and num_classes columns.
The second variable that must be optimized is called biases, and is defined as a 1-
dimensional tensor (or vector) of length num_classes.
biases = tf.Variable(tf.zeros([num_classes]))
Model
This simple mathematical model multiplies the images in the placeholder variable x with
the weights and then adds the biases. The result is a matrix of shape [num_images,
num_classes] because x has shape [num_images, img_size_flat] and weights has
shape [img_size_flat, num_classes], so the multiplication of those two matrices is a
matrix with shape [num_images, num_classes] and then the biases vector is added to
each row of that matrix. Note that the name logits is typical TensorFlow terminology,
but other people may call the variable something else.
logits = tf.matmul(x, weights) + biases
Now logits is a matrix with num_images rows and num_classes columns, where the
element of the i'th row and j'th column is an estimate of how likely the i'th input image
is to be of the j'th class. However, these estimates are a bit rough and difficult to
interpret because the numbers may be very small or large, so we want to normalize
them so that each row of the logits matrix sums to one, and each element is limited
between zero and one. This is calculated using the so-called softmax function and the
result is stored in y_pred.
y_pred = tf.nn.softmax(logits)
The predicted class can be calculated from the y_pred matrix by taking the index of the
largest element in each row.
Cost-function to be optimized
To make the model better at classifying the input images, we must somehow change the
variables for weights and biases. The cross-entropy is a performance measure used in
classification. The goal of optimization is therefore to minimize the cross-entropy so it
gets as close to zero as possible by changing the weights and biases of the model.
TensorFlow has a built-in function for calculating the cross-entropy. Note that it uses the
values of the logits because it also calculates the softmax internally.
cross_entropy= tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=y_true)
We have now calculated the cross-entropy for each of the image classifications
so we have a measure of how well the model performs on each image individually. But
in order to use the cross-entropy to guide the optimization of the model's variables we
need a single scalar value, so we simply take the average of the cross-entropy for all the
image classifications.
cost = tf.reduce_mean(cross_entropy)
Optimization method
Now that we have a cost measure that must be minimized, we can then create an
optimizer. In this case it is the basic form of gradient descent where the step size is set
to 0.5. Note that optimization is not performed at this point. In fact, nothing is calculated
at all; we just add the optimizer-object to the TensorFlow graph for later execution.
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5).minimize(cost)
Performance measures
We need a few more performance measures to display the progress to the user. This is
a vector of booleans whether the predicted class equals the true class of each image.
Conclusion
The model, after being trained for 1000 optimization iterations and with each iteration
using 100 images from the training-set, gave an accuracy of 91%. Refer the notebook for
getting the whole code to replicate the same results in your work as well.
You have had your first look at how we can use TensorFlow, and maybe you are a bit
confused and could not understand the terms etc. very precisely,
Here is a link to all the keywords and the glossary used in deep learning. Read it and then
come back to the tutorial and go through it again to achieve a deeper understanding of
the material.
The notebook is available at the link; you can download it, run it and understand it. Here
are a few suggestions for exercises that will help you to improve your skills with
TensorFlow. It is important to get hands-on experience with TensorFlow in order to learn
how to use it properly.
You may want to back-up this Notebook before you make any changes to it.
In the previous week, we showed that a simple linear model had about 91%
classification accuracy for recognizing hand-written digits in the MNIST data-set. This
week we will implement a simple convolutional neural network in TensorFlow that has
a classification accuracy of about 99%, or more if you complete some of the suggested
exercises. Convolutional networks work by moving small filters across the input image.
This means the filters are re-used for recognizing patterns throughout the entire input
image. This makes convolutional networks much more powerful than fully-connected
networks with the same number of variables. This in turn makes it faster to train
convolutional networks.
Here are a few links to get started with CNNs and explain the difference between
artificial neural networks (ANNs) and CNNs:
A Beginner's Guide to Understanding Convolutional Neural Networks
Convolutional Neural Networks (CNNs): An Illustrated Explanation
Convolutional Neural Network
The following figure shows how the data flows in a convolutional neural network.
The input image is processed in the first convolutional layer using the filter-weights. This results in 16 new images, one for each
filter in the convolutional layer. The images are also down-sampled so the image resolution is decreased from 28x28 to 14x14.
These 16 smaller images are then processed in the second convolutional layer. We need filter-weights for each of these 16
channels, and we need filter-weights for each output channel of this layer. There are 36 output channels so there are a total of
16 x 36 = 576 filters in the second convolutional layer. The resulting images are down-sampled again to 7x7 pixels.
The output of the second convolutional layer is 36 images of 7x7 pixels each. These are then flattened to a single vector of
length 7 x 7 x 36 = 1764, which is used as the input to a fully-connected layer with 128 neurons (or elements). This feeds into
another fully-connected layer with 10 neurons, one for each of the classes, which is used to determine the class of the image,
that is, which number is depicted in the image.
The convolutional filters are initially chosen at random, so the classification is done randomly. The error between the
predicted and true class of the input image is measured as the so-called cross-entropy. The optimizer then automatically
propagates this error back through the convolutional network using the chain rule of differentiation and updates the filter-
weights to improve the classification error. This is done iteratively thousands of times until the classification error is
sufficiently low enough.
The output of the second convolutional layer is 36 images of 7x7 pixels each. These are then flattened to a single vector of
length 7 x 7 x 36 = 1764, which is used as the input to a fully-connected layer with 128 neurons (or elements). This feeds into
another fully-connected layer with 10 neurons, one for each of the classes, which is used to determine the class of the image,
that is, which number is depicted in the image.
Here are a few more links to understand how a CNN works. If you want to deep
dive into the methodology of a CNN model and its functioning, do give them a read.
The best explanation of Convolutional Neural Networks
How do Convolutional Neural Networks work?
Convolutional Layer
The above chart shows the basic idea of processing an image in the first convolutional
layer. The input image depicts the number 7, and four copies of the image are shown
here so we can see more clearly how the filter is being moved to different parts of the
image. For each position of the filter, the dot-product is being calculated between the
filter and the image pixels under the filter, which results in a single pixel in the output
image. So, moving the filter across the entire input image results in a new image being
generated.
The red filter-weights means that the filter has a positive reaction to black pixels
in the input image, while blue pixels means the filter has a negative reaction to black
pixels. In this case it appears that the filter recognizes the horizontal line of the 7-digit,
as can be seen from its stronger reaction to that line in the output image.
The step size for moving the filter across the input is called the stride. There is a
stride for moving the filter horizontally (x-axis) and another stride for moving vertically
(y-axis). In the source code below the stride is set to 1 in both directions. This means the
filter starts in the upper left corner of the input image and is moved 1 pixel to the right
in each step. When the filter reaches the end of the image on the right side, the filter is
moved back to the left side and 1 pixel down the image. This continues until the filter
has reached the lower right corner of the input image and the entire output image has
been generated.
When the filter reaches the end of the right side as well as the bottom of the input
image, it can be padded with zeroes (white pixels). This causes the output image to be
of the exact same dimension as the input image.
Note that the second convolutional layer is more complicated because it uses 16
input channels. We want a separate filter for each input channel, so we need 16 filters
instead of just one. Furthermore, we want 36 output channels from the second
convolutional layer, so in total we need 16 x 36 = 576 filters for the second convolutional
layer. It can be a bit challenging to understand how this works.
The configuration of the convolutional neural network is defined here for convenience,
so you can easily find and change these numbers and re-run the Notebook.
# Convolutional Layer 1.
filter_size1 = 5 # Convolution filters are 5 x 5 pixels.
num_filters1 = 16 # There are 16 of these filters.
# Convolutional Layer 2.
filter_size2 = 5 # Convolution filters are 5 x 5 pixels.
num_filters2 = 36 # There are 36 of these filters.
# Fully-connected layer.
fc_size = 128 # Number of neurons in fully-connected layer.
Steps for building and using CNN model on MNIST dataset
The following steps are explained in the notebook as well. Make sure that you
download the notebook for this week and replicate the same steps yourself. Here is a
link to the notebook. Most of the steps are same as the previous week.
Conclusion
We have seen that a convolutional neural network works much better at recognizing
hand-written digits than the simple linear model we used last week. The convolutional
network has a classification accuracy of about 99%, or even more if you make some
adjustments, compared to only 91% for the simple linear model. However, the
convolutional network is also much more complicated to implement. Therefore, we
would like an easier way to program convolutional neural networks and we would also
like a better way of visualizing their inner workings. There are a lot of application
programming interfaces (APIs) for deep learning that make the use of these complex
models and coding them super easy. One such API is Keras API.
To code ANN/CNN/RNN models using Keras API is super easy. Go check it out yourself
using this link.
Do you get the exact same results if you run the Notebook multiple times without
changing any parameters? What are the sources of randomness?
Run another 10,000 optimization iterations. Are the results better?
Change the learning rate for the optimizer.
Change the configuration of the layers, such as the number of convolutional filters,
the size of those filters, the number of neurons in the fully-connected layer, etc.
Add a so-called drop-out layer after the fully-connected layer. Note that the drop-
out probability should be zero when calculating the classification accuracy, so you
will need a placeholder variable for this probability.
Change the order of ReLU and max-pooling in the convolutional layer. Does it
calculate the same thing? What is the fastest way of computing it? How many
calculations are saved? Does it also work for Sigmoid functions and average-pooling?
Add one or more convolutional and fully-connected layers. Does it help
performance?
What is the smallest possible configuration that still gives good results?
Try using ReLU in the last fully-connected layer. Does the performance change?
Why?
Try not using pooling in the convolutional layers. Does it change the classification
accuracy and training time?
Try using a 2x2 stride in the convolution instead of max-pooling? What is the
difference?
Remake the program yourself without looking too much at this source-code.
Explain to a friend how the program works.
There is a Kaggle competition that uses this dataset (check the scripts and forum
sections for sample code). You have to participate in this competition and see your
standing against the best data scientists in the world. Good luck!
Machine Learning Curriculum - Week 12
Recurrent neural networks are deep learning models with simple structures and
a built-in feedback mechanism, or in different words, the output of a layer is added to
the next input and fed back to the same layer.
The Recurrent neural network is a specialized type of neural network that solves the
issue of maintaining context for sequential data -- such as weather data, stock prices,
genes, time series etc. At each step, the processing unit takes in an input and the current
state of the network and produces an output and a new state that is re-fed into the
network.
However, this model has some problems. It's very computationally expensive to
maintain the state for a large number of units, even more so over a long amount of time.
Additionally, recurrent networks are very sensitive to changes in their parameters. As
such, they are prone to different problems with their gradient descent optimizer -- they
either grow exponentially (an exploding gradient) or they drop down to near zero and
stabilize (a vanishing gradient), both problems that greatly harm a model's learning
capability.
This week we will cover only long short-term memory (LSTM) and its implementation
using TensorFlow. The implementation will be provided in Jupyter notebook using the
same MNIST dataset that we have been using from last three weeks on deep earning.
LSTM is one of the proposed solutions or upgrades to the recurrent neural network
model. It is an abstraction of how computer memory works. It is "bundled" with
whatever processing unit is implemented in the recurrent network, although outside of
its flow, and is responsible for keeping, reading, and outputting information for the
model. The way it works is simple: there is a linear unit, which is the information cell
itself, surrounded by three logistic gates responsible for maintaining the data. One gate
is for inputting data into the information cell, one is for outputting data from the input
cell, and the last one is to keep or forget data depending on the needs of the network.
Thanks to that, it not only solves the problem of keeping states, because the network
can choose to forget data whenever information is not needed; it also solves the
gradient problems, since the logistic gates have a very nice derivative.
the "input" or "write" gate, which handles the writing of data into the
information cell,
the "output" or "read" gate, which handles the sending of data back into the
recurrent network, and
the "keep" or "forget" gate, which handles the maintaining and modification of
the data stored in the information cell.
The three gates are the centerpiece of the LSTM unit. The gates, when activated by the
network, perform their respective functions. For example, the input gate will write
whatever data it is passed onto the information cell, the output gate will return
whatever data is in the information cell, and the keep gate will maintain the data in the
information cell. These gates are analogous and multiplicative, and as such they can
modify the data based on the signals they are sent.
Building an LSTM with TensorFlow
MNIST Dataset
The function input_data.read_data_sets(...) loads the entire dataset and returns an
object tensorflow.contrib.learn.python.learn.datasets.mnist.DataSets
3. one output layer which converts a 128-dimensional output of the LSTM to 10-
dimensional output indicating a class label.
n_input = 28 # MNIST data input (img shape: 28*28)
n_steps = 28 # timesteps
n_hidden = 128 # hidden layer num of features
n_classes = 10 # MNIST total classes (0-9 digits)
learning_rate = 0.001
training_iters = 100000
batch_size = 100
display_step = 10
# The output of the rnn would be a [100x28x128] matrix. we use the linear activation to
map it to a [?x10 matrix]
# Linear activation, using rnn inner loop last output
# output [100x128] x weight [128, 10] + []
output = tf.reshape(tf.split(outputs, 28, axis=1, num=None, name='split')[-1],[-1,128])
return tf.matmul(output, weights['out']) + biases['out']
The whole implementation is available in the notebook here. You will replicate the
results to obtain the similar testing accuracy of 96%.
Restricted Boltzmann Machine (RBM): RBMs are shallow neural nets that learn to
reconstruct data by themselves in an unsupervised fashion.
Simply, an RBM takes the inputs and translates them into a set of numbers that
represents them. Then these numbers can be translated back to reconstruct the inputs.
Through several forward and backward passes, the RBM will be trained, and a trained
RBM can reveal which features are the most important ones when detecting patterns.
First, let’s see what is different between discriminative and generative models.
Generative: using the example of cars, we can build a model of what sedan cars look
like. Then, looking at SUVs, we can build a separate model of what SUV cars look like.
Finally, to classify a new car, we can compare the new car to the sedan model, and
compare it to the SUV model to see whether the new car looks more like an SUV or a
sedan.
In the supervised task, we first form a model for P(x|y), where y is the label for
x. For example, if y indicates whether an example is an SUV (0) or a sedan (1),
then p(x|y = 0) models the distribution of SUVs’ features, and p(x|y = 1) models
the distribution of sedans’ features. If we manage to find P(x|y) and P(y), then
we can use Bayes rule to estimate P(y|x), because: p(y|x) = p(x|y)p(y)/p(x)
RBM layers
An RBM has two layers. The first layer of the RBM is called the visible or input layer.
MNIST images have 784 pixels, so the visible layer must have 784 input nodes. The
second layer is the hidden layer, which possesses i neurons in this case. Each hidden unit
has a binary state, which we’ll call it si, and turns either on or off (i.e., si = 1 or si = 0)
with a probability that is a logistic function of the inputs it receives from the other j
visible units, called for example, p (si = 0) which is also shown in the figure below.
Each node in the first layer also has a bias shared among all visible units. We also define
the bias of the second layer as well as the bias shared among all visible units.
So, for example, if we have 784 units in the visible layer, it will generate a probability
distribution over all the 784 possible visible vectors, i.e, p(v).
It would be really cool if a model, after training, can calculate the probability of a visible
layer, given the hidden layer values.
1) There are two phases of training an RBM: the forward pass, and
2) The backward pass or reconstruction.
Phase 1 forward pass: Processing happens in each node in the hidden layer. That is, input
data from all the visible nodes are being passed to all the hidden nodes. This
computation begins by making stochastic decisions about whether to transmit that input
or not (i.e. to determine the state of each hidden layer). At the hidden layer's nodes, X is
multiplied by a W and added to h_bias. The result of those two operations is fed into
the sigmoid function, which produces the node’s output/state. As a result, one output
is produced for each hidden node. So, for each row in the training set, a tensor of
probabilities is generated, which in our case it is of size [1x500], and totally 55000
vectors (h0=[55000x500]).
Then, we take the tensor of probabilities (from a sigmoidal activation) and make samples
from all the distributions, h0. That is, we sample the activation vector from the
probability distribution of hidden layer values. Samples are used to estimate the
negative phase gradient; this will be explained later.
Exercises
This week, you will build an RBM on MNIST datasets using the code given in the
notebook, and compare it with the simple model, CNN model and LSTM model we built
earlier.
Machine Learning Curriculum - Week 14
m^−p/(2p+d)
m: Number of data points
d: Dimensionality of the data
p: Parameter that depends on the model
As you can see, it increases exponentially! Returning to our example, we don't need to
use all of the 65,536 dimensions to classify an emotion. A human identifies emotions
according to some specific facial expressions, and some key features, like the shapes of
the mouth and eyebrows.
Autoencoder Structure
An autoencoder can be divided in two parts, the encoder and the decoder.
The encoder compresses the representation of an input. In this case we are going
to compress the face of our actor, consisting of 2000-dimensional data to only 30
dimensions, taking some steps between the compressions.
The decoder is a reflection of the encoder network. It works to recreate the input
as closely as possible. It has an important role during training, to force the autoencoder
to select the most important features in the compressed representation.
After the training has been done, you can use the encoded data as reliable
dimensionally-reduced data, applying it to any problems that a dimensionality reduction
problem seems to fit.
This image was extracted from the Hinton paper comparing the two-dimensional
reduction for 500 digits of the MNIST, with PCA on the left and autoencoder on the right.
We can see that the autoencoder provided us with a better separation of data.
An autoencoder uses the loss function to properly train the network. The loss function
will calculate the differences between the output and the expected results. After that,
we can minimize this error by doing gradient descent. There is more than one type of
loss function; it depends on the type of data used.
Find the notebook for the implementation of AE at this link and replicate the code on
yourself.
Autoencoders
Deep Learning: Autoencoders Fundamentals and types
Autoencoders — Bits and Bytes of Deep Learning
Autoencoders – Deep Learning Book
A Beginner’s guide to Deep Autoencoders
Building Autoencoders in Keras
Exercises
In this last week, we will provide you with the best machine learning and deep learning
resources to explore.
We hope you have found this course exciting and you have learnt and explored a lot
through the contents and resources we have shared with you. Here are the awesome ML/DL
resources that you can look into to enhance your knowledge.