Chapter 1
Chapter 1
CONTENTS
1.1 A brief history on Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Machine Learning - Learn from data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Parameters vs. Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Classification vs. Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.3 Model-Based vs. Instance-Based Learnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.4 Shallow vs. Deep Learnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 How to use this book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
“We know the past but cannot control it. We control the future but cannot know it.” by Claude Shannon 1
The main challenge faced by AI study is to teach a computer how to resolve an apparently trivial task to
humans but cannot be tackled by simply using preassigned algorithm or routine logical instructions; for
instance, distinguishing a cat from a dog is obvious to a human, but writing an algorithm to do this task,
so that all aspects are taken into account at once, would be very complicated. Particularly, the most recent
proposal on the use of machine learning is to handle problems similar to this; indeed, with the availability
of hand-held electronic devices, such as smart phones and smart watches, collecting huge amounts of data
on human behavior is far easier nowadays, and this can help to train the machines to learn to mimic us
on how to solve different matter. In the primitive models in statistical learning, most of them are only
composed with a few layers of complexity, and therefore they lack the ability to pick up the more subtle
latent information embedded deeply in the ocean of data. Facing at this bottleneck, to overcome this, with
1
Shannon, C. E. (1959). Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec, 4(142-163),
1.
6
Page 7 A Journey from Machine Learning to Deep Learning
the latest advance in computational power and the availability of labeled data, scholars turn to strengthening
the network approach which leads us to the most recently popular topic - Deep Learning.
Figure 1.1.1: Relations among AI, Machine Learning and Deep Learning
In 1951, Marvin Minsky built the first Stochastic Neural Analog Reinforcement Calculator (SNARC). The
machine, essentially a neural network consisting of 40 neurons, enables human to first simulate the transmis-
sion of neural signals. To honor his contribution, Minsky received the Turing Award, the most prestigious
prize in computer science, in 1969.
In 1955, Allen Newell, Herbert Simon, and Cliff Shaw [6] wrote a compute program called the Logic Theo-
rist to mimic the problem-solving skills of humans. This program successfully proved 38 out of 52 theorems
from Principia Mathematica by Whitehead and Russell (1927).
For the formal origin of AI, the first workshop on Artificial Intelligence held in Dartmouth in the summer
of 1956 is commonly regarded as the date of birth of AI, and it was attended by the representative scholars
in information science and intelligence such as John McCarthy, Marvin Minsky, and Claude Shannon. The
workshop covered topics including neural networks, natural language processing, abstraction, and creativity.
After this series of talks, scientists and engineers have been constantly dreaming of a hypothetical machine
that can exhibit behavior at least as skillful and flexible as humans do, can reason, and can possess the
human soul and mind; researchers often refer to this collective wisdom and research program as General
(Strong) Artificial Intelligence.
Amazed by its unlimited potentials, AI had started to flourish. During this time, some contemporaries
optimistically foresaw that a machine completely driven by AI would come to birth in 20 years time. In
1963, the MIT initialized the Project on Mathematics and Computation (Project MAC), with Minsky and
McCarthy joining at a later time, in which they promoted a series of research topics on image and speech
recognitions. From 1964 to 1966, Joseph Weizenbaum built the world’s first natural language processing
computer program; meanwhile, on the other side of the globe, Waseda University in Japan announced the
invention of the first biped walking robot.
However, the hunger of scientists had yet to be satisfied. Criticism on AI began to rise starting in the
1970’s; indeed, the rapidly growing demand on computational power could not be fulfilled at that time.
All rights reserved. Do not distribute without permission from the authors, Kaiser Fan and Phillip Yam.
Page 8 Statistical Machine Learning and Deep Learning with Python and R
In addition, the variety and complexity of demanding problems in image and natural language processings
had created severe hurdles given the contemporary technological conditions. Reaching this bottleneck, the
public awareness and grant funding started to rapidly decline in the mid of 1970’s, AI development then fell
into decay.
With the success of expert systems, the purpose of developing AI started to deviate from its original goal
of obtaining general intelligence, instead, the interest now is to develop more tailor made system to solve
target practical problems in specific areas. In 1982, John Hopfield proposed a new network model which
was later called the Hopfield network, a kind of recurrent artificial neural network [4], which incorporates
the mechanism of associative memory (the ability to learn and remember the relationship between unrelated
items). In 1986, David Rumelhart, Geoffrey Hinton and Ronald Williams jointly published the paper Learn-
ing representation by back-propagating errors [7] in which they proved empirically the method of backward
propagation can help train a multi-layer neural network such that it can learn the appropriate inherent
representations of an arbitrary mapping of input to output.
During this new wave of passion for AI, Japan’s Ministry of International Trade and Industry initial-
ized a project of building a “fifth-generation computer” in 1982 [8]. It aimed to create a machine with
supercomputer-like performance through large scale simultaneous sychronized calculations, in order to pro-
vide a platform for future developments in AI. However, after spending over 50 billion Japanese yen in 10
years time, the project could still not meet the planned target. In the late 80’s, negative impressions on AI
started to grow in the industry, as it failed to meet the expectations of the tremendous investments that
had been made, AI had once again faded out of people’s mind.
Stepping into the 21st century, the rapid globalization and development of the internet have significantly
boosted the volume of available digital information. On the other hand, the computing capability of Graph-
ics Processing Unit (GPU), first appearing in the 1990’s and then gaining popularity over the next two
All rights reserved. Do not distribute without permission from the authors, Kaiser Fan and Phillip Yam.
Page 9 A Journey from Machine Learning to Deep Learning
decades, has been proliferating; for instance, the calculation speed of a NVIDIA23 Tesla V100 GPU4 has
exceeded 10 trillion FLOPs (floating-point operations per second), surpassing the world’s fastest supercom-
puter in 2001.
With the rapid development of effective big data collection and computing technology, AI has achieved
major breakthroughs. The multi-layer neural network AlexNet5 invented by researchers at the University of
Toronto won the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC). AlexNet outdid, to
a large extent, the first runner-up in the challenge, where its algorithm was based on convolutional neural
networks in machine learning. Henceforth, deep learning based on multi-layer neural networks has been
applied to various areas. For instance, with an advance in deep reinforcement learning, AlphaGo recently
developed by Google has defeated several Go world champions [9]. All of these have captured public atten-
tion on the potentials of deep learning and brought back the revenge of AI.
Figure 1.1.2: The robot “R2-D2” in the Figure 1.1.3: T-800 “Model 101” in the
Star Wars series. movie The Terminator.
Machine learning originated from the early stages of artificial intelligence, and it evolved gradually and
brought in new inspirations into different sub-branches of pattern recognition and computer learning the-
ories. It is an interdisciplinary subject that involves statistics, linear algebra, optimization and numerical
analysis. According to variations in purposes and methodologies, machine learning is classified into super-
vised learning, unsupervised learning and reinforcement learning. Assume that
1. there are N samples in dataset, and
2
Originally, the founders first thought of “NV” standing for “Next Version”, then they added “invidia” referring to the
Latin word envy.
3
We here again would like to express our gratitude to NVIDIA for supporting the joint-institute with CUHK.
4
The product NVIDIA Tesla V100 GPU can be found in https://www.nvidia.com/en-gb/data-center/tesla-v100/.
5
AlexNet is the name of a convolutional neural network (CNN), designed by Alex Krizhevsky in collaboration with Ilya
Sutskever and Geoffrey Hinton, who was Krizhevsky’s Ph.D. advisor.
All rights reserved. Do not distribute without permission from the authors, Kaiser Fan and Phillip Yam.
Page 10 Statistical Machine Learning and Deep Learning with Python and R
2. xm represents the out-sample feature vector, i.e. not in the training dataset, for m > N .
We start the learning procedure by choosing a suitable model. Common supervised learning models include
logistic regression, generalized linear models, classification and regression trees, support vector machine
(SVM), K-nearest neighbors (KNN), naive Bayes classifiers, and many common Deep Neural Networks.
The model is tested by comparing the predicted values against the actual labels, so that the model can be
adjusted accordingly. The training process is repeated until sufficient accuracy is obtained. The learning is
supervised by the feedback obtained from the values of the actual labels; Once the training is finished, new
data can be input into the model for predictions.
1. Dataset: A collection of labeled examples {xn , yn }N n=1 , where
Example 1.2.1. Spam Detection: Suppose that we have 10,000 email messages, each label with “spam” or
“not spam”. However, these email messages cannot be directly used in the model, these labels and passages
in the emails are not numbers! Hence, each email message has to be converted into a feature vector. One
common way is called bag of words: Let say the bag (dictionary) contains 20,000 alphabetically sorted
words, then
1. the first feature has a value of 1 if the email message contains the word “a”; 0 otherwise;
2. the second feature has a value of 1 if the email message contains the word “aaron”; 0 otherwise;
3. ..
.
4. the th
20, 000 feature
th
has a value of 1 if the email message contains the word “zulu”; 0 otherwise.
1, if the nth message contains “zulu”
1, if the n message contains “a”
x(1)
n = · · · x(20,000)
n = .
0, otherwise 0, otherwise
Similarly, the output labels have to be converted into numbers. For example,
1, if the nth message is spam
yn = 1{the nth message is spam} = .
0, otherwise
where 1{·} is the indicator function. This example will be further discussed in support vector machine,
random forest, naive bayes classifier, and CIBer.
The performance of traditional unsupervised learning in feature extraction for any complex data structure
may not be too appealing; alternatively, deep learning has proven its strong unsupervised learning abilities,
All rights reserved. Do not distribute without permission from the authors, Kaiser Fan and Phillip Yam.
Page 11 A Journey from Machine Learning to Deep Learning
especially in the field of computer vision, or when there are some natural ordering metric, or algebraic
structures among feature variables of datapoints; particularly, it is through the Convolutional layers in
Convolutional Neural Network (CNN) and the feedback mechansim via backpropagation in the Deep NN
(latter) part of CNN. Recently, some reasearch also suggests semi-supervised learning, falling in between
supervised and unsupervised learning. It makes use of unlabeled data together with a small amount of
labeled data, striking a balance between the learning performance and the costs in obtaining labeled data.
1. Dataset: A collection of unlabeled examples {xn }N n=1 .
2. Goal: Produce a model that transforms the feature vector xn into the real-valued output yn or a
vector output yn . For example, in the following cases, the model returns:
(a) Clustering: The identity of the cluster for each group of feature vectors in the dataset, i.e.
yn ∈ {1, · · · , C} ,
where C is the total number of clusters. K-means clustering is one of such method in clustering
K different subgroups, where K is a hyperparameter; see Section ??.
(b) Dimension Reduction: A new feature matrix Y ∈ RNY ×DY that has a smaller dimension than
the input feature matrix X ∈ RN ×D . Principal component analysis (PCA) reduces the dimension
of the input feature matrix through looking for the dominant eigenvalues of covariance matrix of
feature vectors; see Section ??.
(c) Outlier Detection: A real-valued number y` that indicates how x` is different from a “typical”
examples in the dataset {xn }N n=1 . The Mahalanobis distance [3] Dn for the independent and
identically distributed (iid) datum xn is such that
Dn2 = (xn − x̄N )> S −1 (xn − x̄N ) ∼ χ2D , n = 1, 2, · · · , N
approximately, where x̄N and S are respectively the mean and the covariance matrix of x1 , · · · , xN ,
and χ2D is the Chi-squared distribution with D degrees of freedom; especially if N is large enough,
these Dn2 ’s, for n = 1, · · · , N , are also approximately independent of each other.
A major usage of RL is to mimic human behaviour, and with the rapidly developing research and algorithms,
they can often outdo humans. For instance, a robot can learn to get up on itself after falling in a simulated
environment. Not to mention the eye-catching Go match in 2017, the computer program AlphaGo developed
by Google defeated Ke Jie, the world champion in Go. Other applications include machine translation (MT)
and predictive text etc.
There are three historical moments in the development of reinforcement learning. First, Sutton et al. (1998)
published the text Reinforcement Learning: An Introduction. The book summarizes the development of
different algorithms in reinforcement learning by 1998. By that time, much emphasis was paid on Q-
learning with tables. Concurrently, algorithms such as direct policy search have already been proposed.
For instance, the algorithm REINFORCE proposed in Williams (1992) directly updates the policy weights
by evaluating the policy gradient. The second time-point was in 2013, when Deep Q Network was first
suggested for gaming by DeepMind; also see Mnih et al. (2013). Deep Q Network integrates reinforcement
learning and deep neural network to form deep reinforcement learning. During 1998-2013, various policy-
based algorithms have also been developed. The third moment, and also the most compelling breakthrough
All rights reserved. Do not distribute without permission from the authors, Kaiser Fan and Phillip Yam.
Page 12 Statistical Machine Learning and Deep Learning with Python and R
in RL, has to be the development of AlphaGo by Google [9]. The RL-trained computer program earned 2
consecutive wins in Go matches over the world champions during 2016-2017.
All rights reserved. Do not distribute without permission from the authors, Kaiser Fan and Phillip Yam.
Page 13 A Journey from Machine Learning to Deep Learning
Unfortunately, in this book, you may not found the materials about deep thinking, in the sense of the book
(see Figure 1.4.1) by professor Kawakami of design studies.
Figure 1.4.1: Kyoto University’s Deep Thinking Method (Japanese) by Hiroshi Kawakami.
“Deep Thinking” by Kawakami is a book on thought that deepens one’s thinking ability and cultivate his/her
skill of analyzing problems and then to propose solution strategies; it is more about scientific methods and
philosophical argument training, which will not be covered in the present book. Instead, we introduce
various practically useful mathematical and statistical models behind a wide range of machine learners. In
1997, Garry Kimovich Kasparov, who was once a world champion chess grandmaster, lost a match to the
IBM supercomputer “Deep Blue” under a limited time constraint. After 20 years, in 2017, he published a
book and named it as “Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins”
(see Figure 1.4.2).
Figure 1.4.2: Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins by Garry
Kasparov.
All rights reserved. Do not distribute without permission from the authors, Kaiser Fan and Phillip Yam.
Page 14 Statistical Machine Learning and Deep Learning with Python and R
In his book, Kasparov revealed experience and strategies playing against the Deep Blue. Although there
were plenty of critisms against artificial intelligence during that time, Kasparov believed that artificial intel-
ligence could bring humans to another height, and predicted the future development of artificial intelligence.
Similar to his idea, we hope that our book can bridge our readers to understand the common existing ma-
chine and deep learners, and to foster the future development of artificial intelligence.
In addition, we may not introduce any materials related to deep diving (see Figure 1.4.3), but only motivating
more about self-driving/autonomous driving (“deep” driving; see Figure 1.4.4) in the due course.
Figure 1.4.3: Deep diving. Photo by Figure 1.4.4: Deep driving. Photo by Alex
http://divemagazine.co.uk/travel/ Kendall https://www.youtube.com/watch?v=
7529-fresh-wrecks. CxanE_W46ts.
In particular, Tesla Inc., a U.S. based company which builds electric car, uses deep learning to develop
an autopilot system. This autopilot system has already been equiped in the Tesla Model 3 (see Figure
1.4.5). However, this autopilot technology can only perform several functions, including but not limited to
accelerating, braking, and steering. The Tesla drivers still need to take control of the car. The U.S. National
Highway Traffic Safety Administration gives a definition to a Level 5 self-driving cars:
“An automated driving system (ADS) on the vehicle can do all the driving in all circumstances.
The human occupants are just passengers and need never be involved in driving.” 6
“An advanced driver assistance system (ADAS) on the vehicle can itself actually control both
steering and braking/accelerating simultaneously under some circumstances. The human driver
must continue to pay full attention (monitor the driving environment) at all times and perform
the rest of the driving task.” 6
Therefore, the current Tesla’s autopilot system only suits the Level 2 requirement. There is still a long
journey for Tesla to improve its system; indeed, in March 2016, an incident was reported on Twitter that
the Tesla’s autopilot system mistakenly recognized the salt lines, which was caused in advance of a massive
snowstorm, as the normal traffic broken white lines on the highway, see Figure 1.4.6.
6
Retreived from https://www.nhtsa.gov/technology-innovation/automated-vehicles-safety
All rights reserved. Do not distribute without permission from the authors, Kaiser Fan and Phillip Yam.
Page 15 A Journey from Machine Learning to Deep Learning
Figure 1.4.5: White Tesla Model 3. Figure 1.4.6: Original picture on Twit-
Photo by https://en.wikipedia.org/ ter: Salt lines confuse Tesla’s autopilot
wiki/Tesla_Model_3. system. Photo by https://twitter.com/
amywebb/status/841292068488118273.
All rights reserved. Do not distribute without permission from the authors, Kaiser Fan and Phillip Yam.
Page 16 Statistical Machine Learning and Deep Learning with Python and R
Bibliography
[1] Bachant, J. and Soloway, E. (1989). The engineering of xcon. Communications of the ACM, 32(3):311–
319.
[2] Buchanan, B. G. and Feigenbaum, E. A. (1980). The stanford heuristic programming project: Goals
and activities. AI Magazine, 1(1):25–25.
[3] Chandra, M. P. (1936). On the generalised distance in statistics. In Proceedings of the National Institute
of Sciences of India, volume 2, pages 49–55.
[4] Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational
abilities. Proceedings of the national academy of sciences, 79(8):2554–2558.
[5] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M.
(2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
[6] Newell, A. and Simon, H. (1956). The logic theory machine–a complex information processing system.
IRE Transactions on information theory, 2(3):61–79.
[7] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-
propagating errors. nature, 323(6088):533–536.
[8] Shapiro, E. Y. (1983). The fifth generation project—a trip report. Communications of the ACM,
26(9):637–641.
[9] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J.,
Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep
neural networks and tree search. nature, 529(7587):484–489.
[10] Sutton, R. S., Barto, A. G., et al. (1998). Introduction to reinforcement learning.
[12] Whitehead, A. and Russell, B. (1927). Principia Mathematica. Number 1 in Cambridge mathematical
library. Cambridge University Press.
[13] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement
learning. Machine learning, 8(3):229–256.
All rights reserved. Do not distribute without permission from the authors, Kaiser Fan and Phillip Yam.