machine-learning-concise
machine-learning-concise
SpringerLink
Book Title Data Science in Practice
Series Title
Chapter Title Machine Learning: A Concise Overview
Copyright Year 2019
Copyright HolderName Springer International Publishing AG, part of Springer Nature
Corresponding Author Family Name Duarte
Particle
Given Name Denio
Prefix
Suffix
Role
Division
Organization Universidade Federal da Fronteira Sul
Address Chapecó, Brazil
Division
Organization University of Skövde
Address Skövde, Sweden
Email [email protected]
[email protected]
Author Family Name Ståhl
Particle
Given Name Niclas
Prefix
Suffix
Role
Division
Organization University of Skövde
Address Skövde, Sweden
Email [email protected]
Abstract Machine learning is a sub-field of computer science that aims to make computers learn. It is a simple view
of this field, but since the first computer was built, we have wondered whether or not they can learn as we
do.
Author Proof
Chapter 3
Machine Learning: A Concise Overview
OF
RO
0
1
DP
Machine learning is a sub-field of computer science that aims to make computers
learn. It is a simple view of this field, but since the first computer was built, we have
2 wondered whether or not they can learn as we do. In 1959, Samuel [40] proposed
3 some procedures to build an algorithm intending to make computers play better
4 checkers than novice players. It was an audacious goal mainly at that time when the
TE
5 available hardware was very limited. However, that shows the importance of machine
6 learning since the first computers were introduced.
7 Nowadays, users demand computers to perform complex tasks and solve several
8 kinds of new problems, while data are being produced from many devices (e.g.,
satellites, cell phones, sensors, among others). Researchers in all fields (e.g., statisti-
EC
10 cians, computer scientists, engineers, to cite some) have started the quest for making
11 computers learn by proposing news techniques to meet the new users demands.
12 Data are the input of any machine learning system. Data contain examples from a
13 given domain, and machine learning algorithms generalize the examples in the data
RR
14 to build mathematical models. The models can be used to predict new outputs from
15 new examples. The data used as input in the training model are called training data.
16 This chapter aims to present an overview of machine learning and to serve as
17 a road map to guide interested readers in applying machine learning to everyday
18 problems and giving skills to become a data scientist. It is organized as follows: next
CO
D. Duarte (B)
Universidade Federal da Fronteira Sul, Chapecó, Brazil
UN
22 when the training data have labels associated with every example. We present two
23 common supervised learning techniques: regressors and classifiers. Some supervised
24 algorithms for both techniques are also presented. The following section introduces
25 another class of machine learning algorithms: unsupervised learning. The training
26 data for unsupervised algorithm have no labels, so the learner aims at partitioning
OF
27 them into groups. Deep leaning is a new trend in machine learning and it is based on
28 different architectures of artificial neural networks. We dedicate the entire Sect. 3.4 to
29 these. Section 3.5 presents an overview of model assessment approaches. The model
30 assessment is critical to validate the model regarding the quality of the prediction
31 outputs. Another important issue in machine learning is the number of attributes in the
RO
32 training dataset. This issue is called dimension of the dataset. Section 3.6 presents
33 some techniques to reduce the dimensionality to enhance the performance of the
34 algorithms and help to visualize the data. The following section presents some final
35 remarks about machine learning: data preprocessing (feature selection and scaling,
36 missing values), bias, variance, over and underfitting. Finally, Sect. 3.8 concludes
37 this chapter.
DP
38 3.1 Introduction
TE
39 Learning is a very complex process, and we cannot say, currently, that a computer
40 can learn. With this in mind, our first definition of machine learning must be carefully
41 reviewed. There is no general agreement about what is learning, however for human
42 beings learning can be defined as (i) functionally as changes in behavior that result
from experience, or (ii) mechanistically as changes in the organism that result from
EC
43
44 experience [6].
45 Computers are mathematical machines; thus, we have to consider learning as a
46 computer program. Figure 3.1 presents pictorially traditional programming (a) and
47 machine learning (b). Notice that, while traditional programming is concerned with
RR
48 finding the right output based on given inputs and a program, machine learning is
49 concerned with finding a right program (later we call it a model) given a set of inputs
50 and outputs (possibly empty). The learned program can now propose new outputs
51 given new inputs.
CO
(a) (b)
UN
52 Based on Fig. 3.1, we can see how different learning means in human beings and
53 computers. Machine learning algorithms try to create a model that represents the
54 input to propose new outputs. This is the type of learning concerned in this chapter.
55 Back to the machine learning definition, we point out two definitions. The first one
56 proposed by Samuel [40] who said that machine learning is a field of study that gives
OF
57 computers the ability to learn without being explicitly programmed. Remark that
58 Samuel’s definition was one of the first proposed definitions. Almost forty years later,
59 Mitchell [32] proposed a more mathematical view of machine learning: a computer
60 program is said to learn from experience E with respect to some task T and some
61 performance measure P, if its performance on T , as measured by P, improves with
RO
62 experience E.
63 Example 3.1.1 Assume we want to learn whether or not a given email is spam. In
64 this case, we give to the algorithm a set of emails Se and divide it into two subsets: not
65 spam (Snse ) and spam (Sse ). This step is the experience (E) we give to the algorithm.
Based on E, our algorithm classifies emails as spam (and consequently as not spam).
66
67
68
69
DP
This step is related to the task (T ). Finally, we want to know how well our algorithm
performs in classifying spam emails. This step is related to the performance of our
algorithm (P). Our goal is, then, to find a T with as good P as possible. Of course, T
70 and P depend on the quality of E, e.g., if we have a good informative Se , our machine
71 learning algorithm may classify all our input emails correctly.
TE
72 Figure 3.2 changes a little bit of the representation of Fig. 3.1. Firstly, we call
73 Training Set (X ) the input of our machine learning algorithm. Depending on the
74 task we want to accomplish, X may have a label for every example. So, X can be
75 represented as {(x(1) , y(1) ), . . . , (x(m) , y(m) )} or {x(1) , . . . , x(m) }, where m is the size
EC
76 of X (training set), and y(j) is the label of x(j) (1 ≤ j ≤ m). The learning algorithm
77 (we have thousand of options) takes X as input and builds a Model (also known as a
78 hypothesis). After assessing our model (i.e., verify how good our P is), we can feed it
79 with new examples (X ) to have predicted outputs. Take into account that, depending
80 on our target task, the predicted outputs can be a class (discrete value), continuous
RR
81 values, clusters, among others. This will be further elaborated in this chapter.
82 Based on what we have already seen, the machine learning process can be divided
83 into four steps: (i) get the training set X , (ii) choose and implement a learning task
84 based on X , (iii) build a model, and (iv) assess the model with new inputs. Remark
85 that these four steps may be repeated until we have reached a good P. Keep in
CO
UN
86 mind that when we have a machine learning problem to solve, the first challenge we
87 face is which machine learning algorithm to use. There are thousands available, and
88 each year other hundreds are proposed [8]. The set of possible available learning
89 algorithms for a given machine learning problem is called hypothesis space. To not
90 get lost in this huge set of choices, the learning components can be divided into
OF
91 three [8]:
92 1. Representation: a task must be represented by some algorithm. We must know
93 what kind of learning we are interested in, and so, the hypothesis space is
94 decreased based on the type of task.
2. Evaluation: the predicted outputs must be evaluated (assessed) to know how good
RO
95
96 the chosen representation is. Depending on the task, different evaluation functions
97 can be used.
98 3. Optimization: based on the results of the evaluation component, optimization
99 must be done. The aim of the learning algorithm is to maximize a given perfor-
100 mance measure.
101
102
DP
Table 3.1 presents some examples for each of the three components. For example,
a Decision Tree may classify a training set into predefined classes. An evaluation
103 component can be accuracy, i.e. how many classes are correctly classified. An entropy
104 function can be used to measure the purity of the attributes within the tree, that is,
105 the amount of information that would be needed to classify an example. Table 3.1
TE
106 gives a little idea of the options we have when we design a machine learning system.
107 Remark that we cannot pick an example for each column to design our system.
108 Each representation has its own set of optimization and evaluation approaches. For
109 example, we can use Greedy Recursive Partitioning for Decision Trees, and the model
can be evaluated by Accuracy.
EC
110
111 Although there does not exist a simple recipe to choose the best approach from
112 one of the three components, the success (or not) of a learner depends on how well
113 the problem is defined, as well as the quality of the training set. The former helps to
114 find a representation that fits better to the problem, and the latter can be considered
RR
Table 3.2 Wind and Wind speed (km/h) Temperature (◦ C) Pace (min)
temperature affecting pace
10.5 12.3 3.5
8.9 15.4 3.2
20.2 13.7 5.5
5.10 3.1 4.0
OF
117 Example 3.1.2 Table 3.2 presents an extract of a dataset about running performance
(column Pace—the number of minutes it takes to cover a kilometer) based on the wind
RO
118
119 speed and the temperature. Suppose we want to predict a pace based on new informa-
120 tion about the weather. X can be seen as (<10.5, 12.3>, <8.9, 15.4>,<20.2, 13.7>,
121 <5.10, 3.1>) and y as (3.5, 3.2, 5.5, 4.0). The size of X is 4 and number of features
122 is 2 (i.e., wind speed and temperature), and we want to predict a continuous value:
123 the pace.
124
125
DP
The number of features is usually known as the dimensionality of the dataset. The
notion of dimensionality leads to a well-known problem in machine learning: the
126 curse of dimensionality [21]. Considering a dataset as a set of points in a plane, the
127 curse of dimensionality can be stated as follows: (i) learning algorithms generally
128 work with interpolation to build models, (ii) interpolation is only possible if points
TE
129 are close to each other, (iii) if points are spread throughout a high dimensional space,
130 the distance between them is large, and (iv) interpolation-based algorithms cannot
131 build a model.
132 Besides, high dimensional dataset also leads to two problems: increase of com-
putation cost and non-informative features. Suppose that we have, in our example,
EC
133
134 the following features: the running shoes price and quality. Although, quality and
135 price are related to each other, in our dataset we can easily discard the feature price
136 without losing essential information for our machine learning system. Notice that if
137 we have a dataset with 200 features, discarding or merging some of them would not
RR
138 be an easy task. We deal with the dimensionality problem later in Sect. 3.6.
Example 3.1.3 Given Example 3.1.2 and x = <11.2, 10.0>, we want to predict a
new ŷ such that ŷ represents a valid pace value for the wind speed and tempera-
ture given. To accomplish the prediction, we have to build a model that describes
well enough our training set (X ). We show how to build a model using a simple
CO
linear regression (notice that from Table 3.1 we are choosing a representation for our
problem). A model can be
139 where θj are the weights or parameters (sometimes denoted by W ), xki represents
140 the kth feature of the ith example in X and ŷi is the predicted output of the ith
141 example, and we want a ŷi ≈ yi . θ0 represents the bias of the model (aka intercept).
142 The challenge is to find θ0 , θ1 , and θ2 such that our ŷi value is as close as possible to yi .
143 Given the matrix Xm×d and the column vector Θd +1×1 , we can implement our model
144 using matrix multiplication. However, X4×2 and Θ3×1 are not dimension-compatible.
145 Remark that, in our representation, Θ will always have one more column in relation
OF
146 to X , and we can solve this problem adding a 1’s column in X . As 1 is the identity
147 element under multiplication, so, if we multiply θ0 by 1, its value remains the same.
148 Now, we can represent our model as:
⎡ ⎤ ⎡ ⎤
1 10.5 12.3 ⎡ ⎤ ŷ1
⎢1 θ
15.4⎥ ⎢ŷ2 ⎥
RO
0
⎢ 8.9 ⎥ × ⎣θ1 ⎦ = ⎢ ⎥
149
⎣1 20.2 13.7⎦ ⎣ŷ3 ⎦
θ2
1 5.10 3.1 ŷ4
152
153
DP
ŷ=[2.79, 1.37, 7.36, 1.93]. Our Θ values made a fair prediction for the first example
(y1 = 3.5), but they failed for the others. To find fair values to Θ is an optimization
154 problem (our third component). If we are using linear regression to represent our prob-
155 lem and an error function to evaluate it, we can use an optimization algorithm for find-
156 ing the minimum of a function (e.g., Gradient Descent), and we get the following val-
TE
157 ues Θ T = [3.4, 0.192, −0.132], and, then, ŷ = [3.8, 3.1, 5.5, 4.0]. Finally, to verify
158 if our model is performing well, we have to evaluate it (second component). For linear
159 regression, we can use Root Mean Squared Error (RMSE), and the error is 0.16 (see
160 Sect. 3.5 for details). The smallest the value of RMSE, the closer our model is to make
good predictions. Given x , we have 3.4 + 0.192 × 12.0 + −0.132 × 11.0 = 4.2,
EC
161
162 that is, when the wind speed is 12.0 and the temperature is 11.0, the probably pace
163 would be 4.2.
164 We can evaluate our model using the same set used for training. However, the best
way to test our model is against unseen examples. Later in this chapter, we describe
RR
165
166 some strategies on how to train and test machine learning models.
167 In this section, we presented an overview of machine learning. In the next sections,
168 we describe approaches for implementing a range of types of algorithms to solve
169 machine learning problems. We consider a broad classification of a learning task
(i.e. machine learning algorithm): supervised and unsupervised.1
CO
170
UN
172 Supervised learning can be applied when the dataset contains a set of labels (possibly
173 unitary) for every example. Therefore, the dataset is divided into X , the features of
174 the examples, and y, the labels. The labels help a learning algorithm to build the
OF
175 predicting model and act as a guide to the learners.
176 Labels may be discrete (classes) or continuous (numeric values). Depending on
177 the type of label, we can apply classifiers or regressors. Besides, labels make the
178 evaluation of a model easier since we have the ground-truth to compare with the
179 predicted values. Table 3.2 shows a dataset that can be used for supervised algorithms.
RO
180 Remark that the label (Pace) represents continuous values, and, so, the dataset can be
181 used as input for regression supervised algorithms. If we change the pace to discrete
182 values (e.g., fast, slow, normal, etc.), our problem becomes a classification problem.
183 Note that any regression problem can be turned into a classification problem by
184 binning the continuous target values. Therefore, the first step of machine learning
185
186
DP
system design is to analyze the dataset to identify which representation must be used.
In the following, we describe both regression and classification.
188 Regression is a type of supervised machine learning algorithm whose target variables
189 are continuous values. Predicting currency exchange rates, temperatures, and the time
190 when an event may occur are examples of regression problems since the predicted
EC
191 outputs are continuous values. In this section, we present some regression algorithms.
193 When we face a machine learning problem with continuous target variables, we have
194 to choose a representation based on regression algorithms. Linear regression is the
195 most common algorithm used in regression and serves as the base to understand all
196 other regression algorithms [42].
197 The mathematical definition of linear regression is given by the following equa-
CO
198 tion:
199 hΘ (X ) = θ0 + θ1 × x1(i) + · · · + θm × xm(i) (3.1)
200 where θj are weights (θ0 is the bias, and θk is the weight for the kth feature of x(i) ,
201 1 ≤ k ≤ m), and xj(i) is the jth feature of the ith example in X (dataset).
UN
202 The best way to understand how linear regression works is to plot a graph with the
203 features × label. However, most of the time the dataset is multidimensional, that is, it
204 has more than one feature. Later in this chapter, we discuss dimensionality reduction,
205 but, for the sake of simplicity, we consider only the feature Temperature from our
206 dataset (Table 3.2). Table 3.3 shows the original dataset from Table 3.2 extended with
207 some new examples to have more points in the graph. Figure 3.3 shows the new
208 dataset plotted in a 2D graph (X × y).
209 Remark that, in Fig. 3.3a, there is a line drawn by hΘ (X ) with weights that do not
210 represent the data points very well. However, in Fig. 3.3b, the average distance of
the points to the line shows that hΘ (X ) better describes the dataset. We can verify
OF
211
212 whether or not our model fits the training dataset well by measuring the (Euclidean)
213 distances between the points and the line. In Fig. 3.3, the red lines show pictorially
214 some distances. Remark that one way to evaluate our model is to calculate the average
215 distance between the points and the line.
RO
216 Example 3.2.1 Based on Fig. 3.3b, we have h([2.3,0.222]) ([10]) = 4.52, that is, when
217 the temperature is 10 degrees, the runner will take 4.52 minutes to run 1 kilometer.
218 For a temperature of 12.3, our model outputs a pace of 5.03 which is higher than the
219 ground-truth value of 3.5 (see Table 3.3).
220
221
222
DP
So far, we have chosen the representation of our model (i.e., Linear Regres-
sion), and we still need to choose approaches to evaluate and optimize the model.
In Sect. 3.5, we present some approaches for model evaluation. However, if we use
223 Mean Square Error, our model Θ = [2.3, 0.222] gets a score of 0.641. Knowing
TE
Table 3.3 Table 3.2 with just Temperature (◦ C) Pace (min)
one feature and new examples
12.3 3.5
15.4 3.2
13.7 5.5
EC
3.1 4.0
11.3 4.6
10.8 4.3
9.7 4.0
RR
4.5 3.5
5.3 3.3
5.2 3.8
7.4 3.4
6.2 4.2
CO
7.8 4.3
8.5 3.8
12.3 5.3
14.3 5.2
13.2 4.4
UN
9.8 4.6
8.3 4.6
(a) (b)
OF
Fig. 3.3 Perform (the drawn line) of two regression models in the same dataset
RO
224 that close to 0 is better, our model should be optimized. The optimization is the third
225 component of our hypotheses space (see Table 3.1). Gradient Descent is a common
226 solution to optimize (or train) a Linear Regression Model.
227
1
m
229 J (Θ) = (hΘ (x(i) ) − y(i) )2 (3.2)
2m i=1
TE
230 where m is the number of observations in the training dataset. Remark that hΘ (x(i) )
231 may be replaced by ŷ. The mathematical definition for the gradient descent is shown
232 in Eq. (3.3).
∂
EC
1
m
236 Θ =Θ −α (hΘ (x(i) ) − y(i) )x(i) (3.4)
m i=1
237 The gradient descent works as follows: (i) we randomly initialize Θ (for linear
regression all Θ can be initialized to 0),2 (ii) define a value to α,3 and (iii) we run
CO
238
239 gradient descent until it converges. We can use J (Θ) to stop the looping. When J (Θ)
240 stabilizes, we consider that the gradient descent has converged.
241 Remark that the parameters (called hyper parameters) are essential for building
242 good machine learning systems. The ordinary linear regression needs only a good
UN
2 For some representations, zero is not a good initial value. Random values from 0 to 1 work in most
of cases.
3 Select a small value to α, say 0.01, plot J (Θ) to identify how the gradient is converging, increase
α (e.g., doubling its value) up to have an expected convergence.
243 value for the learning rate, but it is not always the case. There are some more sophis-
244 ticated algorithms that we have to choose values for all the hyper parameters needed.
245 Sometimes the data points cannot be described by one straight line. In this case,
246 we can still apply linear regression to the dataset. We just need to change our model. If
247 the points are not linearly organized, we can apply polynomial models (in this case, a
OF
248 linear regression may be called a polynomial regression). Suppose we have a dataset
249 with just one feature, and the data points represent a quadratic function. We could
250 build a model as hΘ (X ) = Θ0 + Θ1 × x1 + Θ2 × x12 . It is not hard to implement
251 since we insert a new column representing the squared feature. Therefore, the model
252 becomes hΘ (X ) = Θ0 + Θ1 × x1 + Θ2 × x2 , where x2 = x12 . Any polynomial func-
RO
253 tion can be used to build a non-linear model. The characteristics of the dataset are
254 the guide for choosing the best one. However, the cost function must be convex to be
255 optimized with gradient descent (the cost function in Eq. 3.2 happens to be convex).
256 An alternative to training a learning model (i.e., solve the parameters Θ) is to
257 use normal equation. Equation (3.5) gives the mathematical definition of normal
258
259
equations.
DP
Θ = (X T · X )−1 · X · y (3.5)
260 There are two advantages of normal equation over gradient descent: there is no
261 learning rate and no iteration. However, the matrix that represents the dataset must
262 be invertible, and the computational cost to multiple and invert matrix is high.4
TE
263 Remark that J (Θ) guides gradient descent in finding the best weights for the
264 model. However, sometimes J (Θ) fits well the training set (J (Θ) ≈ 0) but fails to
265 generalize the test set (J (Θ) 0). This situation (aka overfitting) indicates that our
266 model has been specialized for the training dataset, i.e., the noise in the dataset has
been taken into account during the learning process. Overfitting may happen when
EC
267
268 the dataset is highly dimensional, and some features may be irrelevant in the training
269 step.
270 There are several techniques to combat the overfitting, the most popular is to add
271 regularization to the (cost) function. The regularization penalizes the function using
RR
272 the weights and some other parameters (see Sect. 3.7.2 for more details).
273 The algorithm described here for regression problems represents a small part of
274 the hypotheses space for solving this kind of problem, but it can be used as the basis
275 for understanding other regression algorithms.
CO
278
279 transformed into a classification dataset. We have just to discretize the values, that is,
280 group the continues values into classes. The label (Pace) in Table 3.2, for example, can
Table 3.4 The dataset from Table 3.4 with Pace having discrete values
Wind speed (km/h) Temperature (◦ C) Pace #Class
10.5 12.3 Fast 0
8.9 15.4 Fast 0
20.2 13.7 Normal 1
OF
5.10 3.1 Normal 1
281 be discretized resulting in a new dataset shown in Table 3.4. The continuous values
RO
282 are transformed into discrete values as follows: (i) fast (Pace ≤ 3.5), and (ii) normal
283 (Pace > 3.5). As we are working with mathematical representations, the classes of
284 the labels should be represented as numerical values. Therefore, fast is represented by
285 0, and normal by 1 (column #Class). Classes may be binary (as our example) or multi-
286 classes (we could represent the Pace as fast, regular, normal, slow, among others).
287
288
289
DP
In the first case, we have y ∈ {0, 1}, and in the second case, y ∈ {0, 1, . . . , k − 1}
(where k is the number of classes). We focus this section on binary classification
since they are more intuitive to understand. Besides, binary classifiers are applied
290 to several situations: classification emails as spam or not, fraudulent transactions,
291 problems regarding winners or losers, among others. In addition, any multi-class
problem may be solved by dividing up the problem into many binary classification
TE
292
295 Logistic regression is one of the simplest and most efficient classifiers. The intuition
296 behind logistic regression is similar to linear regression. We extend Eq. (3.1) to output
297 ŷ such that ŷ in [0, 1]. Equation (3.6) presents a sigmoid (or logistic) function that
298 always returns values between 0 and 1 inclusive and the logistic regression can be
RR
301 Sigmoid functions behavior as follows: σ (z = 0) = 0.5, 0 ≤ σ (z < 0) < 0.5, and
0.5 < σ (z > 0) ≤ 1 (Fig. 3.4 shows pictorially this behavior). Remark that we may
CO
302
303 interpret the output of Eq. (3.6) as the probability of y = 1 (or y = 0). Therefore,
304 the binary classifier can output 1 or 0 based on a threshold defined by the user. If
305 we want an equal distribution, the value returned by σ (z) can be rounded, and so we
306 have y = 1 when σ (z) > 0.5, or y = 0, otherwise.
If we use the previous cost function in the logistic regression, we would have a non-
UN
307
308 convex function which could not be optimized with steepest gradient descent. If we
309 try it anyhow, we would risk to find a local minimum instead of the global minimum
310 (Fig. 3.5 shows pictorially this situation: solid circles represent local minima and the
311 open one represents the global minimum). To avoid this, we consider the cost for
OF
RO
Fig. 3.4 The behavior of a sigmoid function
314 Remark that the cost is 0 if y and ŷ are equal to 1, and increases when ŷ → 0. It is
315 the behavior we want since the penalty has to increase when the distance of y and ŷ
316 increases. The same reasoning may be applied to when y = 0. Note that the Eq. (3.7)
can be rewritten as y × (−log(ŷ)) + (1 − y) × (−log(1 − ŷ)). Thus, we can define
CO
317
1 (i)
m
319 J (Θ) = − [ y log(hΘ (x(i) )) + ((1 − y(i) )log(1 − hΘ (x(i) )))] (3.8)
m i=1
UN
320 Notice that when y = 0, the first part of the summation (the left side of the addition)
321 is canceled since y multiplies the log value, and the same reasoning can be applied
322 to the right side when y = 1. The gradient descent remains the same as shown in
323 Eq. (3.3).
324 If we plot our dataset in a vector space, logistic regression draws a line that
325 separates the data points based on their classes. This separation is called a decision
326 boundary. However, sometimes the classes cannot be separated by a straight line, and
327 in this case, one solution is to transform the feature matrix into a higher dimensional
328 space by adding new features with higher degree. Another solution is to apply another
OF
329 algorithm to the problem, for example, Support Vector Machines (SVM) [15].
330 One-versus-all is one of the approaches to deal with multi-class classification.
331 This approach learns one class at a time, and thus, a dataset with k classes has k sets
332 of Θs. For example, in a dataset with three classes (0, 1, and 2), we keep the label
333 of first class and update the others to 1; we do the same for the second class, update
RO
334 the others to 0, and so on. To verify in which class an example ei belongs, we apply
335 every learned Θ to ei , and the highest one corresponds to the predicted class of ei .
337
338
DP
Decision trees are another representation for solving classification problems. The
model is based on decision rules implemented in the nodes of a tree. This model is
339 more understandable for humans than logistic regression. See a pictorial representa-
340 tion below:
TE
Rule
Action1 Actionn
...
...
341
EC
342
343 Basically, every node represents a test to be performed on a single attribute, and a
344 child node is accessed depending on the result of the test. The testing is repeated until
345 it reaches a leaf node, and finally the class is found. Figure 3.6 presents a decision tree
for the dataset from Table 3.4. Remark that the tree covers all cases of the dataset, and
RR
346
347 we may conclude that a runner will have a normal pace under temperatures below
348 12.3.
349 The choice of the attributes for each rule and node has a major role in the success
350 of a decision tree. The criterion is based on the information gain of the attributes
(features). The attribute that gives the greatest information gain becomes the root of
CO
351
352 the tree, and the internal nodes follow the ranking of the information gain. Entropy
353 calculates the (im)purity of an attribute regarding the classes, and it may be used
354 to calculate the information gain (i.e. how well an attribute can describe a given
355 class) [3]. Equation (3.9) gives the mathematical definition of an entropy for a
subset Xi :
UN
356
OF
Fig. 3.6 A decision tree built from dataset in Table 3.4
RO
358 where pi+ is the probability that a randomly taken example in Xi is positive and
n+
359 can be estimated by the relative frequency pi+ = n+ +n
i
− ; the same reasoning is used to
i i
calculate pi− .
360
DP
The entropy for every value i of an attribute attr in X is calculated as follows
(considering the attr has K different values):
K
H (X , attr) = P(Xi ) × H (Xi )
TE
i=1
361 where P(Xi ) is the probability of an example belonging to Xi and can be estimated by
362 the relative size of subset Xi in X : P(Xi ) = |Xi|
|X |
. Finally, the mathematical definition
363 of a knowing attribute attr information gain is given by Eq. (3.10).
EC
365 where H (X ) is the entropy of the whole dataset. Remark that if X is well balanced
366 for the classes, H (X ) will be close to its maximum (≈1).
RR
367 Example 3.2.2 Let X be the dataset from Table 3.4, the result from Eq. (3.9) is:
368 H (X ) = 2+22
× log2 ( 2+2
2
) − 2+2
2
× log2 ( 2+2
2
) = 1. This means that X is well bal-
369 anced. It is easy to see since X is composed of two positive and two negative exam-
370 ples. Let t1 be the temperature greater or equal to 12.3 and the class normal be
a positive example, the entropy is: H (X , t1 ) = 13 × log2 ( 31 ) − 23 × log2 ( 23 ) = 0.92.
CO
371
372 For the temperate below 12.3, the entropy is by definition 1 since there is no negative
373 example, and logarithm of zero is not defined. The total entropy of the attribute tem-
374 perature is H (X , temperature) = 41 × 1 + 34 × 0.92 = 0.94. Finally, the information
375 gain is I (X , temperature) = 1 − 0.94 = 0.06. If we calculate the information gain
for a wind speed less or equal to 10.5, we will have the same result. That is, both
UN
376
377 attribute have the same information gain, and so both can be the root of the decision
378 tree. Remember that the tree is organized by following the information gain of the
379 attributes in descending order.
380 The equations and formulas above consider that the attribute values are discrete,
381 but it is not always the case: decision tree can be induced from numerical attributes
382 as well. An approach to discretize the values follows: an attribute attr is sorted, the
383 range of each class for attr is calculated, and the ranges are ranked by information
384 gain. Each range corresponds to a discrete value of attr. A range for an attribute attr
can be [10, 20], i.e., attr > 10 ∧ attr < 20.
OF
385
386 Decision trees are the basis for lots of other tree classifiers. One of the most
387 effective one is the random forest. Roughly, random forest trees combine into an
388 ensemble. N random samples are selected from the dataset X ; each sample is used
389 to build a decision tree with some samples of X . Therefore, random forest trees use
RO
390 N decision trees to build the best model for a given dataset.
391 Decision trees are proposed to deal with classification problems. However, there
392 are several approaches to adapt decision tree algorithms to regression problems [26].
393 A regression tree is similar to a classification tree, except that the label y takes
394 continuous numerical values and a regression model is fitted to each node to give the
395 predicted values of y.
DP
396 3.3 Unsupervised Learning
TE
397 When the dataset has no labels, that is, there is no previous classification of the
398 examples, we apply an unsupervised learning algorithm. The goal is to infer classes
399 or groups from the dataset without the help of the labels. In this case, the dataset
400 is in the form X = {x(1) , . . . , x(m) }. Unsupervised learning is less objective than
supervised learning, since there are no labels to guide the user for the analysis. The
EC
401
402 domain of the dataset must be known by the user to build useful models, otherwise,
403 the results may not be understandable.
404 Although, unsupervised learning is harder to model than a supervised learning,
405 the importance of such techniques is growing since there are more unlabeled data
RR
406 than the labeled ones. Besides, many learning problems are related to unsupervised
407 problems: recommendation, classification of customer behaviors in a website, market
408 segmentation, among others. Clustering is the most popular technique for unsuper-
409 vised learning.
410 Clustering is about discovering semantically related groups in an unlabeled
CO
411 dataset. The number of groups (aka clusters) is defined by the user based on his/her
412 knowledge of a dataset X . For example, let’s say X represents examples of heights
413 and weights of people, and we want to separate them into 3 T-shirts sizes (e.g., S, M,
414 and L). The dataset can be split into 3 clusters, and, based on the user knowledge,
415 each cluster represents the height × weight characteristic for each T-shirts size.
Data clustering has been used for [18] (i) gaining insight into data, generate
UN
416
417 hypotheses, detect anomalies, and identify salient features, (ii) identifying the degree
418 of similarity among forms or organisms, and (iii) for organizing the data and sum-
419 marizing it through cluster prototypes.
420 K-means is one of the most popular and easy to understand clustering algo-
421 rithms [19]. The basic idea is to define k centroids that help to build the clusters.
422 Every example in the dataset will be associated to one of the k centroids. The dataset
423 is seen as a set of data points in a plane, and the algorithm tries to group them into
424 clusters by measuring the distance between a given data point and a centroid.
OF
Data: X = {x(1) , . . . , x(m) }, K centroids μ1 , . . . , μK
Result: A set of K centroids for X
Initialize the centroids (either K ⊂ X or data points picking from the plane);
repeat
RO
for each x(i) in X do
c(i) = argmin( Kk=1 ||x(i) − μk ||2 ); // assign x(i) to the closest centroids
k
425
end
for each μi in K do
// update the centroids with the average of the points associated to them
end
μi = |c1(i) | x(j) ∈c(i) x(j) ; DP
until K converge;
Algorithm 1: K-means pseudo code
TE
426 Algorithm 1 presents a K-means pseudo-code. The two internal loops are the main
427 parts of the algorithm. The first one associates each data point (an example from X )
428 to a given centroids. The second loop updates the centroids positions by averaging
429 the points associated to them. The main loop is repeated until the data points get
EC
435 and target of each centroid), and finally, the data points are updated (d). Those steps
436 are repeated until the centroids do not change anymore. The initial centroids play
437 an essential role in K-means algorithms, and the resulting clusters may be different
438 depending on the initial centroids.
439 Although, there are methods for selecting the so-called correct number of clusters
CO
440 (e.g., Silhouette and CH index methods [11]), the user knowledge plays an essential
441 role to define the number of centroids (the number of clusters). Each cluster will
442 have a semantic meaning in relationship to the domain of the dataset for the expert
443 in the domain.
444 The K-means algorithm is a clustering algorithm based on partition, i.e., the idea
behind is to consider the center of data the points as the center of the corresponding
UN
445
446 cluster. Another category of clustering algorithms are those based on a hierarchy.
447 This kind of algorithm builds a hierarchical relationship among data to cluster them.
448 Each data point, in the beginning, is a cluster itself. The closest clusters are merged.
OF
RO
DP
TE
Fig. 3.7 First steps of K-means algorithm
449 The merge operation builds a dendrogram representing the nested clusters. The den-
450 drogram shows the pattern and similarities of the clusters. Dendrograms can be seen
EC
454 Deep learning is a type of representation learning where the machine itself learn
455 several internal representations from raw data to perform regression or classification
[23]. This is in contrast to more classical machine learning algorithms which often
CO
456
457 require carefully engineered features that are based on domain expertise [1]. Deep
458 learning models are built up in a layer-wise structure where each layer learn a set of
459 hidden representations, that in many cases cannot be understood by a human observer.
460 The representations in each layer are non-linear compositions of the representations
461 in the previous layer. This allows the model to first learn very simple representations
UN
462 in the first layers which are then combined into more and more complex and abstract
463 representations for each layer. An example of this is that when deep learning models
464 are used on images, they often start by learning to detect edges and strokes [44].
465 These are then combined into simple objects, objects that then are combined into
466 even more complex objects for each layer. Since each layer only learns from the
467 representation of the previous layer, a general purpose learning algorithm, such as
468 back propagation [22], can be used to train a given network.
OF
469 3.4.1 Artificial Neural Networks
470 Most algorithms in deep learning are based on artificial neural networks [23]. In
471 contrast to deep learning the field of artificial neural networks has been around
RO
472 for some time. It all started in 1943 when McCulloch and Pitts, a neuroscientist and
473 mathematician, defined a mathematical model of how they believed a neuron worked
474 in a biological brain [28]. The next step came in 1949 when Hebb came up with a rule
475 that made it possible to train an artificial neuron to learn and subsequently recognize
476 a set of given patterns [16]. In 1958 Rosenblatt, a psychologist, further generalised
477
478
479
DP
the works of McCulloch and Pitts and proposed a model, called the perceptron, for
an artificial neuron [39]. The mathematical definition of a perceptron is given in
Eq. (3.11), and a graphical representation is shown in Fig. 3.8.
480 y=f xi ∗ wi + b (3.11)
TE
i
481 The perceptron was then further analysed and developed by Minsky and Papert
482 [30]. In the analysis of the perceptron, Minsky and Papert showed that a single
483 perceptron was not sufficient to learn certain problems (e.g., nonlinear problems),
EC
484 for example, the XOR problem. Instead, they argued that multi-layered perceptrons
485 were needed to solve such problems. However such networks were not possible to
486 train at that time, this lead to an AI winter and very little research on ANNs were
487 conducted on neural networks for some time. This has changed during the years,
RR
488 thanks to the increase in computational and improvements to the methodology, such
489 as the introduction of the backpropagation algorithm, unsupervised pre-training [9]
490 and the rectified linear unit [34]. These improvements have allowed researchers to
491 build networks with many hidden layers, so called deep neural networks [23]. In
492 the following sections, we present several different architectures of artificial neural
OF
493 networks used in deep learning.
RO
495 A feedforward neural network is an artificial neural network where information only
496 moves in one direction; thus feedforward networks are acyclical and therefore free
497 of loops. The layout of a typical feedforward network is shown in Fig. 3.9. The most
498 basic feedforward network is the perceptron [39] where the output is an activation
499
500
501
DP
function applied to the weighted sum of the input plus a bias. If the sigmoid function,
described in Eq. (3.6), is used as the activation function a single perceptron performs
exactly the same task as logistic regression (see Sect. 3.2.2). A standard architecture
502 of feedforward networks is to arrange multiple neurons in interconnected layers.
503 Each neuron in any layer, except the final output layer, has directed connections to
504 all neurons in the subsequent layer. This types of networks are called multilayer
TE
505 perceptrons. As with the perceptron, the output that each neuron will propagate to
506 the next layer is an activation function applied to the weighted sum of all inputs plus
507 a bias. As long as the activation function is differentiable, it is possible to calculate
508 how the output will change if any of the weights is changed, and thus the network
can be optimized with gradient based methods.
EC
509
RR
CO
UN
Fig. 3.9 The layout of a multilayer perceptron with two hidden layers, each having 5 neurons
511 Convolutional neural networks (CNNs) are a special type of neural networks that
512 are mainly used in image analysis [25], but some researchers have used CNNs for
513 natural language processing [24]. The main idea behind the CNN architecture is
OF
514 that basic features in a small area of an image can be analysed independently of its
515 position and the rest of the image. Thus an image can be split up into many small
516 patches. Each patch can then be analysed in the same way and independently of the
517 other patches. The information from each patch can then be merged, to create a more
518 abstract representation of the image.
RO
519 This scheme is implemented in a CNN using two different steps; convolutional
520 and sub-sampling steps. In a convolutional step, a feedforward neural network is
521 applied to all small patches of the image, generating several maps of hidden features.
522 In the sub-sampling step, the size of the feature map is reduced. This is often done
523 by reducing a neighborhood of features to a single value. The most common way
524
525
526
DP
for this reduction is to either represent the neighborhood with the maximum or the
average value. These two steps are then combined into a deep structure with several
layers.
527 It has been shown that a CNN learns to detect general and simple patterns in
528 the first layers, such as detecting edges, lines, and dots [44]. The abstraction of the
learned features will increase in each layer. If, for example, the first layer detects
TE
529
530 edges and dots, the next layer may combine these edges and dots into simple patters.
531 These patterns may then be combined into more complex and abstract objects in the
532 next layer. One of the main benefits of this approach is that the CNN learns translation
533 invariant features. Thus a CNN can learn general features about objects in an image
EC
536 A recurrent neural network (RNN) is a type of artificial neural network where there
537 are cyclical connections between neurons, unlike feedforward networks which are
538 acyclical [37]. This allows the network to keep an inner state allowing it to act
539 on information from previous input to the network, thus exhibit dynamic temporal
CO
540 behaviour. This makes RNNs optimal for the analysis of sequential data, such as text
541 [29] and time series [5]. One big problem with recurrent neural networks, which also
542 occurs in deep feedforward networks, is that the gradients in the backpropagation
543 will either go to zero or infinity [36]. This has however been partially solved by the
544 introduction of special network architectures, such as the long short term memory
UN
545 (LSTM) [14] and the gated recurrent unit (GRU) [4].
OF
550 following the same distribution as the collected data. While the second network,
551 called the discriminator, tries to distinguish between the examples that are generated
552 by the generator and the data that are sampled from the real data distribution.
553 The training of these two networks consists of two phases where the first part aims
554 to train the generator and the second to train the discriminator. In the first phase, the
RO
555 generator creates several examples and gets information about how the discriminator
556 would judge these examples, and in which direction to change these examples so that
557 they are more likely to pass as real data to the discriminator. In the second phase,
558 several generated and real examples are presented to the discriminator, that classifies
559 them as real or generated. The discriminator is then given the correct answers and
560
561
562
DP
how to change its settings to preform better when classifying future examples. This
can be compared to the competition between a money counterfeiter and a bank. The
task of the counterfeiter is to generate fake money, and the bank should be able to
563 determine if money is faked or not. If the counterfeiter gets better at creating new fake
564 money, the bank must take new measures to discover the fake money and if the bank
gets better at discovering fake money, the counterfeiter must come up with better
TE
565
566 and creative ways to create new money. The hope when training a GAN is that the
567 generating network and the discriminating network will reach a stalemate where they
568 are both good at their tasks. Successful works including GANs, are the generation of
569 images of human faces [12], images of hotel rooms [38] and the generation of text
EC
572 There are many representations to build models from data. In the previous sections,
573 we have seen some of them. However, we need to evaluate the built model to check
574 how well it performs on unseen examples, that is, how well it generalizes the training
575 dataset. The design of machine learning system, as stated before, is composed of
CO
576 several steps: (i) the choice of a dataset as input, (ii) the choice of a representation
577 for a learner, (iii) an approach to optimize the model, and, finally, (iv) an evaluation
578 of the model. The evaluation (or assessment) must be done in a dataset not used for
579 training. Basically, the original dataset is split into two subsets: the training and the
580 test sets. The usual approach to dividing the dataset is as follows: 70% for training
UN
581 and 30% for testing. Remark that the number of examples in each subset depends
582 on the number of example in the original dataset, and the selection of examples for
583 each subset must be balanced (mainly in classification problems), that is, each subset
584 must have representative information of the domain to be modeled.
585 Another important remark is that, during the training phase, we have to test our
586 algorithm (or algorithms) using different hyper parameters (e.g., learning rate, node
587 purity, number of clusters, etc.). In this scenario, we may also divide the training
588 set into cross-validation sets (or split the original dataset into three subsets: training,
589 validation, and test). Therefore, the training data is used to training some learning
OF
590 algorithms. In the validation set, the performance of trained algorithms are evaluated,
591 and thus, the best one is chosen to model our problem. Moreover, the test data is
592 used to evaluate the chosen model against new examples.
593 In the training set, we use a loss function (or another similar function) to verify
594 whether or not our model is converging. When we are satisfied with the results in the
RO
595 training set, our model is run against the test set, and, depending on the representation
596 used, we choose a metric to evaluate the predictions made by the model. It is clear
597 that a metric for classification is different from a metric for regression, and it is not
598 the same for unsupervised approaches. In the following, we present some metrics
599 for the representations discussed in the previous sections.
606 • R2 score (aka coefficient of determination) is a number that indicates the proportion
607 of the variance in the predicted output from the real output. It is calculated as
m
(y(i) −ŷ(i) )2
608 follows: R2 = 1 − i=1 m
(y(i) −ȳ(i) )2
, where m is the size of the (test) dataset, ŷ is the
i=1
609 predict value, and ȳ is the average of ground-truth values. The closer R2 is to one,
RR
620
621 • Mean absolute error (MAE) is similar to MSE, but it does not square the error.
622 The absolute value of the difference is used instead. It is defined as follows:
(i) (i)
623 MSE = m1 m i=1 |y − ŷ |.
624 • Mean absolute percent error (MAPE) is another metric to evaluate regression
625 models, and the error expressed in generic percentage terms is:
|y(i) −ŷ(i) |
OF
626 MAPE = ( 1n m i=1 |y(i) |
) × 100
627 MAPE and MAE are less sensitive to the occasionally very large error because
628 they do not square the errors. Therefore, if we want our model to ignore big prediction
629 errors, MAPE and MAE may be used. However, the metric which is considered as
one size fits all is RMSE [43].
RO
630
632
633
634
DP
The metrics for classification problems are a little bit easier to apply on the model
than the regression ones. Roughly speaking, the metrics are based on counting how
many predicted classes equals to the observed ones. We use the binary classification
635 to present the metrics since it is more intuitive to understand. The same reason is
636 applied to multi-class prediction.
TE
637 The simpler way to evaluate a classifier is when the classes are well balanced in
638 the dataset (training and test). Accuracy is the metric for this scenario. The predicted
639 classes are matched against to the observed ones, and the number of matched ones
(i)
640 is divided by the size of the dataset: m1 m i=1 y == ŷ(i) , considering that false is 0,
641 and true is 1.
EC
642 In most of the cases, the classes in a dataset are skewed, that is, the number
643 of classes is not balanced. For instance, in a dataset with examples of benign and
644 malignant tumors, maybe most of the examples are labeled as benign tumors. If 96%
645 of the tumors are labeled as benign, and the model outputs benign for every example,
its accuracy will be 96% (a very good accuracy); however, we know that the model
RR
646
651
652 The true positive (TP) means that the classifier matches the positive classes, and
653 the false positive (FP) implies that the classifier outputs negative classes as positive
654 ones. The same reasoning for negative classes: true negative (TN) indicates negative
655 classes are predicted correctly, and false negative (FN) indicates positive classes are
656 predicted as negative ones. We remark that the confusion matrix can be extended for
UN
OF
Precision, recall, and F1 -score can be defined as follows:
TP
precision =
RO
TP + FP
TP
recall =
TP + FN
2 × precision × recall
658
F1 -score =
DP
precision + recall
The precision metric is used when we want exactness, that is, our classifier is covering
659 the positive classes confidently. On the other hand, recall means completeness, that
660 is, how many positive examples our classifier has missed. If we want to balance
TE
661 between recall and precision, F1 -score gives the harmonic mean of precision and
662 recall.
TP + TN
663 Remark that accuracy can be also calculated as follows: TP + FP + TN + FN
.
EC
665 The evaluation of the clusters resulted from a clustering algorithm is a not easy task
666 since there are no true labels to compare with the clusters. The evaluation can be
RR
667 divided into two categories: internal and external. The internal category measures
668 the quality using the training data, and the external category uses the external data
669 (test set). However, the external evaluation is not completely accurate as compared
670 to the methods for supervised learning [10].
671 Silhouette Coefficient is a (internal) metric to measure the quality of the built
CO
672 clusters. It is a popular method that combines both cohesion (similarity between
673 an object and its cluster) and separation (similarity between an object and other
674 clusters). The silhouette coefficient for an individual object (example or data point)
675 can be computed as follows: (i) given an example ei , calculate its average distance to
676 the other examples in the same cluster (aei ), (ii) do the same using the other clusters
UN
bi − ai
677 (bei ), and (iii) the silhouette coefficient for ei is si = max(bi ,ai )
. The average of all
678 coefficients sk can be calculated to find the clustering coefficient.
679 Remark that the coefficient can be a value between −1 and 1. A negative value
680 of si is not desirable since it indicates that the average distance of ei to its cluster is
681 greater than to the other clusters. On the other hand, a positive value of si is an ideal
682 value, and si = 1 indicates that the the average distance of aei is 0.
683 Rand Index (RI) is a metric for external evaluation. It compares the predicted
684 clusters to the real clusters (manually assigned by an expert user), and it is similar to
685 the accuracy metric for supervised algorithms. Here is how to calculate RI:
OF
686 1. Let X be a dataset, Cp be a clustering (set of clusters) build by the clustering
687 algorithm (predicted), and Cr be a set of ground-truth clusters;
688 2. TP is the number of examples (data points) belonging to the same clusters in Cp
689 and Cr ;
3. TN is the number of examples (data points) belonging to different clusters in Cp
RO
690
691 and Cr ;
692 4. FP is the number of examples (data points) belonging to a cluster in Cp but to a
693 different cluster in Cr ;
694 5. FN is the number of examples (data points) belonging to a different cluster in Cp
695 but the same cluster in Cr
696
697
6. Rand Index is calculated as follows: RI = TP + TN
We remark that RI is calculated exactly as accuracy.
DP
TP + TN
+ FP + FN
.
698 Metrics are essential tools to evaluate a machine learning system. Each one must
699 be carefully studied to understand the behavior of the built model. Some metrics can
700 be affected by noise in the data (aka outliers), and others may smooth the effects of
TE
701 noise.
703 Dimensionality reduction plays an essential role in machine learning. Its goal is to
704 decrease the number of features of a dataset. As an example, let’s suppose that we
705 want to identify objects in a set of images. Each image is a 100 × 100 pixels, thus
RR
706 we have 10, 000 features. An approach to reduce the dimensionality can bring the
707 number of feature to 1000. Therefore, dimensionality reduction may be applied to:
708 • Compress data in the main and secondary storage.
709 • Speed up learning algorithms.
710 • Visualize the dataset in 2D or 3D planes.
CO
711 The reduction may also merge the more correlated features into one (or more). For
712 instance, a dataset can have a feature f1 that represents the height in centimeters and
713 another feature f2 that also represents the height but in inches. Based on the correlation
714 of f1 and f2 , a dimensionality reduction technique may merge both features to form
a new one.
UN
715
716 Given a dataset X = {(x(1) , y(1) ), . . . , (x(m) , y(m) )} (possibly without labels y(i) ),
717 the dimensionality reduction aims to transform x(i) ∈ Rd into z (i) ∈ Rk (where k <
718 d ), resulting in X = {(z (1) , y(1) ), . . . , (z (m) , y(m) )}. Figure 3.10 shows a 3D dataset
OF
RO
Fig. 3.10 A 3D dataset reduced to 2D dataset
719 (three features a) reduced to a 2D dataset (two features b). The reduction was done
720 using the PCA technique.
DP
721 3.6.1 Principal Component Analysis (PCA)
TE
722 PCA reduces the dimensionality of a dataset by projecting vectors onto the plane and
723 minimizing the projection distance error between the points and the projected vector.
724 It can be described as follows: given a dataset with d dimensions, find k vectors μ(1) ,
725 . . ., μ(k) onto which to project the data, so as to minimize the projection squared
error.
EC
726
727 To find the vectors the following steps must be done: (i) find the covariance
(i) (i)T
728 matrix of the dataset: = m1 m i=1 x · x ( is a d × d matrix), calculate the
729 eigenvector6 U of (U is a d × d matrix), and reduce X to k-dimensions based on
730 U as follows Xreduced = U [:, 1 : k]T · X T . Remark that the new number of dimensions
is taken from U which represents the vectors μ(i). We can approximately reconstruct
RR
731
741 butions with a random walk on neighborhood graphs to find the structure within the
742 data [27]. t-SNE converts distances between data in the original space to probabilities.
743 Another approach that works with nonlinear dimensionality is an extension of PCA:
744 nonlinear PCA. Nonlinear PCA is performed using a five-layer neural network [35]
745 that captures the complex nonlinear relationship between the features.
746 The decision in using a linear or nonlinear approach must be based on the char-
747 acteristic of the dataset; however, sometimes it is not easy to identify whether or not
OF
748 features are linearly or nonlinearly correlated. A rule of thumb is to start with PCA.
749 If it does not work well, use more sophisticated techniques.
RO
750
751 The design of a machine learning system is composed of several steps: from choosing
752 a domain (dataset) to building a model to accurately predict new information about
753 the given domain. This section closes this chapter by presenting some issues that are
754
755
756
DP
orthogonal to the subjects discussed so far. An essential step in machine learning is to
prepare the dataset as a proper input of a learning algorithm. The quality of the dataset
has a high impact on the performance of machine learning-based methods. Thus, in
757 this section, we present some issues about preprocessing the data and checking the
758 behavior of the learning algorithm.
TE
Some characteristic of a dataset may have adverse influence on the learning results
EC
760
761 or even may be not suitable for certain learning algorithm classes. Several learning
762 algorithms deal with only numerical values, and, in this case, features that are not
763 numerical must be either discarded or transformed into numerical values (discretiza-
764 tion may be applied). For example, a feature that stores gender as m and f may have
RR
765 the values replaced by the numerical values 1 and 2, respectively. The contrary is also
766 true: a numerical value can be transformed into a string to speed up some learning
767 algorithms (e.g., decision trees).
768 A dataset can also have features on very different scales, that is, one feature can
769 store values on the order of thousands, and another on the order of ten. An example
CO
770 is the price of an apartment and number of bathrooms. In this case, we can use a
771 technique called feature scaling, that is, all features are scaled to a same range of
772 values. The two most common technique are: standardization ( X − σ
X̄
) and rescaling
X − min(X )
773 ( max(X ) − min(X ) ) feature rescaling. The former makes all features to have the average
774 close to 0, and the latter the features will have values between 0 and 1. The feature
UN
776 Missing data is another problem that must be solved during the preprocessing
777 stage.7 Three approaches may be used to address it: analyze only the available data,
778 impute missing values in the dataset, and use a learning algorithm that deals with
779 missing data.
780 For the first case, examples or features are deleted from the dataset. This is recom-
OF
781 mended when there are not many missing values. The dataset size allows the removal
782 of some examples or features, and the values are randomly missing.
783 Missing values’ imputation aims to replace missing values with some plausible
784 values. The new values are calculated based on some traditional statistic methods
785 (e.g., mean, the most frequent value, or median), or some other more sophisticated
RO
786 approaches (e.g., Expectation Maximization [7], Shell Neighbor Imputation [45]).
787 The third approach is to use a learning algorithm that integrates components to deal
788 with missing values. Probably, these extensions of traditional learning algorithms use
789 some statistical methods cited previously, e.g., [41].
790 In the dataset preprocessing step, we can also drop unnecessary features (aka
791
792
793
DP
attribute or feature selection) or create new features based on the existing ones (aka
attribute or feature transformation). Imagine a dataset with characteristics of cars, and
we want to learn the safety of the cars (e.g., low, medium, high). A feature like license
794 plate will be not important for the learning algorithm. On the other hand, we can build
795 new features from existing ones to improve the learning algorithm performance, e.g.,
796 based on the weight of the car and its horsepower, a new feature power-to-weight
TE
797 ratio can be created. The feature creation is very useful, for example, when we want
798 to build a polynomial model for linear regression (as we saw in Sect. 3.2.1.1).
799 Future selection is a largely used tool in preprocessing dataset to improve a learn-
800 ing algorithm’s performance, and there are many approaches to accomplish it. Most
of them identify the relevance of a feature in relationship to the others. Based on the
EC
801
802 relevance, a subset of features can be extracted from the original dataset [33].
803 The preprocessing step plays an essential role to build good machine learning
804 systems. There is no good rule of thumb to guide during this step. However, the
805 best thing to do is to test several approaches by assessing the results. Another issue
RR
806 to consider is that an approach may fit very well in one dataset but may have poor
807 performance in another one.
808
809 During the learning step, the model may suffer from some learning problems. The
810 most two common ones are under and overfitting. They are intimately related to the
811 bias and the variance of the model. Bias and variance are used to identify some issues
in assessing the ability of a learning method to generalize.
UN
812
7 Forthe sake of simplicity, we consider an invalid value (e.g., mixed characters and numerical
values) for a feature as missing data too.
813 Roughly speaking, variance means how much structure from the dataset the model
814 has learned, while bias means how much structure from the dataset the model has not
815 learned. That is, bias is a learner tendency to learn the same wrong thing consistently,
816 and variance is the tendency to learn random things regardless the dataset [8]. Intu-
817 itively, a biased model has a poor performance in the training set and in the test set
OF
818 (as expected) while, a model with variance has good performance in the training set
819 but poor performance in the test set. When a model has high variance, it means that
820 the model has caught all the details of the training set (including noise and outliers),
821 thus cannot be generalized to unseen examples.
822 Bias and variance can be identified by verifying the performance of the model
RO
823 in both the training and test set. Considering that a model performs better in the
824 training set, the behavior in the test set must follow the performance in the test sets.
825 The ideal scenario would be with low bias and low variance, that is, neither does
826 the model make a strong assumption regarding the dataset nor does it learn useless
827 characteristics from the dataset.
828
829
830
DP
The overfitting and underfitting problems are usually fixed using a (cross) vali-
dation dataset. Models and their parameters are trained in the training set, ranked in
the validation set, and the best model is evaluated in the test set. Another tool widely
831 used to combat the over/underfitting is to add a regularization term to the evalua-
832 tion function. For example, we can add to the gradient descent equation (Eq. 3.4) a
833 regularization term:
TE
1
m m
834 Θ = Θ − α[ (hΘ (x(i) ) − y(i) )x(i) + λ Θj ] (3.12)
m i=1 j=1
EC
835 where λ represents the strength of the regularization. Clearly, if λ is equal to 0 then
836 there is no regularization (or penalty).
837 Multi-label and multi-target8 algorithms are classifiers that learn a vector of values
838 from the observed data (examples). Most traditional supervised learning algorithms
839 are extended to deal with multi-label or multi-target values. We do not cover these
RR
846 Even though, we find many researches on machine learning, it still remains a
847 young field with many under-explored research opportunities. In addition, it has a
848 lot of folk wisdom that can be hard to come by, but helps its development [8, 20].
849 Besides, machine learning plays an important role for data science. The knowledge
850 brought by machine learning in building models for prediction makes it an essential
OF
851 tool for those wanting to extract information and knowledge from data.
RO
855 References
856
857
858
859
1828.
DP
1. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new
perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–
2. Boutell, M. R., Luo, J., Shen, X., & Brown, C. M. (2004). Learning multi-label scene classifi-
860 cation. Pattern Recognition, 37(9), 1757–1771.
861 3. Bratko, I., Michalski, R. S., & Kubat, M. (1999). Machine learning and data mining: Methods
862 and applications.
TE
863 4. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2015). Gated feedback recurrent neural net-
864 works. In International Conference on Machine Learning (pp. 2067–2075).
865 5. Connor, J. T., Martin, R. D., & Atlas, L. E. (1994). Recurrent neural networks and robust time
866 series prediction. IEEE Transactions on Neural Networks, 5(2), 240–254.
867 6. De Houwer, J., Barnes-Holmes, D., & Moors, A. (2013). What is learning? On the nature and
EC
868 merits of a functional definition of learning. Psychonomic Bulletin & Review, 20(4), 631–642.
869 7. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977) Maximum likelihood from incomplete
870 data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological),
871 1–38.
872 8. Domingos, P. (2012). A few useful things to know about machine learning. Communications
873 of the ACM, 55(10), 78–87.
RR
874 9. Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. (2010). Why
875 does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11,
876 625–660.
877 10. Färber, I., Günnemann, S., Kriegel, H. P., Kröger, P., Müller, E., & Schubert, E., et al. (2010).
878 On using class-labels in evaluation of clusterings. In Multiclust: 1st International Workshop
879 on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD
CO
880 (p. 1)
881 11. Fujita, A., Takahashi, D. Y., & Patriota, A. G. (2014). A non-parametric method to estimate
882 the number of clusters. Computational Statistics & Data Analysis, 73, 27–39.
883 12. Gauthier, J. (2014). Conditional generative adversarial nets for convolutional face generation.
884 In Class Project for stanford CS231N: Convolutional neural networks for visual recognition
885 (Vol. 2014, No. 5, p. 2). Winter Semester
13. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., & Ozair, S., et al.
UN
886
887 (2014). Generative adversarial nets. In Advances in neural information processing systems
888 (pp. 2672–2680).
889 14. Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv:1308.0850.
890 15. Gunn, S. R. (1998). Support vector machines for classification and regression. Technical report,
891 Faculty of Engineering, Science and Mathematics–School of Electronics and Computer Sci-
892 ence.
893 16. Hebb, D. (1949). The organization of behavior: A neuropsychological theory. Wiley
894 17. Izenman, A. J. (2008). Modern multivariate statistical techniques. Regression, classification
895 and manifold learning.
OF
896 18. Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters,
897 31(8), 651–666.
898 19. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing
899 Surveys (CSUR), 31(3), 264–323.
900 20. Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects.
901 Science, 349(6245), 255–260.
RO
902 21. Keogh, E., & Mueen, A. (2010). Curse of dimensionality. US: Springer.
903 22. Le Cun, Y., Touresky, D., Hinton, G., & Sejnowski, T. (1988). A theoretical framework for
904 back-propagation. In The connectionist models summer school (Vol. 1, pp. 21–28).
905 23. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
906 24. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to
907 document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
908
909
910
911
DP
25. LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., & Denker, J., et al. (1995). Compar-
ison of learning algorithms for handwritten digit recognition. In International Conference on
Artificial Neural Networks, Perth, Australia (Vol. 60, pp. 53–60).
26. Loh, W. Y. (2011). Classification and regression trees. In Wiley interdisciplinary reviews: Data
912 mining and knowledge discovery (Vol. 1, No. 1, pp. 14–23).
913 27. Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine
914 Learning Research 9, 2579–2605.
TE
915 28. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous
916 activity. The Bulletin of Mathematical Biophysics, 5(4), 115–133.
917 29. Mikolov, T., & Zweig, G. (2012). Context dependent recurrent neural network language model.
918 SLT, 12, 234–239.
919 30. Minsky, M., & Papert, S. (1969). Perceptrons.
31. Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv:1411.1784.
EC
920
921 32. Mitchell, T. M. (1997). Machine learning (1st ed.). New York, NY, USA: McGraw-Hill Inc.
922 33. Molina, L. C., Belanche, L., & Nebot, A. (2002). Feature selection algorithms: A survey and
923 experimental evaluation. In Proceedings of the 2002 IEEE International Conference on Data
924 Mining (pp. 306–313).
925 34. Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines.
RR
926 In Proceedings of the 27th International Conference on Machine Learning (ICML 2010) (pp.
927 807–814).
928 35. Oja, E. (1997). The nonlinear PCA learning rule in independent component analysis. Neuro-
929 computing, 17(1), 25–45.
930 36. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural
931 networks. ICML, 3(28), 1310–1318.
CO
932 37. Pineda, F. J. (1987). Generalization of back-propagation to recurrent neural networks. Physical
933 Review Letters, 59(19), 2229.
934 38. Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep
935 convolutional generative adversarial networks. arXiv:1511.06434.
936 39. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and orga-
937 nization in the brain. Psychological Review, 65(6), 386.
40. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM
UN
938
944 42. Weisberg, S. (2005). Applied linear regression (Vol. 528). Wiley
945 43. Willmott, C. J. (1982). Some comments on the evaluation of model performance. Bulletin of
946 the American Meteorological Society, 63(11), 1309–1313.
947 44. Zeiler, M. D., & Fergus, R. (2014) Visualizing and understanding convolutional networks. In
948 European Conference on Computer Vision (pp. 818–833). Springer.
949 45. Zhang, S. (2011). Shell-neighbor method and its application in missing data imputation. Applied
OF
950 Intelligence, 35(1), 123–133.
RO
DP
TE
EC
RR
CO
UN
Author Queries
Chapter 3
OF
Query Refs. Details Required Author’s response
No queries.
RO
DP
TE
EC
RR
CO
UN
MARKED PROOF
Please correct and return this set
Please use the proof correction marks shown below for all alterations and corrections. If you
wish to return your proof by fax you should ensure that all amendments are written clearly
in dark ink and are made well within the page margins.
or and/or
Insert double quotation marks (As above)
or
Insert hyphen (As above)
Start new paragraph
No new paragraph
Transpose
Close up linking characters