1 Introduction To Machine Learning
1 Introduction To Machine Learning
Machine Learning
Gangadhar Shobha1 and Shanta Rangaswamy1
R. V. College of Engineering, Bengaluru, India
1
Corresponding authors: e-mail: [email protected]; [email protected]
ABSTRACT
The objective of this chapter is to provide the reader with an overview of machine
learning concepts and different types of learning techniques which include supervised,
unsupervised, semi-supervised, and reinforcement learning. Learning algorithms dis-
cussed in this chapter help the reader to easily move from the equations of the book
to a computer program. Various metrics like accuracy, precision, confusion matrix,
recall, RMSE, and quantile of errors used to evaluate machine learning algorithms
are outlined in this chapter. At the end of this chapter we present various applications
of machine learning techniques followed by future trends and challenges.
Keywords: Learning model, Machine learning, Algorithms, Regression, Metrics,
Decision tree
Statistics
Pattern
recognition
Artificial
Machine intelligence
learning
algorithms facilitate computers to build model from sample data available, and to
automate decision making process, based on data inputs and experience. These
techniques identify patterns in data and provide various tools for data mining.
Today, every technology user is at an advantage from machine learning.
The technology of facial recognition allows social media to tag its users. Rec-
ommendation engines suggest which movies or television shows to watch next
based on user preferences. Optical Character Recognition (OCR) technology
converts images of text into editable type. Self-driving cars depend on
machine learning to navigate its way, may soon be available to consumers.
Thus, machine learning is a continuously evolving and developing field,
wherein some known and unknown challenges need to be analyzed.
Machine learning is broadly classified as supervised, unsupervised, semi-
supervised, and reinforcement learning. A supervised learning model has two
major tasks to be performed, classification and regression. Classification is
about predicting a nominal class label, whereas regression is about predicting
the numeric value for the class label. Mathematically, building a regression
model is all about identifying the relationship between the class label and the
input predictors. Predictors are also called attributes. In statistical terms, the
predictors are called independent variables, while the class label is called
dependent variable. A regression model is a representation of this relationship
between dependent and independent variables. Once this is learnt during the
training phase, any new data is plugged into the relationship curve to find the
prediction. This reduces the machine learning problem to solving a mathemati-
cal equation. The broad classification of machine learning is depicted in Fig. 2.
Machine
learning
2 TERMINOLOGIES
The statistical perspective of machine learning frames data in the context of a
hypothetical function (f ) that the machine learning algorithm aims to learn.
Machine Learning Chapter 8 201
Given an input variable (input) the function answers the question as to what is
the predicted output variable (output).
Output ¼ f ðInputÞ
In terms of database, row of a data describes an entity or an observation
about an entity. The column for a row is referred as attribute of an observation
and rows are the instances. Variables are the properties or the kind of charac-
teristics of a certain event or object in a dataset. It could be dependent or inde-
pendent variables. Independent variables are the one that can be manipulated
or changed by researchers or data analyst and then its effects are measured
and compared with. Independent variables are also known as Predictor(s).
Independent variables are called so because they predict or forecast the values
of the dependent variable in the model. The dependent variables refer to the
type of variable that measures the effect of the independent variable(s) on
the test units. Dependent variable is also known as predicted variable because
they are the values that are predicted or assumed by the predictor or indepen-
dent variables.
Consider a scenario where a student’s score in an examination is a depen-
dent variable, as it may change depending on several factors, like how much
he has studied a particular subject, how much he slept in a night before the
exam, or how hungry he was when he took the exam. Generally, when an ana-
lyst is looking for a relationship between two things, he is trying to figure out
what makes the dependent variable change the way it does.
Overfitting is a scenario when the classifier has low training error and high
testing error. In case of overfitting, a classifier or a model tries to capture
every sample point instead of genuine sample at the point. Suppose a classi-
fier model is built using decision tree approach. The size (length and width)
of the tree built would mainly depend on the number of features and the num-
ber of instances in the dataset. Too small or too large tree may not be favor-
able in terms of accuracy and the speed at which it reaches a class label. Also
the tree built may have a high accuracy on the training data, but very less
accuracy on the test data. This scenario is known as Overfitting. Two optimal
approaches or solution to avoid overfitting could be: (i) prepruning, that is,
halt the tree construction early. Do not split a node if this would result in
the goodness measure falling below a threshold, as it is difficult to choose
an appropriate threshold, and (ii) postpruning, that is, removing branches from
a fully grown tree, and get a sequence of progressively pruned trees.
Variance: When a model performs really well on training dataset and
poorly on cross-validation data, then the model is said to have variance. This
means the model is overfitting (making a complicated model so that it fits the
training data very well). Potential solution to overcome variance could be: Get
more training data (if possible), try smaller features (by reducing the order of
polynomial or number of layers in neural network), and increase regulariza-
tion parameter (that is, penalize the features). An example for high-variance
algorithm is k-nearest neighbors algorithm, while Linear Discriminant Analy-
sis (LDA) is an example of a low-variance algorithm.
Underfitting: Underfitting of the curve occurs when a model is too simple
and informed by very less features or regularized too much, which makes the
model rigid toward learning from the dataset. Simple learners have a tendency
to have less variance in their predictions, but more bias toward wrong out-
comes. Underfitting scenario happens, when learner has not found a solution
that fits the observed data to an acceptable level, for example, if the learning
time is too large, and the learning stage is prematurely terminated, or if the
learner did not use a sufficient number of iterations, or if the learner tries to
fit a straight line onto training set whose examples exhibit a quadratic nature.
A model that underfits the training data will miss important aspects of the
data, and this will negatively impact its performance in making accurate pre-
dictions on new data it has not seen during training.
Bias: A learning model is said to have bias when it performs poorly on the
training data as well as on the cross-validation data. Potential solution to over-
come bias could be: Make a complicated (bigger) model (by having neural
network with more layers or polynomial features), train the dataset longer,
or decrease regularization parameter. Using more examples will not have
much influence on the model, as the model is already inadequate and underfits
(bias) the training data. Decision tree is an example of a low-bias algorithm,
and linear regression is an example for high-bias algorithm.
Machine Learning Chapter 8 203
4.1 Accuracy
Accuracy is the amount of correctly classified instances of the total instances.
It is defined as the ratio of number of correct predictions to the total number
of predictions. However, if the test data is not balanced (that is, when most of
the instances or records belong to one of the classes), or one is more interested
in the performance on either one of the classes (biased), the accuracy metric
will not be able to capture the effectiveness of a classifier. For example, in
an employee income level classification scenario, a data analyst is testing
on some data where 99% of the instances represent employees who earn less
than or equal to 50K per annum. It is possible to achieve an accuracy of 0.99
by predicting the class “50K” for all instances. The classifier in this scenario
seems to be doing well, but in reality fails to classify any of the high-income
individuals correctly (top 1%).
Predicted class
Positive Negative
Actual True TP FN
Class False FP TN
negative class, while in the second level it is chosen as positive class. If the
class labels are boolean or integer in nature, then “1” or “true” labeled instances
are assigned as positive class.
4.4 F Measure
In statistical analysis of binary classification, the F score (or F measure) is a
metric of a test’s accuracy. It takes into consideration the precision and the
recall of the test to compute its score. F measure is the harmonic average of
precision and recall. F measure reaches its best value at 1 (perfect precision
and recall) and worst at 0.
1 1
F ¼ 2= +
recall precision
the stock in the coming days, given the past history of the company and the
market, it can be treated as a regression task. RMSE and quantiles error are
the major evaluating metric for regression. Quantile plots are used for univari-
ate data distribution.
RMSE: RMSE is defined as the square root of the average squared dis-
tance between the actual score and the predicted score
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Si ðyi y^i Þ2
RMSE ¼
n
where yi is the true score for the ith data point, ^
y i is the predicted value, and n
is the number of data points.
RMSE measures the standard deviation of the predictions from the ground
truth. The RMSE is one way to measure the performance of a classifier. Error
rate (or number of misclassification) is another one. It is not recommended to
use RMSE as the sole means to understand how well classifier is performing.
RMSE usually gives us how distant our model is from giving the right answer.
So, in a binary classifier, the square root of the mean of the sum of all the
instances (each instance will be 1 or 1) where a binary classifier has gone
wrong will be a number between in the scale 0 to 1, indicating how good
(closer to 1) or how bad (closer to 0) the classifier is performing. It is mostly
used for regression problems. For classification, classification accuracy is a
more appropriate measure.
Quantiles of errors: RMSE comes with a disadvantage, as it takes mean of
all the data points. Assuming the dataset consists of an outlier, it will have a
major impact on the average value. The effect of large outliers during evalua-
tion can be reduced by using robust metric called quantiles of errors. It con-
siders Median Absolute Percentage Error
yi y^i
MAPE ¼ median
yi
Experiment 1:
Fold 1
Test data
Experiment 2:
Fold 2
Test data
Experiment 3:
Fold 3
Test data
Experiment 4:
Fold 4
Test data
Experiment 5:
Fold 5
Test data
1 Xk
Error Estimate : E ¼ E
i¼1 i
k
For example, if 1000 product reviews are collected, these are split into
k ¼ 10-folds. Therefore, each fold ¼ 100 reviews.
First fold ¼ review no. 1 to review no. 100, second fold ¼ review no. 101 to
review no. 200,
Third fold ¼ review no. 201 to review no. 300, and so on.
Therefore, in the first experiment, first fold is considered as test data and all
other ninefolds are used for training, which means first hundred reviews are
fed as input to system, before which remaining 900 reviews are stored in sys-
tem database, with sentimental numbers (1, 1) assigned to them. Such
experiments are conducted 10 times by selecting a different fold for test data
in every experiment. Average of error estimated in each experiment is
calculated.
The first disadvantage is that when we use this method, the validation set error
rate can be highly variable. Second, only subsets of the observations (those in
the training set) are used to fit the model. As machine learning methods tend
to perform poor when trained on fewer observations, validation set error rate
may tend to overestimate the test error for the model fit on the entire dataset.
5 REGRESSION ALGORITHMS
Machine learning algorithms can also be divided as parametric learning model
and nonparametric learning model. Algorithms that have strong assumptions
in the learning process and that simplify the function to the known form are
known as parametric machine learning algorithms. Linear regression and
logistic regression are the examples of parametric machine learning algo-
rithms. Regression algorithms deal with modeling the relationship between
variables that are refined iteratively using a measure of error in the predictions
made by the model. Linear regression is an approach to model the relationship
between a scalar-dependent variable y and one or more explanatory variables
(or independent variables) denoted x. Linear and logistic regressions are the
major algorithms in predictive modeling.
Linear regression is a popular way of analyzing data described in a model
which is linear in nature. It is a process of finding the optimal fitting straight
line through the given data points. However, a mathematical representation
relates the response to the predictor variables. Linear regression is an
attempt to model the relationship between two variables by fitting a linear
equation to observed data, where one variable is considered to be an explan-
atory variable and the other as a dependent variable. For example, statisti-
cian may want to relate the weights of individuals to their heights using a
linear regression model.
A simple linear regression relates two variables (x and y) with a straight
line equation, while a nonlinear regression generates a line, as if every value
of y is a random variable. The objective of this model is to make sum of the
squares value as small as possible. Linear regression is easier to use and inter-
pret. However, if good fit with linear regression is not possible, then nonlinear
208 PART B Mathematical and Machine Learning Foundations
6 CLASSIFICATION ALGORITHMS
As discussed earlier, classification is a supervised mode of building a model.
Algorithms that do not make any strong assumptions or hypotheses about the
form of the mapping function are called as nonparametric machine learning
algorithms. Nonparametric methods are usually more flexible and achieve
better accuracy but require huge data and training time. SVMs, Neural Net-
works, and Decision trees are the examples of nonparametric algorithms.
50
1200
1000
Residual
0
Mobility
800
–50
600
400
–100
200
–150
0
–3 –2 –1 0 1 2 0 200 400 600 800 1000 1200 1400 1600
Density La Fitted value
1.0
0.8
Prob (y = 1)
0.6
0.4
0.2
0.0
0 2 4 6 8 10
X
FIG. 6 Example of logistic regression.
Drawing a decision tree from the available dataset involves splitting attri-
bute of each node. Each branch will have a possible value of the
corresponding attribute. Splitting attribute is the most informative attribute
among all the attributes. To select the most informative attribute, an algorithm
uses a factor called Entropy. Goodness of a split is determined by information
gain. Attribute with the maximum information gain is considered to split.
Dataset is split for all the attributes values. Fig. 8 indicates the sample deci-
sion tree built from the dataset.
Machine Learning Chapter 8 211
Type
Car SUV
Minivan
Doors Tires
2 4 Blackwall Whitewall
+ – – +
FIG. 8 Sample decision tree.
6.1.2 Entropy
The partitioning of dataset into subset must contain instances whose values
are homogenous. Iterative Dichotomiser 3 (ID3) and C4.5 (successor of
ID3) are specific decision tree algorithms that use entropy as an attribute
selection method. Entropy is used to measure the homogeneity of the samples
that characterize the purity of a dataset. It is also defined as the amount of
information contained in an attribute. Fig. 9 represents the form of the entropy
function, as probability varies between 0 and 1.
1
0.8
Entropy(s)
0.6
0.4
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Probability (+)
FIG. 9 The entropy function.
[21+,5–] [8+,30–]
FIG. 10 Branching on an attribute A1.
29 29 35 35
Entropyð½29 +, 35Þ ¼ log 2 log 2 ¼ 0:994
64 64 64 64
21 21 5 5
Entropyð½21 +, 5Þ ¼ log 2 log 2 ¼ 0:706
26 26 26 26
8 8 30 30
Entropyð½8 +, 30Þ ¼ log 2 log 2 ¼ 0:742
38 38 38 38
26
GainðSample dataset S, Attribute A1 Þ ¼ 0:994 Entropyð½21 +, 5Þ
64
38
+ Entropyð½8 +, 30Þ ¼ 0:266
64
Color
G
d
Blue
Re
re
en
S : ½5 +, 9
E ¼ 0:940
GainðSample dataset, Atrribute ColorÞ ¼ 0:029
Doors
2
4
FIG. 12 Branching on number of door.
S : ½5 +, 9
E ¼ 0:940
GainðSample dataset, Attribute DoorsÞ ¼ 0:152
Type
Minivan
r SU
Ca V
S : ½5 +, 9
E ¼ 0:940
GainðSample dataset, Attribute TypeÞ ¼ 0:200
Tires
l
Bl
l
wa
ac
te
k wa
hi
W
ll
S : ½5 +, 9
E ¼ 0:940
GainðSample dataset, Attribute TypeÞ ¼ 0:048
Best attribute among Color, Type, Doors, and Tires is the attribute Type
with Gain equal to 0.200
Step 2: Assuming there are m classes, C1, C2, …, Cn, given a tuple O, the
classifier will predict that O belongs to the class having the highest posterior
probability conditioned on O. That means the naive Bayesian classifier pre-
dicts that tuple O belongs to the class Ci if and only if P(Ci j O) > P(Cj j O)
for 1 j n, j 6¼ i.
That is, we maximize P(Ci j O). The class Ci for which P(Ci j O) is maxi-
mized is called maximum posteriori hypothesis. Applying Bayes theorem,
PðOj Ci ÞPðCi Þ
PðCi j OÞ ¼
Pð O Þ
Step 3: As P(O) is constant for all classes, only P(O j Ci)P(Ci) needs to be
maximized. If the class prior probabilities are not known, then it is
assumed that the classes are equally likely, that is, P(C1) ¼ P(C2) ¼ … ¼
P(Cn) and we would therefore maximize P(O j Ci). Otherwise, one has to
maximize P(O j Ci)P(Ci). Also the class prior probabilities may be esti-
mated by P(Ci) ¼ jCiD j/jD j, where j CiD j is the number of training tuples
of class Ci in D.
Step 4: Now, given the dataset with many attributes, it would be computa-
tionally expensive to compute P(O jCi). To reduce the computation of
evaluating P(Oj Ci), the naive assumption of class independence is made.
This presumes that the attribute values are conditionally independent of
one another, given the class label of the object (that is, there are no depen-
dent relationships among the attributes). Thus,
PðOj Ci Þ ¼ Pðo1 j Ci Þ∗Pðo2 j Ci Þ∗⋯∗Pðon j Ci Þ
Step 5: To predict the class label of O, P(O j Ci)P(Ci) is evaluated for each
class Ci. The classifier predicts that the class label of tuple O is Ci if and
only if
PðOj Ci ÞPðCi Þ > P Oj Cj P Cj for 1 j n, j 6¼ i:
One may estimate the probabilities of P(oi j Ci), P(o2 j C2), ..., P(on j Ci)
from the training tuples.
For example, a fruit can be considered to be an apple, if it is red, round, and
approximately 10 cm in diameter. A naive Bayes classifier considers each of
these features independently to the probability that the given fruit is an apple,
irrespective of any possible correlations between the color, shape, and diameter.
Given a set of samples where each sample has been labeled as one of the
two labels is used as training set, then SVM training algorithm builds a model
which will be capable of classifying any new sample to corresponding label to
which it belongs. An SVM model projects the samples of every label into a
vector space. SVM then tries to separate the projected points such that they
have a maximum distance between them. When a new sample is given to
the model, it is projected into the vector space and the class/label to which
it belongs is predicted based upon which side of the line it falls. In SVM, a
decision surface is used to separate the classes and to maximize the margin
between the classes. This decision surface is known as optimal hyperplane
or just hyperplane. The projected data points which are close or are involved
in the decision of hyperplane creation are known as support vectors. These
support vectors are nothing but simple coordinate of the data points.
The following section explains different types of SVMs.
d d
1
−2 −1 1 2 3 4 5 6 7 −1
−1
−2
−3
2
1
1
−2 2 4 6 8 −1
SVS
−1
−2
−3
X1
X2
w.x+b=0
FIG. 19 Hard margin classifier.
X1
X2
w.x+b=0
FIG. 20 Soft margin classifier.
0 x
x2
0 x
FIG. 21 Projecting linearly inseparable data into higher dimension.
3
2
1
1
−3 −2 −1 1 2 3 −1
−1
−2
−3
3
2
1
1
−1
−2 2 4 6 8
−1
The example of the above data is projected as in Fig. 23 and two support
vectors are clearly identified.
The vectors are now used to solve for a1 and a2.
a1 FðV1 Þ FðV1 Þ + a2 FðV2 Þ FðV1 Þ ¼ 1
a1 FðV1 Þ FðV2 Þ + a2 FðV2 Þ FðV2 Þ ¼ + 1
And computing the dot products results in as below:
3a1 + 5a2 ¼ 1
5a1 + 9 a2 ¼ + 1
The above equation when solved gives values for the coefficients as a1 ¼ 7
and a2 ¼ 4 and the hyperplane equation as y ¼ wx + b, w ¼ (1,1) and b ¼ 3
(b is bias).
To classify the (4,5) dataset into its class:
f ð4, 5Þ ¼ 7∗ Fð1, 1Þ Fð4, 5Þ + 4∗ Fð2, 2Þ Fð4, 5Þ
¼ 7∗ ð1, 1, 1Þ ð0, 1, 1Þ + 4∗ ð2, 2, 1Þ ð0, 1, 1Þ ¼ 2
As value is negative, it belongs to class 1.
Similarly it is not always possible to solve the nonlinearity problem with
same number of dimension. At that time it may be required to project them
into different planes using equation as below:
2 2
x1 + x2 5
f2 ðx1 , x2 Þ ¼ x1 , x2 ,
3
7 CLUSTERING ALGORITHMS
Clustering is an unsupervised learning mode, to draw inferences from datasets
consisting of input data without class label or target values, or for exploratory
data analysis to find hidden patterns. It groups data instances that are similar
to each other in a cluster and data instances that are very dissimilar from each
other into different clusters. Cluster analysis is about forming the clusters or
organizing the data such that the intracluster distance is less and intercluster
distance is more.
Machine Learning Chapter 8 223
In the recent years, due to rapid increase and accessibility of web resources
and documents, text clustering has become a significant area of research.
Applications of cluster analysis could be gene sequence study, market research,
or object recognition in an image, medicine, psychology, sociology, marketing,
biology, insurance, archeology, libraries, etc. For example, if a mobile cell
phone company wants to optimize the locations where they have to build the
towers, machine learning is used to recommend and approximate the number
of users relying on their towers. As a phone can connect to only one tower at
a time, the team uses clustering algorithms to design the best placement of cell
towers to optimize signal reception for groups, or clusters, of their customers.
Popular algorithms to prepare clusters include k-means and k-medoids,
hierarchical clustering, hidden Markov models, self-organizing maps, and
fuzzy c-means clustering. A diagrammatic representation of sample cluster-
ing is seen in Fig. 24.
In a partitional clustering, a set of data objects is divided into nonoverlap-
ping subsets (clusters) such that each data object is exactly one subset.
k-means clustering is a partitional clustering algorithm. Hierarchical cluster-
ing is a set of nested clusters that are organized in the form of a tree. Divisive
and agglomerative clustering are the two types of hierarchical clustering.
Clustering
patterns in
the data
X1 X1
X1
X2
X2
X2
X3
X3 X3
X4
X4 X4
X5
X4
X5 X5
X5
Divisive
t=4 t=3 t=2 t=1 t=0
FIG. 26 Hierarchical clustering.
226 PART B Mathematical and Machine Learning Foundations
8 APPLICATIONS
Machine learning techniques are being applied in a wide range of applications
in order to solve number of fascinating problems.
1. Contextualized experience goes beyond simple personalization, such as
knowing where the user is or what they are doing at a certain point in time.
An abundance of available data enables improved features and better
machine learning models to be created, generating higher levels of perfor-
mance and predictability, which ultimately leads to an improved user
experience.
2. With the rapid increase in devices and applications connected to the Inter-
net of Things, the sheer volume of data being generated will continue to
grow at an incredible rate. It is simply not possible for humans to analyze
and understand such quantities of data manually. Machine learning is help-
ing to aggregate all of this data from countless sources and touch points to
deliver powerful insights, spot actionable trends, and uncover user behav-
ior patterns.
3. Retail buyers are being fed live inventory updates and in many cases
enabling the autoreplenishment of stock as historical data predict the
future stock-level requirements and sales patterns.
4. Healthcare providers are receiving live updates from patients connected to
a variety of devices and again, through machine learning of historical data,
are predicting potential issues and making key decisions that are helping
save lives.
5. Financial service providers are pin pointing potential instances of fraud,
evaluating credit worthiness of applicants, generating sales and marketing
campaigns, and performing risk analysis, all with the help of machine
learning and AI-powered software and interfaces.
6. Statistical techniques may mechanically create segments and groups that
make sense on the data.
7. Data analyst uses machine learning techniques to identify the key influen-
cers on profitability.
9 CONCLUSION
Today, machine learning has progressed from research to mainstream and is a
motivational drive in an era of innovation. Today, industries need to think on
how machine learning can help them in creating a competitive advantage.
Few may use data to spot the trends in performance of employees or their
market products. This will help them to predict and prepare for future policies
and outcomes. Others may use learning models to personalize their inventory,
creating a better user experience and promoting an increased level of involve-
ment with their existing and potential customers.
Machine Learning Chapter 8 227
As the level of accessible data continues to grow, and the cost of storing
and maintaining it continues to drop day by day, more and more machine
learning solutions hosting pretrained models-as-a-service are making it easier
and more affordable for organizations to take advantage. From a development
point of view, this is enabling the quick movement of application prototypes
into production, which is exponentially increasing the growth of new applica-
tions and startups that are now entering and disrupting most markets and
industries out there.
FURTHER READING
Daelemans, W., Hoste, V., De Meulder, F., Naudts, B. et al., (Eds.). 2003. Combined optimization
of feature selection and algorithm parameters in machine learning of language. European
Conference on Machine Learning, ECML 2003. LNCS, pp. 84–95.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J., 2008. LIBLINEAR: a library for
large linear classification. J. Mach. Learn. Res. 9, 1871–1874.
Ferri, C., Hernández-Orallo, J., Modroiu, R., 2009. An experimental comparison of performance
measures for classification. Pattern Recognit. Lett. 30, 27–38.
Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining”, third ed., Elsevier, ISBN 978-0-12-
381479-1.
Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural net-
works. Science 313, 504–507.
Hotta, H., Kittaka, M., Hagiwara, M., 2010. Word vectorization using relations among words for
neural network. IEEJ Trans. Electron. Inf. Syst. 30, 75–82.
Hummel, E., Holyoak, K.J., 2003. A symbolic-connectionist theory of relational inference and
generalization. Psychol. Rev. 110 (2), 220.
Hummel, J.E., Holyoak, K.J., 2005. Relational reasoning in a neurally plausible cognitive archi-
tecture an overview of the IISA project. Curr. Dir. Psychol. Sci. 14 (3), 153–157.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani “An Introduction to Statistical
Learning With Applications in R”, Springer, ISSN 1431-875X, ISBN 978-1-4614-7137-0,
ISBN 978-1-4614-7138-7 (eBook). https://doi.org/10.1007/978-1-4614-7138-7
Joseph, P.J., Vaswani, K., Thazhuthaveetil, M.J., 2006. Construction and use of linear regression
models for processor performance analysis. The Twelfth International Symposium on High-
Performance Computer Architecture. IEEE, pp. 99–108.
Li, L., Zhang, X., 2010. Study of data mining algorithm based on decision tree, IEEE, International
Conference on Computer Design and Applications, Qinhuangdao, vol. 1, China, pp. V1–155.
228 PART B Mathematical and Machine Learning Foundations
Markert, H., Kaufmann, U., Kara Kayikci, Z., Palm, G., 2009. Neural associative memories for
the integration of language, vision and action in an autonomous agent. Neural Netw.
22 (2), 134–143.
Rousu, J., Saunders, C., Szedmak, S., 2006. Kernel-based learning of hierarchical multilabel clas-
sification models. J. Mach. Learn. Res. 7, 1601–1626.
Russell, S.J., Norving, P., 2010. Ensemble learning. In: Artificial Intelligence: A Modern
Approach, third ed. Pearson Education Inc., Upper Saddle River, NJ, pp. 761–766.
Shetty, J., Shobha, G., 2016. In: An ensemble of automatic algorithms for forecasting resource uti-
lization in cloud. Future Technologies Conference, San Francisco, USA, pp. 301–306.
Sokolova, M., Lapalme, G., 2009. A systematic analysis of performance measures for classifica-
tion tasks. Inf. Process. Manag. 45, 427–437.
Togneri, R., Naseen, I., 2010. Linear Regression for Face Recognition. IEEE Trans. Pattern Anal.
Mach. Intell. vol. 32. 11.
Tom M. Mitchell, “Machine Learning”, McGraw-Hill, ISBN: (0070428077).
Tsatsaronis, G., Varlamis, I., Vazirgiannis, M., 2008. Word sense disambiguation with semantic
networks. In: Text, Speech and Dialogue. Springer, pp. 219–226.
Tsatsaronis, G., Varlamis, I., Vazirgiannis, M., 2010. Text relatedness based on a word thesaurus.
J. Artif. Intell. Res. 37 (1), 1–40.
Yang, H., Xu, Z., Zhang, J., Cai, J., 2010. In: A constructing method of decision tree and classi-
fication rule extraction for incomplete information system, International Conference on
Computational Aspects of Social Networks, Taiyuan, pp. 49–52.
Zheng, A., 2015. Evaluating Machine Learning Models, a Beginner’s Guide to Key Concepts and
Pitfalls, first ed. O’Reilly Publications.