Chapter-4 - Intro To Machine Learning
Chapter-4 - Intro To Machine Learning
Chapter-4 - Intro To Machine Learning
Kolla Bhanu Prakash. Data Science Handbook: A Practical Approach, (97–122) © 2022 Scrivener
Publishing LLC
97
98 Data Science Handbook
2 4
1 Train Model 5
• Collection of Data
The primary step of machine learning is collection of data
from the real time domain area of problem occurrence.
The data collection should be reliable and relevant so as to
improve its quality [12].
• Preparation of Data
In the preparation of data the first step is data cleaning which
makes the data ready for data analysis. Most of the unwanted
and error prone data points are removed from data set and
convert all data in to standard format and further the data
is partitioned into two parts one for training and other for
performance evaluation [13].
• Model Training
The dataset which is part of training will help in out-
put value prediction. The output value would exhibit
the much diversity with expected desired value for the
first iteration [14]. The epoch or iterations are repeated
by performing some adjustments with initial values and
further the prediction accuracy of training data increases
incrementally.
• Evaluation Model
The rest of the data which is not used for training the model
is used for performance evaluation [15]. The testing of the
model against the left amount of data will really estimate the
applicability of the data model in providing us with effective
solution for all real time problems.
Introduction to Machine Learning 99
• Prediction
After completion of training and evaluation of data model
now it’s time to deploy the model in real time environments
and improve the accuracy by parameter tuning. As we deploy
the model in real time it need to learn new data and predict
the perfect output to answer new questions.
• Regression
• Classification
• Clustering
1) Regression
Regression is used for the output variables which are in con-
tinuous space. The curve-fitting methodology in mathemat-
ics is followed in regression. It also tries to fits the data for
a given equation of a curve and predicts the output value.
The linear regression, Neural Network maintenance and
perceptron management are popular implementation using
regression mechanisms. Many of the financial institutions
like stock markets try to predict the growth of the invest-
ments made by the shareholders. Rental brokers also try to
use prediction of house prices in a given location to manage
real estate business.
2) Classification
Classification is a process of managing the output variables
which are discrete and meant for identifying the categories
of data. Most of the algorithms of classification type deal
with processing data and divide them in to categories. It is
like finding different categories of curves for fitting the data
points. The example scenario of labeling the emails for spam
in Gmail would be one type of classification problem where
the different factors of email are checked for categorizing
them to spam upon matching at least 80%-90% of anom-
alies match. Naïve Bayes, KNearest Neighbor, support vec-
tor machine, Neural Networks and Logistic Regression are
some popular examples of classification algorithms.
100 Data Science Handbook
3) Clustering
Grouping data of without labeling and having similar fea-
tures leads to mechanism of clustering. Similarity functions
are used to group the data points with similar characteris-
tics. Dissimilar features of different clusters exist among the
different grouped data points and unique patterns can be
identified among the data sets which are not labeled in clus-
tering. K-means and agglomerative are popular examples
of clustering. Customer purchases can be categorized using
clustering techniques.
Supervised Learning model- Regression and Classification
Unsupervised Learning model- Clustering.
• Fraud Detection
Banking sectors implement machine learning algorithm
to detect fraudulent transactions to ensure their customer
safety. Popular machine learning algorithms are used to
train the system to identify transactions with suspicious
features and fault transaction patterns to get detected with
in no time when the authorized customer performing his
normal transactions. Thus the huge amount of daily trans-
actional data is used to train the machine learning model to
detect the frauds in time and provide customer safety while
utilizing online banking services.
• Speech Recognition
Popular chat bot implementations like Alexa, Siri, and
normal Google Assistant work on many machine learning
Introduction to Machine Learning 101
Y = f (X)
1. Linear Regression
Most of the algorithms in machine learning does quantifying the rela-
tionship between input variable (x) and output variable (y) with specific
function. In Linear regression the equation y= f(x)=a+bx is used for estab-
lishing relationship between x and y and a and b are the coefficients which
need to be evaluated where ‘a’ represents the intercept and ‘b’ represent the
slope of the straight line. The Fig 4.1 shows the plotted values of random
points (x, y) of a particular data set. The major objective is to construct a
straight line which is nearest to all the random points. The error value is
computed for each point with its y value.
2. Logistic Regression
Most of the predictions made by the linear regression are on the data of
type continuous like fall of rain in cm for a given location and predictions
made by the logistic regression is on the type of data which is discrete like
no of students who are passed/failed for a given exam by applying function
of transformation.
Logistic regression is used for binary classification where data sets are
denoted in two classes either 0 or 1 for y. Most of the event predictions
will be only two possibilities i.e. either they occur denoted by 1 and not by
0. Like if patient health was predicted for sick using 1 and not by 0 in the
given data set.
The transformation function which is used for logistic expression is
h(x) = 1/(1 + ex) it normally represents s-shaped curve.
The output of the logistic expression represents in the form of probabil-
ity and it value always ranges from 0 to 1. If the probability of patient health
for sick is 0.98 that means the output is assigned to class 1. Thus the output
value is generated using log transforming with x value with function h(x) =
1/(1 + ex). A binary classification is mostly realized using these functions
by applying threshold.
1
The Logistic Function, h(x)=
1 + e-x
1.0
0.8
0.6
h(x)
0.4
0.2
0.0
-8 -6 -4 -2 0 2 4 6 8
x
3. CART
Classification and Regression Trees (CART) are one implementation of
Decision Trees.
In Classification and Regression Trees contains non-terminal (internal)
node and terminal (leaf) nodes. One of the internal node acts as a root node
and all non-terminal nodes as decision making nodes for an input variable
(x) and split the node in two branches and this branching of nodes will stop
at leaf nodes which results in the output variable (y). Thus these trees acts as
Root Node
Over 30 yrs.
Internal Node
Yes No
Yes No
Leaf Nodes
4. Naïve Bayes
Bayes theorem uses probability occurrence of an event when it occurs
in real time. The probability for bayes theorem is computed by a given
hypothesis (h) and by prior knowledge (d).
where:
• Pr(h|d) represents the posterior probability. Where hypoth-
esis probability of h is true, for the given data d, where
Pr(h|d)= Pr(d1| h) Pr(d2| h)….Pr(dn| h) Pr(d)
This algorithm is called ‘naive’ because it assumes that all the variables
are independent of each other, which is a naive assumption to make in
real-world examples.
The algorithm is naïve because the treating of variables is independent
of each other with different assumptions with real world sample examples.
Using the data in above Table 4.1, what is the outcome if weather =
‘sunny’?
To determine the outcome play = ‘yes’ or ‘no’ given the value of variable
weather = ‘sunny’, calculate Pr(yes|sunny) and Pr(no|sunny) and choose
the outcome with higher probability.
5. KNN
K-Nearest Neighbors algorithm mostly uses the data set which considers
all the data to be training.
The KNN algorithm works through the entire data set for find the
instances which are near to K-nearest or similar with record values then
outputs the mean for solving the regression or the mode for a classification
problem with k value specified. The similarity is computed by using the
measures as a Euclidean distance and hamming distance.
frq(X ,Y)
Support =
N
frq(X ,Y)
Rule: X Y Confidence =
frq(X)
Support
Lift =
Supp(X) × Supp(Y)
Fig 4.4 Rule defining for support, confidence and lift formulae.
good example for identifying the products which are purchased more
frequently in combination from the available database of customer pur-
chase. The association rule looks like f:X->Y where if a customer pur-
chase X then only he purchase the item Y.
Example: The association rule defined for a customer purchase made
for milk and sugar will surely buy the coffee powder can be given as {milk,
sugar} -> coffee powder. These association rules are generated whenever
the support and confidence will cross the threshold.
The fig 4.4 provides the support, confidence and lift formulae speci-
fied for X and Y. The support measure will help in pruning the number
of candidate item sets for generating frequent item sets as specified by
the Apriori principle. The Apriori principle states that for a frequent
item sets, and then all of its subsets must all also be frequent.
7. K-means
K-means algorithm is mostly used for grouping the similar data into clus-
ters through more iteration. It computes the centroids of the k cluster and
assigns a new data point to the cluster based on the less distance between
its centroid and data point.
Working of K-means algorithm:
Let us consider the value of k=3 from the fig 4.5 we see there are 3 clus-
ters for which we need to assign randomly for each data point. The cen-
troid is computed for each cluster. The red, blue and green are treated as
the centroids for three clusters. Next will reassign each data point which
is closest to the centroid. The top data points are assigned to blue centroid
similarly the other nearest data points are grouped to red and green cen-
troids. Now compute the centroid for new clusters old centroids are turned
to gray color stars, the new centroids are made to red, green and blue stars.
Introduction to Machine Learning 109
Finally, repeat the steps of identifying new data points for nearing to cen-
troid and switch from one cluster to another to get new centroid until two
consecutive steps the centroids are same and then exit the algorithm.
8. PCA
PCA is a Principal Component Analysis which explores and visualizes the
data for less number of input variables. The reduction of capturing the new
data input values is done based of the data for the new coordinate systems
with axes called as “Principal Components”.
Each component is the result of linear combination of the original vari-
ables which are orthogonal to one another. Orthogonality always leads to
specifying that the correlation between components is zero as shown in
Fig 4.6.
Initial principal component captures the data which are variable at max-
imum in one specific direction similarly second principal component is
resulted with computation of variance on the new data other than used for
first component. The other principal components are constructed while the
remaining variance is computed with different correlated data from the previ-
ous component.
110 Data Science Handbook
PC 2
PC 1
Gene 2 Gene 1
The unique parameter which is used for splitting in random forest always
provides with wide variety of features used for searching at each split point.
Thus always bagging results in random forest tree construction with random
sample of records where each split leads to more random samples of predictors.
1 2
x2 x2
x1 x1
3 4
x2 x2
x1 x1
in step 4 with three decision stumps from the study of the previous models
and applying three splitting rules.
First, start with one decision tree stump to make a decision on one input
variable.
The size of the data points show that we have applied equal weights to
classify them as a circle or triangle. The decision stump has generated a
horizontal line in the top half to classify these points. We can see that there
are two circles incorrectly predicted as triangles. Hence, we will assign
higher weights to these two circles and apply another decision stump.
First splitting rule is done with one input variable to make a decision.
Equal weights of data points were considered to classify them in to circle or
triangle. The decision stump shows the separated horizontal line to catego-
rize the points. In fig 4.7 step-1 clearly shows that two circles were wrongly
predicted for triangles. Now we will assign more weightage to these circles
and go for second decision stump.
In the second splitting rule for decision stump is done on another input
variable. As we observe the misclassified circles were assigned with heavier
weights in the second decision stump so they are categorized correctly and
classified to vertical line on the left but three small circles which are not
matching with that heavier weight are not considered in the current deci-
sion stump. Hence will assign another weights to these three circles which
are at the top and go for another stump.
Third, train another decision tree stump to make a decision on another
input variable.
The third splitting rule the decision tree stump try to make decision on
another input variable. The three misclassified circles in second decision
tree stump are raised to heavier weights thus a vertical line separates them
from rest of the circles and triangles as shown in figure.
On fourth step will combine all decision stumps from the previous
models and define a complex rule to classify the data points correctly from
previous weak learners.
Dimensionality Reduction
In machine learning to resolve classification problems very often many
factors are considered for the final classification. The factors which are
considered for classification are known as variables or features. The more
the numbers of features were considered it would be difficult to visual-
ize the training set and to work on it. Most of the features are correlated
hence possibility of occurrence of redundant is more. This technique of
getting redundant features on the given training data set is done using
Introduction to Machine Learning 113
between the range of initial variables will dominate over small differences
over the small range which will provide us with biased result. So trans-
forming the data on to comparable scales would be better choice to prevent
this issue.
The below formulae can be used to standardize the data variables by
subtracting the mean from the variable value and dividing it by its standard
deviation.
value − mean
z=
standard deviation
40
Percentage of explained variances
30
20
10
0
1 2 3 4 5 6 7 8 9 10
Principal Components
Last Step: Recast the Data Along the Principal Components Axes
From all above steps it is clear that after standardization you make changes
to the data based on the principal component selection and result new fea-
ture vector, but the given input dat is always same.
In this step, which is the last one, the aim is to use the feature vector
formed using the eigenvectors of the covariance matrix, to reorient the
data from the original axes to the ones represented by the principal com-
ponents (hence the name Principal Components Analysis). This can be
done by multiplying the transpose of the original data set by the transpose
of the feature vector.
In the final step will multiply the transposed feature vector with the
transposed original datset.
0.6
0
0.3
1
0.0
2
-0.3
-0.6
3
References
1. Mitchell, Tom (1997). Machine Learning. New York: McGraw Hill. ISBN
0-07-042807-7. OCLC 36417892.
2. Hu, J.; Niu, H.; Carrasco, J.; Lennox, B.; Arvin, F., “Voronoi-Based Multi-
Robot Autonomous Exploration in Unknown Environments via Deep
Reinforcement Learning” IEEE Transactions on Vehicular Technology, 2020.
3. Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Springer,
ISBN 978-0-387-31073-2
122 Data Science Handbook
4. Machine learning and pattern recognition “can be viewed as two facets of the
same field.”[4]: vii
5. Friedman, Jerome H. (1998). “Data Mining and Statistics: What’s the con-
nection?”. Computing Science and Statistics. 29 (1): 3–9.
6. “What is Machine Learning?”. www.ibm.com. Retrieved 2021-08-15.
7. Zhou, Victor (2019-12-20). “Machine Learning for Beginners: An
Introduction to Neural Networks”. Medium. Retrieved 2021-08-15.
8. Domingos 2015, Chapter 6, Chapter 7.
9. Ethem Alpaydin (2020). Introduction to Machine Learning (Fourth ed.).
MIT. pp. xix, 1–3, 13–18. ISBN 978-0262043793.
10. Samuel, Arthur (1959). “Some Studies in Machine Learning Using the Game
of Checkers”. IBM Journal of Research and Development. 3 (3): 210–229.
CiteSeerX 10.1.1.368.2254. doi:10.1147/rd.33.0210.
11. Prakash K.B. Content extraction studies using total distance algorithm,
2017, Proceedings of the 2016 2nd International Conference on Applied and
Theoretical Computing and Communication Technology, iCATccT 2016,
10.1109/ICATCCT.2016.7912085
12. Prakash K.B. Mining issues in traditional indian web documents,2015, Indian
Journal of Science and Technology,8(32),10.17485/ijst/2015/v8i1/77056
13. Prakash K.B., Rajaraman A., Lakshmi M. Complexities in developing
multilingual on-line courses in the Indian context, 2017, Proceedings
of the 2017 International Conference On Big Data Analytics and
Computational Intelligence, ICBDACI 2017, 8070860, 339-342, 10.1109/
ICBDACI.2017.8070860
14. Prakash K.B., Kumar K.S., Rao S.U.M. Content extraction issues in online
web education, 2017,Proceedings of the 2016 2nd International Conference
on Applied and Theoretical Computing and Communication Technology,
iCATccT 2016, 7912086,680-685,10.1109/ICATCCT.2016.7912086
15. Prakash K.B., Rajaraman A., Perumal T., Kolla P. Foundations to fron-
tiers of big data analytics,2016,Proceedings of the 2016 2nd International
Conference on Contemporary