A Survey On Machine Learning Approaches and Its Techniques:: Thomas. Rincy. N Dr. Roopam Gupta
A Survey On Machine Learning Approaches and Its Techniques:: Thomas. Rincy. N Dr. Roopam Gupta
Abstract: With the data and information is available at a data. In un-supervised learning the data is not labeled, more
tremendous rate, there is a need for machine learning approaches. precisely we have an unlabelled data. Semi-Supervised
Machine learning, it analyses the study and constructs the learning is a merger of labeled and unlabeled data. In
algorithms by making prediction on data. It builds model from the reinforcement learning the software agent gathers from the
inputs to make the decisions or predictions. Machine learning
interaction with the environment to take actions that would
algorithms it assists in bridging the gap of understanding. In this
literature we investigate different machine learning approaches maximize the reward. The fig. 1 it depicts the classification of
and its techniques. machine learning system.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 03,2020 at 11:07:10 UTC from IEEE Xplore. Restrictions apply.
nominal set of rules that is familiar with the training data. The independence among the features. Bayes theorem can be
main goal is to construct the smallest set of rules that is similar stated in mathematical terms:
with the training data. P (X/Y) = P (X) (Y/X) / P (Y)
Where X and Y are events.
P(X) and P(Y) are events.
Machine
Learning
Neural
Network
Linear Discriminant
Analysis
Support Vector
Machine
RIPPER [10] is an algorithm that is based on rules. P(X) and P(Y) are the prior probabilities of X and Y. P (X/Y)
Through the process of imitated growing and pruning, it is a posterior probability, of the probability of observing the
generates rules. For learning the set of rules the Genetic event X, given that Y is true. P (Y/X) is known as likelihood,
algorithms (GAs) [11] are also applied. Finding the quality the probability of observing the event Y, given that X is true.
chromosomes is the ultimate aim of Genetic Algorithm. The The advantage of the Naive Bayes classifier is the least
finesses of a chromosome are described in the Genetic computational time required for training the data.
Algorithm by the function known as fitness function [12].
III.IV. k- Nearest Neighbor classifiers. k-NN [14] is a
III.III. Naïve-Bayesian classifier. Naive Bayesian classifiers nonparametric technique used for regression and
[13] are probabilistic classifiers with their relation related to classification. In the feature space the input of k-NN contains
Bayes theorem having strong assumption of naive the k closest training examples. Then the output, it will depend
SCEECS 2020
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 03,2020 at 11:07:10 UTC from IEEE Xplore. Restrictions apply.
whether k-NN is applied for regression or classification are employed with learning tasks having substantial amount of
purposes. In k-NN classification, the class membership is an features with respect to the number of training data.
output. With object allocated to its class with trivial among its Solving the nth dimensional quadratic programming
k-NN (k being a positive integer, and small) the object is (QP) the training is performed on Support Vector Machines,
classified by its majority vote of its neighbors. When k = 1, where n represents the number of samples in the training data.
then the object is allocated to that class having single nearest Large problems of the SVM cannot not be solved as it may
neighbor. The output is the property value for the object in k- contain the large quadratic operations also there is numerical
NN regression. computation which makes the algorithm slow in terms of
processing time. There is variation of SVM called Sequential
Minimal Optimization (SMO). SMO can solve the SVM
III.V. Neural Network. The neural network conceptual model quadratic problem without employing additional matrix
was proposed in 1943 by [15]. It consists of different cells. storage and without applying the optimization steps on the
The cell receives data from other cells, processes the inputs, numerical quadratic programming [21].
and passes the outputs to other cells. Since then, there was an
intensive research to develop the ANNs. A perceptron [16] is IV. UN-SUPERVISED LEARNING
a neural network that consists of a single neuron that can In un-supervised learning the data is not labeled,
receive more than one input to produce a single output. To more precisely we have an unlabelled data. In un-supervised
classify linearly separable classes, by finding m-dimensional learning we have the input variable (P), but there is no output
hyper plane in the feature space that separates the instances of variable. The representation is seen as a model of data. The
the two classes, perceptron are used. In Radial Basis Function aim of un-supervised learning is to discover the hidden
RBF [17] a radial activation function is performed by every structures from unlabelled data or to infer a model having the
hidden unit, while the weighted sum of hidden output unit is probability density of input data. This section investigates the
performed by each output unit. It is commonly known as tri- basic algorithms used in un-supervised learning.
layer feedback network.
IV.I. K- Means Clustering. K-means algorithm [22]: The main
III.VI. Linear Discriminant Analysis. A linear classifier [18] idea of the algorithms is to partition the N observation in
contains the vector, having weight w and bias having b. Given space in to K clusters. The information and nearest mean
an instance p, the predicted class label q, is obtained according belongs to this cluster and works as model of the cluster. As a
to: result the data space it splits in to Voronoi cells. K-means
Q= sign (wT p + b) algorithm is an iterative method, which starts with a random
selection of the k-means v1, v2… vk. .With each number of
With the help of weight vector w, the instance space iteration the data points are grouped in k-clusters, in keeping
is mapped onto a one-dimensional space, afterwards; to isolate with the closest mean to each of the points, mean is then
the positive instances from negative instances, a point on the updated according to the points within the cluster. The
line is identified. A linear learning algorithm, which finds the grouping of data with regards to data points in accordance
best w and b for separating different classes, is Fisher’s linear with the cluster means and updating the cluster means in
discriminant analysis [19]. Fisher’s linear discriminant accordance to set of points will continue until there is no
analysis it allows the instances of the same class to be change in the cluster means or points. The variant of K-means
adjacent, by keeping the variance of each class smaller, on the is termed as K-medoids. In K-medoids, instead of taking the
other hand it allows those instances having distinct class to be mean the larger part of the cluster, having the centrally located
far, by accomplishing the distance between centers of distinct data point is investigated as a reference point of the
classes larger. corresponding cluster [23].
III.VII. Support Vector Machines. SVMs [20] revolve around IV.II. Gaussian Mixture Model. The Gaussian mixtures were
the margin on either side of a hyperplane that separates two popularized by Duda and Hart in their seminal text, Pattern
data classes. To reduce an upper bound on the generalization Classification and Scene Analysis in 1973 [24]. A Gaussian
error, the main idea is to generate the largest available Mixture is a function that consists of several Gaussians, each
distance between its instance on either side and separating identified by k ∈ {1,…, K}, where K represents the number of
hyperplane. Finding an optimum hyperplane is the main idea clusters of dataset. In Gaussian mixture model (GMM), the
of linearly separable data. The data points that lie on the each Gaussian is characterized by the sequence of mean and
margins of optimum hyperplane are termed as Support Vector variance that consists of mixture of M Gaussian distributions.
Points, and it is characterized as the linear combination of Then the weight of each Gaussian will ultimately be the third
these points. An alternative data point is neglected. The parameter that is associated to each Gaussian distribution in a
different features available on the training data do not affect Gaussian mixture model. When clustering is performed using
the complexity of SVM. This is the primary reason the SVM Gaussian Mixture Model, the goal is to find the criterion such
SCEECS 2020
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 03,2020 at 11:07:10 UTC from IEEE Xplore. Restrictions apply.
as mean and covariance of each distribution and the weights, components transformation. The expectation-maximization
so that the resulting model fits optimally in the data. In (EM) algorithm is applied on the multinomial mixture for the
Gaussian Mixture Model, the likelihood of the data should be job of text classification [29].
expanded so that the data can be optimally fitted. It can be
obtained by applying iterative expectation maximization (EM) V.II. Transductive Support Vector Machine: TSVM [30], it
algorithm [25]. extends the Support Vector Machine (SVM) having the
unlabelled data. The idea is to have the maximal margin
IV.III. Hidden Markov model. Hidden Markov Model (HMM) among the labeled and unlabeled data on its linear boundary
[26] is a parameterized distribution for sequences of by labeling the unlabeled data. Unlabeled data has the least
observations. Basically, (HMM) is a Markov process that is generalization error on a decision boundary. The linear
divided in two components called observable components and boundary is put away from the dense region by the unlabeled
unobservable or hidden components. That is, a hidden Markov data. With all the available estimation solutions to TSVM, it is
model is a Markov process (Yk, Zk) k ≥ 0 on the state space C curious to understand just how valuable TSVM will be a
× D, where we presume that we have a means of observing Yk, global optimum solution. Global optimal solution on small
but not Zk as the signal process and C as the signal state space, datasets is found in [31]. Overall an excellent accuracy is
while the observed component Yk is called the observation obtained on small dataset.
process and D is the observation state space.
HMM, is sometimes called as a doubly stochastic V.III. Graph based approaches. Graph structure, it defines the
process. Markovian stochastic process can be logically set of vertices V and set of edges E. More intuitionally, the
modeled by an HMM-based approach in which the actual structure can be defined as G = (V, E). Graph is created by the
states are not visited, these states are presumed to be nodes and edges, the nodes it defines labeled and unlabeled
unobserved or hidden; instead, the state can be observed that patterns or samples, the edges it determines the affinity
is stochastically dependent on the unobserved state. between labeled and unlabeled data. Labeling information of
each pattern is proliferated to its adjoining pattern till the
IV.IV. Principal Component Analysis. PCA [27] is an global optimum state is attained. The labeled data pattern is
analytical procedure that converts the correlated variables into progressed to its adjoining points. The graph based techniques
linearly uncorrelated variables, with the help of an orthogonal are focus of interests among researchers due to its better
transformation. This is named as principal components. The performance. Graph mincut problem is proposed by Blum et.al
PCA is a multivariate dimensionality reduction tool that [33] in semi-supervised learning. A step Markov random walk
extracts the features representing most of the features in the is achieved on the graph by Szummer et.al [34].
given data and thus removing the least features having less
information without losing the crucial information in data.
When real data is collected, the random variables representing V.IV. Self Training Methods. Self-training is a methodology
the data attributes are presumed to be highly correlated. The applied in semi-supervised learning. On a small amount of
correlation between random variables can be found in the data, the classifier is trained and then classifier is applied to
covariance matrix. The aggregate of the variances will give classify the unlabeled data. The highest promising unlabeled
the overall variability. points with its labels predicted are appended to training
dataset. The classifier is again trained with the training
dataset. This procedure goes on repeating itself. For teaching
V. SEMI-SUPERVISED LEARNING itself, classifier had its own predictions. This methodology is
Semi-Supervised learning is the sequence of called bootstrapping or self-teaching [35]. Various natural
labeled and unlabeled data. The labeled data is very sparse language processing tasks applies the methodology of self
while there is an enormous amount of unlabelled data. The teaching.
data is used to create an appropriate model of the data
classification. The goal of semi-supervised learning is to
classify the unlabelled data from the labeled data. This section VI. REINFORCEMENT LEARNING
explores some of the most familiar algorithms used in the In reinforcement learning the software agent gathers
Semi-Supervised learning. from the interaction with the environment to take actions that
would maximize the reward. The environment is formulated
V.I. Generative models. Generative model [28] considers a as markov decision process. In reinforcement learning there is
model p (u, v) = p (u/v) p (v) where p (u/v) is known as no availability of input/output variables. The software agent it
mixture distribution. The mixture components can be analyzed receives the input i, the present state of environment s, then
when there are large numbers of unlabelled data is available. the software agent it determines an action a, to achieve the
The generative model is model of condition probability of the output. The values of state transition and the state of
observable value X, given a value Y. Consider {Pθ} be a environment, which is changed by the action of the software
distribution family and is denoted by parameter vector θ. θ agent is communicated through scalar reinforcement signal.
may be identified only if θ1 ≠ θ2 ⇒ yθ1 ≠ yθ2 to the mixture After the action is chosen, reinforcement learning tells its
SCEECS 2020
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 03,2020 at 11:07:10 UTC from IEEE Xplore. Restrictions apply.
software agent to reward its subsequent state. The software Things. Willey Online Library, 2018.
agent is not told which action would be best in terms of long [6] Ross Quinlan, “Machine learning”. Vol.1 no.1, 1986.
term interest. The software agent needs to gather information [7] Quinlan, J.R, “Induction of Decision trees” Machine
about the states, actions, transition, rewards for optimal Learning. Vol. 1, Issue 1. pp: 81-106, Springer, 1986.
[8] Quinlan, J. R, “C4.5: Programs for Machine Learning”. Morgan
working. This section reviews algorithms used in Kaufmann Publishers, 1993.
reinforcement learning. [9] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J,
“Classification and Regression Trees”, Wadsworth, Belmont,
VI.I. Q-Learning. Q-learning [36] is a type of model free CA. Republished by CRC Press, 1984.
reinforcement learning. It can also be known as an approach [10] William Cohen, “RIPPER Fast Effective Rule Induction”,
of asynchronous dynamic programming (DP). Q-learning Proceedings of the 12th International Conference on Machine
Learning, 1995.
allows the agents having the ability of learning to perform
[11] Eiben A.E, "Genetic algorithms with multi-parent
exemplary in markovian field by recognizing the effects of its recombination". PPSN III: In Proc. International Conference on
actions, which is no longer required by them to build domain Evolutionary Computation. The Third Conference on Parallel
maps. Q-learning finds an optimal policy and it boosts the Problem Solving from Nature: pp: 78–87, 1994.
predicted value of the total reward, from beginning of the [12] Colin R. Reeves, Jonathan E. Rowe, “Genetic Algorithms:
current state to any and all successive steps, for a finite Principles and Perspectives: A Guide to GA Theory”. Kluwer
markov decision process, given the infinite search time and a Academic Publishers Norwell, MA, USA, 2002.
partially random policy. An optimal action-selection policy [13] Rish, Irina, “An empirical study of the naive Bayes classifier”.
can be associated with Q-learning. IJCAI Workshop on Empirical Methods in AI, 2001.
[14] Altman, N. S, "An introduction to kernel and nearest-neighbor
nonparametric regression”. The American Statistician. 46 (3):
VI.II. Deep Q-Networks. (DQNs) [37] combines reinforcement 175–185, 1992.
learning with a deep network. Through a sequence of [15] McCulloch, Warren; Walter Pitts, "A Logical Calculus of Ideas
observations, actions and rewards, the DQNs consider a task Immanent in Nervous Activity". Bulletin of Mathematical
Biophysics. 5 (4): 115–133, 1943.
in which the agent interacts with an environment. The main
[16] Freund, Y.; Schapire, R. E, "Large margin classification using
aim of the agent is to select actions in a manner that it the perceptron algorithm". Machine Learning. 37 (3): 277–296,
augments the cumulative future reward. It applies the replay 1999.
experiences that randomizes on top of the data, by eliminating [17] Buhmann, Martin Dietrich, “Radial basis functions: theory and
correlations in the observation sequence and smoothing over implementations”. Cambridge University Press, 2003.
changes in the data distribution. To reduce the correlations [18] Guo-Xun Yuan, Chia-Hua Ho, “Recent Advances of Large-
within the target, iterative update techniques are applied, so Scale Linear Classification”. pp: 2584-2603, Proceedings of the
that the target values are periodically updated. IEEE, 2012.
[19] Fisher, R. A, "The Use of Multiple Measurements in Taxonomic
Problems". Annals of Eugenics. 7 (2): 179–188, 1936.
VII. CONCLUSION [20] Cortes, Corinna, Vapnik, Vladimir N, "Support-vector
networks". Machine Learning. 20 (3): 273–297, 1995.
In this study, various machine learning techniques [21] John C. Platt, “Probabilistic Outputs for Support Vector
and its approaches were analyzed. The classification of Machines and Comparisons to Regularized Likelihood
machine learning approaches such as supervised learning, un- Methods”. pp: 61-74, Advances in large margin classifiers MIT
supervised learning, semi-supervised learning and press, 1999.
reinforcement learning and its various algorithms are the [22] MacQueen J. B, “Some Methods for classification and Analysis
important contributions of this study. In future we intend to of Multivariate Observations”. Proceedings of 5th Berkeley
Symposium on Mathematical Statistics and Probability.
develop a model based on machine learning techniques.
pp: 281–297, University of California Press, 1967.
[23] Kaufman, L. and Rousseeuw, P.J, “Clustering by means of
Medoids”, in Statistical Data Analysis Based on the Norm and
REFERENCES
Related Methods, pp: 405–416, North-Holland, 1987.
[1] A.M Turing, “Computing Machinery and Intelligence”. Mind 49 [24] Duda, R. O. and Hart, P. E, “Pattern Classification and Scene
pp: 433-460, 1950. Analysis”. John Wiley and Sons, Inc, 1973.
[2] Phil Simon, “Too Big to Ignore”: The Business Case for Big [25] Dempster, A.P, Laird N.M, Rubin, D.B, "Maximum Likelihood
Data. Wiley. ISBN 978-1-118-63817-0, 2013. from Incomplete Data via the EM Algorithm". pp: 1–38, Journal
[3] Mohssen Mohammed, Muhammad Badruddin Khan, “Machine of the Royal Statistical Society, 1977.
Learning Algorithms and Applications”. CRC press Taylor and [26] Baum, L. E.; Petrie, T, "Statistical Inference for Probabilistic
Francis Group, 2017. Functions of Finite State Markov Chains". pp: 1554–1563, The
[4] Chih-Fong Tsai, Yu-Feng Hsu, “Intrusion detection by machine Annals of Mathematical Statistics, 1966.
learning: A review”. Expert Systems with applications. pp: [27] Pearson, K., "On Lines and Planes of Closest Fit to Systems of
11994-12000. Elsevier, 2009. Points in Space". pp: 559–572, Philosophical Magazine, 1901.
[5] Myeongsu Kang, Noel Jordan Jameson, “Machine learning [28] Ng, Andrew and Jordan, Michael, “On Discriminative vs.
Fundamentals”. Prognostics and health management in Generative classifiers: A comparison of logistic regression and
electronics: Fundamentals, Machine Learning, and Internet of naive bayes” Advances in Neural Info. Process system, 2002.
SCEECS 2020
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 03,2020 at 11:07:10 UTC from IEEE Xplore. Restrictions apply.
[29] Kamal Nigam, Andrew K Mccallum, “Text classification from
labeled and unlabelled documents using EM”. Vol. 39, pp: 103-
134, Machine learning, Springer, 2000.
[30] Vapnik, V, Chervonenkis, A, “Theory of Pattern Recognition”.
Nauka, Moscow, 1974.
[31] Chapelle, O., Sindhwani, V., & Keerthi, S. S, “Branch and
bound for semisupervised support vector machines”. Advances
in Neural Information Processing Systems (NIPS), 2006b.
[32] Xiaojin, In the “Semi-Supervised Learning Literature Survey”.
University of Wisconsin, Madison, 2005.
[33] Blum, A., & Chawla, S, “Learning from labeled and unlabeled
data using graph mincuts”. Proc. 18th International Conf. on
Machine Learning, 2001.
[34] Szummer, M., & Jaakkola, T, “Partially labeled classification
with Markov random walks”. Advances in Neural Information
Processing Systems, 2001.
[35] Chapelle Olivier, Schölkopf Bernhard, Zien, Alexander, “Semi-
supervised learning”, MIT Press, 2006.
[36] Christopher J. C. H. Watkins, Peter Dayan, “Q-Learning”.
Machine learning, Vol.8, pp: 279-292, Springer, 1989.
[37] Volodymyr Mnih Koray Kavukcuoglu, “Playing Atari with
Deep Reinforcement Learning”, Deep Mind Technologies. pp:
1-9, Toronto, 2013.
SCEECS 2020
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 03,2020 at 11:07:10 UTC from IEEE Xplore. Restrictions apply.