ML Research Paper
ML Research Paper
I. Introduction
Although, the term machine learning has its origins in computer science, there have been several
vector quantization methods developed in telecommunications and signal processing for coding and
compression. In computer and data science, learning is accomplished based on examples (data
samples) and experience. A basic signal/data processing framework that includes pre-processing,
noise removal and segmentation is shown in Figure 1, where, the signal is acquired from the sensor
and then processed, typically in a frame-by-frame or batch mode. Removal of noise and feature
extraction follows next and finally the classification stage which will provide either an estimate or a
decision is at the end of the process.
Figure 1: Basic signal processing framework including pre- processing, feature extraction and
classification
Typically, the feature extraction stage will extract compact information bearing parameters that can
characterize the data. The classification stage will have to be trained by a machine learning
algorithm to recognize and classify the collection of features. The field of machine learning is vast
and applications are expanding rapidly especially with the emergence of fast mobile devices that
also have access to cloud computing . Compressing and extracting information from sensors and
big data have recently elevated interest in the area Smart city projects, mobile health monitoring,
networked security, manufacturing, self-driven automobiles, surveillance, intelligent border control;
every application has its idiosyncrasies and requires customized features, adaptive learning, and data
fusion. Data compression and statistical signal and data analysis has a large role transmitting and
interpreting data and producing meaningful analytics. Machine Learning algorithms can be broadly
classified into three categories based on the properties, style of learning, and the way data are used
[13]: supervised, unsupervised and semi-supervised algorithms. This type of classification is important
in identifying the role of the input data, the utility of the algorithms and learning models relative to the
applications.
II. SUPERVISED LEARNING
In supervised learning, “true” or “correct” labels of the input dataset are available. The
algorithm is “trained” using the labelled input dataset (training data) which means ground truth
samples are available for training. In the training process, the algorithm makes appropriate
predictions on the input data and improves its estimates using the ground truth and reiterating
until the algorithm reaches a desired level of accuracy. In almost all the machine learning
algorithms, we optimize a cost function or an objective function. The cost function is typically a
measure of the error between the ground truth and the algorithm estimates. By minimizing the
cost function, we train our model to produce estimates that are close to the correct values
(ground truth). Minimization of the cost function is usually achieved using gradient descent
technique . Variants of gradient descent technique such as stochastic gradient descent for a
used in many machine learning training paradigms. Suppose we have ‘𝑚’ number of training
minibatch, momentum based gradient descent nesterov accelerated gradient descent have been
examples, each one of them is a labelled data and can be represented in a pair:(x, 𝑦), here x
represents the input data and 𝑦 represents the class label. The input data x can be an 𝑛
dimensional, whereas each dimension corresponds to a feature or a variable. Supervised learning
methods are used in various fields including the identification of phytoplankton species , mapping
rainfall induced landslides , and classification of biomedical data . In a machine learning algorithm
is integrated on an embedded sensor system for IoT applications. In the following sub-sections, we
present supervised learning algorithms.
A. Linear Regression
Regression is a statistical technique of estimating the relationship between input and output
variables. It maps the input variables to a continuous function. A simple univariate linear
regression model is shown in Figure 2.
The training dataset consists of ‘𝑚’ labelled training sets(x, 𝑦) < 𝑅𝑛+1 , x is the independent
variable and 𝑦 is the dependent variable. The linear regression model assumes the relationship
between independent variable and dependent variable is linear and fits a straight line to the data
points. This relationship is expressed by a hypothesis function or a prediction function. It is
expressed as
where 𝑥1, 𝑥2,. . . 𝑥𝑛 are the features and w0, w1, w2.. . w𝑛 are the weights of the model. The
approach can be used to perform linear regression through slope filtering. Equation (1) is for a
multivariate linear regression model. The output is the linear sum of the weighted input features.
The weights are typically learned by weighted least squares minimization process. We can also
make use of quadratic, cubic or higher polynomial terms to obtain completely different hypothesis
function which can fit quadratic , cubic or polynomial curves respectively, rather than a simple
straight line. Multivariate linear regression is used for several applications, including activity
recognition and classification, steady state visual evoked potential (SSVEP) recognition for BCI data.
B. Logistic Regression
The objective of multivariate regression model is to determine a hypothesis function which outputs
a continuous value. Now, we present another class of supervised learning algorithms: Classification,
in which the objective is to obtain a discrete output. Logistic regression is a statistical way of
modelling a binomial outcome. As before, the input can have one or more features (or variables).
For a binary logistic regression, the outcome can be a 0 or 1 which performs binary classification of
positive class from negative class. Logistic regression uses a sigmoid curve shown in the Figure 3 to
output a probability value and thus performs the classification. The hypothesis function for a
logistic regression is given by
1/
𝑆(𝑧) = 1 + 𝑒−𝑧 (3)
for each 𝐶 possible outcomes or 𝐶 number of classes. Here, 𝑝(𝜔𝑐|𝑥) is the posterior probability that
given feature x belongs to 𝑐th class 𝜔𝑐, and 𝑝(𝜔𝑐) is the prior probability of the class 𝜔𝑐 independent
of the data, and 𝑝(x|𝜔𝑐) is the likelihood which is the probability of the predictor given the class and
𝑝(x) is the prior probability of the predictor which is the normalizing factor. There are many variations of
Naïve Bayes theorem, some of them tackle the poor assumptions of Naïve Bayes [54,55,56]. Naïve Bayes
algorithm is used for text classification [57], for credit scoring [58], for emotion classification and
recognition [67], and detection of epileptic seizures from EEG signals
E. k-Nearest Neighbors
The k-Nearest Neighbors (k-NN) algorithm is one of the simplest supervised machine learning algorithm. k-
NN can be used for classification of input points to discrete outcomes. A simple k-NN model is shown in
Figure 5.
Figure 4: Maximum Figure 5: A simple k-NN
margin intuition; model for different values
hyperplane A has of k
maximum separation.
k-NN can be used for regression analysis where the outcome of a dependent variable is
predicted from the input independent variables. In Figure 5, for k =3, the test point (star) is
classified as belonging to class B and for k=6, the point is classified as belonging to class A. k-NN is a
non-probabilistic and non-parametric model and hence it is the first choice for classification study when
there is no prior knowledge about the distribution of data. k-NN stores all the labelled input points to
classify any unknown sample and this makes it computationally expensive. The classification is based on
the similarity measure (a distance metric). Any unknown sample is classified by the majority vote of its k
nearest neighbors. The complexity increases as the dimensionality increases and hence dimensionality
reduction techniques are performed before using k-NN to avoid the effects of curse of dimensionality . k-
NN classifier is used for stress detection using physiological signals in and detection of epileptic seizures .
In the case of unsupervised algorithms there are no explicit labels associated with the training
dataset. The objective is to draw inferences from the input data and then model the hidden or
the underlying structure and the distribution in the data, in order to learn more about the data.
Clustering is the most common example of an unsupervised algorithm. The details of the same is
mentioned below.
A. Clustering
Clustering deals with finding a structure or pattern in a collection of unlabeled dataset. For a
given dataset, clustering algorithm groups the given data into K number of clusters such that the
data points within each cluster are similar to each other and data points from different clusters are
dissimilar. Similar to k-NN algorithm, we make use of a similarity metric or distance metric.
Different distance metrics such as Euclidean, Mahalanobis, cosine, Minkowski etc. are used.
Although Euclidean distance metric is used more often, that it is not a suitable metric to capture
the quality of the clustering. The K-means algorithm is one of the simplest clustering algorithms
and is aintuitive and iterative algorithm. It clusters the data by separating them into K groups of
equal variances, minimizing the inertia or within-cluster sum-of-squares. However, the algorithm
the data point is assigned to the cluster with the nearest mean 𝝁(j), which is also referred to as the
requires the number of clusters to be specified before running the algorithm. Each observation or
Centroid of that cluster. Thus, the K clusters can be specified by the K centroids.
K-means clustering algorithms leads to Voronoi tessellation. K-means algorithms
iterations stops (converges) when there is no change in the value of means of the clusters. In
Figure 6, a converged K-means algorithm is shown. Clustering has several applications in many
fields. In biology, clustering has been used to determine groups of genes that have similar
functions , for detection of brain tumor in cardiogram data clustering in business and e-
commerce analysis and information retrieval image segmentation and compression , in
the study of quantitative resolutions of nanoparticles , in fault detection in Solar PV panels and in
speech recognition .
B. Vector Quantization
In its simplest form vector quantization organizes data in vectors and represents them by their
centroids. It typically uses a K-means clustering algorithm to train the quantizer. The centroids form
codewords and all the codewords are stored in a Codebook.
Figure 7: Uniform quantization of 2-dimensional Data. Figure 8: Vector quantization of 2-dimensional Data.
Vector quantization is a lossy compression method and is used in several coding applications. As a result, the
compressed data has errors that are inversely proportional to density. This property is shown in Figure 8 and
compared with uniform quantization Figure 7.The Vector quantization technique is used in various speech
applications including speech coding ,emotion recognition , audio compression , large-scale image
classification and image compression .
In this section, a brief introduction to the field of artificial neural networks is provided with the
focus on deep learning methodologies and their applications. Artificial neural networks are widely
used in the areas of image classification, pattern recognition and they have proved to be the most
successful and they achieve superior results in various fields including signal processing , computer
vision , speech processing and natural language processing .
Deep learning is a branch in machine learning that has gained popularity quite recently, capable of
learning multiple levels of abstraction. Although, the inception of neural networks dates in 1960 ,
deep learning gained more popularity since 2012 because of the great advancements in the GPUs
and availability of large labelled datasets. In Figure 9, a simple artificial neural network with 4
hidden layers is shown. The last layer, namely the output layer, performs classification. The term
“deep learning” refers to several layers used to learn multiple levels of representation. Each
successive layer takes the output of the previous layer and feeds the result to the next layer.
Figure 9: Artificial Neural Network with four hidden layers.
Typical artificial neural networks challenges include initialization of the network parameters,
overfitting, and long training time. We now have various techniques to address the above
problems. Batch normalization , normalization propagation , weight normalization , layer
normalization all help in accelerating the training of deep neural networks. Dropouts help in
reducing overfitting. There are several network architectures including the one shown in Figure 9
which consists of dot product layers (fully connected layers). A convolutional layer processes
volume of activations rather than a vector and produces feature maps.It also makes use of a
subsampling layer or a max- pooling layer to reduce the size of the feature maps. Figure 10
shows an example of a convolutional neural network (CNN). Networks whose output depends
on present and past inputs, namely recurrent neural networks (RNNs) , have also been used in
several applications.
CONCLUSION
This Machine Learning short survey paper supported the tutorial session of the IISA2017. The
paper covered supervised and unsupervised learning models. We also provided a brief
introduction to current deep learning methodologies and outlined several applications including
pattern recognition, anomaly detection, computer vision and speech processing. The paper
provides extensive bibliography of machine algorithms and their applications.