A Quick Review of Machine Learning Algorithms: Susmita Ray
A Quick Review of Machine Learning Algorithms: Susmita Ray
A Quick Review of Machine Learning Algorithms: Susmita Ray
Feb 2019
Abstract—Machine learning is predominantly an area of with three types of problems namely: classification, regression
Artificial Intelligence which has been a key component of and clustering. Depending on the availability of types and
digitalization solutions that has caught major attention in the categories of training data one may need to select from the
digital arena. In this paper author intends to do a brief review of available techniques of “supervised learning”, “unsupervised
various machine learning algorithms which are most frequently
learning”, “semi supervised learning” and
used and therefore are the most popular ones. The author intends
to highlight the merits and demerits of the machine learning “reinforcement learning” to apply the appropriate machine
algorithms from their application perspective to aid in an learning algorithm. In the next few sections, some of the most
informed decision making towards selecting the appropriate widely used machine learning algorithms will be reviewed..
learning algorithm to meet the specific requirement of the
application. II. GRADIENT DESCENT ALGORITHM
Gradient Descent is an iterative method in which the objective
Keywords—Gradient Descent, Logistic Regression, Support Vector is to minimize a cost function. It should be possible to
Machine, K Nearest Neighbor, Artificial Neural Network, Decision compute the partial derivative of the function which is slope
Tree, Back Propagation Algorithm, Bayesian Learning, Naïve or gradient. The coefficients are computed at each iteration by
Bayes.
taking the negative of the derivative and by reducing the
coefficients at each step by a learning rate (step size)
multiplied by derivative so that the local minima can be
I. INTRODUCTION achieved after a few iterations. So eventually the iterations
A good start point for this paper will be to begin with the are stopped when it converges to minimum value of the cost
fundamental concept of Machine Learning. In Machine function after which there is no further reduction in cost
Learning a computer program is assigned to perform some function. There are three different types of this method:
tasks and it is said that the machine has learnt from its “Stochastic Gradient Descent” (SGD), “Batch Gradient
experience if its measurable performance in these tasks Descent”(BGD),. and “Mini Batch Gradient Descent”
improves as it gains more and more experience in executing (MBGD)
these tasks. So the machine takes decisions and does In BGD error is computed for every example within the
predictions / forecasting based on data. Take the example of training dataset, but the model will be updated only after the
computer program that learns to detect / predict cancer from evaluation of all training examples are completed.
the medical investigation reports of a patient. It will improve The main advantage of BGD algorithm is computational
in performance as it gathers more experience by analyzing efficiency. It produces a stable error gradient and a stable
medical investigation reports of wider population of patients. convergence. However the algorithm has the disadvantage that
Its performance will be measured by the count of correct the stable error gradient can sometimes result in a state of
predictions and detections of cancer cases as validated by an convergence that is not the best which the model can achieve.
experienced Oncologist. Machine Learning is applied in wide Also the algorithm requires the entire training dataset to be in
variety of fields namely : robotics, virtual personal assistants memory and available to it.
(like Google), computer games, pattern recognition, natural
language processing, data mining, traffic prediction, online In SGD error is calculated for each training example within
transportation network (e.g. estimating surge price in peak the dataset and parameters are updated for every training
hour by Uber app), product recommendation, share market example. This might result in SGD to be faster than BGD, for
prediction, medical diagnosis, online fraud prediction,
the specific problem. SGD has the advantage that the frequent
agriculture advisory, search engine result refining (e.g. Google
search engine), BoTs (chatbots for online customer support), updates result in a detailed rate of improvement. However the
E-mail spam filtering, crime prediction through frequent updates are more computationally expensive as
video surveillance system, social media services(face compared to the BGD approach. The frequency
recognition in facebook). Machine Learning generally deals
of those updates can also result in noisy gradients, which may performance contribution of employees to the organization
cause the error rate to jump around, instead of decreasing which can help in creating an employee incentivisation
slowly. An example application of SGD will be to evaluate scheme.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on September 23,2020 at 10:01:15 UTC from IEEE Xplore. Restrictions apply.
Approach of MBGD is obtained by combining the concepts of there may not be relationship between mean of dependent and
SGD and BGD. In this approach the training dataset is split independent variables which linear regression expects.
into small batches and an update is performed for each of
these batches. Therefore it creates a balance between the
robustness of SGD and the efficiency of BGD. This algorithm IV. MULTIVARIATE REGRESSION ANALYSIS
can be used to train a neural network and so this algorithm is
mostly used in deep learning. The approach of Gradient A simple linear regression model has a dependent variable
Descent optimization is used in Backpropagation algorithm guided by a single independent variable. However real life
wherein the gradient of loss function is computed to adjust the problems are more complex. Generally one dependent
weight of neurons. variable depends on multiple factors. For example, the price
Gradient Descent algorithm has the following disadvantage: if of a house depends on many factors like the neighborhood it is
the learning rate for gradient descent is too fast, it is going situated in, area of it, number of rooms, attached facilities,
to skip the true local minimum to optimize for time. If it is distance of nearest station / airport from it, distance of nearest
too slow, the gradient descent may never converge because it shopping area from it, etc. In summary in simple linear
is trying really hard to find a local minimum exactly. regression there is a one-to-one relationship between the input
The learning rate can affect which minimum is reached and variable and the output variable. But in multiple linear
how quickly it is reached. A good practice is to have a regression, there is a many-to-one relationship, between a
changing learning rate, that slows down as the error starts to number of independent (input/predictor) variables and one
decrease. dependent (output/response) variable. Adding more input
variables does not mean the regression will be better, or will
III. LINEAR REGRESSION ALGORITHM offer better predictions. Multiple and simple linear regression
have different use cases and one is not superior than the other.
In some cases adding more input variables can make things
Regression is an approach of supervised learning. It can be worse as it results in over-fitting. Again as more input
used to model continuous variables and do the predictions. variables are added it creates relationships among them. So
Examples of application of linear regression algorithm are the not only are the input variables potentially related to the
following : prediction of price of real-estate, forecasting of output variable, they are also potentially related to each other,
sales, prediction of students’ exam scores, forecasting of this is referred to as multicollinearity. The optimal scenario is
movements in the price of stock in stock exchange. In for all of the input variables to be correlated with the output
Regression we have the labeled datasets and the output variable, but not with each other
variable value is determined by input variable values - so it is
the supervised learning approach. The most simple form of
Multivariate technique has the following merits : it gives a
regression is linear regression where the attempt is made to fit
deep insight to the relationship between the set of independent
a straight line (straight hyperplane) to the dataset and it is
variables and dependent variables. It also gives insight to
possible when the relationship between the variables of
relationship among the independent variables. This is
dataset is linear.
achieved through multiple regression, tabulation techniques
and partial correlation. It models the complex real world
Linear regression has the advantage that it is easy to problems in a practical and realistic way.
understand and it is also easy to avoid over fitting by
regularization. Also we can use SGD to update linear models Multivariate technique has the following demerits :
with new data. Linear Regression is a good fit if it is known complexity of this technique is high and it requires knowledge
that the relationship between covariates and response variable and expertise on statistical techniques and statistical
is linear. It shifts focus from statistical modeling to data modeling. The sample size for statistical modeling needs to be
analysis and preprocessing. Linear Regression is good for high to get a higher confidence level on analysis outcome.
learning about the data analysis process. However, it is not a Also it often gets too difficult to do a meaningful analysis and
recommended method for most practical applications because interpretation of the outputs of statistical model.
it oversimplifies real world problems.
This Regression Analysis technique involving multiple
Disadvantage of Linear regression is that it is not a good fit variables can be used in property valuation, car evaluation,
when one needs to deal with non-linear relationships. forecasting electricity demand, quality control, process
Handling complex patterns is difficult. Also it is tough to add optimization, quality assurance, process control and medical
the right polynomials appropriately in the model. Linear diagnosis etc.
Regression over simplifies many real world problems. The V. LOGISTIC REGRESSION
covariates and response variables usually do not have a linear
relationship. Hence fitting a regression line using OLS will Logistic regression is used to deal a classification problem. It
give us a line with a high train RSS. In real world problems gives the binomial outcome as it gives the probability if an
36
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on September 23,2020 at 10:01:15 UTC from IEEE Xplore. Restrictions apply.
event will occur or not (in terms of 0 and 1) based on values applications like predicting future use of library books and
of input variables. For example, predicting if a tumor is tumor prognosis problems.
malignant or benign or an e-mail is classified as spam or not
are the instances which can be considered as binomial
VII. SUPPORT VECTOR MACHINE
outcome of Logistic Regression. There can be multinomial
outcome of Logistic Regression as well e.g. prediction of type
of cuisine preferred : Chinese, Italian, Mexican etc. There can Support Vector Machines (SVM) can handle both
be ordinal outcome as well like : product rating 1 to 5 etc. So classification and regression problems. In this method
Logistic Regression deals with prediction of target variable hyperplane needs to be defined which is the decision
which is categorical. Whereas Linear Regression deals with boundary. When there are a set of objects belonging to
prediction of values of continuous variable e,g. prediction of different classes then decision plane is needed to separate
real estate price over a span of 3 years. them. The objects may or may not be linearly separable in
which case complex mathematical functions called kernels are
Logistic Regression has the following advantages : simplicity needed to separate the objects which are members of different
of implementation, computational efficiency, efficiency from classes. SVM aims at correctly classifying the objects based
training perspective, ease of regularization. No scaling is on examples in the training data set. Following are the
required for input features. This algorithm is predominantly advantages of SVM : it can handle both semi structured and
used to solve problems of industry scale. As the output of structured data, it can handle complex function if the
Logistic Regression is a probability score so to apply it for appropriate kernel function can be derived. As generalization
solving business problem it is required to specify customized is adopted in SVM so there is less probability of over fitting.
performance metrics so as to obtain a cutoff which can be It can scale up with high dimensional data. It does not get
used to do the classification of the target. Also logistic stuck in local optima.
regression is not affected by small noise in the data and multi-
collinearity. Logistic Regression has the following Following are disadvantages of SVM : its performance goes
disadvantages: inability to solve non-linear problem as its down with large data set due to the increase in the training
decision surface is linear, prone to over fitting, will not work time. It will be difficult to find appropriate kernel function.
out well unless all independent variables are identified. Some SVM does not work well when dataset is noisy. SVM does
examples of practical application of Logistic Regression are: not provide probability estimates. Understanding the final
predicting the risk of developing a given disease, cancer SVM model is difficult. Support Vector Machine finds its
diagnosis, predicting mortality of injured patients and in practical application in cancer diagnosis, fraud detection in
engineering for predicting probability of failure of a given credit cards, handwriting recognition, face detection and text
process, system or product. classification etc. So among the three approaches of Logistic
Regression, Decision Tree and SVM the first approach to
VI. DECISION TREE attempt will be the logistic regression approach, next the
decision trees (Random Forests) can be tried to see if there is
Decision Tree is a Supervised Machine Learning approach to significant improvement. When the number of observations
solve classification and regression problems by continuously and features are high then SVM can be tried out..
splitting data based on a certain parameter. The decisions are
in the leaves and the data is split in the nodes. In VIII. BAYESIAN LEARNING
Classification Tree the decision variable is categorical In Bayesian Learning a prior probability distribution is
(outcome in the form of Yes/No) and in Regression tree the selected and then updated to obtain a posterior distribution
decision variable is continuous. Decision Tree has the Later on with availability of new observations the previous
following advantages : it is suitable for regression as well as posterior distribution can be used as a prior. Incomplete
classification problem, ease in interpretation, ease of handling datasets can be handled by Bayesian network. The method can
categorical and quantitative values, capable of filling missing prevent over-fitting of data. There is no need to remove
values in attributes with the most probable value, high contradictions from data. Bayesian Learning has the following
performance due to efficiency of tree traversal algorithm. disadvantages : selection of prior is difficult. Posterior
Decision Tree might encounter the problem of over-fitting for distribution can be influenced by prior to a great extent. If the
which Random Forest is the solution which is based on prior selected is not correct it will lead to wrong predictions.
ensemble modeling approach. It can be computationally intensive. Bayesian Learning can
Disadvantages of decision tree is that it can be unstable, it be used for applications like medical diagnosis and disaster
may be difficult to control size of tree, it may be prone to victim identification etc.
sampling error and it gives a locally optimal solution- not
globally optimal solution. Decision Trees can be used in
37
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on September 23,2020 at 10:01:15 UTC from IEEE Xplore. Restrictions apply.
IX. NAÏVE BAYES Disadvantages of KNN are the following: classifying
unknown records are relatively expensive. It requires distance
computation of k-nearest neighbors. With the growth in
This algorithm is simple and is based on conditional training set size the algorithm gets computationally intensive,.
probability . In this approach there is a probability table which Noisy / irrelevant features will result in degradation of
is the model and through training data it is updated. The accuracy.
"probability table" is based on its feature values where one
needs to look up the class probabilities for predicting a new
observation. The basic assumption is of conditional It is lazy learner; it computes distance over k neighbors. It
independence and that is why it is called "naive". In real does not do any generalization on the training data and keeps
world context the assumption that all input features are all of them. It handles large data sets and hence expensive
independent from one another can hardly hold true. calculation. Higher dimensional data will result in decline in
accuracy of regions.. KNN can be used in Recommendation
Naïve Bayes (NB) have the following advantages system, in medical diagnosis of multiple diseases showing
: implementation is easy, gives good performance , works with similar symptoms, credit rating using feature similarity,
less training data, scales linearly with number of predictors handwriting detection, analysis done by financial institutions
and data points, handles continuous and discrete data, can before sanctioning loans, video recognition, forecasting votes
handle binary and multi-class classification problems, make for different political parties and image recognition.
probabilistic predictions. It handles continuous and discrete
data. It is not sensitive to irrelevant features. XI. K MEANS CLUSTERING ALGORITHM
K Means Clustering Algorithm is frequently used for solving
Naïve Bayes has the following disadvantages: Models which clustering problem. It is a form of unsupervised learning. It
are trained and tuned properly often outperform NB models as has the following advantages: it is computationally more
they are too simple. If there is a need to have one of the efficient than hierarchical clustering when variables are huge.
feature as “continuous variable” (like time) then it is difficult With globular cluster and small k it produces tighter clusters
to apply Naive Bayes directly, Even though one can make than hierarchical clustering. Ease in implementation and
“buckets” for “continuous variables” it’s not 100% correct. interpretation of the clustering results are the attraction of this
There is no true online variant for Naive Bayes, So all data algorithm. Order of complexity of the algorithm is O(K*n*d)
need to be kept for retraining the model. It won’t scale when and so it is computationally efficient.
the number of classes are too high, like > 100K. Even for
prediction it takes more runtime memory compared to SVM Disadvantages of K-Means Clustering Algorithm are the
or simple logistic regression. It is computationally intensive following: prediction of K value is hard. Performance suffers
specially for models involving many variables. when clusters are globular. Also since different initial
partitions result in different final clusters it impacts
Naïve Bayes can be used in applications such as performance. Performance degrades when there is difference
Recommendation System and forecasting of cancer relapse or in the size and density in the clusters in the input data.
progression after Radiotherapy. Uniform effect often produces clusters with relatively uniform
size even if the input data have different cluster size. Spherical
assumption (i.e. joint distribution of features within each
X. K NEAREST NEIGHBOUR ALGORITHM cluster is spherical) is hard to be satisfied as the correlation
between features break it and would put extra weights on
correlated features. K value is not known. It is sensitive to
K Nearest Neighbor (KNN) Algorithm is a classification outliers. It is sensitive to initial points and local optimal, and
algorithm It uses a database which is having data points there is no unique solution for a certain K value - so one needs
grouped into several classes and the algorithm tries to classify to run K mean for a K value lots of times(20-100times) and
the sample data point given to it as a classification problem. then pick the results with lowest J.
KNN does not assume any underlying data distribution and so
it is called non-parametric. Advantages of KNN algorithm K Means Clustering algorithm can be used for document
are the following : it is simple technique that is easily classification, customer segmentation, rideshare data analysis,
implemented. Building the model is cheap. It is extremely automatic clustering of IT alerts, call record details analysis
flexible classification scheme and well suited for Multi-modal and insurance fraud detection.
classes. Records are with multiple class labels. Error rate is at
most twice that of Bayes error rate. It can sometimes be the
best method. KNN outperformed SVM for protein function
prediction using expression profiles.
38
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on September 23,2020 at 10:01:15 UTC from IEEE Xplore. Restrictions apply.
XII. BACK PROPAGATION ALGORITHM derivative of the activation function is very small and it results
This algorithm provides a very simple and efficient way to in Network Paralysis. ANN with multilayer needs many
compute the gradient in a neural network and one can use it in repeated presentations of the input patterns, in which we need
conjunction with stochastic gradient descent which is also to adjust the weights so that an optimal solution is achieved
quite simple. There are more complex "quasi-Newton" with the network settling down.
techniques which make a better estimate of the gradient
direction and step size, but they don't perform better than XIII. CONCLUSION
backprop and SGD. Back Propagation Algorithm is used in In this paper an attempt was made to review most frequently
deep learning. Neural Network (NN) has its specific used machine learning algorithms to solve classification,
applications in different industry segments and it has its merits regression and clustering problems. The advantages,
and demerits. Scenarios where there are no well defined disadvantages of these algorithms have been discussed along
criteria or rules to find an answer then NN is useful. It gives with comparison of different algorithms (wherever possible)
the solution but it becomes difficult to explain how the in terms of performance, learning rate etc. Along with that,
solution is arrived at and so it is like a blackbox.. NN finds its examples of practical applications of these algorithms have
application in classification of credit rating and in forecasting been discussed. Types of machine learning techniques namely
market dynamics in financial sector. Here are some of the supervised learning, unsupervised learning, semi supervised
applications of NN in marketing segment : in product learning, have been discussed. It is expected that it will give
classification, in classification of customer segments i.e. insight to the readers to take an informed decision in
which customers will like and purchase which products, identifying the available options of machine learning
finding new market for specific product category, in algorithms and then selecting the appropriate machine
associating relationship between customer and company. NN learning algorithm in the specific problem solving context.
becomes instrumental in increasing revenue of a business
house, in increasing the percentage of response to direct REFERENCES
marketing. NN finds its application in Post offices for sorting
[1] D. Pelleg, A. Moore (2000): “X-means: Extending K-means with
the letters/parcels based on area zip code / postal code. Efficient Estimation of the Number of Clusters”; ICML ’00
Following are the merits of NN for which it is widely used in Proceedings of the Seventeenth International Conference on Machine
industry segments as mentioned above: easy adaption to new Learning, pp. 727-734.
scenarios, fault tolerant, ability to handle noisy data. [2] Rushika Ghadge, Juilee Kulkarni, Pooja More, Sachee Nene, Priya R ,
Shortcoming of NN are the following : training time of NN is [3] “Prediction of Crop Yield using Machine Learning”, International
Research Journal of Engineering & Technology, Vol 5, Issue 2, Feb-
very long and for training the NN efficiently, the sample sets 2018.
need to be large. Back Propagation Algorithm encounters the [4] C. Phua, V. Lee, K. Smith, R. Gayler (2010); “Comprehensive Survey
Moving Target Problem which impacts its efficiency. There of Data Mining-based Fraud Detection Research”, ICICTA ’10
are number of hidden layers in the Artificial Neural Network Proceedings of the 2010 International Conference on Intelligent
Computation Technology and Automation Volume 1, pp. 50-53.
(ANN). Every unit within the network contributes to the [5] S. Cheng, J. Liu, X. Tang (2014); “Using unlabeled Data to Improve
overall performance of the network. But the complexity Inductive Models by Incorporating Transductive Models”; International
increases as all of the units are changing simultaneously and Journal of Advanced Research in Artificial Intelligence, Volume 3
the units in ANN layer are unable to communicate among Number 2, pp. 33-38.
[6] Sonal S. Ambalkar, S. S. Thorat2, “Bone Tumor Detection from MRI
them. What every ANN unit can see are its inputs and the
Images using Machine Learning: A Review”, International Research
error signal which is propagated back to it from ANN output. Journal of Engineering & Technology”, Vol. 5, Issue 1, Jan -2018.
Each ANN unit tries to solve this problem which is defined by [7] Rajat Raina, Alexis Battele, Honglak Lee,Benjamin Packer, Andrew Y.
the error signal and the complexity comes into picture as this Ng , “Self-taught Learning : Transfer of Learning from Unlabeled
problem is changing all the time. As a result it takes a long Data”, Computer Science Department, Stanford University, CA, USA,
Proceedings of 24th International Conference on Machine Learning
time for this dynamics to settle down among all units in ANN. Corvallis, OR, 2007.
However research has shown that with the increase in number [8] Jimmy Lin, Alek Kolcz, “Large-Scale Machine Learning at Twitter”,
of hidden layers in ANN there is an exponential rate of Proceedings of SIGMOD ’12, May 20–24, 2012, Scottsdale, Arizona,
slowing down in back propagation learning. Herd effect is one USA.
[9] Dr. Rama Kishore, Taranjit Kaur, “Backpropagation Algorithm: An
common manifestation of the moving-target problem. Other Artificial Neural Network Approach for Pattern Recognition”,
problems with Back Propagation Learning are network International Journal of Scientific & Engineering Research, Volume 3,
paralysis, local minima and slow convergence. The algorithm Issue 6, June-2012.
works in a way to reduce the error by changing the weights [10] Kedar Potdar, Rishab Kinnerkar, “A Comparative Study of Machine
Algorithms applied to Predictive Breast Cancer Data”, International
and as a result “Local Minima” occurs. But if the error in this Journal of Science & Research, Vol. 5, Issue 9, pp. 1550-1553,
process goes up as part of a more general fall, it will “get September 2016.
stuck” (as it can not go uphill) and the error will stop
reducing. During training when the weights are adjusted to
very large values then these large weights can force most of
the units to operate at extreme values, in a region where the
39
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on September 23,2020 at 10:01:15 UTC from IEEE Xplore. Restrictions apply.