16 Comparison of Data Science Algorithms
16 Comparison of Data Science Algorithms
Model can be
Models the A set of easily explained
No
relationship organized rules Prediction of to business users Manufacturing,
restrictions.
between input and that contain an target Divides the data set applications
Rule Accepts
output by deducing antecedent variable, in rectilinear where description
induction categorical, Easy to deploy
simple “IF/THEN” (inputs) and which is fashion of model is
numeric, and in almost any
rules from a data consequent categorical necessary
binary inputs tools and
set (output class)
applications
k-Nearest A lazy learner Entire training No Prediction of Requires very The deployment Image processing,
neighbors where no model is data set is the restrictions. target little time to runtime and applications
generalized. Any model However, the variable, build the model. storage where slower
new unknown data distance which is Handles missing requirements will response time is
point is compared calculations categorical attributes in the be expensive acceptable
against similar work better unknown record
known data points with numeric gracefully. Arbitrary selection
in the training set data. Data Works with of value of k
Algorithm Description Model Input Output Pros Cons Use Cases
No description of
the model
needs to be nonlinear
Time required to Training data set
No model and needs to be
Predicts the output A lookup table restrictions. deploy is representative
Prediction of
class based on the of probabilities However, the minimum sample of
probability
Bayes’ theorem by and conditional probability population and
Naïve for all class Spam detections,
calculating class probabilities for calculation needs to have
Bayesian values, along Great algorithm text mining
conditional each attribute works better complete
with the for
probability and with an output with combinations of
winning class benchmarking.
prior probability class categorical input and output.
attributes Strong statistical Attributes need to
foundation be independent
yield different
results. Good at
points belonging to
handling combinations
different classes
nonlinear
relationships
The classical
The model
predictive Cannot handle
consists of The workhorse of
model that missing data.
coefficients for most predictive
expresses the All The label Categorical data Pretty much any
each input modeling
Linear relationship attributes may be are not directly scenario that requires
predictor and techniques. Easy to
regression between inputs should be numeric or usable, but predicting a continuous
their statistical use and explain to
and an output numeric binominal require numeric value
significance. A nontechnical
parameter in the transformation
bias (intercept) business users
form of an into numeric
may be optional
equation
Logistic Technically, The model All The label One of the most Cannot handle Marketing scenarios
regression this is a consists of attributes may only common missing data. Not (e.g., will click or not
Algorithm Description Model Input Output Pros Cons Use Case
coefficients for
each input
predictor that
relate to the
classification
“logit.” classification intuitive when
method. But
Transforming should be be methods. dealing with a click), any general two-
structurally it is
the logit into numeric binominal Computationally large number of class problem
similar to linear
probabilities of efficient predictors
regression
occurrence (of
each class)
completes the
model
Measures the
Transactions List of Unsupervised
strength of Finds simple,
format with relevant approach with Requires
co- easy to Recommendation engines,
FP-Growth items in the rules minimal user preprocessing if
occurrence understand rules cross-selling, and content
and Apriori columns and developed inputs. Easy to input is of
between one like {Milk, suggestions
transactions from the understand different format
item with Diaper}→{Beer}
in the rows data set rules
another
k-Means Data set is divided Algorithm No Data set is Simple to Specification Customer
Algorithm Description Model Input Output Pros Cons Use case
restrictions.
finds k centroids However, the of k is
and all the data distance implement. arbitrary and
appended by segmentation, anomaly
points are calculations Can be used may not find
into k clusters by one of detection, applications
assigned to the work better for natural
finding k centroids the k cluster where globular
nearest centroid, with numeric dimension clusters.
labels clustering is natural
which forms a data. Data reduction Sensitive to
cluster should be outliers
normalized
Specification
No of density
restrictions. Finds the parameters. A
However, the natural bridge Applications where
List of clusters
Identifies clusters distance Cluster labels clusters of between two clusters are
and assigned data
as a high-density calculations based on any shape. clusters can nonglobular shapes and
DBSCAN points. Default
area surrounded by work better identified No need to merge the when the prior number
cluster 0 contains
low-density areas with numeric clusters mention cluster. of natural groupings is
noise points
data. Data number of Cannot cluster not known
should be clusters varying
normalized density data
set
Self- A visual clustering A two- No No explicit A visual way Number of Diverse applications
organizing technique with dimensional restrictions. clusters to explain centroids including visual data
maps roots from neural lattice where However, the identified. the clusters. (topology) is exploration, content
networks and similar data distance Similar data Reduces specified by suggestions, and
prototype clustering points are calculations points occupy multi- the user. Does dimension reduction
arranged next to work better either the same dimensional not find
each other with numeric cell or are data to two natural
data. Data placed next to dimensions clusters in the
should be each other in data
normalized the
Algorithm Description Model Input Output Pros Cons Use case
neighborhood
Every data
Accepts both point has a
numeric and distance Easy to
Outlier All data points
categorical score. The implement.
identified based are assigned a
Distance– attributes. higher the Works well Specification Fraud detection,
on distance distance score
based Normalization is distance, with of k is arbitrary preprocessing technique
to kth nearest based on nearest
required since the more numeric
neighbor neighbor
distance is likely the attributes
calculated data point
is an outlier
Every data
Accepts both point has a
Specification of
numeric and density Easy to
Outlier is All data points distance
categorical score. The implement.
identified based are assigned a parameter by
Density- attributes. lower the Works well Fraud detection,
on data points in density score the user.
based Normalization is density, the with preprocessing technique
low-density based on the Inability to
required since more likely numeric
regions neighborhood identify varying
density is the data attributes
density regions
calculated point is an
outlier
Local Outlier is All data points Accepts both Every data Can handle Specification of Fraud detection,
outlier identified based as assigned a numeric and point has a the varying distance preprocessing technique
factor on calculation of relative density categorical density density parameter by
relative density score based on attributes. score. The scenario the user
Algorithm Description Model Input Output Pros Cons Use Case
lower the
relative
Normalization is
density, the
in the the required since
more likely
neighborhood neighborhood density is
the data
calculated
point is an
outlier
Recurrent Just as conv nets are A sequence of RNNs can Unlike other RNNs suffer from Forecasting time series,
specialized for any type (time process types of vanishing (or natural language processing
analyzing spatially series, text, sequences and neural exploding) situations such as machine
correlated data, speech, etc). output other networks, gradients when translation, image
Layer Type Description Input Output Pros Cons Use Cases
item matrix
into two
of an item
matrices (P
can be
and Q) with More accurate
better
latent than
explained
factors. Fill user-item neighborhood the prediction
factorization by their matrix s
the blank preferences. based is made
preference
values in the collaborative
of an item’s
ratings filtering
character
matrix by
(inferred)
dot product
of P and Q
Content-based - A Every time User-item Complete Every user has a Storage and eCommerce,
Supervised learning personalized a user rating d ratings separate model computational content, and
models classificatio prefers an matrix and matrix and could be time connection
n or item, it is a Item profile independently recommendation
Assumptio
Algorithm Description Input Output Pros Cons Use Case
n
regression
model for
every single
user in the
system.
Learn a
vote of
classifier customized.
preference
based on Hyper s
for item
user likes or personalization
attributes
dislikes of
an item and
its
relationship
with item
attributes
Decompose the
Increased Accuracy Application
time series into
Models for the understanding of depends on the where the
trend, Historical Forecasted
Decomposition individual the time series models used explanation of
seasonality, and value value
components by visualizing for components is
noise. Forecast
the components components important
the components
Exponential The future Learn the Historical Forecasted Applies to wide Multiple Cases where
smoothing value is the parameters of value value range of time seasonality in trend or
function of past the smoothing series with or the data make seasonality is
observations equation from without trend or the models not evident
Algorithm Description Model Input Output Pros Cons Use Case
the historical
seasonality cumbersome
data
The future
value is the
Parameter
ARIMA function of auto Forms a The optimal Applies on
values for
(autoregressive correlated past Historical Forecasted statistical p,d,q value is almost all
(p,d,q), AR,
integrated data points and value value baseline for unknown to types of time
and MA
moving average) the moving model accuracy begin with series data
coefficients
average of the
predictions
PCA combines the Each principal Numerical Numerical Efficient way to Sensitive to Most
(principal most component is a attributes attributes extract predictors scaling effects, numeric-
component important function of (reduced that are i.e., requires valued data
analysis) attributes into attributes in the set). Does uncorrelated to normalization of sets require
filter-based a fewer dataset not really each other. Helps attribute values dimension
number of require a to apply Pareto before reduction
transformed label principle in application.
attributes identifying Focus on variance
Algorithm Description Model Input Output Pros Cons Use Case
sometimes results
attributes with
in selecting noisy
highest variance
attributes
Data sets
require a Applications
Selecting
label. Can for feature
attributes No restrictions
Info gain Similar to only be selection
based on on variable Same as decision Same as decision
(filter- decision tree applied on where target
relevance to type for trees trees
based) model data sets variable is
the target or predictors
with categorical or
label
nominal numeric
label
Data sets
Extremely robust.
require a
Selecting Uses the chi- A fast and Applications
label. Can
attributes square test of efficient scheme for feature
Chi-square Categorical only be Sometimes
based on independence to identify which selection
(filter- (polynominal) applied on difficult to
relevance to to relate categorical where all
based) attributes data sets interpret
the target or predictors to variables to select variables are
with a
label label for a predictive categorical
nominal
model
label
Data sets
Selecting Multicollinearity Need to begin
Works in with few
Backward attributes The label problems can be with a full model,
conjunction All attributes input
elimination based on may be avoided. Speeds which can
with modeling should be variables
(wrapper- relevance to numeric or up the training sometimes be
methods such numeric where feature
based) the target or binominal phase of the computationally
as regression selection is
label modeling process intensive
required