0% found this document useful (0 votes)
30 views19 pages

Data Science Classification Etc

The document discusses data science classification and algorithms. It describes supervised and unsupervised learning models, classification and regression techniques, and clustering. Common data science algorithms are also presented for tasks like classification, regression, anomaly detection, time series forecasting, clustering, and recommendation engines. Finally, it outlines the typical data science process and some common process frameworks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views19 pages

Data Science Classification Etc

The document discusses data science classification and algorithms. It describes supervised and unsupervised learning models, classification and regression techniques, and clustering. Common data science algorithms are also presented for tasks like classification, regression, anomaly detection, time series forecasting, clustering, and recommendation engines. Finally, it outlines the typical data science process and some common process frameworks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Data Science Classification

Data Science and Visualization


22AD2202
Data Science Classification

Data Science and Visualization


[20AD2202]
Data Science Classification
 Supervised or unsupervised learning models.
 Supervised techniques predict the value of

the output variables based on a set of input


variables.
 To do this, a model is developed from a

training dataset where the values of input and


output are previously known.
 The output variable that is being predicted is

also called a class label or target variable.

Data Science and Visualization


[20AD2202]
Data Science Classification
 Unsupervised or undirected data science
uncovers hidden patterns in unlabeled data.
 In unsupervised data science, there are no

output variables to predict.


 The objective of this class of data science

techniques, is to find patterns in data based


on the relationship between data points
themselves.
 An application can employ both supervised

and unsupervised learners.

Data Science and Visualization


[20AD2202]
Data Science Classification
 Classification and regression techniques predict
a target variable based on input variables.
 The prediction is based on a generalized model
built from a previously known dataset.
 In regression tasks, the output variable is
numeric.
 Deep learning is a more sophisticated artificial
neural network that is increasingly used for
classification and regression problems.
 Clustering is the process of identifying the
natural groupings in a dataset.

Data Science and Visualization


[20AD2202]
Data Science Algorithms
 In data science, it is the blueprint for how a particular data
problem is solved.
 Many of the learning algorithms are recursive, where a set
of steps are repeated many times until a limiting condition
is met.
 Some algorithms also contain a random variable as an
input and are aptly called randomized algorithms.
 A classification task can be solved using many different
learning algorithms such as decision trees, artificial neural
networks, k-NN, and even some regression algorithms.

Data Science and Visualization


[20AD2202]
Data Science Algorithms
 The choice of which algorithm to use depends on the
type of dataset, objective, structure of the data,
presence of outliers, available computational power,
number of records, number of attributes, and so on.
 It is up to the data science practitioner to decide

which algorithm (s) to use by evaluating the


performance of multiple algorithms.
 There have been hundreds of algorithms developed in

the last few decades to solve data science problems.

Data Science and Visualization


[20AD2202]
Data Science Algorithms
 Data science algorithms can be implemented by custom-
developed computer programs in almost any computer
language.
 This obviously is a time consuming task.
 In order to focus the appropriate amount of time on data
and algorithms, data science tools or statistical
programing tools, like R, RapidMiner, Python, SAS
Enterprise Miner, etc., which can implement these
algorithms with ease, can be leveraged.
 These data science tools offer a library of algorithms as
functions, which can be interfaced through programming
code or configurated through graphical user interfaces.
Data Science and Visualization
[20AD2202]
Data Science Algorithms
Tasks Description Algorithms Examples
Classificati Predict if a data point belongs Decision trees, neural Assigning voters into
on to one of the predefined networks, Bayesian known buckets by political
  classes. The prediction will be
based on learning from a
models, induction
rules,
parties, e.g., soccer moms
Bucketing new customers
known dataset k-nearest neighbors into one of the known
  customer groups  
Regression Predict the numeric target Linear regression, Predicting the
label of a data point. The logistic regression unemployment rate for the
prediction will be based on next year Estimating
learning from a known dataset insurance premium

Anomaly Predict if a data point is an Distance-based, Detecting fraudulent credit


detection outlier compared to other data density-based, LOF card transactions and
points in the dataset network intrusion

Data Science and Visualization


[20AD2202]
Tasks Description Algorithms Examples
Time series Predict the value of the target Exponential Sales forecasting,
Forecasting variable for a future timeframe smoothing, production
based on historical values ARIMA, forecasting, virtually any
regression growth phenomenon that
needs to be extrapolated
Clustering Identify natural clusters within the k-Means, density- Finding customer segments
dataset based on inherit properties based clustering in a company based on
within the dataset (e.g., DBSCAN) transaction, web, and
customer call data
Association Identify relationships within an FP-growth Finding cross-selling
analysis item set based on transaction data algorithm, a priori opportunities for a retailer
algorithm based on transaction
purchase history
Recommen Predict the preference of an item Collaborative Finding the top
dation for a user filtering, content- recommended
engines based filtering, movies for a user
Hybrid
recommenders

Data Science and Visualization


[20AD2202]
Data Science process
 The methodical discovery of useful
relationships and patterns in data is enabled
by a set of iterative activities collectively
known as the data science process.
 The standard data science process involves
 (1) Understanding the problem,
 (2) Preparing the data samples,
 (3) Developing the model,
 (4) Applying the model on a dataset to see

how the model may work in the real world,


 (5) Deploying and maintaining the models.

Data Science and Visualization


[20AD2202]
Data Science Process Frameworks
 CRISP-DM: Cross Industry Standard Process for
Data Mining
 SEMMA: Sample, Explore, Modify, Model, and

Assess
 DMAIC: Define, Measure, Analyze, Improve,

and Control

Data Science and Visualization


[20AD2202]
Cross Industry Standard Process for
Data Mining (CRISP-DM)

Data Science and Visualization


[20AD2202]
Data Science process

Data Science and Visualization [20AD2202]


Prior knowledge
 Prior knowledge refers to information that is
already known about a subject.
 The data science problem doesn’t emerge in

isolation; it always develops on top of


existing subject matter and contextual
information that is already known.
 The prior knowledge step in the data

science process helps to define what


problem is being solved, how it fits in the
business context, and what data is needed
in order to solve the problem.
Data Science and Visualization
[20AD2202]
Objective
 The data science process starts with a need for
analysis, a question, or a business objective.
 This is possibly the most important step in the
data science process
 Without a well-defined statement of the
problem, it is impossible to come up with the
right dataset and pick the right data science
algorithm.
 As an iterative process, it is common to go
back to previous data science process steps,
revise the assumptions, approach, and tactics.

Data Science and Visualization


[20AD2202]
Subject Area
 The process of data science uncovers hidden patterns
in the dataset by exposing relationships between
attributes.
 But the problem is that it uncovers a lot of patterns.
 The false or spurious signals are a major problem in

the data science process.


 It is up to the practitioner to sift through the exposed

patterns and accept the ones that are valid and


relevant to the answer of the objective question.
 Hence, it is essential to know the subject matter, the

context, and the business process generating the data.


Data Science and Visualization
[20AD2202]
Data
 Understanding how the data is collected, stored, transformed,
reported, and used is essential to the data science process.
 This part of the step surveys all the data available to answer
the business question and narrows down the new data that
need to be sourced.
 There are quite a range of factors to consider: quality of the
data, quantity of data, availability of data, gaps in data, does
lack of data compel the practitioner to change the business
question, etc.
 The objective of this step is to come up with a dataset to
answer the business question through the data science process.
 It is critical to recognize that an inferred model is only as
good as the data used to create it.
Data Science and Visualization
[20AD2202]
 A dataset (example set) is a collection of data with a
defined structure. “data frame”.
 A data point (record, object or example) is a single
instance in the dataset. Each instance contains the
same structure as the dataset.
 An attribute (feature, input, dimension, variable, or
predictor) is a single property of the dataset.
 Attributes can be numeric, categorical, date-time,
text, or Boolean data types.
 A label (class label, output, prediction, target, or
response) is the special attribute to be predicted
based on all the input attributes.
 identifiers are special attributes that are used for
locating or providing context to individual records.

Data Science and Visualization


[20AD2202]

You might also like