0% found this document useful (0 votes)

30 views19 pages

Data Science Classification Etc

The document discusses data science classification and algorithms. It describes supervised and unsupervised learning models, classification and regression techniques, and clustering. Common data science algorithms are also presented for tasks like classification, regression, anomaly detection, time series forecasting, clustering, and recommendation engines. Finally, it outlines the typical data science process and some common process frameworks.

Uploaded by

Elangovan GuruvaReddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views19 pages

Data Science Classification Etc

Uploaded by

Elangovan GuruvaReddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Data Science Classification

Data Science and Visualization

22AD2202
Data Science Classification

Data Science and Visualization

[20AD2202]
Data Science Classification
 Supervised or unsupervised learning models.
 Supervised techniques predict the value of

the output variables based on a set of input

variables.
 To do this, a model is developed from a

training dataset where the values of input and

output are previously known.
 The output variable that is being predicted is

also called a class label or target variable.

Data Science and Visualization

[20AD2202]
Data Science Classification
 Unsupervised or undirected data science
uncovers hidden patterns in unlabeled data.
 In unsupervised data science, there are no

output variables to predict.

 The objective of this class of data science

techniques, is to find patterns in data based

on the relationship between data points
themselves.
 An application can employ both supervised

and unsupervised learners.

Data Science and Visualization

[20AD2202]
Data Science Classification
 Classification and regression techniques predict
a target variable based on input variables.
 The prediction is based on a generalized model
built from a previously known dataset.
 In regression tasks, the output variable is
numeric.
 Deep learning is a more sophisticated artificial
neural network that is increasingly used for
classification and regression problems.
 Clustering is the process of identifying the
natural groupings in a dataset.

Data Science and Visualization

[20AD2202]
Data Science Algorithms
 In data science, it is the blueprint for how a particular data
problem is solved.
 Many of the learning algorithms are recursive, where a set
of steps are repeated many times until a limiting condition
is met.
 Some algorithms also contain a random variable as an
input and are aptly called randomized algorithms.
 A classification task can be solved using many different
learning algorithms such as decision trees, artificial neural
networks, k-NN, and even some regression algorithms.

Data Science and Visualization

[20AD2202]
Data Science Algorithms
 The choice of which algorithm to use depends on the
type of dataset, objective, structure of the data,
presence of outliers, available computational power,
number of records, number of attributes, and so on.
 It is up to the data science practitioner to decide

which algorithm (s) to use by evaluating the

performance of multiple algorithms.
 There have been hundreds of algorithms developed in

the last few decades to solve data science problems.

Data Science and Visualization

[20AD2202]
Data Science Algorithms
 Data science algorithms can be implemented by custom-
developed computer programs in almost any computer
language.
 This obviously is a time consuming task.
 In order to focus the appropriate amount of time on data
and algorithms, data science tools or statistical
programing tools, like R, RapidMiner, Python, SAS
Enterprise Miner, etc., which can implement these
algorithms with ease, can be leveraged.
 These data science tools offer a library of algorithms as
functions, which can be interfaced through programming
code or configurated through graphical user interfaces.
Data Science and Visualization
[20AD2202]
Data Science Algorithms
Tasks Description Algorithms Examples
Classificati Predict if a data point belongs Decision trees, neural Assigning voters into
on to one of the predefined networks, Bayesian known buckets by political
classes. The prediction will be
based on learning from a
models, induction
rules,
parties, e.g., soccer moms
Bucketing new customers
known dataset k-nearest neighbors into one of the known
customer groups
Regression Predict the numeric target Linear regression, Predicting the
label of a data point. The logistic regression unemployment rate for the
prediction will be based on next year Estimating
learning from a known dataset insurance premium

Anomaly Predict if a data point is an Distance-based, Detecting fraudulent credit

detection outlier compared to other data density-based, LOF card transactions and
points in the dataset network intrusion

Data Science and Visualization

[20AD2202]
Tasks Description Algorithms Examples
Time series Predict the value of the target Exponential Sales forecasting,
Forecasting variable for a future timeframe smoothing, production
based on historical values ARIMA, forecasting, virtually any
regression growth phenomenon that
needs to be extrapolated
Clustering Identify natural clusters within the k-Means, density- Finding customer segments
dataset based on inherit properties based clustering in a company based on
within the dataset (e.g., DBSCAN) transaction, web, and
customer call data
Association Identify relationships within an FP-growth Finding cross-selling
analysis item set based on transaction data algorithm, a priori opportunities for a retailer
algorithm based on transaction
purchase history
Recommen Predict the preference of an item Collaborative Finding the top
dation for a user filtering, content- recommended
engines based filtering, movies for a user
Hybrid
recommenders

Data Science and Visualization

[20AD2202]
Data Science process
 The methodical discovery of useful
relationships and patterns in data is enabled
by a set of iterative activities collectively
known as the data science process.
 The standard data science process involves
 (1) Understanding the problem,
 (2) Preparing the data samples,
 (3) Developing the model,
 (4) Applying the model on a dataset to see

how the model may work in the real world,

 (5) Deploying and maintaining the models.

Data Science and Visualization

[20AD2202]
Data Science Process Frameworks
 CRISP-DM: Cross Industry Standard Process for
Data Mining
 SEMMA: Sample, Explore, Modify, Model, and

Assess
 DMAIC: Define, Measure, Analyze, Improve,

and Control

Data Science and Visualization

[20AD2202]
Cross Industry Standard Process for
Data Mining (CRISP-DM)

Data Science and Visualization

[20AD2202]
Data Science process

Data Science and Visualization [20AD2202]

Prior knowledge
 Prior knowledge refers to information that is
already known about a subject.
 The data science problem doesn’t emerge in

isolation; it always develops on top of

existing subject matter and contextual
information that is already known.
 The prior knowledge step in the data

science process helps to define what

problem is being solved, how it fits in the
business context, and what data is needed
in order to solve the problem.
Data Science and Visualization
[20AD2202]
Objective
 The data science process starts with a need for
analysis, a question, or a business objective.
 This is possibly the most important step in the
data science process
 Without a well-defined statement of the
problem, it is impossible to come up with the
right dataset and pick the right data science
algorithm.
 As an iterative process, it is common to go
back to previous data science process steps,
revise the assumptions, approach, and tactics.

Data Science and Visualization

[20AD2202]
Subject Area
 The process of data science uncovers hidden patterns
in the dataset by exposing relationships between
attributes.
 But the problem is that it uncovers a lot of patterns.
 The false or spurious signals are a major problem in

the data science process.

 It is up to the practitioner to sift through the exposed

patterns and accept the ones that are valid and

relevant to the answer of the objective question.
 Hence, it is essential to know the subject matter, the

context, and the business process generating the data.

Data Science and Visualization
[20AD2202]
Data
 Understanding how the data is collected, stored, transformed,
reported, and used is essential to the data science process.
 This part of the step surveys all the data available to answer
the business question and narrows down the new data that
need to be sourced.
 There are quite a range of factors to consider: quality of the
data, quantity of data, availability of data, gaps in data, does
lack of data compel the practitioner to change the business
question, etc.
 The objective of this step is to come up with a dataset to
answer the business question through the data science process.
 It is critical to recognize that an inferred model is only as
good as the data used to create it.
Data Science and Visualization
[20AD2202]
 A dataset (example set) is a collection of data with a
defined structure. “data frame”.
 A data point (record, object or example) is a single
instance in the dataset. Each instance contains the
same structure as the dataset.
 An attribute (feature, input, dimension, variable, or
predictor) is a single property of the dataset.
 Attributes can be numeric, categorical, date-time,
text, or Boolean data types.
 A label (class label, output, prediction, target, or
response) is the special attribute to be predicted
based on all the input attributes.
 identifiers are special attributes that are used for
locating or providing context to individual records.

Data Science and Visualization

[20AD2202]

Joel E. Collier - Applied Structural Equation Modeling Using AMOS - Basic To Advanced Techniques-Routledge (2020)
100% (2)
Joel E. Collier - Applied Structural Equation Modeling Using AMOS - Basic To Advanced Techniques-Routledge (2020)
367 pages
Business Architecture
100% (2)
Business Architecture
37 pages
SRM Formula Sheet-2
100% (1)
SRM Formula Sheet-2
11 pages
OS-CO2-Session 12 Implicit Threading
No ratings yet
OS-CO2-Session 12 Implicit Threading
33 pages
Role of Computer PDF
No ratings yet
Role of Computer PDF
8 pages
Elements of Nonlinear Series Analysis and Forecasting PDF
100% (8)
Elements of Nonlinear Series Analysis and Forecasting PDF
626 pages
419 Data Science
No ratings yet
419 Data Science
2 pages
Ebola Visualization
No ratings yet
Ebola Visualization
11 pages
Regression
No ratings yet
Regression
19 pages
Statistical Analysis Using SAS
No ratings yet
Statistical Analysis Using SAS
47 pages
OS-CO1-Session 03 Resource Management
No ratings yet
OS-CO1-Session 03 Resource Management
18 pages
WEEK 7 - Two Sample Mean Test
No ratings yet
WEEK 7 - Two Sample Mean Test
26 pages
OS-CO2-Session 11 Multithreading Models
No ratings yet
OS-CO2-Session 11 Multithreading Models
24 pages
DSBDAL Lab Manual
No ratings yet
DSBDAL Lab Manual
26 pages
OS-CO2-Session 13 Scheduling Criteria
No ratings yet
OS-CO2-Session 13 Scheduling Criteria
21 pages
A Critical Review of Factors Influencing The Remaining Driving Range of Electric Vehicles
No ratings yet
A Critical Review of Factors Influencing The Remaining Driving Range of Electric Vehicles
6 pages
SAS Interview Questions: Click Here
No ratings yet
SAS Interview Questions: Click Here
31 pages
Task 2&3
No ratings yet
Task 2&3
13 pages
KOM6115 Assignment 2 (GS65807)
No ratings yet
KOM6115 Assignment 2 (GS65807)
12 pages
Cost Sheet
No ratings yet
Cost Sheet
63 pages
OS-CO1-Session 04 Computing Environments
No ratings yet
OS-CO1-Session 04 Computing Environments
13 pages
Impact of Social Networking Media Usage PDF
No ratings yet
Impact of Social Networking Media Usage PDF
11 pages
Unit 2 - (A) Correlation & Regression
No ratings yet
Unit 2 - (A) Correlation & Regression
15 pages
Analisis Pengaruh Pembiayaan Mudharabah, Musyarakah Dan Murabahah Terhadap Return On Equity Bank Umum Syariah
No ratings yet
Analisis Pengaruh Pembiayaan Mudharabah, Musyarakah Dan Murabahah Terhadap Return On Equity Bank Umum Syariah
14 pages
Cegnet Guide To Evaluating Your Careers Programme
No ratings yet
Cegnet Guide To Evaluating Your Careers Programme
30 pages
Pragmatic Adaptation As A Requirement in Translation: A Case of Persian and English Versions of The Shah
No ratings yet
Pragmatic Adaptation As A Requirement in Translation: A Case of Persian and English Versions of The Shah
7 pages
Senior. Research Proposal Guideline
No ratings yet
Senior. Research Proposal Guideline
14 pages
Example of How To Use Multiple Linear Regression
No ratings yet
Example of How To Use Multiple Linear Regression
4 pages
Hyper Tuner
No ratings yet
Hyper Tuner
11 pages
A48970353 - 24830 - 7 - 2019 - MGN303 At2 Q1912
No ratings yet
A48970353 - 24830 - 7 - 2019 - MGN303 At2 Q1912
5 pages
Cluster Analysis On PCA On Wholesale Customers Data
No ratings yet
Cluster Analysis On PCA On Wholesale Customers Data
6 pages
Geostatistical Analysis
No ratings yet
Geostatistical Analysis
25 pages
Questionnaire Stock Market
No ratings yet
Questionnaire Stock Market
18 pages
Pavan's Resume
No ratings yet
Pavan's Resume
1 page
Lecture-1 Big Data
No ratings yet
Lecture-1 Big Data
15 pages

Data Science Classification Etc

Uploaded by

Data Science Classification Etc

Uploaded by

Data Science Classification

Data Science and Visualization

Data Science and Visualization

the output variables based on a set of input

training dataset where the values of input and

also called a class label or target variable.

Data Science and Visualization

output variables to predict.

techniques, is to find patterns in data based

and unsupervised learners.

Data Science and Visualization

Data Science and Visualization

Data Science and Visualization

which algorithm (s) to use by evaluating the

the last few decades to solve data science problems.

Data Science and Visualization

Anomaly Predict if a data point is an Distance-based, Detecting fraudulent credit

Data Science and Visualization

Data Science and Visualization

how the model may work in the real world,

Data Science and Visualization

Data Science and Visualization

Data Science and Visualization

Data Science and Visualization [20AD2202]

isolation; it always develops on top of

science process helps to define what

Data Science and Visualization

the data science process.

patterns and accept the ones that are valid and

context, and the business process generating the data.

Data Science and Visualization

You might also like