0% found this document useful (0 votes)

16 views38 pages

03 Data Science Process - Fall 23-24

Chapter 2 outlines the data science process, emphasizing the importance of understanding the problem objective, subject area, and data types. It discusses data preparation techniques such as handling missing values, ensuring data quality, and feature selection, as well as modeling methods including supervised and unsupervised learning. The chapter concludes with the significance of splitting datasets into training and test sets for model evaluation.

Uploaded by

Sumaiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views38 pages

03 Data Science Process - Fall 23-24

Uploaded by

Sumaiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 38

Chapter 2

Data Science Process

CRISP Data Mining Framework
Data Science Process
Prior Knowledge
Gaining information on:

Objective of the problem

Subject area of the problem
Data
Objective of the problem

The data science process starts with a need for analysis, a question, or a business objective. This is
possibly the most important step in the data science Process. Without a well-defined statement of the
problem, it is impossible to come up with the right dataset and pick the right data science algorithm.
Subject area of the problem

The process of data science uncovers hidden patterns in the dataset by exposing relationships between
attributes. But the problem is that it uncovers a lot of patterns. The false or spurious signals are a major
problem in the data science process. The false or spurious signals are a major problem in the data
science process. It is up to the practitioner to sift through the exposed patterns and accept the ones that
are valid and relevant to the answer to the objective question. Hence, it is essential to know the subject
matter, the context, and the business process generating the data.
Data

Understanding how the data is collected, stored, transformed, reported, and used is essential to the data
science process. There are quite a range of factors to consider: quality of the data, quantity of data. The
objective of this step is to come up with a dataset to answer the business question through the data
science process. For the following example, a sample dataset of ten data points with three attributes has
been put together: identifier, credit score, and interest rate. First, some of the terminology used in the
data science process are discussed.
Data Types
• Two types of data: Labelled Data & Unlabelled Data

• Labelled data

• Specially designated attribute and the aim is to use the data given to predict the value of that attribute
for instances that have not yet been seen. Data of this kind is called labelled.

• ID=1000027,clump=7,UnifShape=11, MargAdh=H, SingEpiSize=1,BareNuc=15,BlandChrome=

4,NormNucl=5,Mit 1,THEN Class = ?
Data Types
• Unlabelled data

• Data that does not have any specially designated attribute is called unlabelled.
• Here the aim is simply to extract the most information we can from the data available .
Learning Methods

• Supervised Learning
• Data mining using labelled data is known as supervised learning.

• Classification

• If the designated attribute is categorical, the task is called classification.

• Classification is one form of prediction, where the value to be predicted is a label.

• a hospital may want to classify medical patients into those who are at high, medium or low risk

• we may wish to classify a student project as distinction, merit, pass or fail

• Nearest Neighbour Matching, Classification Rules, Classification Tree, …

Learning Methods

• Numerical Prediction (Regression)

• If the designated attribute is numerical, the task is called numerical prediction (regression).

• Numerical prediction (often called regression) is another. In this case we wish to predict a
numerical value, such as a company’s profits or a share price.

• A very popular way of doing this is to use a Neural Network

Learning Methods

• Unsupervised Learning
• Data mining using unlabelled data is known as unsupervised learning.

• Association Rules

• Sometimes we wish to use a training set to find any relationship that exists amongst the values of variables,
generally in the form of rules known as association rules.

• APRIORI

• Market Basket Analysis

• Clustering

• Clustering algorithms examine data to find groups of items that are similar.

• K-Means Clustering, Agglomerative Hierarchical Clustering

A dataset (example set) is a collection of data with a defined structure. Table 2.1 shows a dataset. It
has a well-defined structure with 10 rows and 3 columns along with the column headers. This structure
is also sometimes referred to as a “data frame”.

A data point (record, object or example) is a single instance in the dataset. Each row in Table 2.1 is a
data point. Each instance contains the same structure as the dataset.

 An attribute (feature, input, dimension, variable, or predictor) is a single property of the dataset. Each
column in Table 2.1 is an attribute. Attributes can be numeric, categorical, date-time, text, or Boolean
data types. In this example, both the credit score and the interest rate are numeric attributes.
A label (class label, output, prediction, target, or response) is the special attribute to be predicted based
on all the input attributes. In Table 2.1, the interest rate is the output variable .
Data Preparation

 Data Exploration
 Data quality
 Missing values
 Noisy values
 Invalid values
 Data types and Conversion
 Transformation
 Outliers
 Feature selection
 Sampling
Data Exploration

Data preparation starts with an in-depth exploration of the data and gaining a
better understanding of the dataset. Data exploration, also known as exploratory
data analysis, provides a set of simple tools to achieve basic understanding of the
data. Data exploration approaches involve computing descriptive statistics and
visualization of data.
Data Quality

Data quality is the measure of how well suited a data set is to serve its specific
purpose. Measures of data quality are based Data Correctness, Data Freshness and
Data Completeness.

Data Correctness: How accurately the data value describes real-world facts. For example, a
B2B sales rep wishes to look at a prospect company’s number of employees. If they accidentally grab
the wrong company from their database, because its name and location are similar to another
organization, they will report a wrong number, be misinformed, and potentially lose an opportunity to
sell to a qualified prospect. In this case, the rep used incorrect data. The data correctness metric is
usually measured with classification metrics such as precision – a compound of correct data points
compared to incorrect data points. There are many potential root causes of collection issues such as
collection noise, faulty data transformations, outdated data, or incorrect schema description.
Data Freshness : This refers to how relevant the data is to describe the current state of an
entity, and takes into consideration the timeliness of the data and how frequently it is updated. This is a
tricky measurement as “freshness” ranges from data that is updated in real-time to data that is updated
annually. Each business use case will differ in its data freshness thresholds and requirements. For
example, data that doesn’t change frequently, like a person or institution name, would not require the
same freshness as stock market or Twitter trends. In any case, data must be up-to-date, if it is not it could
mislead a decision. This metric is typically measured with time
Data Completeness : A measure which describes how whole and complete a data
asset is. Completeness is especially important when you want to attach new attributes to your existing
data. In cases where you have low coverage, you would get limited support for the different attributes
that you enrich, and the data becomes less useful. Coverage is also important if you want to extract
insights from your dataset.
Missing Values

In many real-world datasets data values are not recorded for all attributes. This can happen simply
because there are some attributes that are not applicable for some instances, a malfunction of the
equipment used to record the data, a data collection form to which additional fields were added after
some data had been collected, information that could not be obtained, e.g. about a hospital patient

k-nearest neighbor (k-NN) algorithm for classification tasks are often robust with missing values. Neural
network models for classification tasks do not perform well with missing attributes, and thus, the data
preparation step is essential for developing neural network models
Methods to Handle Missing Values

• Discard Instances

• Replace by Most Frequent/Average Value

Discard Instances

• This is the simplest strategy: delete all instances where there is at least one missing value and use
the remainder.

• It has the advantage of avoiding introducing any data errors. Its disadvantage is that discarding
data may damage the reliability of the results derived from the data
Replace by Most Frequent/Average Value

• A less cautious strategy is to estimate each of the missing values using the values that are present
in the dataset.

• A straightforward but effective way of doing this for a categorical attribute is to use its most
frequently occurring (non-missing) value

• In the case of continuous attributes, it is likely that no specific numerical value will occur more
than a small number of times. In this case the estimate used is generally the average value.
Noisy Values

• A noisy value to mean one that is valid for the dataset, but is incorrectly recorded

• The number 69.72 may accidentally be entered as 6.972, or a categorical attribute value such as
brown may accidentally be recorded as another of the possible values, such as blue.
Invalid Values

•69.7X for 6.972 or bbrown for brown

•An invalid value can easily be detected and either corrected or rejected
Data types and Conversion

The attributes in a dataset can be of different types, such as continuous numeric (interest rate), integer
numeric (credit score), or categorical. For example, the credit score can be expressed as categorical
values (poor, good, excellent) or numeric score. Different data science algorithms impose different
restrictions on the attribute data types.
Transformation

In some data science algorithms like k-NN, the input attributes are expected to be numeric and
normalized, because the algorithm compares the values of different attributes and calculates distance
between the data points. Normalization prevents one attribute dominating the distance results because of
large values. To overcome this problem ,we generally normalize the values of continuous attributes.

The idea is to make the values of each attribute run from 0 to 1. In general, if the lowest value of
attribute A is min and the highest value is max, we convert each value of A, say a, to (a − min)/(max −
min).
Outliers

Outliers are those data points that are significantly different from the rest of the dataset. They are often
abnormal observations that skew the data distribution, and arise due to inconsistent data entry, or
erroneous observations. Detecting outliers may be the primary purpose of some data science
applications, like fake email detection, fraud or intrusion detection.
Feature Selection

Reducing the number of attributes, without significant loss in the performance of the model, is called
feature selection.

many data science problems involve a dataset with hundreds to thousands of attributes. In text mining
applications, every distinct word in a document forms a distinct attribute in the dataset. Not all the
attributes are equally important or useful in predicting the target. The presence of some attributes might
be counter productive. Some of the attributes may be highly correlated with each other, like annual
income and taxes paid. A large number of attributes in the dataset significantly increases the complexity
of a model and may degrade the performance of the model due to the curse of dimensionality
Data Sampling

Sampling is a process of selecting a subset of records as a representation of the original dataset for use in
data analysis or modeling. The sample data serve as a representative of the original dataset with similar
properties, such as a similar mean. Sampling reduces the amount of data that need to be processed and
speeds up the build process of the modeling. In most cases, to gain insights, extract the information, and
to build representative predictive models it is sufficient to work with samples. Theoretically, the error
introduced by sampling impacts the relevancy of the model, but their benefits far outweigh the risks.
MODELING

A model is the abstract representation of the data and the relationships in a given dataset.
MODELING

Splitting Training and Test data sets: The modeling step creates a representative model inferred from the
data. The dataset used to create the model, with known attributes and target, is called the training
dataset.

The validity of the created model will also need to be checked with another known dataset called the
test dataset or validation dataset. To facilitate this process, the overall known dataset can be split into a
training dataset and a test dataset. A standard rule of thumb is two-thirds of the data are to be used as
training and one-third as a test dataset
Training Set and Test Set
MODELING

Splitting Training and Test data sets

MODELING
MODELING

Evaluation of test dataset

Application

Product readiness
Technical integration
Model response time
Remodeling
Assimilation
Knowledge

Posterior knowledge

Past Questions Main
No ratings yet
Past Questions Main
61 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Data Quality
100% (2)
Data Quality
16 pages
Emerging Trends MCQ
100% (1)
Emerging Trends MCQ
18 pages
Case Study Mysql
0% (1)
Case Study Mysql
3 pages
Strategic Management Challenges in The 21st Century
No ratings yet
Strategic Management Challenges in The 21st Century
6 pages
Introduction To Analytics
100% (1)
Introduction To Analytics
45 pages
Sip Parameters
100% (1)
Sip Parameters
20 pages
ADE7763-ADI Energy Metering
No ratings yet
ADE7763-ADI Energy Metering
56 pages
Name: Haseeb Arif Reg No: SP18-BSE-087 Date of Submission: May 10, 2020. Submitted To: Ms. Saira Beg
100% (2)
Name: Haseeb Arif Reg No: SP18-BSE-087 Date of Submission: May 10, 2020. Submitted To: Ms. Saira Beg
7 pages
A New Approach For Object Detection, Recognition and Retrieving in Painting Images
No ratings yet
A New Approach For Object Detection, Recognition and Retrieving in Painting Images
2 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Patient Monitor: Series
No ratings yet
Patient Monitor: Series
498 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
01 Course Introduction - Fall 23-24
No ratings yet
01 Course Introduction - Fall 23-24
14 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
BCA Course Outcomes
No ratings yet
BCA Course Outcomes
5 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
HNDR-S4812 User's Manual
No ratings yet
HNDR-S4812 User's Manual
74 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
IMaster NCE Smart LCT V100R021C00 User Guide 01-C
No ratings yet
IMaster NCE Smart LCT V100R021C00 User Guide 01-C
59 pages
PS Nvision Handbook
No ratings yet
PS Nvision Handbook
80 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Normalization
No ratings yet
Normalization
35 pages
Getting Started With Granta EduPack - Tutorial Exercises
No ratings yet
Getting Started With Granta EduPack - Tutorial Exercises
44 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Neb Class 12 Computer Programming in C Notes
No ratings yet
Neb Class 12 Computer Programming in C Notes
60 pages
Unit I
No ratings yet
Unit I
57 pages
Data Mining
No ratings yet
Data Mining
40 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
Down 2
No ratings yet
Down 2
61 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Airbnb GRP 6
No ratings yet
Airbnb GRP 6
26 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Vaibhav Minor Project Report New
No ratings yet
Vaibhav Minor Project Report New
46 pages
CH 2
No ratings yet
CH 2
36 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Unit 2
No ratings yet
Unit 2
21 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
UISearchController Tutorial Getting Started
No ratings yet
UISearchController Tutorial Getting Started
16 pages
Data2 Science Process Am
No ratings yet
Data2 Science Process Am
33 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
02 Introduction - Fall 23-24
No ratings yet
02 Introduction - Fall 23-24
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
DP
No ratings yet
DP
44 pages
Brosur Rectiverter Dan Battery (48Vdc 24Vdc 220vac)
No ratings yet
Brosur Rectiverter Dan Battery (48Vdc 24Vdc 220vac)
6 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
ML Lecture 5 Data Quality
No ratings yet
ML Lecture 5 Data Quality
19 pages
03 Data Science Process - Spring-24-25
No ratings yet
03 Data Science Process - Spring-24-25
48 pages
Syllabus CSI104 Summer 2021
No ratings yet
Syllabus CSI104 Summer 2021
13 pages
Unit 3
No ratings yet
Unit 3
18 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
Lecture 14
No ratings yet
Lecture 14
40 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Lecture 2 (LAB)
No ratings yet
Lecture 2 (LAB)
14 pages
Unit IV-storage Virtualization
No ratings yet
Unit IV-storage Virtualization
26 pages
Lecture 1 (LAB)
No ratings yet
Lecture 1 (LAB)
13 pages
04 Data Exploration
No ratings yet
04 Data Exploration
12 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
EEI3346 Final Written Paper - 9th January2022
No ratings yet
EEI3346 Final Written Paper - 9th January2022
10 pages
DS&ML 4
No ratings yet
DS&ML 4
9 pages
Create A Larger Than 4GB Casper Partition: Search
No ratings yet
Create A Larger Than 4GB Casper Partition: Search
6 pages
HTML Assignment
No ratings yet
HTML Assignment
3 pages
Chapter 1
No ratings yet
Chapter 1
6 pages
Software Testing Practical No 6
No ratings yet
Software Testing Practical No 6
2 pages
Jquery Validation
No ratings yet
Jquery Validation
2 pages
Bit4Id Annual Rate Contract of Signature On Fly
No ratings yet
Bit4Id Annual Rate Contract of Signature On Fly
1 page
Badar Part
No ratings yet
Badar Part
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet

03 Data Science Process - Fall 23-24

Uploaded by

03 Data Science Process - Fall 23-24

Uploaded by

Chapter 2

Data Science Process

Objective of the problem

• ID=1000027,clump=7,UnifShape=11, MargAdh=H, SingEpiSize=1,BareNuc=15,BlandChrome=

• If the designated attribute is categorical, the task is called classification.

• Classification is one form of prediction, where the value to be predicted is a label.

• we may wish to classify a student project as distinction, merit, pass or fail

• Nearest Neighbour Matching, Classification Rules, Classification Tree, …

• Numerical Prediction (Regression)

• A very popular way of doing this is to use a Neural Network

• Market Basket Analysis

• K-Means Clustering, Agglomerative Hierarchical Clustering

• Replace by Most Frequent/Average Value

•69.7X for 6.972 or bbrown for brown

Splitting Training and Test data sets

Evaluation of test dataset

You might also like