0% found this document useful (0 votes)

4 views34 pages

Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025

The document outlines the data science methodology and data preparation processes, emphasizing the importance of data quality, cleaning, and transformation. It details the steps involved in data cleaning, including removing duplicates, fixing structural errors, and handling missing data. The document also highlights the significance of feature selection in improving model performance and reducing training time.

Uploaded by

hafizmna04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views34 pages

Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025

Uploaded by

hafizmna04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Introduction to Data Science

Data Science Methodology & Data Preparation

DR SHUHAIDA MOHAMED SHUHIDAN

Jan 2025
CREDIT: Ts. Dr. Nurul Aida Osman
Learning Outcomes

At the end of this session, you will be able to:

• Explain data science methodology
• List the processes in data preparation/cleaning

2
I. Data Science Methodology
Methodology
Data collection & Curation (Storing)

Data cleaning

Data processing

Evaluation and Deployment

4
II. Data Preparation
Data Preparation

Data Quality

Features Data
Setting Understanding

Data
Preparation

Data Data Acquiring

Transformation / Extraction

Data Cleaning

6
Data Preparation

• Data preparation – difficult since it is different according to dataset and specific to project, yet it is critical
(customized).

• The objectives are to make sure the dataset is accurate, complete, and relevant.

• People agree on:

o Garbage in, garbage out
o 70%-80% of the Machine Learning project time is spent on data preparation

• However, there are common processes which are implemented in various projects.

• The processes are:

o Data quality/understanding
o Data acquiring/extraction
o Data cleaning
o Data transformation
o Features setting for modelling

7
II-A. Data Quality/Understanding
DATA QUALITY / UNDERSTANDING

• A very important phase in machine learning project development.

• Normally conducted in the first stage of a project.

• Among the activities are:

o Understanding infra/network/system setup
o Understanding data and source
o Determine bad values, non-numeric values etc
o Know the behavior of the equipment
o Tags, number of tags, related equipment/system
o Failure report, FFN
o Data uniqueness, completeness

9
II-B. Data Acquiring/Extraction
Data Acquiring/Extraction

• Also known as data collection

• The process of collecting/gathering information on variables of interest, in an established systematic method
that enables one to answer stated research questions, test hypotheses, and evaluate outcomes.

Primary Data Acquiring

• A unique problem with no related works which is conducted in the past.

• E.g.
o Questionnaire
o Observation
o Interview
o Focus Groups
o Experiments
o Sensor
11
Data Acquiring/Extraction

Secondary Data Acquiring

• Using data that is readily available or collected by someone else. Such data can be found on the internet,
library, engineers/users or documents in the organization.

• By using online repositories, such as Kaggle, GitHub, Data Hub, and Gapminder.

• E.g.:
o Published data
o Government publications
o Public records
o Historical and statistical documents
o Business documents
o Technical and trade journals

12
Data Acquiring/Extraction

• E.g., Data is acquired every 15 minutes from server and contains data points of various sensor tags of
different equipment.

• Need to extract tags and values by equipment.

• Save the extracted data accordingly in respective file (e.g. csv).

13
Data Acquiring/Extraction
CASE STUDY
Assume:
• The acquired data is in ex_acquired.csv – containing data of 20 sensor tags for 3 time frames.
• The 20 tags are tags of 2 compressor equipment – Comp1 and Comp2 → this information is contained in
ex_eqlist.csv.
• We are to extract the acquired data and save them into 2 separate files by equipment.

ex_acquired.csv ex_eqlist.csv 14
Data Acquiring/Extraction
CASE STUDY

Compressor_2.csv

Compressor_1.csv
15
II-C. Data Cleaning
Data Cleaning

Start

Search bad Bad value

Data set value list

Found YES Remove bad

bad value
value?

NO
Convert non-
Non-numeric numeric
values list values

Cleaned data

End

17
Data Cleaning

What is data cleaning?

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset. When combining multiple data sources, there are many opportunities for
data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even
though they may look correct. There is no one absolute way to prescribe the exact steps in the data
cleaning process because the processes will vary from dataset to dataset. But it is crucial to establish a
template for your data cleaning process so you know you are doing it the right way every time.

18
Data Cleaning

Steps in Data Cleaning

• Step 1: Remove duplicate or irrelevant observations.

• Step 2: Fix structural errors
• Step 3: Filter unwanted outliers
• Step 4: Handle missing data
• Step 5: Validate and QA

19
Data Cleaning
Step 1: Remove duplicate or irrelevant observations
• Remove unwanted data from your dataset, including duplicate or irrelevant data.
• Duplicate observations will happen most often during data collection. When you combine data sets from
multiple places, scrape data, or receive data from clients or multiple departments, there are possibilities to
create duplicate data.
• Irrelevant observations are when you notice observations that do not fit into the specific problem you are
trying to analyze.
• E.g., if you want to analyze data regarding millennial customers, but your dataset includes older generations,
you might remove those irrelevant observations. This can make analysis more efficient and minimize
distraction from your primary target—as well as creating a more manageable and more performant dataset.

20
Data Cleaning

Step 2: Fix structural errors

• Structural errors are when you measure or transfer data and notice strange naming conventions, typos, or
incorrect capitalization.
• These inconsistencies can cause mislabeled categories or classes.
• For example, you may find “N/A” and “Not Applicable” both appear, but they should be analyzed as the same
category.

21
Data Cleaning
Step 3: Filter unwanted outliers

• Often, there will be one-off observations where, at a glance, they do not appear to fit within the data you
are analyzing.
• If you have a legitimate reason to remove an outlier, like improper data-entry, doing so will help the
performance of the data you are working with.
• However, sometimes it is the appearance of an outlier that will prove a theory you are working on.
Remember: just because an outlier exists, doesn’t mean it is incorrect.
• This step is needed to determine the validity of that number. If an outlier proves to be irrelevant for
analysis or is a mistake, consider removing it.

22
Data Cleaning

Step 4: Handle missing data

• You can’t ignore missing data because many algorithms will not accept missing values. There are a
couple of ways to deal with missing data. Neither is optimal, but both can be considered.
• As a first option, you can drop observations that have missing values, but doing this will drop or lose
information, so be mindful of this before you remove it.
• As a second option, you can input missing values based on other observations; again, there is an
opportunity to lose integrity of the data because you may be operating from assumptions and not actual
observations.
o Mean – replace missing values with the mean value
o Median – replace missing values with the median value
o Interpolation – take the points before the missing value and after the missing value, then connect
the points with values in between
o K-Nearest Neighbors

23
Data Cleaning

Step 5: Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as a part of basic
validation:

• Does the data make sense?

• Does the data follow the appropriate rules for its field?
• Does it prove or disprove your working theory, or bring any insight to light?
• Can you find trends in the data to help you form your next theory?
• If not, is that because of a data quality issue?

Note: False conclusions because of incorrect or “dirty” data can inform poor business strategy and decision-
making. False conclusions can lead to an embarrassing moment in a reporting meeting when you realize
your data doesn’t stand up to scrutiny. Before you get there, it is important to create a culture of quality
data in your organization. To do this, you should document the tools you might use to create this culture
and what data quality means to you.

24
Data Cleaning
CASE STUDY
• Certain tag values contain bad values which need to be removed.

• The definition of bad values is based on the project.

• E.g.:

TagsValue Definition Value to be assigned

Alarm Good value 1
Bad Bad value Garbage
Calc Failed Bad value Garbage
Configure Bad value Garbage
Connected Good value 1
FALSE Good value 0
FAULT Good value 0
Good Good value 1
I/O Timeout Bad value Garbage
Intf Shut Bad value Garbage NOTE:
Normal Good value 1 • Bad values are removed
Not Connect Bad value Garbage
Out of Serv Bad value Garbage
Garbage • The remaining values in non-numerical
Pt Created Bad value
Scan Off Bad value Garbage form are converted into numeric value
TRUE Good value 1 e.g. 1 or 0
25
II-D. Data Transformation
DATA TRANSFORMATION

• Purpose: To transform dataset’s dimension to follow the required format for modeling.

• E.g., Cleaned data sets comprise data organized in 3 columns x n rows format. The 3 columns are Tag ID, Time and
Value. That means, rows consist of tags with their time and value. This format requires transformation as the
modeling process expects data sets to be Tag ID as the column, and samples or rows are listed by Time.

• This requires data sets to be transformed, i.e., pivot process is imposed upon the cleaned data so that the Tag ID that
is originally listed down by row, now must become the column.

• The pivot process requires a lot of computational power since the data sets are all huge in size. This requires the
data sets to be split into several pieces so that each small part can be computed with reasonable time. Finally, the
pieces of data sets are combined to be used in modeling.

• Other activities that may be carried out:

o Imputation
o Aggregation/clustering

27
DATA TRANSFORMATION

• Purpose: To transform dataset’s dimension to follow the required format for modeling

28
II-E. Features Setting
Feature Setting

• The data features that we use to train our machine learning models have a huge influence on the
performance that can be achieved.

• Irrelevant or partially relevant features can negatively impact model performance.

• Feature selection → a process to select those features in data that contribute most to the prediction
variable or output.

• Three benefits of performing feature selection before modeling are:

o Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
o Improves Accuracy: Less misleading data means modeling accuracy improves.
o Reduces Training Time: Less data means that algorithms train faster.

Source: https://machinelearningmastery.com/feature-selection-machine-learning-python/
30
Feature Setting
Row Corrosion Rate Factor1 Factor2 Factor3
1 0.575687 4.7 0.2 6.4
2 0.617291 4.25 0.2 6.45
…
…
…
98 0.205765 3.8 0.2 6.41
99 0.090778 3.6875 0.2 6.82
100 0.099716 3.575 0.2 6.751429

Row Corrosion Rate Factor1 Factor3

1 0.575687 4.7 6.4
2 0.617291 4.25 6.45
…
…
…
98 0.205765 3.8 6.41
99 0.090778 3.6875 6.82
100 0.099716 3.575 6.751429 31
Feature Setting

Predictors / Features Targets / Labels

Training set X_train y_train 70%

Testing set X_test y_test 30%

32
Feature Setting
Example
Predictors / Features Targets / Labels

X_train y_train
Training set 70%
Factor1, Factor3 Corrosion Rate
(row 1 – 70) (row 1 – 70)

X_test y_test
Factor1, Factor3 Corrosion Rate
Testing set 30%
(row 71 – 100) (row 71 – 100)

33
Summary

You have learned…

✓ Data science methodology
✓ Data preparation processes

Next…
❖ Data cleaning hands-on
❖ Feature setting hand-on

Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Cleaning Guide
No ratings yet
Data Cleaning Guide
4 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Data Cleaning Using Pandas
No ratings yet
Data Cleaning Using Pandas
9 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
SMA Expt 3
No ratings yet
SMA Expt 3
9 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Data Segmentation
No ratings yet
Data Segmentation
11 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Data Mining Group Assignment4
No ratings yet
Data Mining Group Assignment4
10 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Unit 1 (DWV)
No ratings yet
Unit 1 (DWV)
12 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
The Good and Bad Data: Poonam Kumari Poonamku@buffalo - Edu Oliver Kennedy Okennedy@buffalo - Edu
No ratings yet
The Good and Bad Data: Poonam Kumari Poonamku@buffalo - Edu Oliver Kennedy Okennedy@buffalo - Edu
2 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
Aspects of Data Quality (Excellent!)
No ratings yet
Aspects of Data Quality (Excellent!)
2 pages
Data Cleaning - Importance and Techniques
No ratings yet
Data Cleaning - Importance and Techniques
1 page
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
DWM
No ratings yet
DWM
14 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
The Data Science Process
No ratings yet
The Data Science Process
33 pages
Chap 3
No ratings yet
Chap 3
26 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Document
No ratings yet
Document
29 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Unsupervised by Any Other Name - Hidden Layers of Knowledge Production in Artificial Intelligence On Social Media
No ratings yet
Unsupervised by Any Other Name - Hidden Layers of Knowledge Production in Artificial Intelligence On Social Media
11 pages
Scream and Gunshot Detection and Localization For Audio-Surveillance Systems
No ratings yet
Scream and Gunshot Detection and Localization For Audio-Surveillance Systems
6 pages
Artificial Intelligence - Machine Learning Fundamentals
No ratings yet
Artificial Intelligence - Machine Learning Fundamentals
31 pages
Using Data Mining To Predict Student Performance
No ratings yet
Using Data Mining To Predict Student Performance
12 pages
CoVVSURF Package - Geneur
No ratings yet
CoVVSURF Package - Geneur
23 pages
AI-900 Exam Valid Dumps
No ratings yet
AI-900 Exam Valid Dumps
18 pages
Hurry Download Artificial Intelligence in Medical Imaging From Theory To Clinical Practice 1st Edition Digital PDF Download
No ratings yet
Hurry Download Artificial Intelligence in Medical Imaging From Theory To Clinical Practice 1st Edition Digital PDF Download
17 pages
A Survey On Data Mining Techniques For COVID Prediction
100% (2)
A Survey On Data Mining Techniques For COVID Prediction
6 pages
Project Final
No ratings yet
Project Final
78 pages
MLR in R PDF
No ratings yet
MLR in R PDF
5 pages
What Is Feature Selection
No ratings yet
What Is Feature Selection
9 pages
Data Science - UNIT-3 - Notes
No ratings yet
Data Science - UNIT-3 - Notes
32 pages
My CV
No ratings yet
My CV
10 pages
AI-900 Exam
No ratings yet
AI-900 Exam
161 pages
Unit 2
No ratings yet
Unit 2
37 pages
Machine Learning Based Customer Churn Prediction in Banking: November 2020
No ratings yet
Machine Learning Based Customer Churn Prediction in Banking: November 2020
7 pages
2020-IEEE-Intelligence Bearing Fault Diagnosis Model Using Multiple Feature Extraction and Binary Particle Swarm Optimization With Extended Memory
No ratings yet
2020-IEEE-Intelligence Bearing Fault Diagnosis Model Using Multiple Feature Extraction and Binary Particle Swarm Optimization With Extended Memory
14 pages
ABC Optimization - Full Paper (Santosh Galgotia)
No ratings yet
ABC Optimization - Full Paper (Santosh Galgotia)
17 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
112 pages
Chronic Kidney Disease Prediction Using Machine Learning Techniques
No ratings yet
Chronic Kidney Disease Prediction Using Machine Learning Techniques
19 pages
Machine Learning Workflow Ebook
No ratings yet
Machine Learning Workflow Ebook
22 pages
Solar Power Prediction
No ratings yet
Solar Power Prediction
20 pages
CheatSheet Advanced Control Orchestration A3 Web 0
No ratings yet
CheatSheet Advanced Control Orchestration A3 Web 0
2 pages
FAI Lecture - 4-10-2023 PDF
No ratings yet
FAI Lecture - 4-10-2023 PDF
27 pages
A Survey of Machine Learning Algorithms For Big Data Analytics
No ratings yet
A Survey of Machine Learning Algorithms For Big Data Analytics
4 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
Feature Selection Techniques For ML - A Survey of More Than Two Decades of Research - Dipti Theng
No ratings yet
Feature Selection Techniques For ML - A Survey of More Than Two Decades of Research - Dipti Theng
63 pages
Network Intrusion Detection Using Supervised Machine Learnin (3) )
No ratings yet
Network Intrusion Detection Using Supervised Machine Learnin (3) )
24 pages
Exp 01-B Feature Selection and Extraction
No ratings yet
Exp 01-B Feature Selection and Extraction
12 pages
Proceedings of National Conference Without Crop 11
No ratings yet
Proceedings of National Conference Without Crop 11
199 pages

Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025

Uploaded by

Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025

Uploaded by

Introduction to Data Science

Data Science Methodology & Data Preparation

DR SHUHAIDA MOHAMED SHUHIDAN

At the end of this session, you will be able to:

Evaluation and Deployment

Data Data Acquiring

• People agree on:

• The processes are:

• A very important phase in machine learning project development.

• Normally conducted in the first stage of a project.

• Among the activities are:

• Also known as data collection

Primary Data Acquiring

• A unique problem with no related works which is conducted in the past.

Secondary Data Acquiring

• Need to extract tags and values by equipment.

• Save the extracted data accordingly in respective file (e.g. csv).

Search bad Bad value

Found YES Remove bad

What is data cleaning?

Steps in Data Cleaning

• Step 1: Remove duplicate or irrelevant observations.

Step 2: Fix structural errors

Step 4: Handle missing data

Step 5: Validate and QA

• Does the data make sense?

• The definition of bad values is based on the project.

TagsValue Definition Value to be assigned

• Other activities that may be carried out:

• Irrelevant or partially relevant features can negatively impact model performance.

• Three benefits of performing feature selection before modeling are:

Row Corrosion Rate Factor1 Factor3

Predictors / Features Targets / Labels

Training set X_train y_train 70%

Testing set X_test y_test 30%

You have learned…

You might also like