0% found this document useful (0 votes)
4 views34 pages

Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025

The document outlines the data science methodology and data preparation processes, emphasizing the importance of data quality, cleaning, and transformation. It details the steps involved in data cleaning, including removing duplicates, fixing structural errors, and handling missing data. The document also highlights the significance of feature selection in improving model performance and reducing training time.

Uploaded by

hafizmna04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views34 pages

Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025

The document outlines the data science methodology and data preparation processes, emphasizing the importance of data quality, cleaning, and transformation. It details the steps involved in data cleaning, including removing duplicates, fixing structural errors, and handling missing data. The document also highlights the significance of feature selection in improving model performance and reducing training time.

Uploaded by

hafizmna04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Introduction to Data Science

Data Science Methodology & Data Preparation

DR SHUHAIDA MOHAMED SHUHIDAN


Jan 2025
CREDIT: Ts. Dr. Nurul Aida Osman
Learning Outcomes

At the end of this session, you will be able to:


• Explain data science methodology
• List the processes in data preparation/cleaning

2
I. Data Science Methodology
Methodology
Data collection & Curation (Storing)

Data cleaning

Data processing

Evaluation and Deployment

4
II. Data Preparation
Data Preparation

Data Quality

Features Data
Setting Understanding

Data
Preparation

Data Data Acquiring


Transformation / Extraction

Data Cleaning

6
Data Preparation

• Data preparation – difficult since it is different according to dataset and specific to project, yet it is critical
(customized).

• The objectives are to make sure the dataset is accurate, complete, and relevant.

• People agree on:


o Garbage in, garbage out
o 70%-80% of the Machine Learning project time is spent on data preparation

• However, there are common processes which are implemented in various projects.

• The processes are:


o Data quality/understanding
o Data acquiring/extraction
o Data cleaning
o Data transformation
o Features setting for modelling

7
II-A. Data Quality/Understanding
DATA QUALITY / UNDERSTANDING

• A very important phase in machine learning project development.

• Normally conducted in the first stage of a project.

• Among the activities are:


o Understanding infra/network/system setup
o Understanding data and source
o Determine bad values, non-numeric values etc
o Know the behavior of the equipment
o Tags, number of tags, related equipment/system
o Failure report, FFN
o Data uniqueness, completeness

9
II-B. Data Acquiring/Extraction
Data Acquiring/Extraction

• Also known as data collection


• The process of collecting/gathering information on variables of interest, in an established systematic method
that enables one to answer stated research questions, test hypotheses, and evaluate outcomes.

Primary Data Acquiring

• A unique problem with no related works which is conducted in the past.

• E.g.
o Questionnaire
o Observation
o Interview
o Focus Groups
o Experiments
o Sensor
11
Data Acquiring/Extraction

Secondary Data Acquiring

• Using data that is readily available or collected by someone else. Such data can be found on the internet,
library, engineers/users or documents in the organization.

• By using online repositories, such as Kaggle, GitHub, Data Hub, and Gapminder.

• E.g.:
o Published data
o Government publications
o Public records
o Historical and statistical documents
o Business documents
o Technical and trade journals

12
Data Acquiring/Extraction

• E.g., Data is acquired every 15 minutes from server and contains data points of various sensor tags of
different equipment.

• Need to extract tags and values by equipment.

• Save the extracted data accordingly in respective file (e.g. csv).

13
Data Acquiring/Extraction
CASE STUDY
Assume:
• The acquired data is in ex_acquired.csv – containing data of 20 sensor tags for 3 time frames.
• The 20 tags are tags of 2 compressor equipment – Comp1 and Comp2 → this information is contained in
ex_eqlist.csv.
• We are to extract the acquired data and save them into 2 separate files by equipment.

ex_acquired.csv ex_eqlist.csv 14
Data Acquiring/Extraction
CASE STUDY

Compressor_2.csv

Compressor_1.csv
15
II-C. Data Cleaning
Data Cleaning

Start

Search bad Bad value


Data set value list

Found YES Remove bad


bad value
value?

NO
Convert non-
Non-numeric numeric
values list values

Cleaned data

End

17
Data Cleaning

What is data cleaning?

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset. When combining multiple data sources, there are many opportunities for
data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even
though they may look correct. There is no one absolute way to prescribe the exact steps in the data
cleaning process because the processes will vary from dataset to dataset. But it is crucial to establish a
template for your data cleaning process so you know you are doing it the right way every time.

18
Data Cleaning

Steps in Data Cleaning

• Step 1: Remove duplicate or irrelevant observations.


• Step 2: Fix structural errors
• Step 3: Filter unwanted outliers
• Step 4: Handle missing data
• Step 5: Validate and QA

19
Data Cleaning
Step 1: Remove duplicate or irrelevant observations
• Remove unwanted data from your dataset, including duplicate or irrelevant data.
• Duplicate observations will happen most often during data collection. When you combine data sets from
multiple places, scrape data, or receive data from clients or multiple departments, there are possibilities to
create duplicate data.
• Irrelevant observations are when you notice observations that do not fit into the specific problem you are
trying to analyze.
• E.g., if you want to analyze data regarding millennial customers, but your dataset includes older generations,
you might remove those irrelevant observations. This can make analysis more efficient and minimize
distraction from your primary target—as well as creating a more manageable and more performant dataset.

20
Data Cleaning

Step 2: Fix structural errors


• Structural errors are when you measure or transfer data and notice strange naming conventions, typos, or
incorrect capitalization.
• These inconsistencies can cause mislabeled categories or classes.
• For example, you may find “N/A” and “Not Applicable” both appear, but they should be analyzed as the same
category.

21
Data Cleaning
Step 3: Filter unwanted outliers

• Often, there will be one-off observations where, at a glance, they do not appear to fit within the data you
are analyzing.
• If you have a legitimate reason to remove an outlier, like improper data-entry, doing so will help the
performance of the data you are working with.
• However, sometimes it is the appearance of an outlier that will prove a theory you are working on.
Remember: just because an outlier exists, doesn’t mean it is incorrect.
• This step is needed to determine the validity of that number. If an outlier proves to be irrelevant for
analysis or is a mistake, consider removing it.

22
Data Cleaning

Step 4: Handle missing data

• You can’t ignore missing data because many algorithms will not accept missing values. There are a
couple of ways to deal with missing data. Neither is optimal, but both can be considered.
• As a first option, you can drop observations that have missing values, but doing this will drop or lose
information, so be mindful of this before you remove it.
• As a second option, you can input missing values based on other observations; again, there is an
opportunity to lose integrity of the data because you may be operating from assumptions and not actual
observations.
o Mean – replace missing values with the mean value
o Median – replace missing values with the median value
o Interpolation – take the points before the missing value and after the missing value, then connect
the points with values in between
o K-Nearest Neighbors

23
Data Cleaning

Step 5: Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as a part of basic
validation:

• Does the data make sense?


• Does the data follow the appropriate rules for its field?
• Does it prove or disprove your working theory, or bring any insight to light?
• Can you find trends in the data to help you form your next theory?
• If not, is that because of a data quality issue?

Note: False conclusions because of incorrect or “dirty” data can inform poor business strategy and decision-
making. False conclusions can lead to an embarrassing moment in a reporting meeting when you realize
your data doesn’t stand up to scrutiny. Before you get there, it is important to create a culture of quality
data in your organization. To do this, you should document the tools you might use to create this culture
and what data quality means to you.

24
Data Cleaning
CASE STUDY
• Certain tag values contain bad values which need to be removed.

• The definition of bad values is based on the project.

• E.g.:

TagsValue Definition Value to be assigned


Alarm Good value 1
Bad Bad value Garbage
Calc Failed Bad value Garbage
Configure Bad value Garbage
Connected Good value 1
FALSE Good value 0
FAULT Good value 0
Good Good value 1
I/O Timeout Bad value Garbage
Intf Shut Bad value Garbage NOTE:
Normal Good value 1 • Bad values are removed
Not Connect Bad value Garbage
Out of Serv Bad value Garbage
Garbage • The remaining values in non-numerical
Pt Created Bad value
Scan Off Bad value Garbage form are converted into numeric value
TRUE Good value 1 e.g. 1 or 0
25
II-D. Data Transformation
DATA TRANSFORMATION

• Purpose: To transform dataset’s dimension to follow the required format for modeling.

• E.g., Cleaned data sets comprise data organized in 3 columns x n rows format. The 3 columns are Tag ID, Time and
Value. That means, rows consist of tags with their time and value. This format requires transformation as the
modeling process expects data sets to be Tag ID as the column, and samples or rows are listed by Time.

• This requires data sets to be transformed, i.e., pivot process is imposed upon the cleaned data so that the Tag ID that
is originally listed down by row, now must become the column.

• The pivot process requires a lot of computational power since the data sets are all huge in size. This requires the
data sets to be split into several pieces so that each small part can be computed with reasonable time. Finally, the
pieces of data sets are combined to be used in modeling.

• Other activities that may be carried out:


o Imputation
o Aggregation/clustering

27
DATA TRANSFORMATION

• Purpose: To transform dataset’s dimension to follow the required format for modeling

28
II-E. Features Setting
Feature Setting

• The data features that we use to train our machine learning models have a huge influence on the
performance that can be achieved.

• Irrelevant or partially relevant features can negatively impact model performance.

• Feature selection → a process to select those features in data that contribute most to the prediction
variable or output.

• Three benefits of performing feature selection before modeling are:


o Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
o Improves Accuracy: Less misleading data means modeling accuracy improves.
o Reduces Training Time: Less data means that algorithms train faster.

Source: https://machinelearningmastery.com/feature-selection-machine-learning-python/
30
Feature Setting
Row Corrosion Rate Factor1 Factor2 Factor3
1 0.575687 4.7 0.2 6.4
2 0.617291 4.25 0.2 6.45



98 0.205765 3.8 0.2 6.41
99 0.090778 3.6875 0.2 6.82
100 0.099716 3.575 0.2 6.751429

Row Corrosion Rate Factor1 Factor3


1 0.575687 4.7 6.4
2 0.617291 4.25 6.45



98 0.205765 3.8 6.41
99 0.090778 3.6875 6.82
100 0.099716 3.575 6.751429 31
Feature Setting

Predictors / Features Targets / Labels

Training set X_train y_train 70%

Testing set X_test y_test 30%

32
Feature Setting
Example
Predictors / Features Targets / Labels

X_train y_train
Training set 70%
Factor1, Factor3 Corrosion Rate
(row 1 – 70) (row 1 – 70)

X_test y_test
Factor1, Factor3 Corrosion Rate
Testing set 30%
(row 71 – 100) (row 71 – 100)

33
Summary

You have learned…


✓ Data science methodology
✓ Data preparation processes

Next…
❖ Data cleaning hands-on
❖ Feature setting hand-on

34

You might also like