Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
2
I. Data Science Methodology
Methodology
Data collection & Curation (Storing)
Data cleaning
Data processing
4
II. Data Preparation
Data Preparation
Data Quality
Features Data
Setting Understanding
Data
Preparation
Data Cleaning
6
Data Preparation
• Data preparation – difficult since it is different according to dataset and specific to project, yet it is critical
(customized).
• The objectives are to make sure the dataset is accurate, complete, and relevant.
• However, there are common processes which are implemented in various projects.
7
II-A. Data Quality/Understanding
DATA QUALITY / UNDERSTANDING
9
II-B. Data Acquiring/Extraction
Data Acquiring/Extraction
• E.g.
o Questionnaire
o Observation
o Interview
o Focus Groups
o Experiments
o Sensor
11
Data Acquiring/Extraction
• Using data that is readily available or collected by someone else. Such data can be found on the internet,
library, engineers/users or documents in the organization.
• By using online repositories, such as Kaggle, GitHub, Data Hub, and Gapminder.
• E.g.:
o Published data
o Government publications
o Public records
o Historical and statistical documents
o Business documents
o Technical and trade journals
12
Data Acquiring/Extraction
• E.g., Data is acquired every 15 minutes from server and contains data points of various sensor tags of
different equipment.
13
Data Acquiring/Extraction
CASE STUDY
Assume:
• The acquired data is in ex_acquired.csv – containing data of 20 sensor tags for 3 time frames.
• The 20 tags are tags of 2 compressor equipment – Comp1 and Comp2 → this information is contained in
ex_eqlist.csv.
• We are to extract the acquired data and save them into 2 separate files by equipment.
ex_acquired.csv ex_eqlist.csv 14
Data Acquiring/Extraction
CASE STUDY
Compressor_2.csv
Compressor_1.csv
15
II-C. Data Cleaning
Data Cleaning
Start
NO
Convert non-
Non-numeric numeric
values list values
Cleaned data
End
17
Data Cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset. When combining multiple data sources, there are many opportunities for
data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even
though they may look correct. There is no one absolute way to prescribe the exact steps in the data
cleaning process because the processes will vary from dataset to dataset. But it is crucial to establish a
template for your data cleaning process so you know you are doing it the right way every time.
18
Data Cleaning
19
Data Cleaning
Step 1: Remove duplicate or irrelevant observations
• Remove unwanted data from your dataset, including duplicate or irrelevant data.
• Duplicate observations will happen most often during data collection. When you combine data sets from
multiple places, scrape data, or receive data from clients or multiple departments, there are possibilities to
create duplicate data.
• Irrelevant observations are when you notice observations that do not fit into the specific problem you are
trying to analyze.
• E.g., if you want to analyze data regarding millennial customers, but your dataset includes older generations,
you might remove those irrelevant observations. This can make analysis more efficient and minimize
distraction from your primary target—as well as creating a more manageable and more performant dataset.
20
Data Cleaning
21
Data Cleaning
Step 3: Filter unwanted outliers
• Often, there will be one-off observations where, at a glance, they do not appear to fit within the data you
are analyzing.
• If you have a legitimate reason to remove an outlier, like improper data-entry, doing so will help the
performance of the data you are working with.
• However, sometimes it is the appearance of an outlier that will prove a theory you are working on.
Remember: just because an outlier exists, doesn’t mean it is incorrect.
• This step is needed to determine the validity of that number. If an outlier proves to be irrelevant for
analysis or is a mistake, consider removing it.
22
Data Cleaning
• You can’t ignore missing data because many algorithms will not accept missing values. There are a
couple of ways to deal with missing data. Neither is optimal, but both can be considered.
• As a first option, you can drop observations that have missing values, but doing this will drop or lose
information, so be mindful of this before you remove it.
• As a second option, you can input missing values based on other observations; again, there is an
opportunity to lose integrity of the data because you may be operating from assumptions and not actual
observations.
o Mean – replace missing values with the mean value
o Median – replace missing values with the median value
o Interpolation – take the points before the missing value and after the missing value, then connect
the points with values in between
o K-Nearest Neighbors
23
Data Cleaning
At the end of the data cleaning process, you should be able to answer these questions as a part of basic
validation:
Note: False conclusions because of incorrect or “dirty” data can inform poor business strategy and decision-
making. False conclusions can lead to an embarrassing moment in a reporting meeting when you realize
your data doesn’t stand up to scrutiny. Before you get there, it is important to create a culture of quality
data in your organization. To do this, you should document the tools you might use to create this culture
and what data quality means to you.
24
Data Cleaning
CASE STUDY
• Certain tag values contain bad values which need to be removed.
• E.g.:
• Purpose: To transform dataset’s dimension to follow the required format for modeling.
• E.g., Cleaned data sets comprise data organized in 3 columns x n rows format. The 3 columns are Tag ID, Time and
Value. That means, rows consist of tags with their time and value. This format requires transformation as the
modeling process expects data sets to be Tag ID as the column, and samples or rows are listed by Time.
• This requires data sets to be transformed, i.e., pivot process is imposed upon the cleaned data so that the Tag ID that
is originally listed down by row, now must become the column.
• The pivot process requires a lot of computational power since the data sets are all huge in size. This requires the
data sets to be split into several pieces so that each small part can be computed with reasonable time. Finally, the
pieces of data sets are combined to be used in modeling.
27
DATA TRANSFORMATION
• Purpose: To transform dataset’s dimension to follow the required format for modeling
28
II-E. Features Setting
Feature Setting
• The data features that we use to train our machine learning models have a huge influence on the
performance that can be achieved.
• Feature selection → a process to select those features in data that contribute most to the prediction
variable or output.
Source: https://machinelearningmastery.com/feature-selection-machine-learning-python/
30
Feature Setting
Row Corrosion Rate Factor1 Factor2 Factor3
1 0.575687 4.7 0.2 6.4
2 0.617291 4.25 0.2 6.45
…
…
…
98 0.205765 3.8 0.2 6.41
99 0.090778 3.6875 0.2 6.82
100 0.099716 3.575 0.2 6.751429
32
Feature Setting
Example
Predictors / Features Targets / Labels
X_train y_train
Training set 70%
Factor1, Factor3 Corrosion Rate
(row 1 – 70) (row 1 – 70)
X_test y_test
Factor1, Factor3 Corrosion Rate
Testing set 30%
(row 71 – 100) (row 71 – 100)
33
Summary
Next…
❖ Data cleaning hands-on
❖ Feature setting hand-on
34