0% found this document useful (0 votes)
29 views

4.1 - Data Preprocessing

Uploaded by

mactabios23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

4.1 - Data Preprocessing

Uploaded by

mactabios23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Data Preprocessing

CC19 – Data Mining


Agenda
• Definition of Data Preprocessing
• Types of Data Preprocessing
• Data Cleaning
• Data Integration
• Data Transformation
• Data Normalization
• Data Reduction
• Steps of Data Preprocessing
Defining Data Preprocessing
• Data preprocessing is a key step in data mining that involves
modifying data to prepare it for analysis.
• This allows the data to better fit different data mining analysis
techniques and tools.
• Different techniques can be utilized depending on the type of data
being analyzed.
Defining Data Preprocessing
• Preparing data in important to ensure that large datasets can be
processed more easily.
• While more data is available for analysis compared to before, a lot of
that data is “dirty”.
• The data collected via data collection techniques can also be
inconsistent in format and quality.
Defining Data Preprocessing
• Many techniques for data mining rely on data which is complete
or noise free.
• Unfortunately for us, real-world data is rarely clean or complete.
• These are other reasons why we need to preprocess data to make it
usable for data mining tools.
Types of Data Preprocessing
• Listed below are common techniques for data preprocessing:
• Data Cleaning
• Data Integration
• Data Transformation
• Data Normalization
• Data Reduction
Types of Data Preprocessing – Data Cleaning

• Data cleaning involves correcting bad data, filtering incorrect data, or


reduce unnecessary data details.
• It is a general technique that is commonly used with other techniques.
• Treatment of missing and noise data is also included here.
Types of Data Preprocessing – Data Cleaning

• Data cleaning involves


identifying and correcting errors
and inconsistencies in the data.
• These errors can involve missing
values, outliers, and duplicates.
Types of Data Preprocessing – Data Integration

• Data integration involves merging data from multiple data sources.


• This should include steps to reduce redundancies and inconsistencies
in your data set.
• Techniques involved here include identification and unification of
variables and domains.
Types of Data Preprocessing – Data Integration

• Data integration can be


challenging as it requires
combining data from different
sources with different formats,
structures, and semantics.
• Techniques used here can
include record linkage and data
fusion.
Types of Data Preprocessing – Data Transformation

• Data transformation involves converting data so that the mining


process result could be more efficient.
• These are typically composed of different tasks that are dependent on
the type of data being transformed.
• Some data transformation techniques might not work if the data used
is incompatible.
Types of Data Preprocessing – Data Transformation

• Data transformation techniques


includes smoothing, feature
construction, aggregation, or
summarization.
Types of Data Preprocessing – Data Normalization

• Data normalization involves scaling data to a common range.


• Normalizing the data attempts to give all attributes equal weight to
make them easier to analyze.
• This is done because the measurement units used for data mining can
affect the data analysis.
Types of Data Preprocessing – Data Normalization

• All attributes in the data mining


process should be expressed in
the same measurement units and
should use a common scale or
range.
Types of Data Preprocessing – Data Reduction

• Data reduction comprises techniques which obtain a reduced


representation of the original data.
• Data being processed maintains the essential structure and integrity of
the original data but is downsized.
• This is done because many data mining algorithms become very slow
the more data they process.
Types of Data Preprocessing – Data Reduction

• There are three common types of data reduction methods:


• Feature selection
• Instance selection
• Discretization
Types of Data Preprocessing – Data Reduction

Feature Selection
• This achieves the reduction of
data by removing irrelevant or
redundant features.
• This aims to find a minimum set
of attributes.
Types of Data Preprocessing – Data Reduction

Instance Selection
• This looks at choosing a subset
of the total available data to
achieve the original purpose of
data mining.
• It works in a similar manner to
statistical sampling methods.
Types of Data Preprocessing – Data Reduction

Discretization
• This transformed quantitative
(numerical) data into qualitative
(nominal) data.
• An association between each
interval with a numerical discrete
value is then established.
Types of Data Preprocessing
• To summarize how these data preprocessing tools work:
• Data Cleaning – How do I clean up the data?
• Data Integration – How do I incorporate and adjust data?
• Data Transformation – How do I provide accurate data?
• Data Normalization – How do I unify and scale data?
• Data Reduction – How do I select the best features of my data?
Steps in Data Preprocessing
• These are the general steps to consider when doing data preprocessing:
• Assess your Data Quality
• Clean your Data
• Transform your Data
• Reduce your Data
• Further Process your Data
Steps in Data Preprocessing
Assess your Data Quality
• Start by looking at your data to get an idea of its overall quality.
• This is where you look at your data collection results and determine
what issues your data may have.
• Once you have identified issues, you then need to determine which
data preprocessing techniques to use.
Steps in Data Preprocessing
Assess your Data Quality
• These are common issues you might need to look at in your data:
• Mismatched Data Types
• Mixed Data Values
• Outliers
• Missing Data
Steps in Data Preprocessing
Clean your Data
• Generally, you always want to clean your data as your first
preprocessing method.
• This is because it removes useless, unrelated, corrupted, or incorrect
data which can interfere with other steps.
• This can be done manually by deleting files or automated with code or
tools.
Steps in Data Preprocessing
Transform your Data
• This is where your data is transformed into a format suitable for
your data analysis tools.
• How you transform your data will depend on what tool you are using
and what analysis you will perform.
• This involves steps such as normalization to further enhance the data.
Steps in Data Preprocessing
Reduce your Data
• You will then want to reduce the size of your overall dataset as
needed to make analysis easier.
• This may not be needed for small datasets but becomes important for
larger datasets.
• This ensures that your data analysis process will not be slow or
impossible.
Steps in Data Preprocessing
Further Process your Data
• You will need to determine if your current data preprocessing
steps are sufficient.
• This is typically done after data analysis to check if the data
preprocessing enhanced the results.
• You can add or remove preprocessing methods if you find that they are
not effective for your dataset.
References
• Data Preprocessing in Data Mining – GeeksforGeeks
• Data Preprocessing in Data Mining.pdf (dstu.dp.ua)
• What Is Data Preprocessing & What Are The Steps Involved? (monke
ylearn.com)
• Data Preprocessing: Definition, Key Steps and Concepts (techtarget.co
m)
• A survey on data preprocessing for data stream mining: Current status
and future directions (ugr.es)

You might also like