0% found this document useful (0 votes)

9 views

03_Data_Preprocessing

Uploaded by

l.arrizabalaga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

03_Data_Preprocessing

Uploaded by

l.arrizabalaga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Rubén Sánchez Corcuera

[email protected]

Why would we need data preprocessing? Which problems do

you think that we may encounter with real world data?

Data (5-10 mins)

Preprocessing
2

Data Preprocessing

■ Real world datasets are highly susceptible to noisy, missing, and

01
inconsistent data due to their typically huge size (often several
gigabytes or more) and their likely origin from multiple,
heterogeneous sources.
■ Low-quality data will lead to low-quality results 🡪 Garbage in,

An Overview of Data
garbage out (GIGO)

Preprocessing
3 4
Data Preprocessing Data Quality: Why Preprocess the Data?
■ There are several data preprocessing techniques:
● Data cleaning can be applied to remove noise and correct inconsistencies in ■ Data have quality if they satisfy the requirements of the intended use.
data. ■ There are many factors comprising data quality, including
● Data integration merges data from multiple sources into a coherent data store ● Accuracy
such as a data warehouse.
● Completeness
● Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering. ● Consistency
● Data transformations (e.g., normalization) may be applied, where data are scaled ● Timeliness
to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and
efﬁciency of mining algorithms involving distance measurements. ● Believability

■ These techniques are not mutually exclusive; they may work together. ● Interpretability

● E.g. data cleaning can involve transformations to correct wrong data, such as by Let's see it with an example
transforming all entries for a date ﬁeld to a common format

5 6

Data Quality: Why Preprocess the Data? (Example) Data Quality: Why Preprocess the Data? (Example)
Imagine that you work for a company that sells products. You carefully inspect
the company’s database, identifying and selecting the attributes or dimensions
(e.g., item, price, and units sold) to be included in your analysis.
■ We can see how we are missing three important attributes deﬁning data quality:
You notice that several of the attributes for various tuples have no recorded accuracy, completeness, and consistency.
value. For your analysis, you would like to include information as to whether each
● Inaccurate, incomplete, and inconsistent data are commonplace properties
item purchased was advertised as on sale, yet you discover that this information
of large real-world databases and data warehouses.
has not been recorded. Furthermore, users of your database system have
reported errors, unusual values, and inconsistencies in the data recorded for
some transactions.
In other words, the data you wish to analyze are incomplete (lacking attribute
values or certain attributes of interest, or containing only aggregate data);
inaccurate or noisy (containing errors, or values that deviate from the expected);
and inconsistent (e.g., containing discrepancies in the department codes used
to categorize items).
7 8
Inaccurate data Incomplete data
■ This is summarized as having incorrect values: ■ Incomplete or missing information
● The data collection instruments used may be faulty. ● Attributes of interest may not always be available, such as customer
information for sales transaction data.
● There may have been human or computer errors occurring at data entry.
● Other data may not be included simply because they were not considered
● Users may purposely submit incorrect data values for mandatory ﬁelds when
important at the time of entry.
they do not wish to submit personal information (e.g., by choosing the default
value “January 1” displayed for birthday) → disguised missing data. ● Relevant data may not be recorded due to a misunderstanding or because
● Errors in data transmission can also occur.
of equipment malfunctions.
● Data that were inconsistent with other recorded data may have been
● There may be technology limitations such as limited buffer size for coordinating
synchronized data transfer and consumption. deleted.

● Incorrect data may also result from inconsistencies in naming conventions or ● The recording of the data history or modiﬁcations may have been
data codes, or inconsistent formats for input ﬁelds (e.g., date). overlooked.

● Duplicate tuples also require data cleaning. ● Missing data, particularly for tuples with missing values for some attributes,
may need to be inferred.
9 10

Inconsistent data
■ Data that contains discrepancies
Data stored in different formats across the dataset

Real world is not

●

● Conflicting values for the same entity. E.g. different addresses for a patient in the
system
● Variability in units of measurement. Some data points in kilometers and some in

safe for data

meters
● Non-standardized naming conventions
● Different spelling or capitalization conventions for the data (New York vs new
york)

11 12
Major Tasks in Data Preprocessing Major Tasks in Data Preprocessing

■ Data cleaning routines work to “clean” the data by ﬁlling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.
■ Three main tasks in Data Preprocessing:
● If users believe the data are dirty, they are unlikely to trust the results of any
● Data cleaning
analysis that has been applied.
● Data integration
● Dirty data can cause confusion for the analysis procedure, resulting in
● Data reduction unreliable output.
■ Although a lot of algorithms have mechanisms for dealing with incomplete
or noisy data, they are not always robust.

13 14

Major Tasks in Data Preprocessing

■ Data integration involves integrating data from various sources. E.g., you
have a dataset from a bike sharing company and you integrate it with the
weather data from each day to try to predict demand.

02
■ Data reduction obtains a reduced representation of the data set that is
much smaller in volume, yet produces the same (or almost the same)
analytical results.
● In dimensionality reduction, data encoding schemes are applied so as

Data cleaning
to obtain a reduced or “compressed” representation of the original
data.
● In numerosity reduction, the data are replaced by alternative, smaller
representations using parametric models or nonparametric models.

15 16
Data Cleaning: Missing Values Data Cleaning: Missing Values

There are multiple methods to deal with missing values:

1. Ignore the tuple: This is usually done when the class label is
01 Ignore the tuple missing (assuming the task is classiﬁcation).

02 Fill in the missing value manually a. This method is not very effective, unless the tuple contains
several attributes with missing values. It is especially poor
03 Usea a global constant to ﬁll in the missing value when the percentage of missing values per attribute varies
considerably.
04 Use a measure of central tendency for the attribute
b. By ignoring the tuple, we do not make use of the remaining
05 Use the attribute mean or median for all samples of a class attributes’ values in the tuple. Such data could have been
useful to the task at hand.
06 Use the most probable value to ﬁll in the missing value

17 18

Data Cleaning: Missing Values Data Cleaning: Missing Values

4. Use a measure of central tendency for the attribute (e.g., the mean
or median) to ﬁll in the missing value:
2. Fill in the missing value manually: In general, this approach is time
consuming and may not be feasible given a large data set with a. For normal (symmetric) data distributions, the mean can be
many missing values. used.
3. Use a global constant to ﬁll in the missing value: Replace all b. In skewed data distribution the median should be employed.
missing attribute values by the same constant such as a label like
5. Use the attribute mean or median for all samples belonging to the
“NONE” or -∞.
same class as the given tuple:
a. If missing values are replaced by, say, “NONE,” then the model
a. For example, if classifying customers according to blood
may mistakenly think that they form an interesting concept,
pressure, we may replace the missing value with the mean
since they all have a value in common.
(median if skewed distribution) age value for customers in the
same blood pressure category as that of the given tuple.
19 20
Data Cleaning: Missing Values Data Cleaning: Missing Values

■ Methods 3 to 6 bias the data → the ﬁlled-in value may not be

correct
6. Use the most probable value to ﬁll in the missing value:
■ Method 6 is a popular strategy:
a. This may be determined with regression, inference-based
● In comparison to the other methods, it uses the most
tools using a Bayesian formalism, or decision tree induction.
information from the present data to predict missing values.
E.g., using the other patients attributes in your data set, you
may construct a decision tree to predict the missing values for ● By considering the other attributes’ values in its estimation of
age. the missing value for income, there is a greater chance for
example that the relationships between income and the other
attributes are preserved.

21 22

Data Cleaning: Noisy data Data Cleaning: Noisy data

■ Outlier analysis: Outliers may be detected by clustering, for

■ Noise is a random error or variance in a measured variable
example, where similar values are organized into groups, or
● Some basic statistical description techniques and methods of “clusters.” Intuitively, values that fall outside of the set of clusters
data visualization (e.g., box plots and scatter plots) can be may be considered outliers.
used to identify outliers, which may represent noise.
● Outlier can also be detected with clustering techniques
■ We can use smoothing techniques to try to deal with noise.
● Binning
● Regression

23 24
Data Cleaning: Noisy data Data Cleaning: Noisy data
In this example, the data for price are
ﬁrst sorted and then partitioned into
equal-frequency bins of size 3 (i.e.,
■ Binning: Binning methods smooth a sorted data value by each bin contains three values). In
consulting its “neighborhood,” that is, the values around it. smoothing by bin means, each value in
a bin is replaced by the mean value of
● The sorted values are distributed into a number of “buckets” or the bin. Smoothing by bin medians can
bins. be employed, in which each bin value
is replaced by the bin median. In
■ Because binning methods consult the neighborhood of values, they smoothing by bin boundaries, the
perform local smoothing. minimum and maximum values in a
given bin are identiﬁed as the bin
boundaries. Each bin value is then
replaced by the closest boundary
value.
25 26

Data Cleaning: Noisy data

■ Regression: Data smoothing can also be done by regression, a

03
technique that conforms data values to a function. Linear
regression involves ﬁnding the “best” line to ﬁt two attributes (or
variables) so that one attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where

Data integration
more than two attributes are involved and the data are ﬁt to a
multidimensional surface.

27 28
Data integration
■ Data integration aims to consolidate data from different sources
● Is a complex process that is out of scope for this class.
■ Data integration deals with several problems:
●

●
Entity identiﬁcation: identify that entities in different datasets are
the same.
Redundancy reduction: avoid redundancies when integrating
04
datasets.
●

●
Correlation analysis: identify correlations and discard duplicated
data.
Data Value Conflict Detection and Resolution: identify conflicts on
Data reduction
data while doing the integration.
29 30

Data reduction Data reduction: Dimensionality reduction

■ Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume, yet
closely maintains the integrity of the original data. ■ Dimensionality reduction is the process of reducing the number of
random variables or attributes under consideration.
■ Processing the reduced data set should be more efﬁcient yet produce
the same (or almost the same) analytical results. ● Dimensionality reduction methods include wavelet transforms and
principal components analysis which transform or project the
● Remember our discussion about the curse of dimensionality.
original data onto a smaller space.
■ Data reduction strategies:
● Attribute subset selection is a method of dimensionality reduction
● Dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed.
● Numerosity reduction
● Data compression

31 32
Data reduction: Numerosity reduction Data reduction: Data compression
■ In data compression, transformations are applied so as to obtain a
■ Numerosity reduction techniques replace the original data volume by
reduced or “compressed” representation of the original data.
alternative, smaller forms of data representation. These techniques may be
parametric or nonparametric. ● If the original data can be reconstructed from the compressed
● For parametric methods, a model is used to estimate the data, so that data without any information loss, the data reduction is called
typically only the data parameters need to be stored, instead of the lossless.
actual data. (Outliers may also be stored.) Regression and log-linear ● If, instead, we can reconstruct only an approximation of the
models are examples. original data, then the data reduction is called lossy.
● Nonparametric methods for storing reduced representations of the ■ Dimensionality reduction and numerosity reduction techniques can
data include histograms, clustering, sampling, and data cube also be considered forms of data compression.
aggregation.
■ We are not going to apply data compression in this class. When you do
■ Not all these methods are directly applicable to machine learning. We will
image processing, you will see that is very common to reduce the
study regression and clustering as independent tasks during the semester.
resolution and number of channels of the image.
33 34

Dimensionality reduction: Wavelets Example of DWT

■ The discrete wavelet transform (DWT) is a linear signal processing

technique that, when applied to a data vector X, transforms it to a
numerically different vector, X’ , of wavelet coefficients.
■ Very useful when working with sensor data.
■ The usefulness lies in the fact that the wavelet transformed data can be
truncated. A compressed approximation of the data can be retained by
storing only a small fraction of the strongest of the wavelet coefficients.
For example, all wavelet coefficients larger than some user-specified
threshold can be retained.

35 36
Dimensionality reduction: Dimensionality reduction: PCA
Principal Components Analysis

■ Suppose that the data to be reduced consist of tuples or data ■ Unlike attribute subset selection, which reduces the attribute set
vectors described by n attributes or dimensions. size by retaining a subset of the initial set of attributes, PCA
“combines” the essence of attributes by creating an alternative,
■ Principal components analysis (PCA) searches for k n-dimensional
smaller set of variables.
orthogonal vectors that can best be used to represent the data,
where k <= n. ■ The initial data can then be projected onto this smaller set.
■ The original data are thus projected onto a much smaller space, ■ PCA often reveals relationships that were not previously suspected
resulting in dimensionality reduction. and thereby allows new interpretations.

37 38

Dimensionality reduction: PCA

The basic procedure is as follows:
1. The input data are normalized, so that each attribute falls within the
same range. This step helps ensure that attributes with large
domains will not dominate attributes with smaller domains.
● As you will see in our examples, we are going to standardize
them.
2. PCA computes k vectors that provide a basis for the normalized
Transforming the dataset from 4 input data. These are unit vectors that each point in a direction
to 2 dimensions using PCA perpendicular to the others. These vectors are referred to as the
principal components. The input data are a linear combination of
the principal components.

39 40
Dimensionality reduction: PCA Dimensionality reduction: PCA

3. The principal components are sorted in order of decreasing

■ PCA can be applied to ordered and unordered attributes, and can
“significance” or strength. The principal components essentially
handle sparse data and skewed data.
serve as a new set of axes for the data, providing important
information about variance. That is, the sorted axes are such that ■ Multidimensional data of more than two dimensions can be
the first axis shows the most variance among the data, the second handled by reducing the problem to two dimensions.
axis shows the next highest variance, and so on.
■ Principal components may be used as inputs to multiple regression
4. Because the components are sorted in decreasing order of and cluster analysis.
“significance,” the data size can be reduced by eliminating the
■ In comparison with wavelet transforms, PCA tends to be better at
weaker components, that is, those with low variance. Using the
handling sparse data, whereas wavelet transforms are more
strongest principal components, it should be possible to
suitable for data of high dimensionality.
reconstruct a good approximation of the original data.

41 42

Dimensionality reduction: Dimensionality reduction:

Feature subset selection Feature subset selection
■ Datasets for analysis may contain hundreds of features, many of
which may be irrelevant to the mining task or redundant.
■ Leaving out relevant features or keeping irrelevant features may be
● E.g., imagine a medical dataset with the clinical history,
detrimental, causing confusion for the mining algorithm employed.
personal data, -omics data, images from the radiographies,
tomographies…, demographic data of the target population, ■ This can result in discovered patterns of poor quality.
epidemiologic data…
■ In addition, the added volume of irrelevant or redundant features
■ Although it may be possible for a domain expert to pick out some can slow down the mining process.
of the useful features, this can be a difﬁcult and time-consuming
task, especially when the data’s behavior is not well known

43 44
Dimensionality reduction: Dimensionality reduction:
Feature subset selection Feature subset selection
■ For n features, there are 2n possible subsets.
■ Feature subset selection reduces the data set size by removing
irrelevant or redundant features (also called attributes or ● An exhaustive search for the optimal subset of features can be prohibitively
dimensions). expensive, especially as n and the number of data classes increase.
■ For this reason, heuristic methods that explore a reduced search space are
■ The goal of attribute subset selection is to ﬁnd a minimum set of commonly used for feature subset selection.
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained ■ These methods are typically greedy in that, while searching through feature
space, they always make what looks to be the best choice at the time.
using all attributes.
■ Their strategy is to make a locally optimal choice in the hope that this will lead to
■ Mining on a reduced set of attributes has an additional beneﬁt: It a globally optimal solution.
reduces the number of attributes appearing in the discovered
■ Such greedy methods are effective in practice and may come close to
patterns, helping to make the patterns easier to understand. estimating an optimal solution.

45 46

Dimensionality reduction:
Feature subset selection
■ As we will see in the exercises, there are multiple different methods to do
this.
■ In some cases, we may even want to create new features based on others.
Such feature construction can help improve accuracy and understanding of
structure in high dimensional data. Selecting 2 of the features
● E.g., we may wish to add the feature area based on the features height and
width.
■ By combining attributes, feature construction can discover missing
information about the relationships between features that can be useful for
knowledge discovery.

47 48
Data Transformation and Data Discretization
In data transformation, the data are transformed or consolidated into forms
appropriate for analyzing. Strategies for data transformation include the

05
following:
■ Smoothing, which works to remove noise from the data. Techniques include
binning, regression, and clustering.

Data Transformation ■ Feature construction, where new attributes are constructed and added
from the given set of attributes to help the mining process.

and Data
■ Aggregation, where summary or aggregation operations are applied to the
data. E.g., the daily sales data may be aggregated so as to compute
monthly and annual total amounts.

Discretization
■ Normalization, where the attribute data are scaled so as to fall within a
smaller range, such as 1.0 to 1.0, or 0.0 to 1.0.

49 50

Data Transformation and Data Discretization Normalization

■ The measurement unit used can affect the data analysis.
■ Standardization, transform individual features to look like standard normally
distributed data: Gaussian with zero mean and unit variance. ● E.g., changing measurement units from meters to inches for
height, or from kilograms to pounds for weight, may lead to very
■ Discretization, where the raw values of a numeric attribute (e.g., age) are different results.
replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g.,
youth, adult, senior). The labels, in turn, can be recursively organized into ■ In general, expressing an attribute in smaller units will lead to a larger
higher-level concepts, resulting in a concept hierarchy for the numeric range for that attribute, and thus tend to give such an attribute greater
attribute effect or “weight.”
■ Concept hierarchy generation for nominal data, where attributes such as ■ To help avoid dependence on the choice of measurement units, the
street can be generalized to higher-level concepts, like city or country. data should be normalized.
Many hierarchies for nominal attributes are implicit within the database
schema and can be automatically deﬁned at the schema deﬁnition level. ■ This involves transforming the data to fall within a smaller or common
range such as [-1, 1] or [0.0, 1.0].

51 52
Normalization Standardization
■ Normalizing the data attempts to give all attributes an equal weight. ■ Standardization of datasets is a common requirement for many machine
learning estimators
■ Normalization is particularly useful for classification algorithms
● They might behave badly if the individual features do not more or less look
involving neural networks or distance measurements such as like standard normally distributed data: Gaussian with zero mean and unit
nearest-neighbor classification and clustering. variance.
● If using the neural network backpropagation algorithm for ■ In practice we often ignore the shape of the distribution and just transform the
classification, normalizing the input values for each attribute data to center it by removing the mean value of each feature, then scale it by
measured in the training tuples will help speed up the learning dividing non-constant features by their standard deviation.
phase. ■ For instance, many elements used in the objective function of a learning
■ For distance-based methods, normalization helps prevent attributes algorithm may assume that all features are centered around zero or have
variance in the same order.
with initially large ranges (e.g., income) from outweighing attributes
with initially smaller ranges (e.g., binary attributes). ■ If a feature has a variance that is orders of magnitude larger than others, it might
dominate the objective function and make the estimator unable to learn from
■ It is also useful when given no prior knowledge of the data. other features correctly as expected.
53 54

Task speciﬁc transformations Further reading

■ There are several task speciﬁc transformations: embeddings or TF-IDF for NLP,
optical flow for video processing, GPS coordinates to semantic locations, clinical
terms to vocabulary codes, …
■ Chapters 2 & 3 in [Han and Kamber, 2006]
■ Scikit-learn offers several tools to preprocess data, you can see the
documentation here:
● https://scikit-learn.org/stable/modules/preprocessing.html
Extra material:
■ https://ataspinar.com/2018/12/21/a-guide-for-using-the-wavelet-tr
ansform-in-machine-learning/

Health record

Athena from Ohdsi

55 56
Exercises Do you have any questions?
[email protected]

■ Let's take a look at PCA, feature selection and use of

standardization in scikit-learn with the colabs that you have in
ALUD.
■ We will see how to apply them and we will do some exercises.

Thanks!
57 58

Computer Science Past Paper Igcse Cambridge
0% (1)
Computer Science Past Paper Igcse Cambridge
11 pages
Confirmed Blockchain Script 2020
50% (8)
Confirmed Blockchain Script 2020
2 pages
Art + Data Book - A Collection of Tableau Dashboards
100% (1)
Art + Data Book - A Collection of Tableau Dashboards
51 pages
Annova VoLTE Analytics KPI Reference Guide
No ratings yet
Annova VoLTE Analytics KPI Reference Guide
28 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
lec 1 Data Acquisition and preprocessing
No ratings yet
lec 1 Data Acquisition and preprocessing
8 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
DMI UNIT 3
No ratings yet
DMI UNIT 3
12 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
DS-Unit-2_ABM_final
No ratings yet
DS-Unit-2_ABM_final
134 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Data Preprocessing - 1: Course Leader
No ratings yet
Data Preprocessing - 1: Course Leader
22 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
3. Data Preprocessing
No ratings yet
3. Data Preprocessing
120 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Unit - II
No ratings yet
Unit - II
56 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
DWM
No ratings yet
DWM
14 pages
Data_Preprocessing-1-19
No ratings yet
Data_Preprocessing-1-19
19 pages
Data Preparation and Analysis
No ratings yet
Data Preparation and Analysis
22 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Chapter 2
No ratings yet
Chapter 2
40 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Data Preprocessing 1_annotated
No ratings yet
Data Preprocessing 1_annotated
23 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
UNIT-2 PREPROCESSING
No ratings yet
UNIT-2 PREPROCESSING
18 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
From Everand
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
George Snypes
2/5 (1)
Oracle AIM Crisp Handout
No ratings yet
Oracle AIM Crisp Handout
7 pages
Scala Tutorial
No ratings yet
Scala Tutorial
119 pages
Mac OSX Keyboard Cheat Sheet
100% (1)
Mac OSX Keyboard Cheat Sheet
3 pages
YCPTI03 Process Control Functional Description
100% (1)
YCPTI03 Process Control Functional Description
48 pages
Ada Lovelace 1.dox
No ratings yet
Ada Lovelace 1.dox
3 pages
DSA Lab Programs Easy and Small For SEM III LAB Externals @vtunetwork
No ratings yet
DSA Lab Programs Easy and Small For SEM III LAB Externals @vtunetwork
23 pages
Chapter 5
No ratings yet
Chapter 5
8 pages
Active Server Pages
No ratings yet
Active Server Pages
12 pages
Process User Guide Change Order Initiation Business Process
No ratings yet
Process User Guide Change Order Initiation Business Process
7 pages
AIM and AMHS Newsletter Sept 2010
No ratings yet
AIM and AMHS Newsletter Sept 2010
6 pages
PEPS Pentacut
No ratings yet
PEPS Pentacut
2 pages
Critical Path Method
No ratings yet
Critical Path Method
11 pages
Class-X Ch-2 Polynomials (Maths Assignment)
No ratings yet
Class-X Ch-2 Polynomials (Maths Assignment)
2 pages
Database Migration From MySql To Oracle 11g Using Golden Gate
100% (2)
Database Migration From MySql To Oracle 11g Using Golden Gate
8 pages
Mani - Akella
No ratings yet
Mani - Akella
6 pages
Vm-742B Network Communication
No ratings yet
Vm-742B Network Communication
3 pages
MA1254 Random Processes :: Unit 1 :: Probability & Random Variable
No ratings yet
MA1254 Random Processes :: Unit 1 :: Probability & Random Variable
1 page
Programs on Python (AutoRecovered)
No ratings yet
Programs on Python (AutoRecovered)
5 pages
Role of Project Manager
100% (2)
Role of Project Manager
11 pages
Offline and Falling Behind: Barriers To Internet Adoption
No ratings yet
Offline and Falling Behind: Barriers To Internet Adoption
127 pages
SYSTIMAX Warranty Checklist
No ratings yet
SYSTIMAX Warranty Checklist
1 page
5G Open Innovation IPZ Business Plan
No ratings yet
5G Open Innovation IPZ Business Plan
10 pages
Nathaniels Resume
No ratings yet
Nathaniels Resume
1 page
AVR Pro
No ratings yet
AVR Pro
19 pages
Combining Relations: Discrete Structures
No ratings yet
Combining Relations: Discrete Structures
18 pages
Chinese MRKT
No ratings yet
Chinese MRKT
22 pages

03_Data_Preprocessing

Uploaded by

03_Data_Preprocessing

Uploaded by

Rubén Sánchez Corcuera

Why would we need data preprocessing? Which problems do

Data (5-10 mins)

■ Real world datasets are highly susceptible to noisy, missing, and

Real world is not

safe for data

Major Tasks in Data Preprocessing

There are multiple methods to deal with missing values:

Data Cleaning: Missing Values Data Cleaning: Missing Values

■ Methods 3 to 6 bias the data → the ﬁlled-in value may not be

Data Cleaning: Noisy data Data Cleaning: Noisy data

■ Outlier analysis: Outliers may be detected by clustering, for

Data Cleaning: Noisy data

■ Regression: Data smoothing can also be done by regression, a

Data reduction Data reduction: Dimensionality reduction

Dimensionality reduction: Wavelets Example of DWT

■ The discrete wavelet transform (DWT) is a linear signal processing

Dimensionality reduction: PCA

3. The principal components are sorted in order of decreasing

Dimensionality reduction: Dimensionality reduction:

Data Transformation and Data Discretization Normalization

Task speciﬁc transformations Further reading

Athena from Ohdsi

■ Let's take a look at PCA, feature selection and use of

You might also like