0% found this document useful (0 votes)
4 views

03_Data_Preprocessing

Uploaded by

l.arrizabalaga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

03_Data_Preprocessing

Uploaded by

l.arrizabalaga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Rubén Sánchez Corcuera

[email protected]

Why would we need data preprocessing? Which problems do


you think that we may encounter with real world data?

Data (5-10 mins)

Preprocessing
2

Data Preprocessing

■ Real world datasets are highly susceptible to noisy, missing, and

01
inconsistent data due to their typically huge size (often several
gigabytes or more) and their likely origin from multiple,
heterogeneous sources.
■ Low-quality data will lead to low-quality results 🡪 Garbage in,

An Overview of Data
garbage out (GIGO)

Preprocessing
3 4
Data Preprocessing Data Quality: Why Preprocess the Data?
■ There are several data preprocessing techniques:
● Data cleaning can be applied to remove noise and correct inconsistencies in ■ Data have quality if they satisfy the requirements of the intended use.
data. ■ There are many factors comprising data quality, including
● Data integration merges data from multiple sources into a coherent data store ● Accuracy
such as a data warehouse.
● Completeness
● Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering. ● Consistency
● Data transformations (e.g., normalization) may be applied, where data are scaled ● Timeliness
to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and
efficiency of mining algorithms involving distance measurements. ● Believability

■ These techniques are not mutually exclusive; they may work together. ● Interpretability

● E.g. data cleaning can involve transformations to correct wrong data, such as by Let's see it with an example
transforming all entries for a date field to a common format

5 6

Data Quality: Why Preprocess the Data? (Example) Data Quality: Why Preprocess the Data? (Example)
Imagine that you work for a company that sells products. You carefully inspect
the company’s database, identifying and selecting the attributes or dimensions
(e.g., item, price, and units sold) to be included in your analysis.
■ We can see how we are missing three important attributes defining data quality:
You notice that several of the attributes for various tuples have no recorded accuracy, completeness, and consistency.
value. For your analysis, you would like to include information as to whether each
● Inaccurate, incomplete, and inconsistent data are commonplace properties
item purchased was advertised as on sale, yet you discover that this information
of large real-world databases and data warehouses.
has not been recorded. Furthermore, users of your database system have
reported errors, unusual values, and inconsistencies in the data recorded for
some transactions.
In other words, the data you wish to analyze are incomplete (lacking attribute
values or certain attributes of interest, or containing only aggregate data);
inaccurate or noisy (containing errors, or values that deviate from the expected);
and inconsistent (e.g., containing discrepancies in the department codes used
to categorize items).
7 8
Inaccurate data Incomplete data
■ This is summarized as having incorrect values: ■ Incomplete or missing information
● The data collection instruments used may be faulty. ● Attributes of interest may not always be available, such as customer
information for sales transaction data.
● There may have been human or computer errors occurring at data entry.
● Other data may not be included simply because they were not considered
● Users may purposely submit incorrect data values for mandatory fields when
important at the time of entry.
they do not wish to submit personal information (e.g., by choosing the default
value “January 1” displayed for birthday) → disguised missing data. ● Relevant data may not be recorded due to a misunderstanding or because
● Errors in data transmission can also occur.
of equipment malfunctions.
● Data that were inconsistent with other recorded data may have been
● There may be technology limitations such as limited buffer size for coordinating
synchronized data transfer and consumption. deleted.

● Incorrect data may also result from inconsistencies in naming conventions or ● The recording of the data history or modifications may have been
data codes, or inconsistent formats for input fields (e.g., date). overlooked.

● Duplicate tuples also require data cleaning. ● Missing data, particularly for tuples with missing values for some attributes,
may need to be inferred.
9 10

Inconsistent data
■ Data that contains discrepancies
Data stored in different formats across the dataset

Real world is not


● Conflicting values for the same entity. E.g. different addresses for a patient in the
system
● Variability in units of measurement. Some data points in kilometers and some in

safe for data


meters
● Non-standardized naming conventions
● Different spelling or capitalization conventions for the data (New York vs new
york)

11 12
Major Tasks in Data Preprocessing Major Tasks in Data Preprocessing

■ Data cleaning routines work to “clean” the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.
■ Three main tasks in Data Preprocessing:
● If users believe the data are dirty, they are unlikely to trust the results of any
● Data cleaning
analysis that has been applied.
● Data integration
● Dirty data can cause confusion for the analysis procedure, resulting in
● Data reduction unreliable output.
■ Although a lot of algorithms have mechanisms for dealing with incomplete
or noisy data, they are not always robust.

13 14

Major Tasks in Data Preprocessing

■ Data integration involves integrating data from various sources. E.g., you
have a dataset from a bike sharing company and you integrate it with the
weather data from each day to try to predict demand.

02
■ Data reduction obtains a reduced representation of the data set that is
much smaller in volume, yet produces the same (or almost the same)
analytical results.
● In dimensionality reduction, data encoding schemes are applied so as

Data cleaning
to obtain a reduced or “compressed” representation of the original
data.
● In numerosity reduction, the data are replaced by alternative, smaller
representations using parametric models or nonparametric models.

15 16
Data Cleaning: Missing Values Data Cleaning: Missing Values

There are multiple methods to deal with missing values:


1. Ignore the tuple: This is usually done when the class label is
01 Ignore the tuple missing (assuming the task is classification).

02 Fill in the missing value manually a. This method is not very effective, unless the tuple contains
several attributes with missing values. It is especially poor
03 Usea a global constant to fill in the missing value when the percentage of missing values per attribute varies
considerably.
04 Use a measure of central tendency for the attribute
b. By ignoring the tuple, we do not make use of the remaining
05 Use the attribute mean or median for all samples of a class attributes’ values in the tuple. Such data could have been
useful to the task at hand.
06 Use the most probable value to fill in the missing value

17 18

Data Cleaning: Missing Values Data Cleaning: Missing Values


4. Use a measure of central tendency for the attribute (e.g., the mean
or median) to fill in the missing value:
2. Fill in the missing value manually: In general, this approach is time
consuming and may not be feasible given a large data set with a. For normal (symmetric) data distributions, the mean can be
many missing values. used.
3. Use a global constant to fill in the missing value: Replace all b. In skewed data distribution the median should be employed.
missing attribute values by the same constant such as a label like
5. Use the attribute mean or median for all samples belonging to the
“NONE” or -∞.
same class as the given tuple:
a. If missing values are replaced by, say, “NONE,” then the model
a. For example, if classifying customers according to blood
may mistakenly think that they form an interesting concept,
pressure, we may replace the missing value with the mean
since they all have a value in common.
(median if skewed distribution) age value for customers in the
same blood pressure category as that of the given tuple.
19 20
Data Cleaning: Missing Values Data Cleaning: Missing Values

■ Methods 3 to 6 bias the data → the filled-in value may not be


correct
6. Use the most probable value to fill in the missing value:
■ Method 6 is a popular strategy:
a. This may be determined with regression, inference-based
● In comparison to the other methods, it uses the most
tools using a Bayesian formalism, or decision tree induction.
information from the present data to predict missing values.
E.g., using the other patients attributes in your data set, you
may construct a decision tree to predict the missing values for ● By considering the other attributes’ values in its estimation of
age. the missing value for income, there is a greater chance for
example that the relationships between income and the other
attributes are preserved.

21 22

Data Cleaning: Noisy data Data Cleaning: Noisy data

■ Outlier analysis: Outliers may be detected by clustering, for


■ Noise is a random error or variance in a measured variable
example, where similar values are organized into groups, or
● Some basic statistical description techniques and methods of “clusters.” Intuitively, values that fall outside of the set of clusters
data visualization (e.g., box plots and scatter plots) can be may be considered outliers.
used to identify outliers, which may represent noise.
● Outlier can also be detected with clustering techniques
■ We can use smoothing techniques to try to deal with noise.
● Binning
● Regression

23 24
Data Cleaning: Noisy data Data Cleaning: Noisy data
In this example, the data for price are
first sorted and then partitioned into
equal-frequency bins of size 3 (i.e.,
■ Binning: Binning methods smooth a sorted data value by each bin contains three values). In
consulting its “neighborhood,” that is, the values around it. smoothing by bin means, each value in
a bin is replaced by the mean value of
● The sorted values are distributed into a number of “buckets” or the bin. Smoothing by bin medians can
bins. be employed, in which each bin value
is replaced by the bin median. In
■ Because binning methods consult the neighborhood of values, they smoothing by bin boundaries, the
perform local smoothing. minimum and maximum values in a
given bin are identified as the bin
boundaries. Each bin value is then
replaced by the closest boundary
value.
25 26

Data Cleaning: Noisy data

■ Regression: Data smoothing can also be done by regression, a

03
technique that conforms data values to a function. Linear
regression involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where

Data integration
more than two attributes are involved and the data are fit to a
multidimensional surface.

27 28
Data integration
■ Data integration aims to consolidate data from different sources
● Is a complex process that is out of scope for this class.
■ Data integration deals with several problems:


Entity identification: identify that entities in different datasets are
the same.
Redundancy reduction: avoid redundancies when integrating
04
datasets.


Correlation analysis: identify correlations and discard duplicated
data.
Data Value Conflict Detection and Resolution: identify conflicts on
Data reduction
data while doing the integration.
29 30

Data reduction Data reduction: Dimensionality reduction


■ Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume, yet
closely maintains the integrity of the original data. ■ Dimensionality reduction is the process of reducing the number of
random variables or attributes under consideration.
■ Processing the reduced data set should be more efficient yet produce
the same (or almost the same) analytical results. ● Dimensionality reduction methods include wavelet transforms and
principal components analysis which transform or project the
● Remember our discussion about the curse of dimensionality.
original data onto a smaller space.
■ Data reduction strategies:
● Attribute subset selection is a method of dimensionality reduction
● Dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed.
● Numerosity reduction
● Data compression

31 32
Data reduction: Numerosity reduction Data reduction: Data compression
■ In data compression, transformations are applied so as to obtain a
■ Numerosity reduction techniques replace the original data volume by
reduced or “compressed” representation of the original data.
alternative, smaller forms of data representation. These techniques may be
parametric or nonparametric. ● If the original data can be reconstructed from the compressed
● For parametric methods, a model is used to estimate the data, so that data without any information loss, the data reduction is called
typically only the data parameters need to be stored, instead of the lossless.
actual data. (Outliers may also be stored.) Regression and log-linear ● If, instead, we can reconstruct only an approximation of the
models are examples. original data, then the data reduction is called lossy.
● Nonparametric methods for storing reduced representations of the ■ Dimensionality reduction and numerosity reduction techniques can
data include histograms, clustering, sampling, and data cube also be considered forms of data compression.
aggregation.
■ We are not going to apply data compression in this class. When you do
■ Not all these methods are directly applicable to machine learning. We will
image processing, you will see that is very common to reduce the
study regression and clustering as independent tasks during the semester.
resolution and number of channels of the image.
33 34

Dimensionality reduction: Wavelets Example of DWT

■ The discrete wavelet transform (DWT) is a linear signal processing


technique that, when applied to a data vector X, transforms it to a
numerically different vector, X’ , of wavelet coefficients.
■ Very useful when working with sensor data.
■ The usefulness lies in the fact that the wavelet transformed data can be
truncated. A compressed approximation of the data can be retained by
storing only a small fraction of the strongest of the wavelet coefficients.
For example, all wavelet coefficients larger than some user-specified
threshold can be retained.

35 36
Dimensionality reduction: Dimensionality reduction: PCA
Principal Components Analysis

■ Suppose that the data to be reduced consist of tuples or data ■ Unlike attribute subset selection, which reduces the attribute set
vectors described by n attributes or dimensions. size by retaining a subset of the initial set of attributes, PCA
“combines” the essence of attributes by creating an alternative,
■ Principal components analysis (PCA) searches for k n-dimensional
smaller set of variables.
orthogonal vectors that can best be used to represent the data,
where k <= n. ■ The initial data can then be projected onto this smaller set.
■ The original data are thus projected onto a much smaller space, ■ PCA often reveals relationships that were not previously suspected
resulting in dimensionality reduction. and thereby allows new interpretations.

37 38

Dimensionality reduction: PCA


The basic procedure is as follows:
1. The input data are normalized, so that each attribute falls within the
same range. This step helps ensure that attributes with large
domains will not dominate attributes with smaller domains.
● As you will see in our examples, we are going to standardize
them.
2. PCA computes k vectors that provide a basis for the normalized
Transforming the dataset from 4 input data. These are unit vectors that each point in a direction
to 2 dimensions using PCA perpendicular to the others. These vectors are referred to as the
principal components. The input data are a linear combination of
the principal components.

39 40
Dimensionality reduction: PCA Dimensionality reduction: PCA

3. The principal components are sorted in order of decreasing


■ PCA can be applied to ordered and unordered attributes, and can
“significance” or strength. The principal components essentially
handle sparse data and skewed data.
serve as a new set of axes for the data, providing important
information about variance. That is, the sorted axes are such that ■ Multidimensional data of more than two dimensions can be
the first axis shows the most variance among the data, the second handled by reducing the problem to two dimensions.
axis shows the next highest variance, and so on.
■ Principal components may be used as inputs to multiple regression
4. Because the components are sorted in decreasing order of and cluster analysis.
“significance,” the data size can be reduced by eliminating the
■ In comparison with wavelet transforms, PCA tends to be better at
weaker components, that is, those with low variance. Using the
handling sparse data, whereas wavelet transforms are more
strongest principal components, it should be possible to
suitable for data of high dimensionality.
reconstruct a good approximation of the original data.

41 42

Dimensionality reduction: Dimensionality reduction:


Feature subset selection Feature subset selection
■ Datasets for analysis may contain hundreds of features, many of
which may be irrelevant to the mining task or redundant.
■ Leaving out relevant features or keeping irrelevant features may be
● E.g., imagine a medical dataset with the clinical history,
detrimental, causing confusion for the mining algorithm employed.
personal data, -omics data, images from the radiographies,
tomographies…, demographic data of the target population, ■ This can result in discovered patterns of poor quality.
epidemiologic data…
■ In addition, the added volume of irrelevant or redundant features
■ Although it may be possible for a domain expert to pick out some can slow down the mining process.
of the useful features, this can be a difficult and time-consuming
task, especially when the data’s behavior is not well known

43 44
Dimensionality reduction: Dimensionality reduction:
Feature subset selection Feature subset selection
■ For n features, there are 2n possible subsets.
■ Feature subset selection reduces the data set size by removing
irrelevant or redundant features (also called attributes or ● An exhaustive search for the optimal subset of features can be prohibitively
dimensions). expensive, especially as n and the number of data classes increase.
■ For this reason, heuristic methods that explore a reduced search space are
■ The goal of attribute subset selection is to find a minimum set of commonly used for feature subset selection.
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained ■ These methods are typically greedy in that, while searching through feature
space, they always make what looks to be the best choice at the time.
using all attributes.
■ Their strategy is to make a locally optimal choice in the hope that this will lead to
■ Mining on a reduced set of attributes has an additional benefit: It a globally optimal solution.
reduces the number of attributes appearing in the discovered
■ Such greedy methods are effective in practice and may come close to
patterns, helping to make the patterns easier to understand. estimating an optimal solution.

45 46

Dimensionality reduction:
Feature subset selection
■ As we will see in the exercises, there are multiple different methods to do
this.
■ In some cases, we may even want to create new features based on others.
Such feature construction can help improve accuracy and understanding of
structure in high dimensional data. Selecting 2 of the features
● E.g., we may wish to add the feature area based on the features height and
width.
■ By combining attributes, feature construction can discover missing
information about the relationships between features that can be useful for
knowledge discovery.

47 48
Data Transformation and Data Discretization
In data transformation, the data are transformed or consolidated into forms
appropriate for analyzing. Strategies for data transformation include the

05
following:
■ Smoothing, which works to remove noise from the data. Techniques include
binning, regression, and clustering.

Data Transformation ■ Feature construction, where new attributes are constructed and added
from the given set of attributes to help the mining process.

and Data
■ Aggregation, where summary or aggregation operations are applied to the
data. E.g., the daily sales data may be aggregated so as to compute
monthly and annual total amounts.

Discretization
■ Normalization, where the attribute data are scaled so as to fall within a
smaller range, such as 1.0 to 1.0, or 0.0 to 1.0.

49 50

Data Transformation and Data Discretization Normalization


■ The measurement unit used can affect the data analysis.
■ Standardization, transform individual features to look like standard normally
distributed data: Gaussian with zero mean and unit variance. ● E.g., changing measurement units from meters to inches for
height, or from kilograms to pounds for weight, may lead to very
■ Discretization, where the raw values of a numeric attribute (e.g., age) are different results.
replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g.,
youth, adult, senior). The labels, in turn, can be recursively organized into ■ In general, expressing an attribute in smaller units will lead to a larger
higher-level concepts, resulting in a concept hierarchy for the numeric range for that attribute, and thus tend to give such an attribute greater
attribute effect or “weight.”
■ Concept hierarchy generation for nominal data, where attributes such as ■ To help avoid dependence on the choice of measurement units, the
street can be generalized to higher-level concepts, like city or country. data should be normalized.
Many hierarchies for nominal attributes are implicit within the database
schema and can be automatically defined at the schema definition level. ■ This involves transforming the data to fall within a smaller or common
range such as [-1, 1] or [0.0, 1.0].

51 52
Normalization Standardization
■ Normalizing the data attempts to give all attributes an equal weight. ■ Standardization of datasets is a common requirement for many machine
learning estimators
■ Normalization is particularly useful for classification algorithms
● They might behave badly if the individual features do not more or less look
involving neural networks or distance measurements such as like standard normally distributed data: Gaussian with zero mean and unit
nearest-neighbor classification and clustering. variance.
● If using the neural network backpropagation algorithm for ■ In practice we often ignore the shape of the distribution and just transform the
classification, normalizing the input values for each attribute data to center it by removing the mean value of each feature, then scale it by
measured in the training tuples will help speed up the learning dividing non-constant features by their standard deviation.
phase. ■ For instance, many elements used in the objective function of a learning
■ For distance-based methods, normalization helps prevent attributes algorithm may assume that all features are centered around zero or have
variance in the same order.
with initially large ranges (e.g., income) from outweighing attributes
with initially smaller ranges (e.g., binary attributes). ■ If a feature has a variance that is orders of magnitude larger than others, it might
dominate the objective function and make the estimator unable to learn from
■ It is also useful when given no prior knowledge of the data. other features correctly as expected.
53 54

Task specific transformations Further reading

■ There are several task specific transformations: embeddings or TF-IDF for NLP,
optical flow for video processing, GPS coordinates to semantic locations, clinical
terms to vocabulary codes, …
■ Chapters 2 & 3 in [Han and Kamber, 2006]
■ Scikit-learn offers several tools to preprocess data, you can see the
documentation here:
● https://scikit-learn.org/stable/modules/preprocessing.html
Extra material:
■ https://ataspinar.com/2018/12/21/a-guide-for-using-the-wavelet-tr
ansform-in-machine-learning/

Health record

Athena from Ohdsi


55 56
Exercises Do you have any questions?
[email protected]

■ Let's take a look at PCA, feature selection and use of


standardization in scikit-learn with the colabs that you have in
ALUD.
■ We will see how to apply them and we will do some exercises.

Thanks!
57 58

You might also like