03_Data_Preprocessing
03_Data_Preprocessing
Preprocessing
2
Data Preprocessing
01
inconsistent data due to their typically huge size (often several
gigabytes or more) and their likely origin from multiple,
heterogeneous sources.
■ Low-quality data will lead to low-quality results 🡪 Garbage in,
An Overview of Data
garbage out (GIGO)
Preprocessing
3 4
Data Preprocessing Data Quality: Why Preprocess the Data?
■ There are several data preprocessing techniques:
● Data cleaning can be applied to remove noise and correct inconsistencies in ■ Data have quality if they satisfy the requirements of the intended use.
data. ■ There are many factors comprising data quality, including
● Data integration merges data from multiple sources into a coherent data store ● Accuracy
such as a data warehouse.
● Completeness
● Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering. ● Consistency
● Data transformations (e.g., normalization) may be applied, where data are scaled ● Timeliness
to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and
efficiency of mining algorithms involving distance measurements. ● Believability
■ These techniques are not mutually exclusive; they may work together. ● Interpretability
● E.g. data cleaning can involve transformations to correct wrong data, such as by Let's see it with an example
transforming all entries for a date field to a common format
5 6
Data Quality: Why Preprocess the Data? (Example) Data Quality: Why Preprocess the Data? (Example)
Imagine that you work for a company that sells products. You carefully inspect
the company’s database, identifying and selecting the attributes or dimensions
(e.g., item, price, and units sold) to be included in your analysis.
■ We can see how we are missing three important attributes defining data quality:
You notice that several of the attributes for various tuples have no recorded accuracy, completeness, and consistency.
value. For your analysis, you would like to include information as to whether each
● Inaccurate, incomplete, and inconsistent data are commonplace properties
item purchased was advertised as on sale, yet you discover that this information
of large real-world databases and data warehouses.
has not been recorded. Furthermore, users of your database system have
reported errors, unusual values, and inconsistencies in the data recorded for
some transactions.
In other words, the data you wish to analyze are incomplete (lacking attribute
values or certain attributes of interest, or containing only aggregate data);
inaccurate or noisy (containing errors, or values that deviate from the expected);
and inconsistent (e.g., containing discrepancies in the department codes used
to categorize items).
7 8
Inaccurate data Incomplete data
■ This is summarized as having incorrect values: ■ Incomplete or missing information
● The data collection instruments used may be faulty. ● Attributes of interest may not always be available, such as customer
information for sales transaction data.
● There may have been human or computer errors occurring at data entry.
● Other data may not be included simply because they were not considered
● Users may purposely submit incorrect data values for mandatory fields when
important at the time of entry.
they do not wish to submit personal information (e.g., by choosing the default
value “January 1” displayed for birthday) → disguised missing data. ● Relevant data may not be recorded due to a misunderstanding or because
● Errors in data transmission can also occur.
of equipment malfunctions.
● Data that were inconsistent with other recorded data may have been
● There may be technology limitations such as limited buffer size for coordinating
synchronized data transfer and consumption. deleted.
● Incorrect data may also result from inconsistencies in naming conventions or ● The recording of the data history or modifications may have been
data codes, or inconsistent formats for input fields (e.g., date). overlooked.
● Duplicate tuples also require data cleaning. ● Missing data, particularly for tuples with missing values for some attributes,
may need to be inferred.
9 10
Inconsistent data
■ Data that contains discrepancies
Data stored in different formats across the dataset
● Conflicting values for the same entity. E.g. different addresses for a patient in the
system
● Variability in units of measurement. Some data points in kilometers and some in
11 12
Major Tasks in Data Preprocessing Major Tasks in Data Preprocessing
■ Data cleaning routines work to “clean” the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.
■ Three main tasks in Data Preprocessing:
● If users believe the data are dirty, they are unlikely to trust the results of any
● Data cleaning
analysis that has been applied.
● Data integration
● Dirty data can cause confusion for the analysis procedure, resulting in
● Data reduction unreliable output.
■ Although a lot of algorithms have mechanisms for dealing with incomplete
or noisy data, they are not always robust.
13 14
■ Data integration involves integrating data from various sources. E.g., you
have a dataset from a bike sharing company and you integrate it with the
weather data from each day to try to predict demand.
02
■ Data reduction obtains a reduced representation of the data set that is
much smaller in volume, yet produces the same (or almost the same)
analytical results.
● In dimensionality reduction, data encoding schemes are applied so as
Data cleaning
to obtain a reduced or “compressed” representation of the original
data.
● In numerosity reduction, the data are replaced by alternative, smaller
representations using parametric models or nonparametric models.
15 16
Data Cleaning: Missing Values Data Cleaning: Missing Values
02 Fill in the missing value manually a. This method is not very effective, unless the tuple contains
several attributes with missing values. It is especially poor
03 Usea a global constant to fill in the missing value when the percentage of missing values per attribute varies
considerably.
04 Use a measure of central tendency for the attribute
b. By ignoring the tuple, we do not make use of the remaining
05 Use the attribute mean or median for all samples of a class attributes’ values in the tuple. Such data could have been
useful to the task at hand.
06 Use the most probable value to fill in the missing value
17 18
21 22
23 24
Data Cleaning: Noisy data Data Cleaning: Noisy data
In this example, the data for price are
first sorted and then partitioned into
equal-frequency bins of size 3 (i.e.,
■ Binning: Binning methods smooth a sorted data value by each bin contains three values). In
consulting its “neighborhood,” that is, the values around it. smoothing by bin means, each value in
a bin is replaced by the mean value of
● The sorted values are distributed into a number of “buckets” or the bin. Smoothing by bin medians can
bins. be employed, in which each bin value
is replaced by the bin median. In
■ Because binning methods consult the neighborhood of values, they smoothing by bin boundaries, the
perform local smoothing. minimum and maximum values in a
given bin are identified as the bin
boundaries. Each bin value is then
replaced by the closest boundary
value.
25 26
03
technique that conforms data values to a function. Linear
regression involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where
Data integration
more than two attributes are involved and the data are fit to a
multidimensional surface.
27 28
Data integration
■ Data integration aims to consolidate data from different sources
● Is a complex process that is out of scope for this class.
■ Data integration deals with several problems:
●
●
Entity identification: identify that entities in different datasets are
the same.
Redundancy reduction: avoid redundancies when integrating
04
datasets.
●
●
Correlation analysis: identify correlations and discard duplicated
data.
Data Value Conflict Detection and Resolution: identify conflicts on
Data reduction
data while doing the integration.
29 30
31 32
Data reduction: Numerosity reduction Data reduction: Data compression
■ In data compression, transformations are applied so as to obtain a
■ Numerosity reduction techniques replace the original data volume by
reduced or “compressed” representation of the original data.
alternative, smaller forms of data representation. These techniques may be
parametric or nonparametric. ● If the original data can be reconstructed from the compressed
● For parametric methods, a model is used to estimate the data, so that data without any information loss, the data reduction is called
typically only the data parameters need to be stored, instead of the lossless.
actual data. (Outliers may also be stored.) Regression and log-linear ● If, instead, we can reconstruct only an approximation of the
models are examples. original data, then the data reduction is called lossy.
● Nonparametric methods for storing reduced representations of the ■ Dimensionality reduction and numerosity reduction techniques can
data include histograms, clustering, sampling, and data cube also be considered forms of data compression.
aggregation.
■ We are not going to apply data compression in this class. When you do
■ Not all these methods are directly applicable to machine learning. We will
image processing, you will see that is very common to reduce the
study regression and clustering as independent tasks during the semester.
resolution and number of channels of the image.
33 34
35 36
Dimensionality reduction: Dimensionality reduction: PCA
Principal Components Analysis
■ Suppose that the data to be reduced consist of tuples or data ■ Unlike attribute subset selection, which reduces the attribute set
vectors described by n attributes or dimensions. size by retaining a subset of the initial set of attributes, PCA
“combines” the essence of attributes by creating an alternative,
■ Principal components analysis (PCA) searches for k n-dimensional
smaller set of variables.
orthogonal vectors that can best be used to represent the data,
where k <= n. ■ The initial data can then be projected onto this smaller set.
■ The original data are thus projected onto a much smaller space, ■ PCA often reveals relationships that were not previously suspected
resulting in dimensionality reduction. and thereby allows new interpretations.
37 38
39 40
Dimensionality reduction: PCA Dimensionality reduction: PCA
41 42
43 44
Dimensionality reduction: Dimensionality reduction:
Feature subset selection Feature subset selection
■ For n features, there are 2n possible subsets.
■ Feature subset selection reduces the data set size by removing
irrelevant or redundant features (also called attributes or ● An exhaustive search for the optimal subset of features can be prohibitively
dimensions). expensive, especially as n and the number of data classes increase.
■ For this reason, heuristic methods that explore a reduced search space are
■ The goal of attribute subset selection is to find a minimum set of commonly used for feature subset selection.
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained ■ These methods are typically greedy in that, while searching through feature
space, they always make what looks to be the best choice at the time.
using all attributes.
■ Their strategy is to make a locally optimal choice in the hope that this will lead to
■ Mining on a reduced set of attributes has an additional benefit: It a globally optimal solution.
reduces the number of attributes appearing in the discovered
■ Such greedy methods are effective in practice and may come close to
patterns, helping to make the patterns easier to understand. estimating an optimal solution.
45 46
Dimensionality reduction:
Feature subset selection
■ As we will see in the exercises, there are multiple different methods to do
this.
■ In some cases, we may even want to create new features based on others.
Such feature construction can help improve accuracy and understanding of
structure in high dimensional data. Selecting 2 of the features
● E.g., we may wish to add the feature area based on the features height and
width.
■ By combining attributes, feature construction can discover missing
information about the relationships between features that can be useful for
knowledge discovery.
47 48
Data Transformation and Data Discretization
In data transformation, the data are transformed or consolidated into forms
appropriate for analyzing. Strategies for data transformation include the
05
following:
■ Smoothing, which works to remove noise from the data. Techniques include
binning, regression, and clustering.
Data Transformation ■ Feature construction, where new attributes are constructed and added
from the given set of attributes to help the mining process.
and Data
■ Aggregation, where summary or aggregation operations are applied to the
data. E.g., the daily sales data may be aggregated so as to compute
monthly and annual total amounts.
Discretization
■ Normalization, where the attribute data are scaled so as to fall within a
smaller range, such as 1.0 to 1.0, or 0.0 to 1.0.
49 50
51 52
Normalization Standardization
■ Normalizing the data attempts to give all attributes an equal weight. ■ Standardization of datasets is a common requirement for many machine
learning estimators
■ Normalization is particularly useful for classification algorithms
● They might behave badly if the individual features do not more or less look
involving neural networks or distance measurements such as like standard normally distributed data: Gaussian with zero mean and unit
nearest-neighbor classification and clustering. variance.
● If using the neural network backpropagation algorithm for ■ In practice we often ignore the shape of the distribution and just transform the
classification, normalizing the input values for each attribute data to center it by removing the mean value of each feature, then scale it by
measured in the training tuples will help speed up the learning dividing non-constant features by their standard deviation.
phase. ■ For instance, many elements used in the objective function of a learning
■ For distance-based methods, normalization helps prevent attributes algorithm may assume that all features are centered around zero or have
variance in the same order.
with initially large ranges (e.g., income) from outweighing attributes
with initially smaller ranges (e.g., binary attributes). ■ If a feature has a variance that is orders of magnitude larger than others, it might
dominate the objective function and make the estimator unable to learn from
■ It is also useful when given no prior knowledge of the data. other features correctly as expected.
53 54
■ There are several task specific transformations: embeddings or TF-IDF for NLP,
optical flow for video processing, GPS coordinates to semantic locations, clinical
terms to vocabulary codes, …
■ Chapters 2 & 3 in [Han and Kamber, 2006]
■ Scikit-learn offers several tools to preprocess data, you can see the
documentation here:
● https://scikit-learn.org/stable/modules/preprocessing.html
Extra material:
■ https://ataspinar.com/2018/12/21/a-guide-for-using-the-wavelet-tr
ansform-in-machine-learning/
Health record
Thanks!
57 58