0% found this document useful (0 votes)
0 views

Chapter 2_ Data Exploration, Preprocessing and Visualization

Chapter 2 covers data exploration, preprocessing, and visualization techniques, emphasizing the importance of understanding data types, distributions, and relationships. It discusses methods for data cleaning, integration, reduction, and transformation, as well as various graphical and non-graphical exploratory data analysis techniques. The chapter highlights the significance of data quality and the challenges faced in handling real-world data, including inaccuracies, incompleteness, and redundancies.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Chapter 2_ Data Exploration, Preprocessing and Visualization

Chapter 2 covers data exploration, preprocessing, and visualization techniques, emphasizing the importance of understanding data types, distributions, and relationships. It discusses methods for data cleaning, integration, reduction, and transformation, as well as various graphical and non-graphical exploratory data analysis techniques. The chapter highlights the significance of data quality and the challenges faced in handling real-world data, including inaccuracies, incompleteness, and redundancies.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Chapter 2: Data Exploration,

Preprocessing and Visualization


● Non-graphical and graphical
methods for exploring data
● Data cleaning: Missing values, noisy
data, inconsistent data

Contents ●

Data integration
Data transformation
● Data reduction
● Discretization and generating
concept hierarchies
● Data visualization tools and
techniques
Getting to know your data
Before jumping into mining, we need to familiarize ourselves with the data.

● What are the types of attributes that make up your data?


● What kind of values does each attribute have?
● Which attributes are discrete, and which are continuous-values?
● What the data look like?
● How are the values distributed?
● Are there ways we can visualize the data to get a better sense of it all?
● Can we spot any outliers?
● Can we measure the similarity of some data objects with respect to others? etc.
Exploratory Data Analysis (EDA)
An approach for data analysis that employs a variety of techniques for

● Better understanding the data


● Detection of mistakes
● Checking of assumptions
● Preliminary selection of appropriate models
● Determining relationships among the explanatory variables, and
● Assessing the direction and rough size of relationships between explanatory and
outcome variables
Exploratory data analysis (EDA)
EDA can be graphical or non-graphical.

Non-graphical methods generally involve calculation of summary statistics, while


graphical methods summarize the data in a diagrammatic or pictorial way.

5
Non-graphical methods: Univariate data
● Categorical data:
A simple tabulation of the frequency of each category

Source: http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf

6
Non-graphical methods: Univariate data
● Quantitative data: Population distribution of the variable using the data
of the observed sample
○ Central tendency: mean, median, mode etc.
○ Spread (an indicator of how far away from the center we are still likely to find data
values): variance, standard deviation, interquartile range (IQR)
○ Modality (number of peaks in the pdf)
○ Shape: Skewness (measure of asymmetry), Kurtosis (measure of peakedness)
○ Outliers (an observation that lies "far away" from other values)

7
Non-graphical methods: Bivariate data
● Categorical data:
○ Cross-tabulation

○ Correlation
Chi-squared test,

Cramer’s V,
etc.

8
Non-graphical methods: Bivariate data
● Quantitative data:
○ Covariance:
A measure of how much two variables “co-vary”
■ +ve covariance = when one measurement is above the mean the other will probably also be above
the mean, and vice versa
■ -ve covariance = when one variable is above its mean, the other is below its mean
■ Covariances near zero = the two variables vary independently of each other.
○ Correlation:
Statistical relationship between two variables
Commonly used correlation coefficient: Pearson’s correlation coefficient
■ -1 = data lie on a perfect straight line with a negative slope
■ 0 = no linear relationship between the variables
■ 1 = data lie on a perfect straight line with a positive slope

9
Non-graphical methods: Multivariate data
● Covariance and correlation matrices:
Pairwise covariances and/or correlations assembled into a matrix

10
● Bar chart

Graphical methods ●

Histogram
Density plot
● Box and whiskers plot
Univariate data
● Time-series plot

11
Bar plot
Bar plots display the distribution (frequencies) of a
categorical variable through vertical or horizontal
bars.

12
Histogram
Histograms are constructed by binning the
data and counting the number of
observations in each bin

To visualize the shape of the distribution

13
Histogram
Bar plot vs histogram

● Bar plots use categorical data


● Histograms use continuous data grouped or categorized (binning) in such a way
that they are considered to be ranges

● Normally, bars in bar plots do not touch each other


● Bars in histograms touch each other to indicate that the items are non-discrete

14
Density plot
Density plots can be thought of as plots of
smoothed histograms

15
Density plot
The smoothness is controlled by a bandwidth
parameter that is analogous to the histogram
binwidth

16
Box and whiskers plot
Presents information about the central
tendency, symmetry, and skew as well as
outliers

17
Box and whiskers plot

Source: https://sites.google.com/site/davidsstatistics/home/notched-box-plots https://en.wikipedia.org/wiki/Box_plot

18
Time series plot

At time t
19
● Scatter plot
● Regression

Graphical methods ●

Box and whiskers plot
Bar chart
Bivariate data

20
Scatter plot
Uses Cartesian coordinates to show the relationship between two variables of a set
of data

21
Box and whiskers plot

22
Bar plot
Stacked bar plot

23
Bar plot
Grouped bar plot

24
Graphical methods ●

Scatter plot matrix
Bubble plot
● Line chart
Multivariate data

25
Scatter plot matrix
Can be used to roughly determine if there
is a linear correlation between multiple
variables

26
Bubble plot
Bubble plots can display the
relationship between three
quantitative variables

A bubble plot is basically a 2D


scatter plot that uses the size of
the plotted point to represent the
value of the third variable

27
Line chart
Connecting the points in a scatter plot Multiple lines can be drawn in the same
moving from left to right gives a line plot
plot

28
Graphical methods
Quantitative Categorical Quantitative & Categorical

Time-series plot Univariate: Box and whiskers plots


Pie charts, Bar graphs
Univariate:
Histograms, Density plots, Box and Bi- or multi-variate:
whiskers plots Bar graphs

Bivariate:
Scatterplots

Multivariate:
Scatterplot matrix, Bubble plots (3
variables)
29
Why preprocess the data?
In many practical situations

● Data contains too many attributes, and some of them are clearly irrelevant and
redundant
● Data is incomplete (some values are missing)
● Data is inaccurate or noisy (contains errors, or values that deviate from the
expected)
● Data is inconsistent (e.g., containing discrepancies in the department codes)

Garbage-In-Garbage-Out (GIGO): Low-quality data will lead to low-quality mining


results
Data quality
● Accuracy - whether the dataset contains errors, or values that deviate from the
expected
● Completeness - lacking attribute values or certain attributes of interest, or
containing only aggregate data?
● Consistency - containing discrepancies between the attribute values
● Timeliness - how timely the data are updated
● Believability - how much the data are trusted by users
● Interpretability - how easy the data are understood
Why is the real world data dirty?
● Inaccurate (or noisy) data may come from
○ Faulty data collection instruments or methods
○ Human or computer errors while entering data
○ Incorrect data purposely submitted by users (aka disguised missing data)
○ Technological limitations
○ Inconsistencies in naming conventions or inconsistent formats for input fields
○ etc.
● Inconsistent data may come from
○ Integration of data from various sources
○ Modification of linked data
○ etc.
● Incomplete data may come from
○ Unavailability of attributes of interest
○ Recording missed due to equipment malfunctions
○ Deletion of inconsistent data
○ Data not submitted in timely fashion
○ etc.
Major Tasks in Data Preprocessing
● Data cleaning
○ Fill in missing values, smooth noisy data, identify or remove outliers, resolve inconsistencies,
detect and remove redundancies
● Data integration
○ Include data from multiple sources (e.g., multiple databases, data cubes, files etc.)
● Data reduction
○ Obtain a reduced representation of the data set that is much smaller in volume, yet produces the
same (or almost the same) analytical results
● Data transformation and data discretization
○ Normalize data values, replace raw data values by ranges or higher conceptual levels (e.g.,
replacing age by higher-level concepts, such as youth, adult or senior) etc.
Data cleaning
● Tasks
○ Handle missing values,
○ Smooth noisy data,
○ Identify or remove outliers,
○ Resolve inconsistencies,
○ Detect and remove redundancies
Missing data
● Data may not always be available
● Incomplete data may come from
○ Unavailability of attributes of interest
○ Recording missed due to equipment malfunctions
○ Deletion of inconsistent data
○ Data not submitted in timely fashion
○ etc.
● Missing data may need to be inferred
Handling missing values
● Ignore the tuple
○ Usually done when the class label is missing
○ Poor when the percentage of missing values per attribute varies considerably
● Use a global constant (e.g., NA, Unknown, -∞ etc.) to fill in the missing value
○ This method is simple but not foolproof as the mining program may mistakenly think that they
form an interesting concept
● Fill in the missing values manually
○ Time consuming
○ May not be feasible for a large data set
● Fill in the missing values automatically
○ Using random values
○ Using a measure of central tendency for the attribute (e.g., mean, median)
○ Using the attribute mean or median for all samples belonging to the same class
○ Using the most probable value (e.g., with regression, decision tree, Bayesian inference etc.)
Noisy data
● Noise is a random error or a variance in measured variable.
● Inaccurate (or noisy) data may come from
○ Faulty data collection instruments or methods
○ Human or computer errors while entering data
○ Incorrect data purposely submitted by users (aka disguised missing data)
○ Technological limitations
○ Inconsistencies in naming conventions or inconsistent formats for input fields
○ etc.
Handling noisy data
● Binning
○ First sort the data, and distribute the sorted
values into a number of buckets or bins
(equal-frequency bins or equal-width bins)
○ Then replace each value in a bin by the mean of
the bin (smoothing by bin means), or by the
median of the bin (smoothing by bin medians), or
by the closest boundary value (smoothing by bin
boundaries)
Handling noisy data
● Regression
○ Fitting the data into regression functions
○ In (simple) linear regression, the data are modeled
to fit a straight line.
● Clustering / Outlier analysis
○ Grouping similar values into groups or clusters.
Values that fall outside of the set of clusters may
be considered outliers
Handling noisy data
● Concept hierarchy
○ Concept hierarchy is a series of mappings from a set of
low-level concepts to larger-level, more general concepts, e.g.
Country > province > city > street
○ Replacing data with higher-level concepts, e.g., a concept
hierarchy of price may map real values into inexpensive,
moderately_priced, and expensive.
● Combined human and computer inspection
○ Detect suspicious values and check by human
○ Correct the data using external references
Data Integration
● Combine data from multiple sources
Data Integration
● Challenges
○ Entity identification problem
○ Redundancies
○ Tuple duplication
○ Data value conflict
Data Integration: Challenges
● Entity identification problem
○ How can equivalent real-world entities from multiple sources be matched up?
○ Schema integration
■ How to be sure that customer_id in one database and cust_number in another refer to the
same attribute?
■ Data codes for for pay_type in one database are “H” and “S” but 1 and 2 in another.
○ Object matching
■ Bill Clinton = William Clinton
○ The metadata of the attributes (name, meaning, data type, range of values permitted for the
attributes, and null rules for handling blank, zero, and null values etc.) can be used to match
attributes.
Data Integration: Challenges
Redundancy

● Data integration may result in duplicate attributes or duplicate tuples.


● Redundant data may occur
○ Due to inconsistencies in attribute naming across multiple sources
○ Due to the use of denormalized tables
○ When an attribute can be derived from another attribute or a set of attributes
Handling redundant data
● Correlation analysis can detect attribute redundancy
○ Given two attributes, correlation analysis can measure how strongly one attribute implies the
other, based on the available data
○ Numerical data: Pearson’s correlation coefficient
○ Categorical data: Chi-square test
● Different probabilistic approaches and machine learning techniques can be used
to detect duplicate tuples
● Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Data Integration: Challenges
Data value conflict

● For the same real world entity, attribute values from different sources are
different
● Possible reasons: different representations, different scales, e.g., metric vs.
British units, grading schemes in different countries/universities etc.
● Metadata information can be used to detect such discrepancies
● Data scrubbing tools and data auditing tools can aid in the discrepancy
detection step
Data reduction
● Complex data analysis and mining on huge amounts of data can take a long time
● Data reduction techniques can be applied to obtain a reduced representation of
the data set that is much smaller in volume, yet closely maintains the integrity
of the original data
Data reduction
● Data reduction strategies
○ Dimensionality reduction: Reducing the number of random variables or attributes under
consideration
○ Numerosity reduction: Replacing the original data volume by alternative, smaller forms of data
representation
○ Data compression: Applying transformations to obtain a reduced or “compressed” representation
of the original data
○ Data Aggregation: Combining multiple data points into a single data point by applying a
summarization function
○ Data Generalization: Replacing a data point with a more general data point that still preserves
the important information
Dimensionality reduction
● Reducing the number of random variables or attributes under consideration
● Some methods:
○ Attribute subset selection / feature selection
○ Principal Component Analysis
○ Wavelet Transforms
Attribute subset selection
● Reduces the data set size by removing irrelevant or
redundant attributes (or dimensions)
● The goal is to find a minimum set of attributes such that the resulting
probability distribution of the data classes is as close as possible to the original
distribution obtained using all attributes.
● For n attributes, there are 2n possible subsets.
○ An exhaustive search for the optimal subset of attributes (brute-force approach) may not be
feasible
○ Heuristic/greedy methods that explore a reduced search space are commonly used for attribute
subset selection.
○ Stepwise forward selection, stepwise backward selection, decision tree induction etc.
Attribute subset selection
Principal component analysis
● Transforms the data from d-dimensional
space into a new coordinate system of
dimension p, where p ≤ d
● Unlike attribute subset selection, which
reduces the attribute set size by retaining a
subset of the initial set of attributes, PCA
“combines” the essence of attributes by
creating an alternative, smaller set of
variables. The initial data can then be
projected onto this smaller set.
Wavelet transforms
● The discrete wavelet transform (DWT) is a linear signal processing technique
that, when applied to a data vector X, transforms it to a numerically different
vector, X′, of wavelet coefficients. The two vectors are of the same length.
● A compressed approximation of the data can be retained by storing only a small
fraction of the strongest of the wavelet coefficients.
Numerosity reduction
● Replacing the original data volume by alternative, smaller forms of data
representation
● These techniques may be parametric or non-parametric
○ Parametric: Regression and log-linear models
○ Non-parametric: Sampling, histograms, clustering
Sampling
● Allows a large data set to be represented by a much smaller random data
sample (or subset)
● Types
○ Simple random sampling with replacement (SRSWR)
○ Simple random sampling without replacement (SRSWOR)
○ Cluster sampling
○ Stratified sampling
Sampling
Clustering
● Partition the objects into groups, or clusters, so that objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters
● In data reduction, the cluster representations of the data are used to replace the
actual data
Data compression
● Applying transformations to obtain a reduced or “compressed” representation of
the original data
● If the original data can be reconstructed from the compressed data without any
information loss, the data reduction is called lossless.
● If we can reconstruct only an approximation of the original data, then the data
reduction is called lossy. Examples: PCA, DWT
Data transformation
● Transforming or consolidating data into forms appropriate for mining
● Strategies
○ Smoothing: removing noise from data
○ Attribute construction: constructing new attributes from the given set of attributes
○ Aggregation: summarizing or aggregating the data
○ Normalization: scaling the attribute data so that they fall within a smaller range
○ Discretization: replacing numerical attribute values with interval labels or conceptual labels
○ Concept hierarchy generation for nominal data: constructing hierarchies of concepts by
generalizing concepts to higher level
Normalization
● Transforming the data to fall within a smaller or common range such as [−1,1]
or [0.0, 1.0].
● Normalization is particularly useful for classification algorithms involving
neural networks or distance measurements such as nearest-neighbor
classification and clustering.
● For distance-based methods, normalization helps prevent attributes with
initially large ranges (e.g., income) from outweighing attributes with initially
smaller ranges (e.g., binary attributes).
● Some methods: min-max normalization, z-score normalization, normalization by
decimal scaling etc.
Min-max normalization
Performs a linear transformation on the original data.

Suppose that minA and maxA are the minimum and maximum values of an attribute,
A. Min-max normalization maps a value, vi, of A to vi' in the range
[new_minA,new_maxA] by computing
Min-max normalization
Suppose that the minimum and maximum values for the attribute income are
$12,000 and $98,000, respectively. We would like to map income to the range
[0.0,1.0]. By min-max normalization, what will be the new value for $73,600?
Z-score normalization
The values for an attribute, A, are normalized based on the mean (i.e., average) and
standard deviation of A.

A value, vi, of A is normalized to vi′ by computing

Where Ā and σA are the mean and standard deviation, respectively, of attribute A.
Z-score normalization is useful when the actual minimum and maximum of attribute A are
unknown, or when there are outliers that dominate the min-max normalization.
Z-score normalization
Suppose that the mean and standard deviation of the values for the attribute income
are $54,000 and $16,000, respectively.

With z-score normalization, what will be the new value for $73,600?
Normalizing by decimal scaling
Normalizes by moving the decimal point of values of attribute A.

A value, vi, of A is normalized to vi′ by computing

where j is the smallest integer such that max(|vi′|) < 1.

Example: Suppose that the recorded values of A range from −986 to 917. The maximum
absolute value of A is 986. To normalize by decimal scaling, we therefore divide each value by
1000 (i.e., j = 3) so that −986 normalizes to −0.986 and 917 normalizes to 0.917.
Discretization
● Dividing the range of a continuous attribute into intervals
● Interval labels can then be used to replace actual data values
● Some classification algorithms only accept categorical attributes
● Types:
○ Supervised (using class information) vs unsupervised (without using class information)
○ Top-down / splitting or bottom-up / merging
Discretization
● Unsupervised discretization
○ Equal-width or equal-frequency
● Supervised discretization
○ Clustering, decision tree
○ Entropy-based discretization
○ Chi square discretization
○ etc.
Unsupervised discretization
● Does not used class information
● Equal-width discretization
○ First, find the minimum and maximum values for the continuous attribute
○ Then, divide the range of the attribute values into the user-specified equal-width discrete
intervals

Original data

Equal-width discretization
Unsupervised discretization
● Equal-frequency discretization
○ Sort the values of the attribute in ascending order
○ Find the number of all possible values for the attribute
○ Then, divide the attribute values into the user-specified number of intervals such that each
interval contains the same number of sorted sequential values

Original data

Equal-frequency discretization
Supervised discretization
Uses class information
Concept hierarchy generation for nominal data
● Nominal attributes have a finite (but possibly large) number of distinct values,
with no ordering among the values. Examples: geographic_location,
job_category etc.
● The concept hierarchies can be used to transform the data into multiple levels of
granularity.
Concept hierarchy generation for nominal data
● Specification of a partial/total ordering of attributes explicitly at the schema
level by users or experts
○ street < city < state < country
● Specification of a hierarchy for a set of values by explicit data grouping
○ {Urbana, Champaign, Chicago} ⊂ Illinois
● Specification of only a partial set of attributes
○ e.g., only street < city, not others
● Automatic generation of hierarchies (or attribute levels) by the analysis of the
number of distinct values
○ e.g., for a set of attributes: {street, city, state, country}
Automatic concept hierarchy generation
● Some hierarchies can be automatically
generated based on the analysis of the
number of distinct values per attribute in the
data set
● The attribute with the most distinct values is
placed at the lowest level of the hierarchy
Data Visualization
● Aims to communicate data clearly and effectively
through graphical representation
● Aids in discovering data relationships that are
otherwise not easily observable by looking at the
raw data
A famous example: The John Snow’s famous cholera map:
mapping cholera cases during an outbreak in London led to the
discovery that cholera was water-borne as the cases all
clustered around a shared water pump
Goals of Data Visualization
Visualization should

● Make large datasets coherent (i.e., present huge amounts of information


compactly)
● Present information from
various viewpoints
● Present information at
several levels of detail
● Support visual comparisons
● Tell stories about the data
Data Visualization Techniques
● Pixel-Oriented Visualization Techniques
● Geometric Projection Visualization Techniques
● Icon-Based Visualization Techniques
● Hierarchical Visualization Techniques
Pixel-oriented Visualization
Idea: Use a pixel where the color of the pixel reflects the dimension’s value

For a data set of m dimensions, pixel-oriented techniques create m windows on the


screen, one for each dimension.

The m dimension values of a record are mapped to m pixels at the corresponding


positions in the windows.

The colors of the pixels reflect the corresponding values (e.g. the smaller the value,
the lighter the shading)
Pixel-oriented Visualization
Example:
Consider a dataset containing: income, credit_limit, transaction_volume, age
Sorting the data in an increasing order of income, and using pixel-based visualization, we
can observe that
● credit limit increases as income
increases
● customers whose income is in
the middle range are more likely
to purchase more
● there is no clear correlation
between income and age.
Pixel-oriented Visualization
Windows do not have to be rectangular.
Geometric Projection Visualization
Pixel-oriented visualization techniques do not show whether there is a dense area in
a multidimensional subspace.

Geometric projection techniques help users find interesting projections of


multidimensional data sets.

A scatter plot displays 2-D data points using Cartesian coordinates.

Third dimension? Use different colors or shapes.


Geometric Projection Visualization
A scatter-plot matrix is an n × n grid of 2-D scatter plots that provides a
visualization of each dimension with every other dimension
Geometric Projection Visualization
The parallel coordinates technique draws n equally
spaced axes, one for each dimension, parallel to one
of the display axes.

A data record is represented by a polygonal line


that intersects each axis at the point corresponding
to the associated dimension value.
Geometric Projection Visualization
Radar plot
Icon-based Visualization
Idea: Use small icons to represent
multidimensional data values.
Chernoff faces
Described by facial characteristic parameters:
head eccentricity, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, eye
spacing, eye size, mouth length and degree of
mouth opening etc.
Icon-based Visualization
Examples:
Hierarchical Visualization Techniques
Idea: Partition all dimensions into subsets (i.e., subspaces), and visualize the
subspaces in a hierarchical manner.

Tree-maps display hierarchical data


as a set of nested rectangles.
Visualizing Complex Data and Relations
Tag/word clouds to visualize statistics of tags/words.

The importance of a tag/word is indicated by font size or color.


Visualizing Complex Data and Relations
Graphs/networks/arc diagrams to visualize relationships
Visualizing Complex Data and Relations
Chord diagrams to visualize flows or connections between several entities
Visualizing Complex Data and Relations
Choropleth maps / bubble maps for visualizing geospatial data
References
● https://www.data-to-viz.com/
● https://d3js.org/

You might also like