0% found this document useful (0 votes)

3 views92 pages

Chapter 2_ Data Exploration, Preprocessing and Visualization

Chapter 2 covers data exploration, preprocessing, and visualization techniques, emphasizing the importance of understanding data types, distributions, and relationships. It discusses methods for data cleaning, integration, reduction, and transformation, as well as various graphical and non-graphical exploratory data analysis techniques. The chapter highlights the significance of data quality and the challenges faced in handling real-world data, including inaccuracies, incompleteness, and redundancies.

Uploaded by

ramkrishnaghimire8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views92 pages

Chapter 2_ Data Exploration, Preprocessing and Visualization

Uploaded by

ramkrishnaghimire8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 92

Chapter 2: Data Exploration,

Preprocessing and Visualization

● Non-graphical and graphical
methods for exploring data
● Data cleaning: Missing values, noisy
data, inconsistent data

Contents ●
●
Data integration
Data transformation
● Data reduction
● Discretization and generating
concept hierarchies
● Data visualization tools and
techniques
Getting to know your data
Before jumping into mining, we need to familiarize ourselves with the data.

● What are the types of attributes that make up your data?

● What kind of values does each attribute have?
● Which attributes are discrete, and which are continuous-values?
● What the data look like?
● How are the values distributed?
● Are there ways we can visualize the data to get a better sense of it all?
● Can we spot any outliers?
● Can we measure the similarity of some data objects with respect to others? etc.
Exploratory Data Analysis (EDA)
An approach for data analysis that employs a variety of techniques for

● Better understanding the data

● Detection of mistakes
● Checking of assumptions
● Preliminary selection of appropriate models
● Determining relationships among the explanatory variables, and
● Assessing the direction and rough size of relationships between explanatory and
outcome variables
Exploratory data analysis (EDA)
EDA can be graphical or non-graphical.

Non-graphical methods generally involve calculation of summary statistics, while

graphical methods summarize the data in a diagrammatic or pictorial way.

5
Non-graphical methods: Univariate data
● Categorical data:
A simple tabulation of the frequency of each category

Source: http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf

6
Non-graphical methods: Univariate data
● Quantitative data: Population distribution of the variable using the data
of the observed sample
○ Central tendency: mean, median, mode etc.
○ Spread (an indicator of how far away from the center we are still likely to ﬁnd data
values): variance, standard deviation, interquartile range (IQR)
○ Modality (number of peaks in the pdf)
○ Shape: Skewness (measure of asymmetry), Kurtosis (measure of peakedness)
○ Outliers (an observation that lies "far away" from other values)

7
Non-graphical methods: Bivariate data
● Categorical data:
○ Cross-tabulation

○ Correlation
Chi-squared test,

Cramer’s V,
etc.

8
Non-graphical methods: Bivariate data
● Quantitative data:
○ Covariance:
A measure of how much two variables “co-vary”
■ +ve covariance = when one measurement is above the mean the other will probably also be above
the mean, and vice versa
■ -ve covariance = when one variable is above its mean, the other is below its mean
■ Covariances near zero = the two variables vary independently of each other.
○ Correlation:
Statistical relationship between two variables
Commonly used correlation coefﬁcient: Pearson’s correlation coefﬁcient
■ -1 = data lie on a perfect straight line with a negative slope
■ 0 = no linear relationship between the variables
■ 1 = data lie on a perfect straight line with a positive slope

9
Non-graphical methods: Multivariate data
● Covariance and correlation matrices:
Pairwise covariances and/or correlations assembled into a matrix

10
● Bar chart

Graphical methods ●
●
Histogram
Density plot
● Box and whiskers plot
Univariate data
● Time-series plot

11
Bar plot
Bar plots display the distribution (frequencies) of a
categorical variable through vertical or horizontal
bars.

12
Histogram
Histograms are constructed by binning the
data and counting the number of
observations in each bin

To visualize the shape of the distribution

13
Histogram
Bar plot vs histogram

● Bar plots use categorical data

● Histograms use continuous data grouped or categorized (binning) in such a way
that they are considered to be ranges

● Normally, bars in bar plots do not touch each other

● Bars in histograms touch each other to indicate that the items are non-discrete

14
Density plot
Density plots can be thought of as plots of
smoothed histograms

15
Density plot
The smoothness is controlled by a bandwidth
parameter that is analogous to the histogram
binwidth

16
Box and whiskers plot
Presents information about the central
tendency, symmetry, and skew as well as
outliers

17
Box and whiskers plot

Source: https://sites.google.com/site/davidsstatistics/home/notched-box-plots https://en.wikipedia.org/wiki/Box_plot

18
Time series plot

At time t
19
● Scatter plot
● Regression

Graphical methods ●
●
Box and whiskers plot
Bar chart
Bivariate data

20
Scatter plot
Uses Cartesian coordinates to show the relationship between two variables of a set
of data

21
Box and whiskers plot

22
Bar plot
Stacked bar plot

23
Bar plot
Grouped bar plot

24
Graphical methods ●
●
Scatter plot matrix
Bubble plot
● Line chart
Multivariate data

25
Scatter plot matrix
Can be used to roughly determine if there
is a linear correlation between multiple
variables

26
Bubble plot
Bubble plots can display the
relationship between three
quantitative variables

A bubble plot is basically a 2D

scatter plot that uses the size of
the plotted point to represent the
value of the third variable

27
Line chart
Connecting the points in a scatter plot Multiple lines can be drawn in the same
moving from left to right gives a line plot
plot

28
Graphical methods
Quantitative Categorical Quantitative & Categorical

Time-series plot Univariate: Box and whiskers plots

Pie charts, Bar graphs
Univariate:
Histograms, Density plots, Box and Bi- or multi-variate:
whiskers plots Bar graphs

Bivariate:
Scatterplots

Multivariate:
Scatterplot matrix, Bubble plots (3
variables)
29
Why preprocess the data?
In many practical situations

● Data contains too many attributes, and some of them are clearly irrelevant and
redundant
● Data is incomplete (some values are missing)
● Data is inaccurate or noisy (contains errors, or values that deviate from the
expected)
● Data is inconsistent (e.g., containing discrepancies in the department codes)

Garbage-In-Garbage-Out (GIGO): Low-quality data will lead to low-quality mining

results
Data quality
● Accuracy - whether the dataset contains errors, or values that deviate from the
expected
● Completeness - lacking attribute values or certain attributes of interest, or
containing only aggregate data?
● Consistency - containing discrepancies between the attribute values
● Timeliness - how timely the data are updated
● Believability - how much the data are trusted by users
● Interpretability - how easy the data are understood
Why is the real world data dirty?
● Inaccurate (or noisy) data may come from
○ Faulty data collection instruments or methods
○ Human or computer errors while entering data
○ Incorrect data purposely submitted by users (aka disguised missing data)
○ Technological limitations
○ Inconsistencies in naming conventions or inconsistent formats for input fields
○ etc.
● Inconsistent data may come from
○ Integration of data from various sources
○ Modification of linked data
○ etc.
● Incomplete data may come from
○ Unavailability of attributes of interest
○ Recording missed due to equipment malfunctions
○ Deletion of inconsistent data
○ Data not submitted in timely fashion
○ etc.
Major Tasks in Data Preprocessing
● Data cleaning
○ Fill in missing values, smooth noisy data, identify or remove outliers, resolve inconsistencies,
detect and remove redundancies
● Data integration
○ Include data from multiple sources (e.g., multiple databases, data cubes, files etc.)
● Data reduction
○ Obtain a reduced representation of the data set that is much smaller in volume, yet produces the
same (or almost the same) analytical results
● Data transformation and data discretization
○ Normalize data values, replace raw data values by ranges or higher conceptual levels (e.g.,
replacing age by higher-level concepts, such as youth, adult or senior) etc.
Data cleaning
● Tasks
○ Handle missing values,
○ Smooth noisy data,
○ Identify or remove outliers,
○ Resolve inconsistencies,
○ Detect and remove redundancies
Missing data
● Data may not always be available
● Incomplete data may come from
○ Unavailability of attributes of interest
○ Recording missed due to equipment malfunctions
○ Deletion of inconsistent data
○ Data not submitted in timely fashion
○ etc.
● Missing data may need to be inferred
Handling missing values
● Ignore the tuple
○ Usually done when the class label is missing
○ Poor when the percentage of missing values per attribute varies considerably
● Use a global constant (e.g., NA, Unknown, -∞ etc.) to fill in the missing value
○ This method is simple but not foolproof as the mining program may mistakenly think that they
form an interesting concept
● Fill in the missing values manually
○ Time consuming
○ May not be feasible for a large data set
● Fill in the missing values automatically
○ Using random values
○ Using a measure of central tendency for the attribute (e.g., mean, median)
○ Using the attribute mean or median for all samples belonging to the same class
○ Using the most probable value (e.g., with regression, decision tree, Bayesian inference etc.)
Noisy data
● Noise is a random error or a variance in measured variable.
● Inaccurate (or noisy) data may come from
○ Faulty data collection instruments or methods
○ Human or computer errors while entering data
○ Incorrect data purposely submitted by users (aka disguised missing data)
○ Technological limitations
○ Inconsistencies in naming conventions or inconsistent formats for input fields
○ etc.
Handling noisy data
● Binning
○ First sort the data, and distribute the sorted
values into a number of buckets or bins
(equal-frequency bins or equal-width bins)
○ Then replace each value in a bin by the mean of
the bin (smoothing by bin means), or by the
median of the bin (smoothing by bin medians), or
by the closest boundary value (smoothing by bin
boundaries)
Handling noisy data
● Regression
○ Fitting the data into regression functions
○ In (simple) linear regression, the data are modeled
to fit a straight line.
● Clustering / Outlier analysis
○ Grouping similar values into groups or clusters.
Values that fall outside of the set of clusters may
be considered outliers
Handling noisy data
● Concept hierarchy
○ Concept hierarchy is a series of mappings from a set of
low-level concepts to larger-level, more general concepts, e.g.
Country > province > city > street
○ Replacing data with higher-level concepts, e.g., a concept
hierarchy of price may map real values into inexpensive,
moderately_priced, and expensive.
● Combined human and computer inspection
○ Detect suspicious values and check by human
○ Correct the data using external references
Data Integration
● Combine data from multiple sources
Data Integration
● Challenges
○ Entity identification problem
○ Redundancies
○ Tuple duplication
○ Data value conflict
Data Integration: Challenges
● Entity identification problem
○ How can equivalent real-world entities from multiple sources be matched up?
○ Schema integration
■ How to be sure that customer_id in one database and cust_number in another refer to the
same attribute?
■ Data codes for for pay_type in one database are “H” and “S” but 1 and 2 in another.
○ Object matching
■ Bill Clinton = William Clinton
○ The metadata of the attributes (name, meaning, data type, range of values permitted for the
attributes, and null rules for handling blank, zero, and null values etc.) can be used to match
attributes.
Data Integration: Challenges
Redundancy

● Data integration may result in duplicate attributes or duplicate tuples.

● Redundant data may occur
○ Due to inconsistencies in attribute naming across multiple sources
○ Due to the use of denormalized tables
○ When an attribute can be derived from another attribute or a set of attributes
Handling redundant data
● Correlation analysis can detect attribute redundancy
○ Given two attributes, correlation analysis can measure how strongly one attribute implies the
other, based on the available data
○ Numerical data: Pearson’s correlation coefﬁcient
○ Categorical data: Chi-square test
● Different probabilistic approaches and machine learning techniques can be used
to detect duplicate tuples
● Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Data Integration: Challenges
Data value conﬂict

● For the same real world entity, attribute values from different sources are
different
● Possible reasons: different representations, different scales, e.g., metric vs.
British units, grading schemes in different countries/universities etc.
● Metadata information can be used to detect such discrepancies
● Data scrubbing tools and data auditing tools can aid in the discrepancy
detection step
Data reduction
● Complex data analysis and mining on huge amounts of data can take a long time
● Data reduction techniques can be applied to obtain a reduced representation of
the data set that is much smaller in volume, yet closely maintains the integrity
of the original data
Data reduction
● Data reduction strategies
○ Dimensionality reduction: Reducing the number of random variables or attributes under
consideration
○ Numerosity reduction: Replacing the original data volume by alternative, smaller forms of data
representation
○ Data compression: Applying transformations to obtain a reduced or “compressed” representation
of the original data
○ Data Aggregation: Combining multiple data points into a single data point by applying a
summarization function
○ Data Generalization: Replacing a data point with a more general data point that still preserves
the important information
Dimensionality reduction
● Reducing the number of random variables or attributes under consideration
● Some methods:
○ Attribute subset selection / feature selection
○ Principal Component Analysis
○ Wavelet Transforms
Attribute subset selection
● Reduces the data set size by removing irrelevant or
redundant attributes (or dimensions)
● The goal is to find a minimum set of attributes such that the resulting
probability distribution of the data classes is as close as possible to the original
distribution obtained using all attributes.
● For n attributes, there are 2n possible subsets.
○ An exhaustive search for the optimal subset of attributes (brute-force approach) may not be
feasible
○ Heuristic/greedy methods that explore a reduced search space are commonly used for attribute
subset selection.
○ Stepwise forward selection, stepwise backward selection, decision tree induction etc.
Attribute subset selection
Principal component analysis
● Transforms the data from d-dimensional
space into a new coordinate system of
dimension p, where p ≤ d
● Unlike attribute subset selection, which
reduces the attribute set size by retaining a
subset of the initial set of attributes, PCA
“combines” the essence of attributes by
creating an alternative, smaller set of
variables. The initial data can then be
projected onto this smaller set.
Wavelet transforms
● The discrete wavelet transform (DWT) is a linear signal processing technique
that, when applied to a data vector X, transforms it to a numerically different
vector, X′, of wavelet coefficients. The two vectors are of the same length.
● A compressed approximation of the data can be retained by storing only a small
fraction of the strongest of the wavelet coefficients.
Numerosity reduction
● Replacing the original data volume by alternative, smaller forms of data
representation
● These techniques may be parametric or non-parametric
○ Parametric: Regression and log-linear models
○ Non-parametric: Sampling, histograms, clustering
Sampling
● Allows a large data set to be represented by a much smaller random data
sample (or subset)
● Types
○ Simple random sampling with replacement (SRSWR)
○ Simple random sampling without replacement (SRSWOR)
○ Cluster sampling
○ Stratified sampling
Sampling
Clustering
● Partition the objects into groups, or clusters, so that objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters
● In data reduction, the cluster representations of the data are used to replace the
actual data
Data compression
● Applying transformations to obtain a reduced or “compressed” representation of
the original data
● If the original data can be reconstructed from the compressed data without any
information loss, the data reduction is called lossless.
● If we can reconstruct only an approximation of the original data, then the data
reduction is called lossy. Examples: PCA, DWT
Data transformation
● Transforming or consolidating data into forms appropriate for mining
● Strategies
○ Smoothing: removing noise from data
○ Attribute construction: constructing new attributes from the given set of attributes
○ Aggregation: summarizing or aggregating the data
○ Normalization: scaling the attribute data so that they fall within a smaller range
○ Discretization: replacing numerical attribute values with interval labels or conceptual labels
○ Concept hierarchy generation for nominal data: constructing hierarchies of concepts by
generalizing concepts to higher level
Normalization
● Transforming the data to fall within a smaller or common range such as [−1,1]
or [0.0, 1.0].
● Normalization is particularly useful for classification algorithms involving
neural networks or distance measurements such as nearest-neighbor
classification and clustering.
● For distance-based methods, normalization helps prevent attributes with
initially large ranges (e.g., income) from outweighing attributes with initially
smaller ranges (e.g., binary attributes).
● Some methods: min-max normalization, z-score normalization, normalization by
decimal scaling etc.
Min-max normalization
Performs a linear transformation on the original data.

Suppose that minA and maxA are the minimum and maximum values of an attribute,
A. Min-max normalization maps a value, vi, of A to vi' in the range
[new_minA,new_maxA] by computing
Min-max normalization
Suppose that the minimum and maximum values for the attribute income are
$12,000 and $98,000, respectively. We would like to map income to the range
[0.0,1.0]. By min-max normalization, what will be the new value for $73,600?
Z-score normalization
The values for an attribute, A, are normalized based on the mean (i.e., average) and
standard deviation of A.

A value, vi, of A is normalized to vi′ by computing

Where Ā and σA are the mean and standard deviation, respectively, of attribute A.
Z-score normalization is useful when the actual minimum and maximum of attribute A are
unknown, or when there are outliers that dominate the min-max normalization.
Z-score normalization
Suppose that the mean and standard deviation of the values for the attribute income
are $54,000 and $16,000, respectively.

With z-score normalization, what will be the new value for $73,600?
Normalizing by decimal scaling
Normalizes by moving the decimal point of values of attribute A.

A value, vi, of A is normalized to vi′ by computing

where j is the smallest integer such that max(|vi′|) < 1.

Example: Suppose that the recorded values of A range from −986 to 917. The maximum
absolute value of A is 986. To normalize by decimal scaling, we therefore divide each value by
1000 (i.e., j = 3) so that −986 normalizes to −0.986 and 917 normalizes to 0.917.
Discretization
● Dividing the range of a continuous attribute into intervals
● Interval labels can then be used to replace actual data values
● Some classification algorithms only accept categorical attributes
● Types:
○ Supervised (using class information) vs unsupervised (without using class information)
○ Top-down / splitting or bottom-up / merging
Discretization
● Unsupervised discretization
○ Equal-width or equal-frequency
● Supervised discretization
○ Clustering, decision tree
○ Entropy-based discretization
○ Chi square discretization
○ etc.
Unsupervised discretization
● Does not used class information
● Equal-width discretization
○ First, find the minimum and maximum values for the continuous attribute
○ Then, divide the range of the attribute values into the user-specified equal-width discrete
intervals

Original data

Equal-width discretization
Unsupervised discretization
● Equal-frequency discretization
○ Sort the values of the attribute in ascending order
○ Find the number of all possible values for the attribute
○ Then, divide the attribute values into the user-speciﬁed number of intervals such that each
interval contains the same number of sorted sequential values

Original data

Equal-frequency discretization
Supervised discretization
Uses class information
Concept hierarchy generation for nominal data
● Nominal attributes have a finite (but possibly large) number of distinct values,
with no ordering among the values. Examples: geographic_location,
job_category etc.
● The concept hierarchies can be used to transform the data into multiple levels of
granularity.
Concept hierarchy generation for nominal data
● Specification of a partial/total ordering of attributes explicitly at the schema
level by users or experts
○ street < city < state < country
● Specification of a hierarchy for a set of values by explicit data grouping
○ {Urbana, Champaign, Chicago} ⊂ Illinois
● Specification of only a partial set of attributes
○ e.g., only street < city, not others
● Automatic generation of hierarchies (or attribute levels) by the analysis of the
number of distinct values
○ e.g., for a set of attributes: {street, city, state, country}
Automatic concept hierarchy generation
● Some hierarchies can be automatically
generated based on the analysis of the
number of distinct values per attribute in the
data set
● The attribute with the most distinct values is
placed at the lowest level of the hierarchy
Data Visualization
● Aims to communicate data clearly and effectively
through graphical representation
● Aids in discovering data relationships that are
otherwise not easily observable by looking at the
raw data
A famous example: The John Snow’s famous cholera map:
mapping cholera cases during an outbreak in London led to the
discovery that cholera was water-borne as the cases all
clustered around a shared water pump
Goals of Data Visualization
Visualization should

● Make large datasets coherent (i.e., present huge amounts of information

compactly)
● Present information from
various viewpoints
● Present information at
several levels of detail
● Support visual comparisons
● Tell stories about the data
Data Visualization Techniques
● Pixel-Oriented Visualization Techniques
● Geometric Projection Visualization Techniques
● Icon-Based Visualization Techniques
● Hierarchical Visualization Techniques
Pixel-oriented Visualization
Idea: Use a pixel where the color of the pixel reﬂects the dimension’s value

For a data set of m dimensions, pixel-oriented techniques create m windows on the

screen, one for each dimension.

The m dimension values of a record are mapped to m pixels at the corresponding

positions in the windows.

The colors of the pixels reﬂect the corresponding values (e.g. the smaller the value,
the lighter the shading)
Pixel-oriented Visualization
Example:
Consider a dataset containing: income, credit_limit, transaction_volume, age
Sorting the data in an increasing order of income, and using pixel-based visualization, we
can observe that
● credit limit increases as income
increases
● customers whose income is in
the middle range are more likely
to purchase more
● there is no clear correlation
between income and age.
Pixel-oriented Visualization
Windows do not have to be rectangular.
Geometric Projection Visualization
Pixel-oriented visualization techniques do not show whether there is a dense area in
a multidimensional subspace.

Geometric projection techniques help users ﬁnd interesting projections of

multidimensional data sets.

A scatter plot displays 2-D data points using Cartesian coordinates.

Third dimension? Use different colors or shapes.

Geometric Projection Visualization
A scatter-plot matrix is an n × n grid of 2-D scatter plots that provides a
visualization of each dimension with every other dimension
Geometric Projection Visualization
The parallel coordinates technique draws n equally
spaced axes, one for each dimension, parallel to one
of the display axes.

A data record is represented by a polygonal line

that intersects each axis at the point corresponding
to the associated dimension value.
Geometric Projection Visualization
Radar plot
Icon-based Visualization
Idea: Use small icons to represent
multidimensional data values.
Chernoff faces
Described by facial characteristic parameters:
head eccentricity, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, eye
spacing, eye size, mouth length and degree of
mouth opening etc.
Icon-based Visualization
Examples:
Hierarchical Visualization Techniques
Idea: Partition all dimensions into subsets (i.e., subspaces), and visualize the
subspaces in a hierarchical manner.

Tree-maps display hierarchical data

as a set of nested rectangles.
Visualizing Complex Data and Relations
Tag/word clouds to visualize statistics of tags/words.

The importance of a tag/word is indicated by font size or color.

Visualizing Complex Data and Relations
Graphs/networks/arc diagrams to visualize relationships
Visualizing Complex Data and Relations
Chord diagrams to visualize ﬂows or connections between several entities
Visualizing Complex Data and Relations
Choropleth maps / bubble maps for visualizing geospatial data
References
● https://www.data-to-viz.com/
● https://d3js.org/

Lectur 4 Basic Statistical Descriptions of Data
No ratings yet
Lectur 4 Basic Statistical Descriptions of Data
44 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data Mining _ Preprocessing
No ratings yet
Data Mining _ Preprocessing
77 pages
Data (1) (1)
No ratings yet
Data (1) (1)
81 pages
DWDM - Unit - III
No ratings yet
DWDM - Unit - III
77 pages
253777
No ratings yet
253777
66 pages
3_Preprocessing
No ratings yet
3_Preprocessing
82 pages
Chapter2 Data Preprocssing
No ratings yet
Chapter2 Data Preprocssing
70 pages
02-KnowYourData
No ratings yet
02-KnowYourData
44 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
02Data
No ratings yet
02Data
65 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
CH 2
No ratings yet
CH 2
36 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
data mining 2
No ratings yet
data mining 2
64 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Module 1
No ratings yet
Module 1
64 pages
02Know Your Data Lecture2 3
No ratings yet
02Know Your Data Lecture2 3
53 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
DWDM-LS2-Fall-24-25
No ratings yet
DWDM-LS2-Fall-24-25
42 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Lect 3
No ratings yet
Lect 3
51 pages
Week2-1
No ratings yet
Week2-1
24 pages
02 Data
No ratings yet
02 Data
65 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
02 Data
No ratings yet
02 Data
41 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
02Data
No ratings yet
02Data
24 pages
02 Data
No ratings yet
02 Data
62 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Lec 2
No ratings yet
Lec 2
26 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
02 Data
No ratings yet
02 Data
64 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
Crash Course Data Science
No ratings yet
Crash Course Data Science
7 pages
E500 RTUtil500 Users Guide R11 PDF
No ratings yet
E500 RTUtil500 Users Guide R11 PDF
139 pages
Computational Intelligence Cyber Security and Computational Models Proceedings of ICC3 2013 1st edition by Sai Sundara Krishnan, Anitha, Lekshmi, Senthil Kumar, Anthony Bonato ISBN 8132216792 9788132216797 - Read the ebook now with the complete version and no limits
100% (9)
Computational Intelligence Cyber Security and Computational Models Proceedings of ICC3 2013 1st edition by Sai Sundara Krishnan, Anitha, Lekshmi, Senthil Kumar, Anthony Bonato ISBN 8132216792 9788132216797 - Read the ebook now with the complete version and no limits
89 pages
operating system memory management assignment (1)
No ratings yet
operating system memory management assignment (1)
3 pages
Introduction To Object-Oriented Analysis and Object-Oriented Design
No ratings yet
Introduction To Object-Oriented Analysis and Object-Oriented Design
43 pages
RANN
No ratings yet
RANN
44 pages
Digi Topics Products Railxplore Apm
No ratings yet
Digi Topics Products Railxplore Apm
10 pages
Module 5-System Security and Wireless Security-With University Questions
No ratings yet
Module 5-System Security and Wireless Security-With University Questions
91 pages
Download Full Developing Virtual Synthesizers with VCV Rack 1st Edition Leonardo Gabrielli (Author) PDF All Chapters
100% (5)
Download Full Developing Virtual Synthesizers with VCV Rack 1st Edition Leonardo Gabrielli (Author) PDF All Chapters
55 pages
TEJB9492-02 - Fleet Product Bulletin
100% (2)
TEJB9492-02 - Fleet Product Bulletin
16 pages
User Guide
No ratings yet
User Guide
14 pages
Sim For Edp 101l
No ratings yet
Sim For Edp 101l
94 pages
APT Hackers
No ratings yet
APT Hackers
8 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Watchlist
No ratings yet
Watchlist
34 pages
ENGG_102_Engineering_Project1582195701
No ratings yet
ENGG_102_Engineering_Project1582195701
1 page
Immediate Download Using Docker Developing and Deploying Software With Containers 1st Edition Adrian Mouat Ebooks 2024
100% (3)
Immediate Download Using Docker Developing and Deploying Software With Containers 1st Edition Adrian Mouat Ebooks 2024
62 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Summer Homework For 2nd Graders
100% (1)
Summer Homework For 2nd Graders
8 pages
Ajol-File-Journals 545 Articles 273549 668d2584390e1
No ratings yet
Ajol-File-Journals 545 Articles 273549 668d2584390e1
17 pages
Screenshot 2023-09-12 at 22.32.23
No ratings yet
Screenshot 2023-09-12 at 22.32.23
1 page
UTD2102CEX
No ratings yet
UTD2102CEX
5 pages
Lecture# 08 Greedy Algorithms
No ratings yet
Lecture# 08 Greedy Algorithms
63 pages
Gler - Career Portfolio
No ratings yet
Gler - Career Portfolio
17 pages
Elfabe
No ratings yet
Elfabe
370 pages
Nexeo Plastics Power Point - VIM - PROCESS - DOC
No ratings yet
Nexeo Plastics Power Point - VIM - PROCESS - DOC
12 pages
Important Tables in SAP FI
No ratings yet
Important Tables in SAP FI
3 pages
Unit 1
No ratings yet
Unit 1
12 pages
Major Project Synopsis Music App Manas SC
No ratings yet
Major Project Synopsis Music App Manas SC
13 pages
CSC404 WEEK2 FUNCTION PART 1 Fadilah
No ratings yet
CSC404 WEEK2 FUNCTION PART 1 Fadilah
11 pages
CZ XGZP6857a100kpg 0001
No ratings yet
CZ XGZP6857a100kpg 0001
7 pages
UmaMahesh Chakali
No ratings yet
UmaMahesh Chakali
2 pages
ECE-C574 ASIC Design I Syllabus Winter Lecture: 1h50m Prof. Baris Taskin Lab: 1h50m
No ratings yet
ECE-C574 ASIC Design I Syllabus Winter Lecture: 1h50m Prof. Baris Taskin Lab: 1h50m
4 pages
Hashes For Yara Training
No ratings yet
Hashes For Yara Training
2 pages
Case Study: The FBI & Apple Security vs. Privacy
No ratings yet
Case Study: The FBI & Apple Security vs. Privacy
3 pages
GPS-Based Attitude and Guidance Displays For General Aviation
No ratings yet
GPS-Based Attitude and Guidance Displays For General Aviation
7 pages
Comparison of Compiler
No ratings yet
Comparison of Compiler
4 pages
Sample Junior Business Analyst Resume
0% (1)
Sample Junior Business Analyst Resume
2 pages
The Power of Graphs
From Everand
The Power of Graphs
Pasquale De Marco
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Extending the Boundaries: An Expansive Journey into Nonparametric Curve Estimation
From Everand
Extending the Boundaries: An Expansive Journey into Nonparametric Curve Estimation
Pasquale De Marco
No ratings yet
DataViz: How to Choose the Right Chart for Your Data: Bite-Size Stats, #7
From Everand
DataViz: How to Choose the Right Chart for Your Data: Bite-Size Stats, #7
Lee Baker
No ratings yet
Image Histogram: Unveiling Visual Insights, Exploring the Depths of Image Histograms in Computer Vision
From Everand
Image Histogram: Unveiling Visual Insights, Exploring the Depths of Image Histograms in Computer Vision
Fouad Sabry
No ratings yet

Chapter 2_ Data Exploration, Preprocessing and Visualization

Uploaded by

Chapter 2_ Data Exploration, Preprocessing and Visualization

Uploaded by

Chapter 2: Data Exploration,

Preprocessing and Visualization

● What are the types of attributes that make up your data?

● Better understanding the data

Non-graphical methods generally involve calculation of summary statistics, while

To visualize the shape of the distribution

● Bar plots use categorical data

● Normally, bars in bar plots do not touch each other

Source: https://sites.google.com/site/davidsstatistics/home/notched-box-plots https://en.wikipedia.org/wiki/Box_plot

A bubble plot is basically a 2D

Time-series plot Univariate: Box and whiskers plots

Garbage-In-Garbage-Out (GIGO): Low-quality data will lead to low-quality mining

● Data integration may result in duplicate attributes or duplicate tuples.

A value, vi, of A is normalized to vi′ by computing

A value, vi, of A is normalized to vi′ by computing

where j is the smallest integer such that max(|vi′|) < 1.

● Make large datasets coherent (i.e., present huge amounts of information

For a data set of m dimensions, pixel-oriented techniques create m windows on the

The m dimension values of a record are mapped to m pixels at the corresponding

Geometric projection techniques help users ﬁnd interesting projections of

A scatter plot displays 2-D data points using Cartesian coordinates.

Third dimension? Use different colors or shapes.

A data record is represented by a polygonal line

Tree-maps display hierarchical data

The importance of a tag/word is indicated by font size or color.

You might also like