0% found this document useful (0 votes)

124 views

Data Mining Notes C3

This document summarizes key concepts from Chapter 3 of an introduction to data mining textbook. It discusses summary statistics such as frequency, mode, percentiles, mean, median, range and variance that can describe a dataset. It also covers common visualization techniques including histograms, box plots, scatter plots and heatmaps to explore univariate and multivariate data. Visualization is important for representing data objects and relationships through graphical elements.

Uploaded by

wuziqi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views

Data Mining Notes C3

Uploaded by

wuziqi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Notes on Introduction to Data Mining:

Chapter3 Data Exploration

wuziqing
5th November 2020

1 Summary Statistics
Summary statistics are quantities, such as the mean and standard deviation,
that capture various characteristics of a potentially large set of values with a
single number or a small set of numbers.

1.1 Frequencies and Mode

Frequency and mode are often useful only for categorical objects.
No.objects with value vi
f requencyvi = (1)
Total No.objects
Mode of a categorical attribute is the value with the highest frequency.

1.2 Percentiles
Percentiles is useful to show the distribution of an ordered attribute.
pth percentile xp is a value of x such that p% of the observed values of x are
less than xp . For example, x50% = 3.0 means that 50% of x is less than 3.0.

1.3 Mean and Median

Mean and Median measures the location of a continuous attribute.
m
1 X
mean(x) = xi (2)
m i=1
(
xr+1 if m is odd, m = 2r + 1
median(x) = 1 (3)
2 (xr + xr+1 ) if m is even, m = 2r
Median shows the middle position of an attribute, while mean sometimes
could be affected by outliers.
Trimmed mean with a specified percentage p can be used to reduce the
effect of outliers on mean. The top and bottom p/2% of data are thrown out
before calculating the mean.

1
1.4 Range and Variance
Range and Variance measures the spread of an attribute.

range(x) = max(x) − min(x) (4)

m
1 X
variance(x) = s2x = (xi − x̄)2 (5)
m − 1 i=1
Variance is particularly sensitive to ourliers, as the difference between the
mean is squared.
Other more robust measures of spread are:
1. Absolute average deviation:
m
1 X
ADD(x) = |xi − ¯(x)| (6)
m i=1

2. Median absolute deviation:

M AD(x) = median(|x1 − x̄, ..., |xm | − x̄|) (7)

3. Interquartile range:
IR(x) = x75% − x25% (8)

1.5 Multivariate Statistics

For data with multiple continuous attributes, the location of the data can be
calculated separately:
x̄ = (x̄1 , ..., x̄n ) (9)
The spread of the data can be shown in the Covariance Matrix S. Covari-
ance matrix of two attributes measures the degree to which the two attributes
vary together, depending on the magnitude.
m
1 X
sij = convariance(xi , xj ) = (xki − x̄i )(xkj − x¯j ) (10)
m−1
k=1

It should be noted that convariance(xi , xi ) = variance(xi ).

Based on the covariance, we can calculate the degree where two attributes
vary together in terms of trend independent of magnitude, i.e., they are linearly
related. The Correlation matrix R is calculated by:
covariance(xi , xj )
rij = correlation(xi , xj ) = (11)
si sj
rij ranges from [−1, 1], where 0 indicates no linear relationship, and 1/−1 in-
dicate perfect positive/negative linear relationship. It should be noted that
correlation(xi , xi ) = 1.

2
2 Visualization
2.1 General Concepts: Representing data to graphical el-
ements
When mapping an data object to a graph, general considerations should be:
1. If the object contains only a single attribute, then the attribute could be
mapped based on its type. For ordered continuous or ordinal attributes,
ordered graphical features such as axis should be used. For categorical
features, each category should be mapped to a unique representation like
color or position.
2. If the object have multiple attributes, it can be represented as a row/column
in the table, or a line in the graph
3. Objects with multiple attributes can also be represented as a point in a
2D or 3D diagram, if the number of attributes are 2 or 3. In this case, we
may need to select a subset of the attributes.
4. Relationships between data objects can usually shown by standard graph
representations, like nodes and edges. Sometimes the relationship can
also be shown implicitly on the graph, like distance in a 2D coordination
system, or on a real-world map.

2.2 Plot Examples

Some common graph techniques are shown in the section.

1. Stem and Leaf Plots: It can show the distribution of one-dimensional

integer and continuous data, as shown in Figure 1. However, it is hard to
scale and only works for small integers.

Stem Leaf
1 1 1 2 3 3 4 4
1 5 6 6 8
2 0 3
2 7 8
3
3 5 7 8 8
4 0 0 0 1 2 4 4 4
4 5 5 6 7 7 7 8 8 9

Figure 1: Stem and leaf plot (1|1 = 1.1)

2. Histogram: It shows the distribution of values of a single attribute by

dividing objects into bins and showing the number of objects that falls into

3
50

object count 30

0
0 5 10 15 20 25
attribute value

Figure 2: Histogram

the bin. For categorical attributes, each type can be a bin. For continuous
attributes, some arbitrary interval can be a bin, as shown in Figure 2.
Pareto histogram sort the category bin by their counts.
Sometimes we can also use 3-d histogram to show the count for two at-
tributes together.
3. Box Plots: It is another methods to display the distribution of a sin-
gle continuous attribute. It shows the 90th, 75th, 50th, 25th and 10th
percentile of the data. An example is shown in Figure 3.
4. Pie Chart: It can show the distribution of each category in a categorical
attribute. It shows how many percent each category takes up in the data.
An example is shown in Figure 4
5. Percentile plots and Empirical Cumulative Distribution Func-
tions (ECDF): It is able to show the distribution of an ordered attribute
more quantitatively.
A Cumulative Distribution Functions (CDF), which CDF(x) shows
the percentage of data less than value x.
A Empirical Cumulative Distribution Functions (ECDF) shows
the fraction of points which are less than the current observed value. Since
the observe value is finite, it is a step function.
An example of ECDF is shown in Figure 5.
On the other hand, a percentile plot draws P (x), which is the value of
xth percentile for all percentiles.

4
Index 2

Index 1

Index 0

0 0.5 1 1.5 2 2.5

Figure 3: Box Plot

20%
C A
10%
30%

40%

Figure 4: Pie Chart

5
1

0.8

0.6

0.4

0.2

0
0 1 2 3 4 5 6

Figure 5: Empirical Cumulative Distribution Function

6. Scatter Plots: It draws each data object as point in a 2-d or 3-d coor-
dinate. It is able to show the relationship between data objects.
If the data has an categorical attribute, it can be shown as different styles
of the dots. An example is shown in Figure 6.

0.5

0.4

0.3

0.2

0.1

0 0.2 0.4 0.6 0.8

Figure 6: Scatter Plot

7. Contour Plots: It is useful to display continuous values related to ge-

ographical locations, such as temperature, pressure etc. An example is
shown in Figure 7.

6
Figure 7: Contour Plot

8. Surface Plots: It can be used to plot geographical information, or math-

ematical functions. An example is shown in Figure 8.
9. Vector Field Plots: It can be used to show values with not only mag-
nitude, but also directions, such as wind flow, gradient change etc. An
example is shown in Figure 9.

10. Heatmap: We can visualize a m ∗ n-dimension matrix by regarding each

value as a point and adjust the brightness/color of the point according to
its value. An example of heatmap is shown in Figure 10.
11. Parallel Coordinates: We can make each attribute as a x value and plot
each data object as a line. It is able to sometimes reveal interesting pat-
terns between attributes or different classes of data objects. An example
is shown in 11.
12. Radar Plot: It can be used to represent multi-dimensional data of a
single data object. Each coordinate is the value of one attribute. An
example is shown in Figure 12.

7
Figure 8: Surface Plot

Figure 9: Vector Field Plot

8
Figure 10: Heatmap for data matrix

Figure 11: Parallel Coordinates

9
Figure 12: Radar Plot

10
3 Multidimensional Data Analysis
If we view the data as multi-dimensional array, we are able to proceed with
aggregating data in various ways and perform analysis accordingly.
As an example, we assume a transaction data with date, product, store and
transaction amount. We could do the following manipulations on the multi-
dimensional array:

1. Select a target quantity in which we are interested to investigate. It

could be the total amount of transaction, number of related unique product
or store, etc.

2. Select relevant attributes/dimensions. For example, we may want to

see the transaction amount for each store, or the transaction amount on
each date for each store.
3. For the rest of the attributes, we can perform dimension reduction
using aggregation. For example, if we want to see the transaction amount
on each date for each product, we need to sum transactions of all stores
for each products on each date.
If we decided to aggregate over k dimensions, we can obtain a data if n − k
dimension. If we only keep 2 dimensions, like the previous example, it is
called pivoting.

4. Sometimes we want to select data of a specified value for an attribute

(Slicing), or a range of data (Dicing).
5. Sometimes we want to aggregated over rows. For example, instead of daily
sales, we may want weekly, or monthly sales. This row aggregation is called
(Roll-Up). Likewise, we could split aggregated data into more specific
data, like from monthly data to daily data. It is called (Drill-Down).

Core Body of Knowledge For The Generalist OHS Professional: Second Edition, 2019
No ratings yet
Core Body of Knowledge For The Generalist OHS Professional: Second Edition, 2019
50 pages
5 Data Exploration
No ratings yet
5 Data Exploration
41 pages
DWDM-LS2-Fall-24-25
No ratings yet
DWDM-LS2-Fall-24-25
42 pages
Data Mining: Exploring Data: Lecture Notes For Chapter 3
No ratings yet
Data Mining: Exploring Data: Lecture Notes For Chapter 3
21 pages
4 - Exploring Data
No ratings yet
4 - Exploring Data
32 pages
Data Mining: Exploring Data Data Mining: Exploring Data: Lecture Notes For Chapter 3 Lecture Notes For Chapter 3
No ratings yet
Data Mining: Exploring Data Data Mining: Exploring Data: Lecture Notes For Chapter 3 Lecture Notes For Chapter 3
34 pages
02 Data
No ratings yet
02 Data
62 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
02 Data
No ratings yet
02 Data
42 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
02 Data
No ratings yet
02 Data
64 pages
CH 2
No ratings yet
CH 2
68 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
M1.2 DS
No ratings yet
M1.2 DS
29 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
data mining 2
No ratings yet
data mining 2
64 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Lecture Notes For Data Exploration Chapter: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Data Exploration Chapter: by Tan, Steinbach, Karpatne, Kumar
43 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
02 Data
No ratings yet
02 Data
65 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
Chapter 03 Exploring Data
No ratings yet
Chapter 03 Exploring Data
45 pages
Week 02.1 Chaptr002
No ratings yet
Week 02.1 Chaptr002
29 pages
Module 1
No ratings yet
Module 1
64 pages
Lecture Notes For Data Exploration Chapter Introduction To Data Mining
No ratings yet
Lecture Notes For Data Exploration Chapter Introduction To Data Mining
46 pages
Summary Statistics: Summary Statistics Are Numbers That Summarize Properties of The Data
No ratings yet
Summary Statistics: Summary Statistics Are Numbers That Summarize Properties of The Data
20 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
02Data
No ratings yet
02Data
65 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Data Science Four Marks Qa
No ratings yet
Data Science Four Marks Qa
4 pages
Lect 3
No ratings yet
Lect 3
51 pages
02 Data
No ratings yet
02 Data
41 pages
Data Exploration LEC3 AM
No ratings yet
Data Exploration LEC3 AM
59 pages
02Data
No ratings yet
02Data
66 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
00. Data+Visualization+in+Python
No ratings yet
00. Data+Visualization+in+Python
17 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Data Mining and Analysis
No ratings yet
Data Mining and Analysis
25 pages
1_L2_Intro_DAM
No ratings yet
1_L2_Intro_DAM
27 pages
4-Data Preprocessing (Cleaning) and Exploration
No ratings yet
4-Data Preprocessing (Cleaning) and Exploration
54 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
02 Data
No ratings yet
02 Data
64 pages
Data Mining: Exploring Data: Lecture Notes For Data Exploration Chapter Introduction To Data Mining
No ratings yet
Data Mining: Exploring Data: Lecture Notes For Data Exploration Chapter Introduction To Data Mining
22 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
02Know Your Data Lecture2 3
No ratings yet
02Know Your Data Lecture2 3
53 pages
03 Temporal, Geospatial Multivariate Data
No ratings yet
03 Temporal, Geospatial Multivariate Data
69 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
SWE 335 Slide 07
No ratings yet
SWE 335 Slide 07
29 pages
Chapter 3 Non Spatial Data Visualization
No ratings yet
Chapter 3 Non Spatial Data Visualization
45 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
Data Mining Notes C1
No ratings yet
Data Mining Notes C1
2 pages
Chapter2 Probability
No ratings yet
Chapter2 Probability
45 pages
Chapter1: Introduction: Notes On MLAPP
No ratings yet
Chapter1: Introduction: Notes On MLAPP
25 pages
Modelling Data Uncertainty in Growth Forecasts: Karmeshu T and F. Lara-Rosano
No ratings yet
Modelling Data Uncertainty in Growth Forecasts: Karmeshu T and F. Lara-Rosano
7 pages
Hull: Options, Futures, and Other Derivatives, Tenth Edition Chapter 23: Estimating Volatilities and Correlations Multiple Choice Test Bank
No ratings yet
Hull: Options, Futures, and Other Derivatives, Tenth Edition Chapter 23: Estimating Volatilities and Correlations Multiple Choice Test Bank
4 pages
Point Estimate
No ratings yet
Point Estimate
39 pages
Final Exam Stat 2019 - 2020
100% (1)
Final Exam Stat 2019 - 2020
5 pages
Correl Review
No ratings yet
Correl Review
68 pages
Lesson1 Shs
No ratings yet
Lesson1 Shs
6 pages
Measures of Central Tendency of Ungrouped Data
No ratings yet
Measures of Central Tendency of Ungrouped Data
21 pages
Week 2 - Introduction To Excel
No ratings yet
Week 2 - Introduction To Excel
8 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
8 pages
Measuring The Influence of Motivational Techniques On Employee Performance and Level of Satisfaction With Reference To It Sector in Chennai
No ratings yet
Measuring The Influence of Motivational Techniques On Employee Performance and Level of Satisfaction With Reference To It Sector in Chennai
7 pages
SOA Exam 4C Fall 2009 Exams Questions
No ratings yet
SOA Exam 4C Fall 2009 Exams Questions
172 pages
Quant Summary
No ratings yet
Quant Summary
16 pages
Semana 13
No ratings yet
Semana 13
3 pages
Rec 8A - Discrete Random Variables ALL
No ratings yet
Rec 8A - Discrete Random Variables ALL
6 pages
C 7
No ratings yet
C 7
25 pages
M101 Lab Activity 9 Hypothesis Tests On One Population Mean
No ratings yet
M101 Lab Activity 9 Hypothesis Tests On One Population Mean
13 pages
Instant download (Ebook) Understanding business valuation : a practical guide to valuing small to medium sized businesses by Trugman, Gary R ISBN 9781119448662, 9781937350635, 1119448662, 1937350630 pdf all chapter
100% (12)
Instant download (Ebook) Understanding business valuation : a practical guide to valuing small to medium sized businesses by Trugman, Gary R ISBN 9781119448662, 9781937350635, 1119448662, 1937350630 pdf all chapter
65 pages
BSA Research ALL Chapters Complete
No ratings yet
BSA Research ALL Chapters Complete
28 pages
CO#2
No ratings yet
CO#2
27 pages
Advance Graphics Module For OpenDSS
No ratings yet
Advance Graphics Module For OpenDSS
57 pages
CABINAS - M1-Problem Set 2 Problem Solving
No ratings yet
CABINAS - M1-Problem Set 2 Problem Solving
2 pages
BOW For SHS Core Subjects PDF
No ratings yet
BOW For SHS Core Subjects PDF
26 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
39 pages
A Skewed Generalized Discrete Laplace Distribution: Seetha Lekshmi, V - Simi Sebastian
No ratings yet
A Skewed Generalized Discrete Laplace Distribution: Seetha Lekshmi, V - Simi Sebastian
8 pages
Proportion of Hard Aggregate
No ratings yet
Proportion of Hard Aggregate
7 pages
Statistics Symbols
100% (2)
Statistics Symbols
3 pages
Paper IV General Forestry II - 1456345
No ratings yet
Paper IV General Forestry II - 1456345
59 pages
Sample Question Paper No.1 - Mathematics Time: 3Hrs Max - Marks: 80 Instructions
No ratings yet
Sample Question Paper No.1 - Mathematics Time: 3Hrs Max - Marks: 80 Instructions
7 pages
Eid Permeability-Rock Acta 2007 PDF
No ratings yet
Eid Permeability-Rock Acta 2007 PDF
7 pages

Data Mining Notes C3

Uploaded by

Data Mining Notes C3

Uploaded by

Notes on Introduction to Data Mining:

Chapter3 Data Exploration

1.1 Frequencies and Mode

1.3 Mean and Median

range(x) = max(x) − min(x) (4)

2. Median absolute deviation:

1.5 Multivariate Statistics

It should be noted that convariance(xi , xi ) = variance(xi ).

2.2 Plot Examples

1. Stem and Leaf Plots: It can show the distribution of one-dimensional

Figure 1: Stem and leaf plot (1|1 = 1.1)

2. Histogram: It shows the distribution of values of a single attribute by

0 0.5 1 1.5 2 2.5

Figure 3: Box Plot

Figure 4: Pie Chart

Figure 5: Empirical Cumulative Distribution Function

0 0.2 0.4 0.6 0.8

Figure 6: Scatter Plot

7. Contour Plots: It is useful to display continuous values related to ge-

8. Surface Plots: It can be used to plot geographical information, or math-

10. Heatmap: We can visualize a m ∗ n-dimension matrix by regarding each

Figure 9: Vector Field Plot

Figure 11: Parallel Coordinates

1. Select a target quantity in which we are interested to investigate. It

2. Select relevant attributes/dimensions. For example, we may want to

4. Sometimes we want to select data of a specified value for an attribute

You might also like