Datalec 1
Datalec 1
— Chapter 2 —
Data Visualization
Summary
2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
Dimensionality
Curse of dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
Distribution
Centrality and dispersion
4
Data Objects
Types:
Nominal
Binary
Ordinal
Numeric: quantitative
Interval-scaled
Ratio-scaled
6
Attribute Types
Nominal:
Nominal means “relating to names.”
The values of a nominal attribute are symbols or “names of things”.
Each value represents some kind of category, code, or state.
So nominal attributes are also referred to as categorical.
The values do not have any meaningful order.
Hair_color = { black, brown, grey, red, white}
Occupation = {teacher, dentist, programmer, farmer }
It is possible to represent the values of as symbols with numbers.
With hair color, we can assign a code of 0 for black, 1 for brown, and so on.
Another example is customor ID, with possible values that are all numeric.
In such cases, the numbers are not intended to be used quantitatively.
Mathematical operations on values of nominal attributes are not meaningful.
A nominal attribute may have integers as values, it is not considered as a
numeric attribute because the integers are not meant to be used
quantitatively.
7
Attribute Types
Binary
Nominal attribute with only 2 states (0 and 1)
Binary attributes are referred to as Boolean if the two states
correspond to true and false.
Symmetric binary:
its states are equally valuable and carry the same weight
There is no preference on which outcome should be coded as 0 or 1.
e.g., gender
Asymmetric binary:
The outcomes of the states are not equally important,
We code the most important outcome, which is usually the rarest one,
by 1 and the other by 0.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
8
Attribute Types
Ordinal
An attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is
not known.
Size = {small, medium, large}
Grade = (e.g., A+, A, A-, B+, and so on)
Ordinal attributes are useful for registering subjective assessments of
qualities.
Cannot be measured objectively.
9
Numeric Attribute Types
A numeric attribute is quantitative.
It is a measurable quantity, represented in integer or real values.
Numeric attributes can be interval-scaled or ratio-scaled.
Interval-scaled
Measured on a scale of equal-sized units.
10
Numeric Attribute Types
Calendar dates are another example. For instance, the
years 2002 and 2010 are eight years apart.
Temperatures in Celsius and Fahrenheit do not have a
true zero-point, that is, neither 0˚C nor 0˚ indicates “no
temperature.”
Ratio-scaled
Inherent zero-point
We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities
11
Discrete vs. Continuous Attributes
Classification algorithms developed often talk of attributes as
being either discrete or continuous.
Discrete Attribute
Has only a finite or countably infinite set of values
collection of documents
Sometimes, represented as integer variables
Continuous Attribute
Has real numbers as attribute values
Data Visualization
Summary
13
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency, variation
and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of
precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
14
Measuring the Central Tendency
Various ways to measure the central tendency of data.
We have some attribute X, like salary, which has been
recorded for a set of objects.
Let x1,x2, : : : ,xN be the set of N observed values or
observations for X.
These values may also be referred to as the data set.
If we were to plot the observations for salary, where would
most of the values fall?
This gives us an idea of the central tendency of the data.
Measures of central tendency include the mean, median,
mode, and midrange.
15
MEAN
The most common and effective numeric measure of
the “center” of a set of data is the (arithmetic) mean.
Let x1,x2, : : : ,xN be a set of N values or observations,
such as for some numeric attribute X, like salary.
The mean of this set of values is
1 x1 x 2 ... xN
n
x xi
N i 1 N
16
MEAN
Sometimes, each value xi in a set may be associated with a
weight wi for i = 1, … ,N.
The weights reflect the significance, importance, or occurrence
frequency attached to their respective values.
In this case, we can compute
n
w x i i
w 1x1 w 2 x 2 ... w N x N
x i 1
n
w 1 w 2 ... w N
w
i 1
i
We can approximate the median of the entire data set (e.g., the median salary)
by interpolation using the formula
n / 2 ( freq )l
median L1 ( ) width
freq m edian
where L1 is the lower boundary of the median interval.
N is the number of values in the entire data set.
(∑ freq )l is the sum of the frequencies of all of the intervals that are lower than
the median interval.
freqmedian is the frequency of the median interval.
width is the width of the median interval. 20
MODE
The mode is another measure of central tendency.
The mode for a set of data is the value that occurs most frequently in
the set.
Therefore, it can be determined for qualitative and quantitative
attributes.
It is possible for the greatest frequency to correspond to several
different values, which results in more than one mode.
Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal.
A data set with two or more modes is multimodal.
If each data value occurs only once, then there is no mode.
Suppose we have the following values for salary (in thousands of
dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63,
70, 70, 110.
The two modes are $52,000 and $70,000.
21
MIDRANGE
The midrange can also be used to assess the central tendency of a
numeric data set.
It is the average of the largest and smallest values in the set.
This measure is easy to compute using the SQL aggregate functions,
max() and min().
The midrange of the data of Example is ( 30,000 + 110,000 ) / 2 =
$70,000.
In a unimodal frequency curve with perfect symmetric data
distribution, the mean, median, and mode are all at the same center
value.
Data in most real applications are not symmetric.
They may instead be either positively skewed, where the mode
occurs at a value that is smaller than the median or negatively
skewed, where the mode occurs at a value greater than the median.
22
Symmetric vs. Skewed Data