Lect 2 DM Converted 1
Lect 2 DM Converted 1
Data Mining
Lecture # 2
Data & Its
Visualization
1
What is
Data?
Collection of data objects Attributes
and their attributes
An attribute is a property Tid efund Marit Taxab
or characteristic of an al
Status
le
Income
Cheat
object
– Examples: eye color of a
person, temperature,
etc.
Objects
1 Yes Single 125K No
– Attribute is also known as
2 No Married 100K No
variable, field, characteristic,
dimension, or feature 3 No Single 70K No
A collection of attributes 4 Yes Married 120K No
describe an object 5 No Divorced 95K Yes
– Object is also known as 6 No Married 60K No
record, point, case, 7 Yes Divorced 220K No
sample, entity, or instance
8 No Single 85K Yes
10
5
Data
Objects
Data sets are made up of data objects.
A data object represents an entity.
Examples:
– sales database: customers, store items,
sales
– medical database: patients, treatments
– university database: students, professors, courses
Also called samples , examples, instances, data points,
objects, tuples.
Data objects are described by attributes.
Database rows -> data objects; columns ->attributes. 3
Attribute
Values
Attribute values are numbers or symbols
assigned to an attribute for a particular object
6
Difference Between Ratio and
Interval
Is it physically meaningful to say that a
temperature of 10 ° is twice that of 5°
on
– the Celsius scale?
– the Fahrenheit scale?
– the Kelvin scale?
Qualitative
only eye color, sex:
distinguish. (=, {male, female}
)
Ordinal Ordinal attribute hardness of minerals, median,
values also {good, better, percentiles,
order objects. best}, grades, rank
(<, >) street numbers correlation
Quantitative
Qualitative
Ordinal An order preserving change of An attribute encompassing
values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic well by the values {1, 2, 3} or
function by { 0.5, 1, 10}.
Quantitative
Numeric
This categorization
Ratio of* attributes
new_value = a old_value is due to can
Length S. S.
be measured in
Stevens meters or feet.
Discrete and Continuous
Attributes
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
Types of data
sets
Record
– Data Matrix
– Document Data
– Transaction Data
Graph
– World Wide Web
– Molecular Structures
Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record
Data
Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marita Taxabl
l e Cheat
Status Income
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction
Data
A special type of record data, where
– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set
of products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph
Data
Examples: Generic graph, a molecule, and webpages
2
5 1
2
5
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
Ordered Data
Spatio-Temporal Data
29
Median:
Mode
30
Symmetric vs. Skewed
31
Data
Median, mean and mode of
symmetric, positively and
negatively skewed data
symmetric
positively negatively
skewed skewed
31
October 7, 2019
Exampl
e
Tid Refund Marital Taxable
Status
Income
Cheat
1 Yes Single 125K No Mean: 1090K
2 No Married 100K No
3 No Single 70K No
4 120K Trimmed mean (remove min, max): 105K
Yes Married
No
10
5 No Divorced 10000K Yes
6 No NULL 60K No
33
Measuring the Dispersion of
Data
Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
– Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
– Outlier: usually, a value higher/lower than 1.5 x IQR(above third and below
first)
Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
Analysis
Five-number summary of a distribution
– Minimum, Q1, Median, Q3,
Maximum
Boxplot
– Data is represented with a box
– The ends of the box are at the first and
third quartiles, i.e., the height of the box is
IQR
– The median is marked by a line within the
box
– Whiskers: two lines outside the box
extended to Minimum and Maximum
– Outliers: points beyond a specified
outlier threshold, plotted individually 34
36
Descriptions
Boxplot: graphic display of five-number summary
Histogram: x-axis are values, y-axis repres. frequencies
Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
Example
Dataset
Histogram
41
Analysis
Histogram: Graph display of tabulated frequencies, shown as bars
Boxplots
The two histograms
shown in the left
may have the same
boxplot
representation
The same values
for: min, Q1,
median, Q3, max
But they have
rather different
data distributions