0% found this document useful (0 votes)
25 views29 pages

Lect 2 DM Converted 1

This document provides an overview of data and its visualization. It defines data as a collection of data objects and their attributes. An attribute is a property or characteristic of an object. Data sets are made up of data objects, which represent entities. Data objects are described by attributes. There are different types of attributes including nominal, ordinal, interval, and ratio attributes. The document also discusses properties of attribute values, differences between attribute types, transformations of attribute types, discrete vs continuous attributes, and different types of data sets such as record data, data matrices, and graph data.

Uploaded by

Manahil Noor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views29 pages

Lect 2 DM Converted 1

This document provides an overview of data and its visualization. It defines data as a collection of data objects and their attributes. An attribute is a property or characteristic of an object. Data sets are made up of data objects, which represent entities. Data objects are described by attributes. There are different types of attributes including nominal, ordinal, interval, and ratio attributes. The document also discusses properties of attribute values, differences between attribute types, transformations of attribute types, discrete vs continuous attributes, and different types of data sets such as record data, data matrices, and graph data.

Uploaded by

Manahil Noor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

1

Data Mining

Lecture # 2
Data & Its
Visualization

1
What is
Data?
 Collection of data objects Attributes
and their attributes
 An attribute is a property Tid efund Marit Taxab
or characteristic of an al
Status
le
Income
Cheat
object
– Examples: eye color of a
person, temperature,
etc.

Objects
1 Yes Single 125K No
– Attribute is also known as
2 No Married 100K No
variable, field, characteristic,
dimension, or feature 3 No Single 70K No
 A collection of attributes 4 Yes Married 120K No
describe an object 5 No Divorced 95K Yes
– Object is also known as 6 No Married 60K No
record, point, case, 7 Yes Divorced 220K No
sample, entity, or instance
8 No Single 85K Yes
10
5
Data
Objects
 Data sets are made up of data objects.
 A data object represents an entity.
 Examples:
– sales database: customers, store items,
sales
– medical database: patients, treatments
– university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes. 3
Attribute
Values
 Attribute values are numbers or symbols
assigned to an attribute for a particular object

 Distinction between attributes and attribute


values
– Same attribute can be mapped to different
attribute
values
 Example: height can be measured in feet or
meters

– Different attributes can be mapped to the same set of


values
 Example: Attribute values for ID and age are integers 4

Types of
Attributes
 There are different types of attributes
– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height {tall, medium, short}
– Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, time, counts
5
Properties of Attribute
Values
 The type of an attribute depends on which of the
following properties/operations it possesses:
– Distinctness: = 
– Order: < >
– Differences are + -
meaningful :
– Ratios are * /
meaningful
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful
differences
– Ratio attribute: all 4 properties/operations

6
Difference Between Ratio and
Interval
 Is it physically meaningful to say that a
temperature of 10 ° is twice that of 5°
on
– the Celsius scale?
– the Fahrenheit scale?
– the Kelvin scale?

 Consider measuring the height above


average
– If Bill’s height is three inches above average and
Bob’s height is six inches above average, then would
we say that Bob is twice as tall as Bill?
– Is this situation analogous to that of temperature? 7
Attribute Description Examples Operations
Type
Nominal Nominal zip codes, employee mode,
attribute values ID numbers, entropy
Categorical

Qualitative
only eye color, sex:
distinguish. (=, {male, female}
)
Ordinal Ordinal attribute hardness of minerals, median,
values also {good, better, percentiles,
order objects. best}, grades, rank
(<, >) street numbers correlation
Quantitative

Interval For interval calendar dates, mean, standard


Numeric

attributes, temperature in deviation,


differences Celsius or Pearson's
between values Fahrenheit correlation
are meaningful.
(+, - )
Ratio For ratio variables, temperature in geometric mean,
both differences Kelvin, harmonic
This categorization of
and ratios are attributes is
monetarydue to S. S. mean,
Stevens
Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical

Qualitative
Ordinal An order preserving change of An attribute encompassing
values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic well by the values {1, 2, 3} or
function by { 0.5, 1, 10}.
Quantitative
Numeric

Interval new_value = a * old_value + b Thus, the Fahrenheit and


where a and b are constants Celsius temperature scales
differ in terms of where their
zero value is and the size of a
unit (degree).

This categorization
Ratio of* attributes
new_value = a old_value is due to can
Length S. S.
be measured in
Stevens meters or feet.
Discrete and Continuous
Attributes
 Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
 Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
Types of data
sets
 Record
– Data Matrix
– Document Data
– Transaction Data
 Graph
– World Wide Web
– Molecular Structures
 Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record
Data
 Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marita Taxabl
l e Cheat
Status Income

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
10
9 No Married 75K No
10 No Single 90K Yes
Data
Matrix
 If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

 Such data set can be represented by an m by n matrix,


where there are m rows, one for each object, and n
columns, one for each attribute
Projectio Projectio Distance Load Thickness
n of x n of y
Load load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document
Data
 Each document becomes a ‘term’ vector
– Each term is a component (attribute) of the vector
– The value of each component is the number of times
the corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction
Data
 A special type of record data, where
– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set
of products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph
Data
 Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6


Ordered
Data
 Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC

CGCAGGGCCCGCCCCGCGCCGTC

GAGAAGGGCCCGCCTGGCGGGCG

GGGGGAGGCGGGGCCGCCCGAGC

CCAACCGAGTCCGACCAGGTGCC

CCCTCTGCTCGGCCTAGACCTGA
Ordered Data

 Spatio-Temporal Data
29

Basic Statistical Descriptions of


Data
 Motivation
– To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
– median, max, min, quantiles, outliers, variance,
etc.
 Numerical dimensions correspond to sorted
intervals
– Data dispersion: analyzed with multiple
granularities
of precision
– Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures 29
30

Measuring the Central Tendency


n
 Mean (algebraic measure) (sample vs. population): 1
x   x
Note: n is sample size and N is population size. n
i
 N
– Weighted arithmetic mean:  x i 1

 Median:

– Middle value if odd number of values, or average of the i  1


middle two values otherwise
– Estimated by interpolation (for grouped data):

 Mode

– Value that occurs most frequently in the data

30
Symmetric vs. Skewed
31

Data
 Median, mean and mode of
symmetric, positively and
negatively skewed data

symmetric

positively negatively
skewed skewed

31
October 7, 2019
Exampl
e
Tid Refund Marital Taxable
Status
Income
Cheat
1 Yes Single 125K No Mean: 1090K

2 No Married 100K No

3 No Single 70K No
4 120K Trimmed mean (remove min, max): 105K
Yes Married
No

Median: (90+100)/2 = 95K

10
5 No Divorced 10000K Yes
6 No NULL 60K No
33
Measuring the Dispersion of
Data
 Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
– Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
– Outlier: usually, a value higher/lower than 1.5 x IQR(above third and below
first)
 Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)

– Standard deviation s (or σ) is the square root of variance s2 (or σ2)


Boxplot
34

Analysis
 Five-number summary of a distribution
– Minimum, Q1, Median, Q3,
Maximum
 Boxplot
– Data is represented with a box
– The ends of the box are at the first and
third quartiles, i.e., the height of the box is
IQR
– The median is marked by a line within the
box
– Whiskers: two lines outside the box
extended to Minimum and Maximum
– Outliers: points beyond a specified
outlier threshold, plotted individually 34
36

Properties of Normal Distribution


Curve
 The normal (distribution) curve
– From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard
deviation)
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it
G raphic Displays of Basic Statistical
3 7

Descriptions
 Boxplot: graphic display of five-number summary
 Histogram: x-axis are values, y-axis repres. frequencies
 Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
Example
Dataset
Histogram
41

Analysis
Histogram: Graph display of tabulated frequencies, shown as bars

 It shows what proportion of cases fall into each of several categories


 Differs from a bar chart in that it is the area of the bar that denotes
the value, not the height as in bar charts, a crucial distinction when
the categories are not of uniform width
 The categories are usually specified as non-overlapping intervals
of some variable. The categories (bars) must be adjacent
Histograms Often Tell More than
42

Boxplots
 The two histograms
shown in the left
may have the same
boxplot
representation
 The same values
for: min, Q1,
median, Q3, max
 But they have
rather different
data distributions

You might also like