0% found this document useful (0 votes)
5 views

Week 1B - Data

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Week 1B - Data

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

1604C331 Data Mining

Week 1B:
Data

Odd Semester 2024-2025


20102620240829
Informatics Engineering
Faculty of Engineering | Universitas Surabaya
Types of Data

2
Informatics Engineering | Universitas Surabaya
What is Data/Dataset? attributes

• Data/Dataset is a collection of data objects and their attributes. Tid Refund Marital Taxable
• An attribute is a property of characteristic of an object. Status Income Cheat

– Examples of attribute: 1 Yes Single 125K No

• eye color of a person 2 No Married 100K No

• temperature 3 No Single 70K No

Objects
– Attribute is also known as variable, field, characteristic, dimension, 4 Yes Married 120K No

feature. 5 No Divorced 95K Yes

• An object is described by a collection of attributes (attribute 6 No Married 60K No

vector or feature vector). 7 Yes Divorced 220K No

– Examples of objects: 8 No Single 85K Yes

• in a sales database: customer, store item, sales 9 No Married 75K No

• in a medical database: patient 10


10 No Single 90K Yes

• in a university database: student, professor, course


– Object is also known as record, point, case, sample, entity, instance.
• The distribution of data involving 1 attribute is called univariate.
A bivariate distribution involves 2 attributes, …
A sample dataset (student info)
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute for
a particular object
• Same attribute can be mapped to different attribute values
– Examples: height can be measured in feet or meters
• Different attribute can be mapped to the same set of values
– Examples: attribute values for ID and age are integers.
• Attribute properties can be different than the values properties used
to represent the attribute.
Measurement of Length
• The way measuring an attribute may not match the attribute
properties.
Properties of Attribute Values
• A useful (and simple) way to specify the type of
an attribute is to identify the properties of
numbers that correspond to underlying
properties of the attribute.
• Example:
– An attribute such as length has many of the
properties of numbers.
– It makes sense to compare and order objects by
length, as well as to talk about the differences and
ratios of length.
Attribute Types

• Each attribute possesses


all the properties and
operations of the attribute
types.
• The definition of the
attribute types is
cumulative: any property
or operation that is valid
for nominal, ordinal, and
interval attributes is also
valid for ratio attributes.
Attributes by the number of values
• DISCRETE attribute (typically, nominal and ordinal attributes)
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of documents
– Often represented as integer variables
– Note: binary attributes are a special case of discrete attributes and assume only
2 values (true/false, yes/no, male/female, 0/1)
• CONTINUOUS attribute (typically, interval and ratio attributes)
– Has real numbers as attribute values
– Examples: temperature, height, or weight
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.
Asymmetric Attributes
• The outcomes of the states are not equally important. One state is
interpreted as more informative than the other state.

• Only presence (a non-zero attribute value) is regarded as important


– Words present in document
– Items present in customer transactions

• If we met a friend in the grocery store would we ever say the following?
“I see your purchases are very similar since we didn’t buy most of the same
things.”
Types of Dataset
• Record
– Relational records
– Data matrix: numerical matrix, crosstabs
– Document data: text document, term-frequency vector
– Transaction data
• Graph and Network
– World Wide Web
– Social or information networks
– Molecular structures
• Ordered
– Video data: sequence of images
– Temporal data: time-series
– Sequential data: transaction sequences
– Genetic sequence data
– Spatial data: maps
Benzene Molecule: C6H6
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
General Characteristics of Datasets
• Dimensionality
– Curse of dimensionality
• Distribution
– Centrality and dispersion
• Resolution
– Pattern depends on the scale
Statistics of Data

20
Informatics Engineering | Universitas Surabaya
Basic Statistical Descriptions of Data
• For data preprocessing to be successful, understand the data.
• Measures of central tendency: measure the location of the middle or center of
a data distribution.
– Given an attribute, where do most of its values fall?
– Mean, median, mode, …
• Dispersion of data
– How are the data spread out?
– Range, quartiles, interquartile range, five-number summary and boxplot, variance,
std, outlier.
• Describe relations among multiple variables
– Numerical data: co-variance and correlation coefficient
– Nominal data: 𝛘2 correlation test
• Visually inspect data using graphic displays
– Bar charts, pie charts, line graphs, histogram, scatter plots
Measuring the central tendency (1)
• Mean: n is sample size and N is population size.

1 n
x =  xi =  x
n i =1 N
n
– Weighted arithmetic mean w x i i
x= i =1
n

w
i =1
i

– Trimmed mean: chopping extreme values


Measuring the central tendency (2)
• Median: middle value if odd number of values, or average of the middle
2 values otherwise.
– Estimated by interpolation (for grouped data)

Approximate Sum before the median interval


median
n / 2 − ( freq) l Interval width (L2 – L1)
median = L1 + ( ) width
freqmedian
Low interval limit
Measuring the central tendency (3)
• Mode: value that occurs most frequently in the data
– Unimodal
• Empirical formula: mean − mode = 3  (mean − median)

– Multi-modal: bimodal, trimodal


Symmetric vs Skewed Data

symmetric negatively skewed


positively skewed
Symmetric vs Skewed Data
Measuring the dispersion of data (1)
Quartiles, outliers, and boxplots

• Quartiles: Q1 (25th percentile), Q3 (75th percentile)


• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles, median is marked, add
whiskers, and plot outliers individually.
• Outlier: usually, a value higher or lower than 1.5 times IQR.
Measuring the dispersion of data (2)
Variance and standard deviation (sample: s, population: σ)
1 n 1 n 2 1 n 2
• Variance: s =
2

n − 1 i =1
( xi − x ) =
2
[ xi − ( xi ) ]
n − 1 i =1 n i =1
n n
1 1
 =  ( xi −  ) =  i − 
2 2 22
x
N i =1 N i =1

• Standard deviations: s (or σ) is the square root of variance s2 (or σ2)


Note: The subtle difference of formulae for
sample vs. population
• n : the size of the sample
• N : the size of the population
Boxplot Analysis
• Five-number summary of a distribution:
– minimum, Q1, median, Q3, maximum.
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles
– The height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to minimum and maximum
– Outliers: points beyond a specified outlier threshold, plotted individually
Properties of Normal Distribution Curve
← — ————Represent data dispersion, spread — ————→

Represent central tendency


Graphic Displays of Basic Statistical Descriptions

• Boxplot
– graphic display of five-number summary
• Histogram
– x-axis are values
– y-axis represents frequencies
• Quantile plot
– each value xi is paired with fi indicating that approximately 100 fi% of data are ≤
xi
• Quatile-quantile (q-q) plot
– graphs the quantiles of one univariant distribution against the corresponding
quantiles of another.
• Scatter plot
– each pair of values is a pair of coordinates and plotted as points in the plane.
Histogram Analysis 40
35
30

• Histogram: graph display of tabulated 25


20
frequencies, shown as bars 15

• It shows what proportion of cases fall into 105


each of several categories 0
10000 30000 50000 70000 90000

• Differs from a bar chart in that it is the area of


the bar that denotes the value, not the height
as in bar charts, a crucial distinction when the
categories are not of uniform width
• The categories are usually specified as non-
overlapping intervals of some variables. The
categories (bars) must be adjacent.
Histogram Often Tells More than Boxplot
• Two histograms shown
on the right may have the
same boxplot
representation:
– the same values for: min,
Q1, median, Q3, and max.

• But, they have rather


different data distributions
Quantile Plot
• Display all the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order
– fi indicates that approximately 100 fi% of the data are below or equal to
the value xi.
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as
points in the plane.
Positively and Negatively Correlated Data

• The left half fragment is positively


correlated
• The right half is negative correlated
Uncorrelated Data
Exercises

40
Informatics Engineering | Universitas Surabaya
Homework
• Do all the exercises.
• You can write the solution on papers or you can use tools like Excel
or Python and explain in detail step-by-step of your work until it finds
the solution.
• Create a pdf file for your solution and submit it to ULS
• You can upload one more file that you use to do the computation
(.xlsx or .ipynb) along with your .pdf file. Upload those files
separately.
• Note: do not forget to put your Student ID and name at the first page
of the pdf file.
Median Exercise
Suppose that the value for a given set of data are grouped into
intervals. The intervals and corresponding frequencies are as follows:

Compute an approximate median value for the data.


Basic Statistics Exercise
Suppose that a hospital tested the age and body fat data for 18 randomly
selected adults with the following results:

a. Calculate the mean, median, and standard deviation of age and %fat.
b. Draw the boxplots for age and %fat.
c. Draw a scatter plot (and optional: q-q plot) based on these two variables
Question?

48
Informatics Engineering | Universitas Surabaya

You might also like