0% found this document useful (0 votes)
7 views

Week 1B - Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Week 1B - Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

1604C331 Data Mining

Week 1B:
Data

Odd Semester 2024-2025


20102620240829
Informatics Engineering
Faculty of Engineering | Universitas Surabaya
Types of Data

2
Informatics Engineering | Universitas Surabaya
What is Data/Dataset? attributes

• Data/Dataset is a collection of data objects and their attributes. Tid Refund Marital Taxable
• An attribute is a property of characteristic of an object. Status Income Cheat

– Examples of attribute: 1 Yes Single 125K No

• eye color of a person 2 No Married 100K No

• temperature 3 No Single 70K No

Objects
– Attribute is also known as variable, field, characteristic, dimension, 4 Yes Married 120K No

feature. 5 No Divorced 95K Yes

• An object is described by a collection of attributes (attribute 6 No Married 60K No

vector or feature vector). 7 Yes Divorced 220K No

– Examples of objects: 8 No Single 85K Yes

• in a sales database: customer, store item, sales 9 No Married 75K No

• in a medical database: patient 10


10 No Single 90K Yes

• in a university database: student, professor, course


– Object is also known as record, point, case, sample, entity, instance.
• The distribution of data involving 1 attribute is called univariate.
A bivariate distribution involves 2 attributes, …
A sample dataset (student info)
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute for
a particular object
• Same attribute can be mapped to different attribute values
– Examples: height can be measured in feet or meters
• Different attribute can be mapped to the same set of values
– Examples: attribute values for ID and age are integers.
• Attribute properties can be different than the values properties used
to represent the attribute.
Measurement of Length
• The way measuring an attribute may not match the attribute
properties.
Properties of Attribute Values
• A useful (and simple) way to specify the type of
an attribute is to identify the properties of
numbers that correspond to underlying
properties of the attribute.
• Example:
– An attribute such as length has many of the
properties of numbers.
– It makes sense to compare and order objects by
length, as well as to talk about the differences and
ratios of length.
Attribute Types

• Each attribute possesses


all the properties and
operations of the attribute
types.
• The definition of the
attribute types is
cumulative: any property
or operation that is valid
for nominal, ordinal, and
interval attributes is also
valid for ratio attributes.
Attributes by the number of values
• DISCRETE attribute (typically, nominal and ordinal attributes)
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of documents
– Often represented as integer variables
– Note: binary attributes are a special case of discrete attributes and assume only
2 values (true/false, yes/no, male/female, 0/1)
• CONTINUOUS attribute (typically, interval and ratio attributes)
– Has real numbers as attribute values
– Examples: temperature, height, or weight
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.
Asymmetric Attributes
• The outcomes of the states are not equally important. One state is
interpreted as more informative than the other state.

• Only presence (a non-zero attribute value) is regarded as important


– Words present in document
– Items present in customer transactions

• If we met a friend in the grocery store would we ever say the following?
“I see your purchases are very similar since we didn’t buy most of the same
things.”
Types of Dataset
• Record
– Relational records
– Data matrix: numerical matrix, crosstabs
– Document data: text document, term-frequency vector
– Transaction data
• Graph and Network
– World Wide Web
– Social or information networks
– Molecular structures
• Ordered
– Video data: sequence of images
– Temporal data: time-series
– Sequential data: transaction sequences
– Genetic sequence data
– Spatial data: maps
Benzene Molecule: C6H6
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
General Characteristics of Datasets
• Dimensionality
– Curse of dimensionality
• Distribution
– Centrality and dispersion
• Resolution
– Pattern depends on the scale
Statistics of Data

20
Informatics Engineering | Universitas Surabaya
Basic Statistical Descriptions of Data
• For data preprocessing to be successful, understand the data.
• Measures of central tendency: measure the location of the middle or center of
a data distribution.
– Given an attribute, where do most of its values fall?
– Mean, median, mode, …
• Dispersion of data
– How are the data spread out?
– Range, quartiles, interquartile range, five-number summary and boxplot, variance,
std, outlier.
• Describe relations among multiple variables
– Numerical data: co-variance and correlation coefficient
– Nominal data: 𝛘2 correlation test
• Visually inspect data using graphic displays
– Bar charts, pie charts, line graphs, histogram, scatter plots
Measuring the central tendency (1)
• Mean: n is sample size and N is population size.

1 n
x =  xi =  x
n i =1 N
n
– Weighted arithmetic mean w x i i
x= i =1
n

w
i =1
i

– Trimmed mean: chopping extreme values


Measuring the central tendency (2)
• Median: middle value if odd number of values, or average of the middle
2 values otherwise.
– Estimated by interpolation (for grouped data)

Approximate Sum before the median interval


median
n / 2 − ( freq) l Interval width (L2 – L1)
median = L1 + ( ) width
freqmedian
Low interval limit
Measuring the central tendency (3)
• Mode: value that occurs most frequently in the data
– Unimodal
• Empirical formula: mean − mode = 3  (mean − median)

– Multi-modal: bimodal, trimodal


Symmetric vs Skewed Data

symmetric negatively skewed


positively skewed
Symmetric vs Skewed Data
Measuring the dispersion of data (1)
Quartiles, outliers, and boxplots

• Quartiles: Q1 (25th percentile), Q3 (75th percentile)


• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles, median is marked, add
whiskers, and plot outliers individually.
• Outlier: usually, a value higher or lower than 1.5 times IQR.
Measuring the dispersion of data (2)
Variance and standard deviation (sample: s, population: σ)
1 n 1 n 2 1 n 2
• Variance: s =
2

n − 1 i =1
( xi − x ) =
2
[ xi − ( xi ) ]
n − 1 i =1 n i =1
n n
1 1
 =  ( xi −  ) =  i − 
2 2 22
x
N i =1 N i =1

• Standard deviations: s (or σ) is the square root of variance s2 (or σ2)


Note: The subtle difference of formulae for
sample vs. population
• n : the size of the sample
• N : the size of the population
Boxplot Analysis
• Five-number summary of a distribution:
– minimum, Q1, median, Q3, maximum.
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles
– The height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to minimum and maximum
– Outliers: points beyond a specified outlier threshold, plotted individually
Properties of Normal Distribution Curve
← — ————Represent data dispersion, spread — ————→

Represent central tendency


Graphic Displays of Basic Statistical Descriptions

• Boxplot
– graphic display of five-number summary
• Histogram
– x-axis are values
– y-axis represents frequencies
• Quantile plot
– each value xi is paired with fi indicating that approximately 100 fi% of data are ≤
xi
• Quatile-quantile (q-q) plot
– graphs the quantiles of one univariant distribution against the corresponding
quantiles of another.
• Scatter plot
– each pair of values is a pair of coordinates and plotted as points in the plane.
Histogram Analysis 40
35
30

• Histogram: graph display of tabulated 25


20
frequencies, shown as bars 15

• It shows what proportion of cases fall into 105


each of several categories 0
10000 30000 50000 70000 90000

• Differs from a bar chart in that it is the area of


the bar that denotes the value, not the height
as in bar charts, a crucial distinction when the
categories are not of uniform width
• The categories are usually specified as non-
overlapping intervals of some variables. The
categories (bars) must be adjacent.
Histogram Often Tells More than Boxplot
• Two histograms shown
on the right may have the
same boxplot
representation:
– the same values for: min,
Q1, median, Q3, and max.

• But, they have rather


different data distributions
Quantile Plot
• Display all the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order
– fi indicates that approximately 100 fi% of the data are below or equal to
the value xi.
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as
points in the plane.
Positively and Negatively Correlated Data

• The left half fragment is positively


correlated
• The right half is negative correlated
Uncorrelated Data
Exercises

40
Informatics Engineering | Universitas Surabaya
Homework
• Do all the exercises.
• You can write the solution on papers or you can use tools like Excel
or Python and explain in detail step-by-step of your work until it finds
the solution.
• Create a pdf file for your solution and submit it to ULS
• You can upload one more file that you use to do the computation
(.xlsx or .ipynb) along with your .pdf file. Upload those files
separately.
• Note: do not forget to put your Student ID and name at the first page
of the pdf file.
Median Exercise
Suppose that the value for a given set of data are grouped into
intervals. The intervals and corresponding frequencies are as follows:

Compute an approximate median value for the data.


Basic Statistics Exercise
Suppose that a hospital tested the age and body fat data for 18 randomly
selected adults with the following results:

a. Calculate the mean, median, and standard deviation of age and %fat.
b. Draw the boxplots for age and %fat.
c. Draw a scatter plot (and optional: q-q plot) based on these two variables
Question?

48
Informatics Engineering | Universitas Surabaya

You might also like