Week 1B - Data
Week 1B - Data
Week 1B:
Data
2
Informatics Engineering | Universitas Surabaya
What is Data/Dataset? attributes
• Data/Dataset is a collection of data objects and their attributes. Tid Refund Marital Taxable
• An attribute is a property of characteristic of an object. Status Income Cheat
Objects
– Attribute is also known as variable, field, characteristic, dimension, 4 Yes Married 120K No
• If we met a friend in the grocery store would we ever say the following?
“I see your purchases are very similar since we didn’t buy most of the same
things.”
Types of Dataset
• Record
– Relational records
– Data matrix: numerical matrix, crosstabs
– Document data: text document, term-frequency vector
– Transaction data
• Graph and Network
– World Wide Web
– Social or information networks
– Molecular structures
• Ordered
– Video data: sequence of images
– Temporal data: time-series
– Sequential data: transaction sequences
– Genetic sequence data
– Spatial data: maps
Benzene Molecule: C6H6
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
General Characteristics of Datasets
• Dimensionality
– Curse of dimensionality
• Distribution
– Centrality and dispersion
• Resolution
– Pattern depends on the scale
Statistics of Data
20
Informatics Engineering | Universitas Surabaya
Basic Statistical Descriptions of Data
• For data preprocessing to be successful, understand the data.
• Measures of central tendency: measure the location of the middle or center of
a data distribution.
– Given an attribute, where do most of its values fall?
– Mean, median, mode, …
• Dispersion of data
– How are the data spread out?
– Range, quartiles, interquartile range, five-number summary and boxplot, variance,
std, outlier.
• Describe relations among multiple variables
– Numerical data: co-variance and correlation coefficient
– Nominal data: 𝛘2 correlation test
• Visually inspect data using graphic displays
– Bar charts, pie charts, line graphs, histogram, scatter plots
Measuring the central tendency (1)
• Mean: n is sample size and N is population size.
1 n
x = xi = x
n i =1 N
n
– Weighted arithmetic mean w x i i
x= i =1
n
w
i =1
i
• Boxplot
– graphic display of five-number summary
• Histogram
– x-axis are values
– y-axis represents frequencies
• Quantile plot
– each value xi is paired with fi indicating that approximately 100 fi% of data are ≤
xi
• Quatile-quantile (q-q) plot
– graphs the quantiles of one univariant distribution against the corresponding
quantiles of another.
• Scatter plot
– each pair of values is a pair of coordinates and plotted as points in the plane.
Histogram Analysis 40
35
30
40
Informatics Engineering | Universitas Surabaya
Homework
• Do all the exercises.
• You can write the solution on papers or you can use tools like Excel
or Python and explain in detail step-by-step of your work until it finds
the solution.
• Create a pdf file for your solution and submit it to ULS
• You can upload one more file that you use to do the computation
(.xlsx or .ipynb) along with your .pdf file. Upload those files
separately.
• Note: do not forget to put your Student ID and name at the first page
of the pdf file.
Median Exercise
Suppose that the value for a given set of data are grouped into
intervals. The intervals and corresponding frequencies are as follows:
a. Calculate the mean, median, and standard deviation of age and %fat.
b. Draw the boxplots for age and %fat.
c. Draw a scatter plot (and optional: q-q plot) based on these two variables
Question?
48
Informatics Engineering | Universitas Surabaya