0% found this document useful (0 votes)

13 views38 pages

Week 1B - Data

Uploaded by

Hafidz Nur shafwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views38 pages

Week 1B - Data

Uploaded by

Hafidz Nur shafwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

1604C331 Data Mining

Week 1B:
Data

Odd Semester 2024-2025

20102620240829
Informatics Engineering
Faculty of Engineering | Universitas Surabaya
Types of Data

2
Informatics Engineering | Universitas Surabaya
What is Data/Dataset? attributes

• Data/Dataset is a collection of data objects and their attributes. Tid Refund Marital Taxable
• An attribute is a property of characteristic of an object. Status Income Cheat

– Examples of attribute: 1 Yes Single 125K No

• eye color of a person 2 No Married 100K No

• temperature 3 No Single 70K No

Objects
– Attribute is also known as variable, field, characteristic, dimension, 4 Yes Married 120K No

feature. 5 No Divorced 95K Yes

• An object is described by a collection of attributes (attribute 6 No Married 60K No

vector or feature vector). 7 Yes Divorced 220K No

– Examples of objects: 8 No Single 85K Yes

• in a sales database: customer, store item, sales 9 No Married 75K No

• in a medical database: patient 10

10 No Single 90K Yes

• in a university database: student, professor, course

– Object is also known as record, point, case, sample, entity, instance.
• The distribution of data involving 1 attribute is called univariate.
A bivariate distribution involves 2 attributes, …
A sample dataset (student info)
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute for
a particular object
• Same attribute can be mapped to different attribute values
– Examples: height can be measured in feet or meters
• Different attribute can be mapped to the same set of values
– Examples: attribute values for ID and age are integers.
• Attribute properties can be different than the values properties used
to represent the attribute.
Measurement of Length
• The way measuring an attribute may not match the attribute
properties.
Properties of Attribute Values
• A useful (and simple) way to specify the type of
an attribute is to identify the properties of
numbers that correspond to underlying
properties of the attribute.
• Example:
– An attribute such as length has many of the
properties of numbers.
– It makes sense to compare and order objects by
length, as well as to talk about the differences and
ratios of length.
Attribute Types

• Each attribute possesses

all the properties and
operations of the attribute
types.
• The definition of the
attribute types is
cumulative: any property
or operation that is valid
for nominal, ordinal, and
interval attributes is also
valid for ratio attributes.
Attributes by the number of values
• DISCRETE attribute (typically, nominal and ordinal attributes)
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of documents
– Often represented as integer variables
– Note: binary attributes are a special case of discrete attributes and assume only
2 values (true/false, yes/no, male/female, 0/1)
• CONTINUOUS attribute (typically, interval and ratio attributes)
– Has real numbers as attribute values
– Examples: temperature, height, or weight
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.
Asymmetric Attributes
• The outcomes of the states are not equally important. One state is
interpreted as more informative than the other state.

• Only presence (a non-zero attribute value) is regarded as important

– Words present in document
– Items present in customer transactions

• If we met a friend in the grocery store would we ever say the following?
“I see your purchases are very similar since we didn’t buy most of the same
things.”
Types of Dataset
• Record
– Relational records
– Data matrix: numerical matrix, crosstabs
– Document data: text document, term-frequency vector
– Transaction data
• Graph and Network
– World Wide Web
– Social or information networks
– Molecular structures
• Ordered
– Video data: sequence of images
– Temporal data: time-series
– Sequential data: transaction sequences
– Genetic sequence data
– Spatial data: maps
Benzene Molecule: C6H6
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
General Characteristics of Datasets
• Dimensionality
– Curse of dimensionality
• Distribution
– Centrality and dispersion
• Resolution
– Pattern depends on the scale
Statistics of Data

20
Informatics Engineering | Universitas Surabaya
Basic Statistical Descriptions of Data
• For data preprocessing to be successful, understand the data.
• Measures of central tendency: measure the location of the middle or center of
a data distribution.
– Given an attribute, where do most of its values fall?
– Mean, median, mode, …
• Dispersion of data
– How are the data spread out?
– Range, quartiles, interquartile range, five-number summary and boxplot, variance,
std, outlier.
• Describe relations among multiple variables
– Numerical data: co-variance and correlation coefficient
– Nominal data: 𝛘2 correlation test
• Visually inspect data using graphic displays
– Bar charts, pie charts, line graphs, histogram, scatter plots
Measuring the central tendency (1)
• Mean: n is sample size and N is population size.

1 n
x =  xi =  x
n i =1 N
n
– Weighted arithmetic mean w x i i
x= i =1
n

w
i =1
i

– Trimmed mean: chopping extreme values

Measuring the central tendency (2)
• Median: middle value if odd number of values, or average of the middle
2 values otherwise.
– Estimated by interpolation (for grouped data)

Approximate Sum before the median interval

median
n / 2 − ( freq) l Interval width (L2 – L1)
median = L1 + ( ) width
freqmedian
Low interval limit
Measuring the central tendency (3)
• Mode: value that occurs most frequently in the data
– Unimodal
• Empirical formula: mean − mode = 3  (mean − median)

– Multi-modal: bimodal, trimodal

Symmetric vs Skewed Data

symmetric negatively skewed

positively skewed
Symmetric vs Skewed Data
Measuring the dispersion of data (1)
Quartiles, outliers, and boxplots

• Quartiles: Q1 (25th percentile), Q3 (75th percentile)

• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles, median is marked, add
whiskers, and plot outliers individually.
• Outlier: usually, a value higher or lower than 1.5 times IQR.
Measuring the dispersion of data (2)
Variance and standard deviation (sample: s, population: σ)
1 n 1 n 2 1 n 2
• Variance: s =
2

n − 1 i =1
( xi − x ) =
2
[ xi − ( xi ) ]
n − 1 i =1 n i =1
n n
1 1
 =  ( xi −  ) =  i − 
2 2 22
x
N i =1 N i =1

• Standard deviations: s (or σ) is the square root of variance s2 (or σ2)

Note: The subtle difference of formulae for
sample vs. population
• n : the size of the sample
• N : the size of the population
Boxplot Analysis
• Five-number summary of a distribution:
– minimum, Q1, median, Q3, maximum.
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles
– The height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to minimum and maximum
– Outliers: points beyond a specified outlier threshold, plotted individually
Properties of Normal Distribution Curve
← — ————Represent data dispersion, spread — ————→

Represent central tendency

Graphic Displays of Basic Statistical Descriptions

• Boxplot
– graphic display of five-number summary
• Histogram
– x-axis are values
– y-axis represents frequencies
• Quantile plot
– each value xi is paired with fi indicating that approximately 100 fi% of data are ≤
xi
• Quatile-quantile (q-q) plot
– graphs the quantiles of one univariant distribution against the corresponding
quantiles of another.
• Scatter plot
– each pair of values is a pair of coordinates and plotted as points in the plane.
Histogram Analysis 40
35
30

• Histogram: graph display of tabulated 25

20
frequencies, shown as bars 15

• It shows what proportion of cases fall into 105

each of several categories 0
10000 30000 50000 70000 90000

• Differs from a bar chart in that it is the area of

the bar that denotes the value, not the height
as in bar charts, a crucial distinction when the
categories are not of uniform width
• The categories are usually specified as non-
overlapping intervals of some variables. The
categories (bars) must be adjacent.
Histogram Often Tells More than Boxplot
• Two histograms shown
on the right may have the
same boxplot
representation:
– the same values for: min,
Q1, median, Q3, and max.

• But, they have rather

different data distributions
Quantile Plot
• Display all the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order
– fi indicates that approximately 100 fi% of the data are below or equal to
the value xi.
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as
points in the plane.
Positively and Negatively Correlated Data

• The left half fragment is positively

correlated
• The right half is negative correlated
Uncorrelated Data
Exercises

40
Informatics Engineering | Universitas Surabaya
Homework
• Do all the exercises.
• You can write the solution on papers or you can use tools like Excel
or Python and explain in detail step-by-step of your work until it finds
the solution.
• Create a pdf file for your solution and submit it to ULS
• You can upload one more file that you use to do the computation
(.xlsx or .ipynb) along with your .pdf file. Upload those files
separately.
• Note: do not forget to put your Student ID and name at the first page
of the pdf file.
Median Exercise
Suppose that the value for a given set of data are grouped into
intervals. The intervals and corresponding frequencies are as follows:

Compute an approximate median value for the data.

Basic Statistics Exercise
Suppose that a hospital tested the age and body fat data for 18 randomly
selected adults with the following results:

a. Calculate the mean, median, and standard deviation of age and %fat.
b. Draw the boxplots for age and %fat.
c. Draw a scatter plot (and optional: q-q plot) based on these two variables
Question?

48
Informatics Engineering | Universitas Surabaya

2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Module No 2 - Part 2 - Compressed - Compressed
No ratings yet
Module No 2 - Part 2 - Compressed - Compressed
46 pages
02 Data
No ratings yet
02 Data
66 pages
CH 2
No ratings yet
CH 2
35 pages
Module 1
No ratings yet
Module 1
64 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Week2 UnderstandingData
No ratings yet
Week2 UnderstandingData
27 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Lect 3
No ratings yet
Lect 3
51 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
CH 2
No ratings yet
CH 2
68 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
DWDM LS2 Fall 24 25
No ratings yet
DWDM LS2 Fall 24 25
42 pages
02 Data
No ratings yet
02 Data
65 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Week2 1
No ratings yet
Week2 1
24 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
1 L2 Intro DAM
No ratings yet
1 L2 Intro DAM
27 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
02 Data
No ratings yet
02 Data
24 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
19 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
02 Data
No ratings yet
02 Data
41 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
02 Data
No ratings yet
02 Data
35 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Core Concepts in Real Analysis
From Everand
Core Concepts in Real Analysis
Roshan Trivedi
No ratings yet
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Quiz 9 - Chap 10
No ratings yet
Quiz 9 - Chap 10
3 pages
SYMBOLS
No ratings yet
SYMBOLS
6 pages
Week 1A - Overview and Introduction of Data Mining
No ratings yet
Week 1A - Overview and Introduction of Data Mining
41 pages
Applied Linear Algebra: Core Principles
From Everand
Applied Linear Algebra: Core Principles
Kartikeya Dutta
No ratings yet
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
978 1 4612 1154 9
No ratings yet
978 1 4612 1154 9
295 pages
Abs Buckling
No ratings yet
Abs Buckling
132 pages
Errors
No ratings yet
Errors
42 pages
Planning A Presentation
No ratings yet
Planning A Presentation
39 pages
STA1006S Summarized Notes
No ratings yet
STA1006S Summarized Notes
16 pages
Python Railway Reservation
No ratings yet
Python Railway Reservation
47 pages
Week 4 - Classification - Decision Tree 1
No ratings yet
Week 4 - Classification - Decision Tree 1
40 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Nature of Inquiry and Research
No ratings yet
Nature of Inquiry and Research
18 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Latex Pratical Front Page.
No ratings yet
Latex Pratical Front Page.
3 pages
Integer Programming
No ratings yet
Integer Programming
51 pages
Anomaly Detection
No ratings yet
Anomaly Detection
49 pages
11 03 0161 03 0wng Indoor Mimo Wlan Channel Models
No ratings yet
11 03 0161 03 0wng Indoor Mimo Wlan Channel Models
43 pages
WC XII Artificial Intelligence 843 AY 2023 24 QP SET1 1
No ratings yet
WC XII Artificial Intelligence 843 AY 2023 24 QP SET1 1
5 pages
MTH302 Py Question
No ratings yet
MTH302 Py Question
5 pages
Cs433 Sp12 Midterm Sol
No ratings yet
Cs433 Sp12 Midterm Sol
9 pages
Worksheet - 3 Error, Approximation, Monotonicity
No ratings yet
Worksheet - 3 Error, Approximation, Monotonicity
8 pages
CDB 3033 Transport Phenomena: Ii. Diffusion Through A Stagnant Gas Film
No ratings yet
CDB 3033 Transport Phenomena: Ii. Diffusion Through A Stagnant Gas Film
18 pages
Early Childhood Geometry
No ratings yet
Early Childhood Geometry
5 pages
Applications of Complex Systems To Operational Design: Booz Allen Hamilton
No ratings yet
Applications of Complex Systems To Operational Design: Booz Allen Hamilton
15 pages
By: Akansh Gupta CSE 08B91A0505
No ratings yet
By: Akansh Gupta CSE 08B91A0505
15 pages
MATH 600, 2nd Examination: Rings and Modules Solutions and Grading Key
No ratings yet
MATH 600, 2nd Examination: Rings and Modules Solutions and Grading Key
3 pages
Phase Transitions in Sudoku: Carlos Cotta
No ratings yet
Phase Transitions in Sudoku: Carlos Cotta
8 pages
How To Use ZedGrap1
No ratings yet
How To Use ZedGrap1
7 pages
Homework 1
No ratings yet
Homework 1
3 pages
HES1125 Equilbrium of Cantilever
No ratings yet
HES1125 Equilbrium of Cantilever
6 pages
2021 P11 Wk03 WS Vectors, Motion and Forces - 1617103209579 - ENT6T
No ratings yet
2021 P11 Wk03 WS Vectors, Motion and Forces - 1617103209579 - ENT6T
3 pages
E1Qalg B
No ratings yet
E1Qalg B
2 pages
Highly Secured Method of Skin Tone Based Steganography For Real Images
No ratings yet
Highly Secured Method of Skin Tone Based Steganography For Real Images
3 pages

Week 1B - Data

Uploaded by

Week 1B - Data

Uploaded by

1604C331 Data Mining

Odd Semester 2024-2025

– Examples of attribute: 1 Yes Single 125K No

• eye color of a person 2 No Married 100K No

• temperature 3 No Single 70K No

feature. 5 No Divorced 95K Yes

• An object is described by a collection of attributes (attribute 6 No Married 60K No

vector or feature vector). 7 Yes Divorced 220K No

– Examples of objects: 8 No Single 85K Yes

• in a sales database: customer, store item, sales 9 No Married 75K No

• in a medical database: patient 10

• in a university database: student, professor, course

• Each attribute possesses

• Only presence (a non-zero attribute value) is regarded as important

– Trimmed mean: chopping extreme values

Approximate Sum before the median interval

– Multi-modal: bimodal, trimodal

symmetric negatively skewed

• Quartiles: Q1 (25th percentile), Q3 (75th percentile)

• Standard deviations: s (or σ) is the square root of variance s2 (or σ2)

Represent central tendency

• Histogram: graph display of tabulated 25

• It shows what proportion of cases fall into 105

• Differs from a bar chart in that it is the area of

• But, they have rather

• The left half fragment is positively

Compute an approximate median value for the data.

You might also like