0% found this document useful (0 votes)

9 views33 pages

Lecture2

The document outlines the content of a course on Data Mining and Business Intelligence, specifically focusing on types of datasets, data objects, and statistical descriptions of data. It covers various data types including record data, graphs, ordered data, and spatial data, along with attributes and their classifications. Additionally, it discusses methods for measuring central tendency and data dispersion, providing examples and exercises for practical understanding.

Uploaded by

kiro2morris3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views33 pages

Lecture2

Uploaded by

kiro2morris3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

SET 393: Data Mining and Business Intelligence

3rd Year

Spring 2025

Lec. 2

Chapter 2. Data, Measurements, and Data Preprocessing

Assistant Professor: Dr. Rasha Saleh
Agenda
➢Types of
Datasets
➢Statics of
Data
Types of Data Sets: (1) Record Data
➢ Relational records
➢ Relational tables, highly structured
➢ Data matrix, e.g., numerical matrix, crosstabs

➢ Transaction data

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
TID Items
1 Bread, Coke, Milk
2 Beer, Bread Document 1 3 0 5 0 2 6 0 2 0 2
3 Beer, Coke, Diaper, Milk
Document 2 0 7 0 2 1 0 0 3 0 0
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0

➢ Document data: Term-frequency vector (matrix) of text documents

Types of Data Sets: (2) Graphs and Networks
➢ Transportation network
➢ system of infrastructure, vehicles, and operations that facilitate the movement of people, goods,
and services across different locations. This network is complex and typically includes roads,
railways, airways, waterways, and pipelines, as well as the vehicles and systems that use these
paths, such as cars, buses, trains, planes, ships, and trucks. Transportation networks are crucial
for economic development, urban planning, logistics, and overall connectivity.

➢ Molecular Structures
➢ Molecular structures refer to the arrangement of atoms within
a molecule, including the bonding patterns, spatial
positioning, and interactions that define the molecule's shape
and properties.

➢ Social or information networks

systems of interconnected individuals or entities that
exchange information, ideas, or resources, facilitating
communication, collaboration, and the flow of knowledge.
Types of Data Sets: (3) Ordered Data
➢ Video data: sequence of images

➢ Temporal data: time-series:

➢ Frequency and magnitude of
earthquakes over time

A video dataset consisting of a sequence of images from a match can be

used for various purposes like analysing game tactics, player movements,
or event detection.

➢ Sequential Data: transaction sequences

Series of actions or events (transactions) that occur in a specific order
over time. These transactions can happen in various domains, such as
e-commerce, banking, or retail, and are critical for understanding
patterns, behaviors, and making predictions
➢ Genetic sequence data
A genetic sequence is simply a linear order of these nucleotide bases, such as:
•DNA sequence: ATCGTAGC... RNA sequence: AUCGAUGC...
Types of Data Sets: (4) Spatial, image and multimedia Data

➢ Spatial data: maps (heat maps)

➢ Image data: Geospatial images

➢ Video data Geospatial data: refers to data that is associated with a

specific location on the Earth’s surface.
Data Objects
Attributes
➢Data sets are made up of data objects
➢A data object represents an entity Data Objects

➢Examples:
➢sales database: customers, store items, sales
➢medical database: patients, treatments
➢university database: students, professors, courses
➢Also called samples , examples, instances, data points, objects, tuples
➢Data objects are described by attributes
➢Database rows → data objects; columns → attributes
Attributes
➢Attribute (or dimensions, features, variables)
➢A data field, representing a characteristic or feature of a data object.
➢E.g., customer _ID, name, address
➢Types: Attributes/features/dimensions/variables
Customer_ name address
➢Nominal (e.g., colors(red, blue), name,… )
ID
➢Binary (e.g., {true, false})
➢Ordinal (e.g., {freshman, sophomore,
junior, senior})
➢Numeric: quantitative (represented as numbers)
➢Length, weight,…
➢Interval-scaled:
(student’s grades (0 to 59 : F, 60 to 70 :D ,…))
➢Ratio-scaled:
ratio of students who got A (5/50) = 10%
➢Discrete (cannot be divided into parts), Continuous Attributes
(can be divided)
Attribute Types
String variable: letters and numbers

➢ Nominal: categories, states, or “names of things”

➢ Hair_color = {auburn, black, blond, brown, grey, red, white}
➢ marital status, occupation, ID numbers, zip codes Hair_color Gender address
➢ Binary
Black male
➢ Nominal attribute with only 2 states (0 and 1)
Brown Female
➢ Symmetric binary: both outcomes equally important
➢ e.g., gender
➢ Asymmetric binary: outcomes not equally important.
➢ e.g., medical test (positive vs. negative)
➢ Convention: assign 1 to most important outcome (e.g., HIV positive)
➢ Ordinal
➢ Values have a meaningful order (ranking) but magnitude (difference) between successive
values is not known
➢ Survey Responses: In a customer satisfaction survey, responses like "Very Unsatisfied," "Unsatisfied," "Neutral," "Satisfied,"
"Very Satisfied" represent ordinal data. You can rank these responses, but the magnitude of the difference between "Neutral" and
"Satisfied" is subjective and cannot be easily measured.
➢ Size = {small, medium, large}, grades, army rankings
Discrete vs. Continuous Attributes
➢Discrete Attribute
➢Has only a finite or countably infinite set of values (A whole number, cannot be
divided)
➢E.g., number of courses, number of siblings, zip codes, profession, or the set of
words in a collection of documents
➢Sometimes, represented as integer variables
➢Note: Binary attributes are a special case of discrete attributes
➢Continuous Attribute
➢Has real numbers as attribute values
➢E.g., temperature, height, or weight
➢Practically, real values can only be measured and represented using a finite number
of digits
➢Continuous attributes are typically represented as floating-point variables
Statics of Data

➢Measuring the Central Tendency

➢Measuring the Dispersion of Data

➢Covariance and Correlation Analysis

➢Graphic Displays of Basic Statics of Data

Basic Statistical Descriptions of Data
Basic Statistical Descriptions of Data
➢ Motivation
➢ To better understand the data: central tendency, variation and spread

When analysing data, it's crucial to have a clear

understanding of the central tendency, variation,
and spread of the data. These concepts help
summarize large datasets, identify patterns, and
make informed decisions based on data.

Central Tendency:

•Purpose: To determine the centre or average of the dataset, providing a representativ

value of the data.
If we have exam scores like [50, 60, 70, 80, 90], the mean is 70, which gives us a
central idea of the overall performance.
Basic Statistical Descriptions of Data
➢ Data dispersion : refers to how spread out or scattered the
data points are.
➢ Characteristics: Median, max, min, quantiles, outliers,
variance, ...
➢ Numerical dimensions correspond to sorted intervals
➢ Data dispersion:
➢ Analyzed with multiple granularities of precision
➢ Granularity is the level of detail at which data is viewed
or analyzed. For example, you might look at data at a
high-level granularity (e.g., annual data), or at a fine-
level granularity (e.g., daily or hourly data).
➢ In the context of dispersion, different granularities can
help reveal patterns of variation in different time
periods, groups, or categories.
Measuring the Central Tendency: (1) Mean
➢Mean (algebraic measure)
Mean (Average): the most common and effective numeric measure of the
center of an attribute. It is the sum of all data points divided by the
number of points. It gives an overall idea of the "average" value of the
dataset.
Note: n is sample size and N is population size.
1 n
x =  xi
n i =1
Measuring the Central Tendency: (1) Mean
➢Mean (algebraic measure) Note: n is sample size and N is population
size.

➢Try it yourself:
Two dice were thrown 10 times. For each throw, their scores were added together
and recorded.
7, 5, 2, 7, 6, 12, 10, 4, 8, 9

Calculate the mean

Measuring the Central Tendency: (1) Mean
➢Mean (algebraic measure) Note: n is sample size and N is population
size.

➢Try it yourself:
Two dice were thrown 10 times. For each throw, their scores were added together
and recorded.
7, 5, 2, 7, 6, 12, 10, 4, 8, 9

Calculate the mean

Mean= 7+5+2+7+6+12+10+4+8+9 = 70 = 7
10 10
Measuring the Central Tendency: Weighted Mean
➢Weighted arithmetic mean:
➢Sometimes each value x(i) in a set may be associated with a weight
w(i) n

for I = 1,……,N  wi xi
x = i =1 n

w
i =1
i

➢ The weights reflect the significance, importance, or occurrence frequency attached to their
respective values.
Measuring the Central Tendency: Weighted Mean
➢Weighted arithmetic mean:
Try it yourself
Find weighted mean for following data set w = {2, 5, 6, 8, 9}, x = {4, 3,
7, 5, 6}
Measuring the Central Tendency: Weighted Mean
➢Weighted arithmetic mean:
Try it yourself
Find weighted mean for following data set w = {2, 5, 6, 8, 9}, x = {4, 3,
7, 5, 6}
Measuring the Central Tendency: Trimmed Mean
➢Trimmed mean:
➢Chopping extreme values
Measuring the Central Tendency: Trimmed Mean
➢Trimmed mean:
➢Try it Yourself
Measuring the Central Tendency: Trimmed Mean
➢Trimmed mean:
➢Try it Yourself
Measuring the Central Tendency: (2) Median
➢Median:
➢After sorting the values in the data set, it is the middle value if the data set is odd
numbers of values, or average of the middle two values otherwise
Measuring the Central Tendency: (2) Median
➢Median:
Measuring the Central Tendency: (2) Median
➢Median:
➢Try it yourself
Measuring the Central Tendency: (2) Median
➢Median:
➢Try it yourself
Measuring the Central Tendency: Median of large dataset
Calculate the mean and median household size (number of family members) in each family
Even number

(3/30)*100=10.0
(4/30)*100=13.3 30 students
(Freq/total cumulative frequency)*100 reflects
30 families

3+4
(2 is included)
3+4+10
(2 ,3 are included)
3+4+10+4

Notes: this is the

total no. of Notes: 100%
students
Measuring the Central Tendency: Median of large dataset

The number of students became odd number(31)

18x1
Measuring the Central Tendency: Median of grouped data
For larger data sets: the median is expensive to compute

Median interval

= 60-40=20

To get the median: get the cumulative frequency and find the middle number of the whole employees. In this
example there are 120 employees, divided by 2 = 60, so the middle number is the 60th employee. This employee
exist in the cumulative frequency (77), so the median interval is (40$-60$)
Measuring the Central Tendency: Median of grouped data
Try it yourself

The following data represents the survey regarding the heights (in cm) of 51 girls of Class x. Find the median height.
Measuring the Central Tendency: Median of grouped data
Try it yourself

The following data represents the survey regarding the heights (in cm) of 51 girls of Class x. Find the median height.
Thank You

Safari
No ratings yet
Safari
385 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Get to Know About Data
No ratings yet
Get to Know About Data
25 pages
lec2-data
No ratings yet
lec2-data
51 pages
Chap2-Data
No ratings yet
Chap2-Data
101 pages
Stats and its Real world applications.
No ratings yet
Stats and its Real world applications.
53 pages
2-1-Data
No ratings yet
2-1-Data
22 pages
Ch 2 (2)
No ratings yet
Ch 2 (2)
35 pages
ML 3170724 Unit-2
No ratings yet
ML 3170724 Unit-2
40 pages
9-1 Data analysis and pre-processing part 1.pdf
No ratings yet
9-1 Data analysis and pre-processing part 1.pdf
19 pages
unit1
No ratings yet
unit1
78 pages
Ch01_ICS422_04
No ratings yet
Ch01_ICS422_04
84 pages
3 Data
No ratings yet
3 Data
64 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
E-Note_33325_Content_Document_20250319114322AM
No ratings yet
E-Note_33325_Content_Document_20250319114322AM
69 pages
Unit 3
No ratings yet
Unit 3
43 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
02Data
No ratings yet
02Data
65 pages
Descriptive Analytics Notes
No ratings yet
Descriptive Analytics Notes
6 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Data Management
100% (1)
Data Management
51 pages
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
No ratings yet
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
43 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
02Data
No ratings yet
02Data
24 pages
Manual Expert 7.1 - OXO
No ratings yet
Manual Expert 7.1 - OXO
1,324 pages
Module 1
No ratings yet
Module 1
64 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Lect 3
No ratings yet
Lect 3
51 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Data ch2
No ratings yet
Data ch2
16 pages
01 Data
No ratings yet
01 Data
100 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
02 Data
No ratings yet
02 Data
64 pages
Lecture 01-05 Data, Central Tendency PDF
No ratings yet
Lecture 01-05 Data, Central Tendency PDF
51 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
02 Data
No ratings yet
02 Data
35 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Unit 3: Classification & Regression: Question Bank and Its Solution
No ratings yet
Unit 3: Classification & Regression: Question Bank and Its Solution
180 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
MMW (Data Management) - Part 1
No ratings yet
MMW (Data Management) - Part 1
26 pages
Sartorius Combics 1 - Combics 2: Service Manual
No ratings yet
Sartorius Combics 1 - Combics 2: Service Manual
110 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Lecture3
No ratings yet
Lecture3
51 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
Lecture1
No ratings yet
Lecture1
32 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Lecture5
No ratings yet
Lecture5
27 pages
Spec Manual (M800V - M80V)
No ratings yet
Spec Manual (M800V - M80V)
604 pages
Overlord Blu-Ray Special 4 - Black Edition
No ratings yet
Overlord Blu-Ray Special 4 - Black Edition
73 pages
Lecture6
No ratings yet
Lecture6
19 pages
ge8 statistics
No ratings yet
ge8 statistics
2 pages
funcgeo2
No ratings yet
funcgeo2
18 pages
Assembly Design R30
No ratings yet
Assembly Design R30
41 pages
NCERT Ransomeware and Malware
100% (1)
NCERT Ransomeware and Malware
34 pages
AcheivingCompetitiveAdvantagethroughCostLeadershipStrategy
No ratings yet
AcheivingCompetitiveAdvantagethroughCostLeadershipStrategy
18 pages
01 Daily Easy English Expression PODCAST
No ratings yet
01 Daily Easy English Expression PODCAST
1 page
Mste Nov 2022 Questions Part 1
No ratings yet
Mste Nov 2022 Questions Part 1
23 pages
Cash Denomination Calculator
No ratings yet
Cash Denomination Calculator
8 pages
TDT Course Sales Agreement
No ratings yet
TDT Course Sales Agreement
1 page
Ieq S-22
No ratings yet
Ieq S-22
20 pages
Immucor ECHO-Brochure US Web
100% (1)
Immucor ECHO-Brochure US Web
9 pages
ôn thi Đại học môn Anh PRACTICE TEST 55-58
No ratings yet
ôn thi Đại học môn Anh PRACTICE TEST 55-58
14 pages
1.1 Systems of Linear Equations - Babar
No ratings yet
1.1 Systems of Linear Equations - Babar
7 pages
My Internship Report
No ratings yet
My Internship Report
10 pages
TSA BIM Ready Complete
No ratings yet
TSA BIM Ready Complete
19 pages
Mathematics Standard 2 Year 12 Topic Guide Algebra
No ratings yet
Mathematics Standard 2 Year 12 Topic Guide Algebra
8 pages
Distributed Pow Er Unit: DPU40D-N06A3 Datasheet
No ratings yet
Distributed Pow Er Unit: DPU40D-N06A3 Datasheet
3 pages
Chapter 19: Sales and Operations Planning: Mcgraw-Hill/Irwin
No ratings yet
Chapter 19: Sales and Operations Planning: Mcgraw-Hill/Irwin
31 pages
Animal Sound Classification Using A Convolutional Neural Network
No ratings yet
Animal Sound Classification Using A Convolutional Neural Network
5 pages
ACCT+604+Mini+Project+1 Anukruti
No ratings yet
ACCT+604+Mini+Project+1 Anukruti
3 pages
Resume
No ratings yet
Resume
3 pages
Final Project
No ratings yet
Final Project
11 pages
1st Year New CHP 1
No ratings yet
1st Year New CHP 1
2 pages
Welcome!: Thistleton and Sadigov Download and Install R For Windows Week 1
No ratings yet
Welcome!: Thistleton and Sadigov Download and Install R For Windows Week 1
3 pages
Brent Braun Position Impossible (PDF) : You've Uploaded 2 of The 5 Required Documents
No ratings yet
Brent Braun Position Impossible (PDF) : You've Uploaded 2 of The 5 Required Documents
3 pages
SAP Controlling Configuration
95% (43)
SAP Controlling Configuration
19 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet

Lecture2

Uploaded by

Lecture2

Uploaded by

SET 393: Data Mining and Business Intelligence

Chapter 2. Data, Measurements, and Data Preprocessing

➢ Document data: Term-frequency vector (matrix) of text documents

➢ Social or information networks

➢ Temporal data: time-series:

A video dataset consisting of a sequence of images from a match can be

➢ Sequential Data: transaction sequences

➢ Spatial data: maps (heat maps)

➢ Image data: Geospatial images

➢ Video data Geospatial data: refers to data that is associated with a

➢ Nominal: categories, states, or “names of things”

➢Measuring the Central Tendency

➢Measuring the Dispersion of Data

➢Covariance and Correlation Analysis

➢Graphic Displays of Basic Statics of Data

When analysing data, it's crucial to have a clear

•Purpose: To determine the centre or average of the dataset, providing a representativ

Calculate the mean

Calculate the mean

Notes: this is the

The number of students became odd number(31)

You might also like