0% found this document useful (0 votes)
3 views

Lecture2

The document outlines the content of a course on Data Mining and Business Intelligence, specifically focusing on types of datasets, data objects, and statistical descriptions of data. It covers various data types including record data, graphs, ordered data, and spatial data, along with attributes and their classifications. Additionally, it discusses methods for measuring central tendency and data dispersion, providing examples and exercises for practical understanding.

Uploaded by

kiro2morris3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture2

The document outlines the content of a course on Data Mining and Business Intelligence, specifically focusing on types of datasets, data objects, and statistical descriptions of data. It covers various data types including record data, graphs, ordered data, and spatial data, along with attributes and their classifications. Additionally, it discusses methods for measuring central tendency and data dispersion, providing examples and exercises for practical understanding.

Uploaded by

kiro2morris3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

SET 393: Data Mining and Business Intelligence

3rd Year

Spring 2025

Lec. 2

Chapter 2. Data, Measurements, and Data Preprocessing


Assistant Professor: Dr. Rasha Saleh
Agenda
➢Types of
Datasets
➢Statics of
Data
Types of Data Sets: (1) Record Data
➢ Relational records
➢ Relational tables, highly structured
➢ Data matrix, e.g., numerical matrix, crosstabs

➢ Transaction data

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
TID Items
1 Bread, Coke, Milk
2 Beer, Bread Document 1 3 0 5 0 2 6 0 2 0 2
3 Beer, Coke, Diaper, Milk
Document 2 0 7 0 2 1 0 0 3 0 0
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0

➢ Document data: Term-frequency vector (matrix) of text documents


Types of Data Sets: (2) Graphs and Networks
➢ Transportation network
➢ system of infrastructure, vehicles, and operations that facilitate the movement of people, goods,
and services across different locations. This network is complex and typically includes roads,
railways, airways, waterways, and pipelines, as well as the vehicles and systems that use these
paths, such as cars, buses, trains, planes, ships, and trucks. Transportation networks are crucial
for economic development, urban planning, logistics, and overall connectivity.

➢ Molecular Structures
➢ Molecular structures refer to the arrangement of atoms within
a molecule, including the bonding patterns, spatial
positioning, and interactions that define the molecule's shape
and properties.

➢ Social or information networks


systems of interconnected individuals or entities that
exchange information, ideas, or resources, facilitating
communication, collaboration, and the flow of knowledge.
Types of Data Sets: (3) Ordered Data
➢ Video data: sequence of images

➢ Temporal data: time-series:


➢ Frequency and magnitude of
earthquakes over time

A video dataset consisting of a sequence of images from a match can be


used for various purposes like analysing game tactics, player movements,
or event detection.

➢ Sequential Data: transaction sequences


Series of actions or events (transactions) that occur in a specific order
over time. These transactions can happen in various domains, such as
e-commerce, banking, or retail, and are critical for understanding
patterns, behaviors, and making predictions
➢ Genetic sequence data
A genetic sequence is simply a linear order of these nucleotide bases, such as:
•DNA sequence: ATCGTAGC... RNA sequence: AUCGAUGC...
Types of Data Sets: (4) Spatial, image and multimedia Data

➢ Spatial data: maps (heat maps)

➢ Image data: Geospatial images

➢ Video data Geospatial data: refers to data that is associated with a


specific location on the Earth’s surface.
Data Objects
Attributes
➢Data sets are made up of data objects
➢A data object represents an entity Data Objects

➢Examples:
➢sales database: customers, store items, sales
➢medical database: patients, treatments
➢university database: students, professors, courses
➢Also called samples , examples, instances, data points, objects, tuples
➢Data objects are described by attributes
➢Database rows → data objects; columns → attributes
Attributes
➢Attribute (or dimensions, features, variables)
➢A data field, representing a characteristic or feature of a data object.
➢E.g., customer _ID, name, address
➢Types: Attributes/features/dimensions/variables
Customer_ name address
➢Nominal (e.g., colors(red, blue), name,… )
ID
➢Binary (e.g., {true, false})
➢Ordinal (e.g., {freshman, sophomore,
junior, senior})
➢Numeric: quantitative (represented as numbers)
➢Length, weight,…
➢Interval-scaled:
(student’s grades (0 to 59 : F, 60 to 70 :D ,…))
➢Ratio-scaled:
ratio of students who got A (5/50) = 10%
➢Discrete (cannot be divided into parts), Continuous Attributes
(can be divided)
Attribute Types
String variable: letters and numbers

➢ Nominal: categories, states, or “names of things”


➢ Hair_color = {auburn, black, blond, brown, grey, red, white}
➢ marital status, occupation, ID numbers, zip codes Hair_color Gender address
➢ Binary
Black male
➢ Nominal attribute with only 2 states (0 and 1)
Brown Female
➢ Symmetric binary: both outcomes equally important
➢ e.g., gender
➢ Asymmetric binary: outcomes not equally important.
➢ e.g., medical test (positive vs. negative)
➢ Convention: assign 1 to most important outcome (e.g., HIV positive)
➢ Ordinal
➢ Values have a meaningful order (ranking) but magnitude (difference) between successive
values is not known
➢ Survey Responses: In a customer satisfaction survey, responses like "Very Unsatisfied," "Unsatisfied," "Neutral," "Satisfied,"
"Very Satisfied" represent ordinal data. You can rank these responses, but the magnitude of the difference between "Neutral" and
"Satisfied" is subjective and cannot be easily measured.
➢ Size = {small, medium, large}, grades, army rankings
Discrete vs. Continuous Attributes
➢Discrete Attribute
➢Has only a finite or countably infinite set of values (A whole number, cannot be
divided)
➢E.g., number of courses, number of siblings, zip codes, profession, or the set of
words in a collection of documents
➢Sometimes, represented as integer variables
➢Note: Binary attributes are a special case of discrete attributes
➢Continuous Attribute
➢Has real numbers as attribute values
➢E.g., temperature, height, or weight
➢Practically, real values can only be measured and represented using a finite number
of digits
➢Continuous attributes are typically represented as floating-point variables
Statics of Data

➢Measuring the Central Tendency

➢Measuring the Dispersion of Data

➢Covariance and Correlation Analysis

➢Graphic Displays of Basic Statics of Data


Basic Statistical Descriptions of Data
Basic Statistical Descriptions of Data
➢ Motivation
➢ To better understand the data: central tendency, variation and spread

When analysing data, it's crucial to have a clear


understanding of the central tendency, variation,
and spread of the data. These concepts help
summarize large datasets, identify patterns, and
make informed decisions based on data.

Central Tendency:

•Purpose: To determine the centre or average of the dataset, providing a representativ


value of the data.
If we have exam scores like [50, 60, 70, 80, 90], the mean is 70, which gives us a
central idea of the overall performance.
Basic Statistical Descriptions of Data
➢ Data dispersion : refers to how spread out or scattered the
data points are.
➢ Characteristics: Median, max, min, quantiles, outliers,
variance, ...
➢ Numerical dimensions correspond to sorted intervals
➢ Data dispersion:
➢ Analyzed with multiple granularities of precision
➢ Granularity is the level of detail at which data is viewed
or analyzed. For example, you might look at data at a
high-level granularity (e.g., annual data), or at a fine-
level granularity (e.g., daily or hourly data).
➢ In the context of dispersion, different granularities can
help reveal patterns of variation in different time
periods, groups, or categories.
Measuring the Central Tendency: (1) Mean
➢Mean (algebraic measure)
Mean (Average): the most common and effective numeric measure of the
center of an attribute. It is the sum of all data points divided by the
number of points. It gives an overall idea of the "average" value of the
dataset.
Note: n is sample size and N is population size.
1 n
x =  xi
n i =1
Measuring the Central Tendency: (1) Mean
➢Mean (algebraic measure) Note: n is sample size and N is population
size.

➢Try it yourself:
Two dice were thrown 10 times. For each throw, their scores were added together
and recorded.
7, 5, 2, 7, 6, 12, 10, 4, 8, 9

Calculate the mean


Measuring the Central Tendency: (1) Mean
➢Mean (algebraic measure) Note: n is sample size and N is population
size.

➢Try it yourself:
Two dice were thrown 10 times. For each throw, their scores were added together
and recorded.
7, 5, 2, 7, 6, 12, 10, 4, 8, 9

Calculate the mean

Mean= 7+5+2+7+6+12+10+4+8+9 = 70 = 7
10 10
Measuring the Central Tendency: Weighted Mean
➢Weighted arithmetic mean:
➢Sometimes each value x(i) in a set may be associated with a weight
w(i) n

for I = 1,……,N  wi xi
x = i =1 n

w
i =1
i

➢ The weights reflect the significance, importance, or occurrence frequency attached to their
respective values.
Measuring the Central Tendency: Weighted Mean
➢Weighted arithmetic mean:
Try it yourself
Find weighted mean for following data set w = {2, 5, 6, 8, 9}, x = {4, 3,
7, 5, 6}
Measuring the Central Tendency: Weighted Mean
➢Weighted arithmetic mean:
Try it yourself
Find weighted mean for following data set w = {2, 5, 6, 8, 9}, x = {4, 3,
7, 5, 6}
Measuring the Central Tendency: Trimmed Mean
➢Trimmed mean:
➢Chopping extreme values
Measuring the Central Tendency: Trimmed Mean
➢Trimmed mean:
➢Try it Yourself
Measuring the Central Tendency: Trimmed Mean
➢Trimmed mean:
➢Try it Yourself
Measuring the Central Tendency: (2) Median
➢Median:
➢After sorting the values in the data set, it is the middle value if the data set is odd
numbers of values, or average of the middle two values otherwise
Measuring the Central Tendency: (2) Median
➢Median:
Measuring the Central Tendency: (2) Median
➢Median:
➢Try it yourself
Measuring the Central Tendency: (2) Median
➢Median:
➢Try it yourself
Measuring the Central Tendency: Median of large dataset
Calculate the mean and median household size (number of family members) in each family
Even number

(3/30)*100=10.0
(4/30)*100=13.3 30 students
(Freq/total cumulative frequency)*100 reflects
30 families

3+4
(2 is included)
3+4+10
(2 ,3 are included)
3+4+10+4

Notes: this is the


total no. of Notes: 100%
students
Measuring the Central Tendency: Median of large dataset

The number of students became odd number(31)


18x1
Measuring the Central Tendency: Median of grouped data
For larger data sets: the median is expensive to compute

Median interval

= 60-40=20

To get the median: get the cumulative frequency and find the middle number of the whole employees. In this
example there are 120 employees, divided by 2 = 60, so the middle number is the 60th employee. This employee
exist in the cumulative frequency (77), so the median interval is (40$-60$)
Measuring the Central Tendency: Median of grouped data
Try it yourself

The following data represents the survey regarding the heights (in cm) of 51 girls of Class x. Find the median height.
Measuring the Central Tendency: Median of grouped data
Try it yourself

The following data represents the survey regarding the heights (in cm) of 51 girls of Class x. Find the median height.
Thank You

You might also like