Data - part 1
Data - part 1
Lecture 2
Data: Part 1
Mohammed Brahimi & Sami Belkacem
1
Outline
1. What is a Dataset?
2. Types of Datasets
3. Types of Attributes
4. Data Preprocessing
5. Similarity measures
2
1- What is a Dataset?
3
Definition of Dataset
● Dataset: collection of objects and their attributes Attributes
Objects
● Object: collection of attributes.
Also known as record, point, case, sample, entity,
or instance.
4
Important Characteristics of Datasets
● Size: The type of analysis often depends on the size of the data.
5
2- Types of Datasets
6
Types of Datasets
● Record Data: records with fixed attributes
○ Relational records
○ Data matrix …
○ Transaction Data
● Spatial Data
○ RGB Images
○ Satellite images
7
Types of Datasets
● Record Data: records with fixed attributes
○ Relational records
○ Data matrix …
○ Transaction Data
● Spatial Data
○ RGB Images
○ Satellite images
8
Types of Datasets
● Record Data: records with fixed attributes
○ Relational records
○ Data matrix …
○ Transaction Data
● Spatial Data
○ RGB Images
○ Satellite images
9
Types of Datasets
● Record Data
○ Relational records
○ Data matrix …
○ Transaction Data
● Spatial Data
○ RGB Images
○ Satellite images
10
3- Types of Attributes
11
Types of Attributes
● Nominal (Unordered Categories)
○ Examples: Gender, eye color, types of fruit (e.g., apple, orange), etc.
12
Types of Operations on Attributes
● Nominal
○ Distinctness ( =, ≠ )
● Ordinal
○ Distinctness ( =, ≠ )
○ Order ( <, > )
● Interval
○ Distinctness ( =, ≠ )
○ Order ( <, > )
○ Meaningful Differences ( +, - )
● Ratio
○ Distinctness ( =, ≠ )
○ Order ( <, > )
○ Meaningful Differences ( +, - )
○ Meaningful Ratios ( *, / )
13
Discrete vs. Continuous Attributes
● Discrete Attribute: Takes values from a finite or
countable set.
○ Examples: gender, eye color, swimming level.
● Typically represented as integers.
● Binary attributes: special case of discrete attributes.
● Real-Life Scenario:
15
4- Data Preprocessing
16
Major Tasks of Data Preprocessing
Data integration (covered in the course advanced databases)
Data cleaning
● Handle duplicates and missing values, identify/remove outliers, smooth noisy data, etc.
Data transformation
17
Data Cleaning
● Poor data quality can negatively impact modeling efforts.
E.g. In Bank loan prediction, poor data can lead to incorrect loan decisions:
– Some credit-worthy candidates are denied loans
– More loans are given to individuals who are unlikely to be creditworthy
● What types of data quality issues exist, and how can we identify and handle them?
19
Missing Values
○ Information is not collected (e.g., people decline to give their age and weight)
○ Attributes may not be applicable to all cases (e.g., annual income not applicable to children)
20
How to Handle Missing Values?
● Delete Records: Drop records if there is enough data and few missing values
● Keep Missing Data: Keep as NaN if, for example, the missing values are ≥ 60% of the observations
● Imputation-Based Techniques:
Average: fill in with the mean or the median for numerical data, and the mode for categorical data
Nearest neighbor: fill in with similar data points based on the nearest neighbors in the dataset
Interpolation: train a prediction model on the dataset to predict the corresponding value
21
Outliers
● Data objects with characteristics significantly different
from the majority in the dataset.
● Determining Causes:
○ Explore the reasons behind the presence of outliers.
22
How to Handle Outliers?
23
Noise
● Noise in Objects: Irrelevant elements
affecting data integrity.
● Noise in Attributes: Modification of
original attribute values.
● Examples:
○ Erroneous values caused by data entry errors
○ Distorted voice on a poor phone line.
○ "Snow" on a television screen.
○ Etc. Noisy data in signal processing
24
How to Handle Noise?
● Binning: Group data into bins and smooth it using means, medians, or defined boundaries.
● Clustering: Apply clustering to separate out noise points that do not fit well within any cluster.
● Semi-supervised method: Combine automated noise detection tools with human inspection
Note: Incorporating noise into data can sometimes enhance the robustness of data mining models
25
Data Transformation
Convert data into a format that is suitable for analysis.
26
Sampling
Select a subset of the dataset to make it more manageable for analysis
● Challenges:
○ Ensure the sample is representative of the population.
○ Address potential bias in the sampling process.
27
Sampling methods
28
Sampling methods
Simple Random Sampling
● Every item has an equal chance of being selected (could be with or without replacement)
Systematic Sampling
● Select individuals at regular intervals from a list or group.
Stratified Sampling
● Divide the population into groups (strata) based on a characteristic, then random samples are
taken from each group.
Cluster Sampling
● Divide the population into clusters (often geographically), then entire clusters are randomly
selected for sampling.
29
Encoding
Convert categorical variables into numerical format for data mining algorithms
One-hot encoding
(suitable)
Label encoding
(not suitable)
30
Encoding methods
Label Encoding
● Converts categories into numerical labels.
● Each category gets a unique integer.
● Can create ordinal relationships, even if not intended (e.g. France (0) < Spain (1))
● Suitable for ordinal attributes
One-Hot Encoding
● Creates a binary column for each category.
● Each category is represented by 1 in its column, 0 elsewhere.
● No ordinal relationships are implied.
● Suitable for nominal attributes
● Increases the dimensionality of the data, which can be a concern with many categories.
More advanced encoding techniques exist for nominal attributes, such as “word embeddings”.
31
Normalization
Scale numerical data to a standard range to ensure attributes contribute equally to the analysis
● Normalization prevent attributes with larger ranges from dominating those with smaller ranges
● Normalization is crucial for the convergence of many data mining algorithms.
32
Normalization methods
● Min-max normalization:
○ Attributes will have the exact same scale.
○ Does not handle outliers well.
● Z-score normalization:
○ More robust to outliers.
○ Does not produce normalized data with
the exact same scale.
○ Still sensitive to extreme outliers
33
Normalization methods
● Min-max normalization:
○ Attributes will have the exact same scale.
○ Does not handle outliers well.
● Z-score normalization:
○ More robust to outliers.
○ Does not produce normalized data with
the exact same scale.
○ Still sensitive to extreme outliers.
34
Normalization methods
Min-Max Normalization is suitable when:
● The chosen data mining algorithm is sensitive to the distribution of the data
35
Discretization
Transforming continuous data into discrete intervals (bins)
● The goal is to improve the quality and usability of data for analysis and modeling.
36
Discretization methods
Discretization methods can be classified in two categories:
1. Unsupervised Discretization
No class label is used during the discretization process.
2. Supervised Discretization
Class labels are used to guide the discretization, optimizing it for
classification tasks.
37
Unsupervised Discretization Methods
Equal Width
Equal Frequency
K-means
38
Supervised Discretization Methods
Top-down (Splitting)
Bottom-up (Merging)
39
5- Similarity and Dissimilarity Measures
40
Similarity and Dissimilarity Measures
Similarity between objects or attributes reveal valuable data relationships for
pattern recognition, clustering, and classification.
● Similarity Measure:
○ Quantifies data object likeness.
○ Higher values indicate greater similarity.
○ Typically within the range [0,1].
● Dissimilarity Measure:
○ Can also be referred to as Distance Measure.
○ Quantifies data object differences.
○ Lower values indicate greater similarity.
○ Often starts at 0 and varies in the upper limit.
● Proximity:
○ Refers to either similarity or dissimilarity.
41
Similarity and Dissimilarity Measures
1. Properties of Similarity
2. Properties of Distance
42
Properties of Similarity
● Identity:
○ s(x, y) = 1 (or maximum similarity) only if x = y.
○ Note: This property may not always hold, e.g., cosine similarity.
● Symmetry:
○ s(x, y) = s(y, x) for all x and y.
○ Symmetry ensures that the order of comparison does not affect the similarity score.
43
Properties of Distance
● Non-Negativity:
○ d(x, y) ≥ 0 for all x and y.
○ d(x, y) = 0 if and only if x = y.
● Symmetry:
○ d(x, y) = d(y, x) for all x and y.
● Triangle Inequality:
○ d(x, z) ≤ d(x,y) + d(y, z) for all x, y, and z.
A A
B B
C C
D D
46
Euclidean Distance
● : number of attributes.
● , , : : kth attributes for objects x and y, respectively.
47
Minkowski Distance
48
Special Cases of Minkowski Distance
● r = 1:
○ Called L1 norm or Manhattan distance.
○ Ideal for measuring distances in grid-like paths.
○ Binary vector example: Hamming distance counts differing bits.
● r = 2:
○ Called L2 norm or Euclidean distance.
○ The most commonly used distance metric.
○ Ideal for measuring the straight-line distance in Euclidean space.
● r → ∞:
○ Called Lmax norm or Chebyshev distance.
○ Calculates the maximum difference between any component of vectors.
○ Ideal when movement is unrestricted in any direction, e.g. king movement in chess
49
Special Cases of Minkowski Distance
● r = 1:
○ Called L1 norm or Manhattan distance.
○ Ideal for measuring distances in grid-like paths.
○ Binary vector example: Hamming distance counts differing bits.
● r = 2:
○ Called L2 norm or Euclidean distance.
○ The most commonly used distance metric.
○ Ideal for measuring the straight-line distance in Euclidean space.
● r → ∞:
○ Called Lmax norm or Chebyshev distance.
○ Calculates the maximum difference between any component of vectors.
○ Ideal when movement is unrestricted in any direction, e.g. king movement in chess
50
Special Cases of Minkowski Distance
● r = 1:
○ Called L1 norm or Manhattan distance.
○ Ideal for measuring distances in grid-like paths.
○ Binary vector example: Hamming distance counts differing bits.
● r = 2:
○ Called L2 norm or Euclidean distance.
○ The most commonly used distance metric.
○ Ideal for measuring the straight-line distance in Euclidean space.
● r → ∞:
○ Called Lmax norm or Chebyshev distance.
○ Calculates the maximum difference between any component of vectors.
○ Ideal when movement is unrestricted in any direction, e.g. king movement in chess
51
Cosine Similarity
54
Simple Matching Coefficient (SMC)
● The number of matches divided by the total number of attributes.
● It is designed for symmetric binary attributes.
Example: Two persons’ buying in a market represented by binary vectors, x=[0,0,1], y=[0,1,1]
Each element represents an asymmetric attribute: whether a person bought an item in a market.
J = 1/2 = 0.5
56
Similarity, Distance, and Attribute type
57
How to Choose the Similarity/Distance Measure?
The choice of the right measure depends on the domain: