0% found this document useful (0 votes)
2 views

Data - part 1

The document provides an overview of data mining concepts, focusing on datasets, attributes, and data preprocessing techniques. It outlines the types of datasets and attributes, as well as the importance of data cleaning, transformation, and similarity measures. Key preprocessing tasks include handling missing values, outliers, and noise, along with methods for data transformation such as normalization and discretization.

Uploaded by

hamza.oukil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data - part 1

The document provides an overview of data mining concepts, focusing on datasets, attributes, and data preprocessing techniques. It outlines the types of datasets and attributes, as well as the importance of data cleaning, transformation, and similarity measures. Key preprocessing tasks include handling missing values, outliers, and noise, along with methods for data transformation such as normalization and discretization.

Uploaded by

hamza.oukil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Data Mining

Lecture 2
Data: Part 1
Mohammed Brahimi & Sami Belkacem

1
Outline

1. What is a Dataset?

2. Types of Datasets

3. Types of Attributes

4. Data Preprocessing

5. Similarity measures

2
1- What is a Dataset?

3
Definition of Dataset
● Dataset: collection of objects and their attributes Attributes

● Attribute: property or characteristic of an object.


Also known as variable, field, characteristic,
dimension, or feature.

Objects
● Object: collection of attributes.
Also known as record, point, case, sample, entity,
or instance.

4
Important Characteristics of Datasets
● Size: The type of analysis often depends on the size of the data.

● Dimensionality: High-dimensional data presents unique challenges.

● Sparsity: Emphasizes the importance of presence over absence.

● Distribution: Considers centrality and dispersion in the data.

● Resolution: Extracted patterns can vary based on the scale of measurement

5
2- Types of Datasets

6
Types of Datasets
● Record Data: records with fixed attributes
○ Relational records
○ Data matrix …
○ Transaction Data

● Graphs and Networks


○ Transportation network
○ Social or information networks…
○ Molecular Structures

● Ordered (Sequence) Data


○ Video: sequence of image
○ Genetic Sequence Data
○ Temporal sequence …

● Spatial Data
○ RGB Images
○ Satellite images

7
Types of Datasets
● Record Data: records with fixed attributes
○ Relational records
○ Data matrix …
○ Transaction Data

● Graphs and Networks


○ Transportation network
○ Social or information networks…
○ Molecular Structures

● Ordered (Sequence) Data


○ Video: sequence of image
○ Genetic Sequence Data
○ Temporal sequence …

● Spatial Data
○ RGB Images
○ Satellite images

8
Types of Datasets
● Record Data: records with fixed attributes
○ Relational records
○ Data matrix …
○ Transaction Data

● Graphs and Networks


○ Transportation network
○ Social or information networks…
○ Molecular Structures

● Ordered (Sequence) Data


○ Video: sequence of image
○ Genetic Sequence Data
○ Temporal sequence …

● Spatial Data
○ RGB Images
○ Satellite images

9
Types of Datasets
● Record Data
○ Relational records
○ Data matrix …
○ Transaction Data

● Graphs and Networks


○ Transportation network
○ Social or information networks…
○ Molecular Structures

● Ordered (Sequence) Data


○ Video: sequence of image
○ Genetic Sequence Data
○ Temporal sequence …

● Spatial Data
○ RGB Images
○ Satellite images

10
3- Types of Attributes

11
Types of Attributes
● Nominal (Unordered Categories)
○ Examples: Gender, eye color, types of fruit (e.g., apple, orange), etc.

● Ordinal (Ordered Categories)


○ Examples: Grades (A,B,C), height (tall, medium, short), swimming level (beginner ... advanced)

● Interval (Numerical, Equal Intervals, No True Zero)


○ Examples: Calendar dates, temperatures in Celsius or Fahrenheit

● Ratio (Numerical, Equal Intervals, True Zero)


○ Examples: Temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race)

12
Types of Operations on Attributes
● Nominal
○ Distinctness ( =, ≠ )
● Ordinal
○ Distinctness ( =, ≠ )
○ Order ( <, > )
● Interval
○ Distinctness ( =, ≠ )
○ Order ( <, > )
○ Meaningful Differences ( +, - )
● Ratio
○ Distinctness ( =, ≠ )
○ Order ( <, > )
○ Meaningful Differences ( +, - )
○ Meaningful Ratios ( *, / )
13
Discrete vs. Continuous Attributes
● Discrete Attribute: Takes values from a finite or
countable set.
○ Examples: gender, eye color, swimming level.
● Typically represented as integers.
● Binary attributes: special case of discrete attributes.

● Continuous Attribute: Takes values within a


continuous range.
○ Examples: height, length, temperature.
● Typically represented as floating-point variables.
14
Asymmetric Attributes
● In asymmetric attributes, only the presence (non-zero value) matters.
● Examples:
■ Words present in documents: Focus on words that appear.
■ Items present in customer transactions: Emphasize purchased items.

● Real-Life Scenario:

In a grocery store, given the number of products, we don’t say:


"Our purchases are similar because we both didn't buy most of the same products."
Instead, we would focus on the products that were bought.

15
4- Data Preprocessing

16
Major Tasks of Data Preprocessing
Data integration (covered in the course advanced databases)

● Integration of multiple databases, data cubes, or files

Data reduction (covered in the next chapter)

● Dimensionality reduction and data compression

Data cleaning

● Handle duplicates and missing values, identify/remove outliers, smooth noisy data, etc.

Data transformation

● Data sampling, encoding, discretization, normalization, etc.

17
Data Cleaning
● Poor data quality can negatively impact modeling efforts.
E.g. In Bank loan prediction, poor data can lead to incorrect loan decisions:
– Some credit-worthy candidates are denied loans
– More loans are given to individuals who are unlikely to be creditworthy

● What types of data quality issues exist, and how can we identify and handle them?

● Examples of data quality problems:


– Duplicate data
– Missing values
– Outliers
– Noise
18
Duplicate Data
● Occurrence of identical or nearly identical data objects.
● Common when merging data from diverse sources.
○ Example: Identical individuals with multiple email addresses.

How to handle duplicate data

● Remove duplicate data objects.


● In some scenarios, we need to keep and handle duplicates, e.g.:
○ Customers with multiple accounts may unintentionally accumulate points separately.
○ Keeping duplicate data ensures customers receive all earned benefits.

19
Missing Values

● Reasons for missing values

○ Information is not collected (e.g., people decline to give their age and weight)

○ Attributes may not be applicable to all cases (e.g., annual income not applicable to children)
20
How to Handle Missing Values?
● Delete Records: Drop records if there is enough data and few missing values

● Keep Missing Data: Keep as NaN if, for example, the missing values are ≥ 60% of the observations

● Imputation-Based Techniques:

Random value: fill in with a random value if introducing noise is acceptable

Average: fill in with the mean or the median for numerical data, and the mode for categorical data

Nearest neighbor: fill in with similar data points based on the nearest neighbors in the dataset

Heuristic-Based: make a reasonable guess based on knowledge of the underlying domain

Interpolation: train a prediction model on the dataset to predict the corresponding value

21
Outliers
● Data objects with characteristics significantly different
from the majority in the dataset.

● Case 1: Outliers as Noise


○ Outliers can be noise that disrupts data analysis.

● Case 2: Outliers as the Focus


○ In certain scenarios, outliers are the primary focus of analysis.
■ Credit card fraud detection
■ Intrusion detection

● Determining Causes:
○ Explore the reasons behind the presence of outliers.
22
How to Handle Outliers?

23
Noise
● Noise in Objects: Irrelevant elements
affecting data integrity.
● Noise in Attributes: Modification of
original attribute values.
● Examples:
○ Erroneous values caused by data entry errors
○ Distorted voice on a poor phone line.
○ "Snow" on a television screen.
○ Etc. Noisy data in signal processing
24
How to Handle Noise?

● Binning: Group data into bins and smooth it using means, medians, or defined boundaries.

● Clustering: Apply clustering to separate out noise points that do not fit well within any cluster.

● Imputation techniques: average, nearest neighbor, heuristic, interpolation, etc.

● Semi-supervised method: Combine automated noise detection tools with human inspection

Note: Incorporating noise into data can sometimes enhance the robustness of data mining models

by preventing overfitting, improving generalization, and fostering adaptability to real-world variations.

25
Data Transformation
Convert data into a format that is suitable for analysis.

● Main data transformation techniques:

○ Sampling: Select a subset to represent a larger population. Objects

○ Encoding: Convert categorical data into numerical formats.

○ Normalization: Scale data to a standard range (e.g., 0 to 1). Attributes

○ Discretization: Convert continuous data into discrete categories.

26
Sampling
Select a subset of the dataset to make it more manageable for analysis

while maintaining its representativeness.

● We use sampling because using the entire dataset is:


○ Expensive: Collecting, storing, and processing vast amounts of data
○ Time-consuming: Analyzing the complete dataset can be impractical due to time constraints.

● Challenges:
○ Ensure the sample is representative of the population.
○ Address potential bias in the sampling process.

27
Sampling methods

28
Sampling methods
Simple Random Sampling
● Every item has an equal chance of being selected (could be with or without replacement)
Systematic Sampling
● Select individuals at regular intervals from a list or group.
Stratified Sampling
● Divide the population into groups (strata) based on a characteristic, then random samples are
taken from each group.
Cluster Sampling
● Divide the population into clusters (often geographically), then entire clusters are randomly
selected for sampling.
29
Encoding
Convert categorical variables into numerical format for data mining algorithms

One-hot encoding
(suitable)

Label encoding
(not suitable)

Encode the nominal attribute


“Country”

30
Encoding methods
Label Encoding
● Converts categories into numerical labels.
● Each category gets a unique integer.
● Can create ordinal relationships, even if not intended (e.g. France (0) < Spain (1))
● Suitable for ordinal attributes
One-Hot Encoding
● Creates a binary column for each category.
● Each category is represented by 1 in its column, 0 elsewhere.
● No ordinal relationships are implied.
● Suitable for nominal attributes
● Increases the dimensionality of the data, which can be a concern with many categories.
More advanced encoding techniques exist for nominal attributes, such as “word embeddings”.
31
Normalization
Scale numerical data to a standard range to ensure attributes contribute equally to the analysis

● Normalization prevent attributes with larger ranges from dominating those with smaller ranges
● Normalization is crucial for the convergence of many data mining algorithms.

Min-max normalization: from [minA, maxA] to [new_minA, new_maxA]

Z-score normalization (μ: mean, σ: standard deviation):

32
Normalization methods

● Min-max normalization:
○ Attributes will have the exact same scale.
○ Does not handle outliers well.

● Z-score normalization:
○ More robust to outliers.
○ Does not produce normalized data with
the exact same scale.
○ Still sensitive to extreme outliers

33
Normalization methods

● Min-max normalization:
○ Attributes will have the exact same scale.
○ Does not handle outliers well.

● Z-score normalization:
○ More robust to outliers.
○ Does not produce normalized data with
the exact same scale.
○ Still sensitive to extreme outliers.

34
Normalization methods
Min-Max Normalization is suitable when:

● We need the data to be scaled to a specific range (e.g. [0, 1])

● The data has no significant outliers

Z-score Normalization is suitable when:

● The data follows a normal distribution (or approximately normal)

● The chosen data mining algorithm is sensitive to the distribution of the data

35
Discretization
Transforming continuous data into discrete intervals (bins)

● A potentially infinite number of values are mapped to a small number of categories.

● The goal is to improve the quality and usability of data for analysis and modeling.
36
Discretization methods
Discretization methods can be classified in two categories:

1. Unsupervised Discretization
No class label is used during the discretization process.

2. Supervised Discretization
Class labels are used to guide the discretization, optimizing it for
classification tasks.

37
Unsupervised Discretization Methods
Equal Width

● Divides the data range into intervals of equal size.


● Simple but can create unbalanced bin counts.

Equal Frequency

● Bins (intervals) have the same number of data points.


● Ensures balanced binning.
● May result in varying interval sizes.

K-means

● Cluster data into k groups


● Assigns each group a representative value.
● Effective in identifying natural groupings but requires
specifying the number of bins.

38
Supervised Discretization Methods

Top-down (Splitting)

● Starts with all data in one bin.


● Recursively splits bins to enhance class separability.

Bottom-up (Merging)

● Starts with each data point in its own bin.


● Merges bins by maximizing class purity.

39
5- Similarity and Dissimilarity Measures

40
Similarity and Dissimilarity Measures
Similarity between objects or attributes reveal valuable data relationships for
pattern recognition, clustering, and classification.
● Similarity Measure:
○ Quantifies data object likeness.
○ Higher values indicate greater similarity.
○ Typically within the range [0,1].

● Dissimilarity Measure:
○ Can also be referred to as Distance Measure.
○ Quantifies data object differences.
○ Lower values indicate greater similarity.
○ Often starts at 0 and varies in the upper limit.

● Proximity:
○ Refers to either similarity or dissimilarity.

41
Similarity and Dissimilarity Measures
1. Properties of Similarity

2. Properties of Distance

3. Similarity and Distance matrix

4. Examples of Similarity and Distance measures

5. Similarity, Distance, and Attribute type


6. How to choose the Similarity/Distance measure?

42
Properties of Similarity
● Identity:
○ s(x, y) = 1 (or maximum similarity) only if x = y.
○ Note: This property may not always hold, e.g., cosine similarity.

● Symmetry:
○ s(x, y) = s(y, x) for all x and y.
○ Symmetry ensures that the order of comparison does not affect the similarity score.

These properties ensure that similarity measures are


reliable and consistent in data analysis.

43
Properties of Distance
● Non-Negativity:
○ d(x, y) ≥ 0 for all x and y.
○ d(x, y) = 0 if and only if x = y.

● Symmetry:
○ d(x, y) = d(y, x) for all x and y.

● Triangle Inequality:
○ d(x, z) ≤ d(x,y) + d(y, z) for all x, y, and z.

These properties ensure that distance measures are


reliable and consistent in data analysis.
44
Similarity and Distance matrix
Consider a dataset with 4 points in a 2D space: A(1,2), B(2,3), C(3,5), D(4,6)

We compute the Euclidean Distance Matrix and Cosine Similarity Matrix

A A
B B
C C
D D

Distance Matrix Similarity Matrix


○ Distances between all data objects ○ Similarities between all data objects
○ Useful for clustering and nearest neighbor algorithms ○ Useful for clustering and recommendation systems
○ Symmetric, with values reflecting dissimilarities ○ Often symmetric, higher values: stronger similarities
45
Examples of Similarity and Distance measures

● Measures for numerical vectors


○ Euclidean Distance
○ Minkowski Distance
○ Cosine Similarity
○ Linear correlation

● Measures for binary vectors


○ Simple Matching Coefficient (SMC)
○ Jaccard Coefficient

46
Euclidean Distance

● : number of attributes.
● , , : : kth attributes for objects x and y, respectively.

Standardization is necessary, if scales differ.

47
Minkowski Distance

● Generalization of Euclidean Distance.


● r : parameter
● n: number of attributes
● xk and yk are, respectively, the kth attributes of objects x and y.
● The hyperparameters r allows to adapt the distance to the characteristics of data.

48
Special Cases of Minkowski Distance
● r = 1:
○ Called L1 norm or Manhattan distance.
○ Ideal for measuring distances in grid-like paths.
○ Binary vector example: Hamming distance counts differing bits.

● r = 2:
○ Called L2 norm or Euclidean distance.
○ The most commonly used distance metric.
○ Ideal for measuring the straight-line distance in Euclidean space.

● r → ∞:
○ Called Lmax norm or Chebyshev distance.
○ Calculates the maximum difference between any component of vectors.
○ Ideal when movement is unrestricted in any direction, e.g. king movement in chess

49
Special Cases of Minkowski Distance
● r = 1:
○ Called L1 norm or Manhattan distance.
○ Ideal for measuring distances in grid-like paths.
○ Binary vector example: Hamming distance counts differing bits.

● r = 2:
○ Called L2 norm or Euclidean distance.
○ The most commonly used distance metric.
○ Ideal for measuring the straight-line distance in Euclidean space.

● r → ∞:
○ Called Lmax norm or Chebyshev distance.
○ Calculates the maximum difference between any component of vectors.
○ Ideal when movement is unrestricted in any direction, e.g. king movement in chess

50
Special Cases of Minkowski Distance
● r = 1:
○ Called L1 norm or Manhattan distance.
○ Ideal for measuring distances in grid-like paths.
○ Binary vector example: Hamming distance counts differing bits.

● r = 2:
○ Called L2 norm or Euclidean distance.
○ The most commonly used distance metric.
○ Ideal for measuring the straight-line distance in Euclidean space.

● r → ∞:
○ Called Lmax norm or Chebyshev distance.
○ Calculates the maximum difference between any component of vectors.
○ Ideal when movement is unrestricted in any direction, e.g. king movement in chess

51
Cosine Similarity

● A . B is the dot product of the two vectors.


● The dot product also represents the cosine of the angle between the two vectors.
● Non-sensitive to magnitudes, focusing on orientation.
● Values are between -1 and 1: -1 (completely dissimilar)
1 (perfectly similar)
0 means orthogonal (no similarity)
52
Linear correlation

● Measure the linear relationship between two variables.

● Evaluates how well one variable predicts another one.

● Values are between -1 and 1:


○ 1 (perfect positive correlation)
○ 0 (zero correlation, i.e. no linear relationship)
○ -1 (perfect negative correlation)
53
Examples of Similarity and Distance measures

● Measures for numerical vectors


○ Euclidean Distance
○ Minkowski Distance
○ Cosine Similarity
○ Linear correlation

● Measures for binary vectors


○ Simple Matching Coefficient (SMC)
○ Jaccard Coefficient

54
Simple Matching Coefficient (SMC)
● The number of matches divided by the total number of attributes.
● It is designed for symmetric binary attributes.

SMC = (f11 + f00) / (f01 + f10 + f00 + f11)

● f01 = the number of attributes where x was 0 and y was 1


● f10 = the number of attributes where x was 1 and y was 0
● f00 = the number of attributes where x was 0 and y was 0
● f11 = the number of attributes where x was 1 and y was 1

Example: Two persons represented by binary vectors, x=[0,0,1], y=[0,1,1]


Each element represents a symmetric attribute: marital status, smoking status, pet ownership.
SMC = 2/3 = 0.667
55
Jaccard Coefficient (J)
● The ratio of shared 1 values to the total number of 1 values across both sets.
● It is designed for asymmetric binary attributes.

J = f11 / (f01 + f10 + f11)

● f01 = the number of attributes where x was 0 and y was 1


● f10 = the number of attributes where x was 1 and y was 0
● f00 = the number of attributes where x was 0 and y was 0
● f11 = the number of attributes where x was 1 and y was 1

Example: Two persons’ buying in a market represented by binary vectors, x=[0,0,1], y=[0,1,1]
Each element represents an asymmetric attribute: whether a person bought an item in a market.
J = 1/2 = 0.5
56
Similarity, Distance, and Attribute type

Similarity/Distance between two objects, x and y, with only one attribute:

57
How to Choose the Similarity/Distance Measure?
The choice of the right measure depends on the domain:

● Comparing two documents using word presence


○ Proximity Measure: Jaccard Coefficient
○ Similarity: Two documents are similar if they share a high number of common words.

● Comparing geographical locations of two cities


○ Proximity Measure: Euclidean Distance
○ Similarity: Two city locations are similar if they are close to each other by distance

● Comparing two time series of temperature (Celsius)


○ Proximity Measure: Cosine Similarity
○ Similarity: Two time series are similar if their “pattern” is similar (they vary the same way over time)

● Measuring the relationship between study hours and exam marks


○ Proximity Measure: Linear Correlation
○ Similarity: A higher correlation coefficient indicates a stronger relationship, suggesting that as study
hours increase, exam marks tend to increase as well. 58

You might also like