02 - ML - Data Presentation-24-03-09
02 - ML - Data Presentation-24-03-09
Machine Learning
Data Presentation
& Data transformation
Data presentation
o Numeric feature
o Categorical feature
Feature engineering
Data transformation
2
Data Representation
3
Data in Machine Learning
4
Type of data
numeric
structured data
categorical
unstructured
Data
data
semi-
structured data
5
Type of data: structured
• Numeric variable
• Continuous
• Floating-Point Numbers (real numbers)
• Eg. flower’s sepal width and height
(in centimeters)
• Discrete
• Interval Values
• Eg. flowers cannot have 10.14 petals,
• though they might have 9, 10, or 11.
6
Type of data: structured
7
Data transformation
o Structured data
o Numeric to Categorical (binning)
o Categorical to Numeric (encoding)
o Unstructured to structured data
o Text
o Bag of Words
o TF-IDF
o …
8
Categorical to numeric
Customers
ID Gender Education Job Title Categorical Numeric
Level (Ordinal)
1 Female Bachelor Sales
Representative Ordinal encoder
Education Level Education Level
2 Male High Customer
school Service
Associate High school High school =0 0
Bachelor Bachelor =1 1
3 Male phD Software Master =2
Master 2
Engineer phD =3
phD 3
4 Female Master Marketing
Manager
Assume categories have inherent order (higher education
… …. …. implies higher knowledge):
Note:
o Applicable when categories have a natural order (e.g., customer satisfaction levels).
o Assigns numerical values that reflect the order (e.g., satisfied = 1, neutral = 2, dissatisfied = 3).
o Advantage: Captures the order information present in the data.
o Disadvantage: Only suitable for ordinal data, not nominal data (with no inherent order).
9
Categorical to numeric
Customers
Categorical Numeric
ID Gender Education Job Title
Level (Nominal)
1 Female Bachelor Sales
Representative Gender Label encoder
Gender
2 Male High Customer
school Service Female
Associate Male Female = 0 0
Male = 1 1
3 Male phD Software
Engineer
Note:
o Simplest approach, assigns a unique integer to each category.
o Advantage: Easy to implement and understand.
o Disadvantage: Assumes an inherent order between categories, which might not be
true (e.g., assigning ”0" to Female and ”1" to Male doesn't imply Female is "better").
10
Categorical to numeric
Categorical Numeric
(Nominal)
Job Title
Software Marketing
( Sales Customer
Software One-Hot Encoding Engineer Manager Represen Service
Engineer tative Associate
Marketing 1 0 0 0
Manager 0 1 0 0
Sales 0 0 1 0
Representative
0 0 0 1
Customer
Service
Associate
Note:
o Creates a new binary feature for each category.
o Advantage: Preserves the categorical nature of the data and avoids imposing order.
o Disadvantage: Can lead to a high number of features if there are many categories,
increasing computational cost.
11
Categorical to numeric
Categorical Numeric
(Nominal) Frequency Encoding
Note:
o Assigns a value to each category based on its frequency / normalized frequency in the data.
o More frequent categories get higher values.
o Advantage: Can be useful for capturing the importance of categories based on their prevalence.
o Disadvantage: May not be suitable for all machine learning models.
12
Numeric to Categorical
o Limitations:
• Loss of information: Grouping continuous data into categories leads to
some loss of information about the original data points.
• Arbitrary choices: The choice of bin boundaries can be subjective and
can significantly impact the results, especially with a small number of bins.
• Potential for bias: If not done carefully, binning can introduce bias into
the data, leading to misleading interpretations or unfair outcomes
13
Numeric to Categorical
o Some techniques:
14
Numeric to Categorical
Scenario: Imagine you have data on customer income (continuous
numerical values) and want to group them into categories for analysis.
Data: (Example income values:
[10 000, 25 000, 38 000, 52 000, 68 000, 85 000, 102 000, 120 000])
Equal-Width Binning
• Define the number of bins: Let's choose 3 bins for simplicity.
• Calculate the bin width: Divide the total income range (highest - lowest)
by the number of bins: (120,000 - 10,000) / 3 = 36,666.67 (rounded to
nearest whole number: 36,667)
• Create bins:
• Bin 1: < $36,667
• Bin 2: $36,667 - $73,333
• Bin 3: > $73,333
15
Numeric to Categorical
Scenario: Imagine you have data on customer income (continuous
numerical values) and want to group them into categories for analysis.
Data: (Example income values:
[10 000, 25 000, 38 000, 52 000, 68 000, 85 000, 102 000, 120 000])
Percentile-Based Binning
o Calculate percentiles: We'll use quartiles (dividing into 4 equal groups). You can choose
other percentiles based on your needs.
• Q1 (25th percentile): $25,000
• Q2 (50th percentile, median): $52,000
• Q3 (75th percentile): $85,000
• Create bins:
• Bin 1: < $25,000
• Bin 2: $25,000 - $52,000
• Bin 3: $52,000 - $85,000
• Bin 4: > $85,000 (Note: This approach creates 4 bins, but you can merge the last two
if desired for 3 bins)
16
Numeric to Categorical
17
Numeric to Categorical
Comparison:
o These three techniques result in different binning
structures:
Equal-width: Creates bins of equal size, but might not
capture the underlying distribution if the data is skewed.
Percentile-based: Captures the distribution better, but
the bin sizes might not be equal.
K-Means: Creates data-driven bins based on natural
groupings, but requires choosing the number of clusters
(k) and might be more complex to interpret.
18
Numeric to Categorical
20
Text (unstructured) to vector (structured)
o Some techniques:
Bag-of-words
TF-IDF (Term Frequency - Inverse Document
Frequency)
Word2Vec
Doc2Vec
FastText
Other: GloVe, BERT, ELMo, …
o Notes:
Depends on intended use, data set size, and
computing resources.
BOW, TF-IDF is simple, suitable for small data sets.
Word2Vec, Doc2Vec, FastText are more efficient but
more complex, suitable for large data sets.
21