0% found this document useful (0 votes)
20 views21 pages

02 - ML - Data Presentation-24-03-09

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views21 pages

02 - ML - Data Presentation-24-03-09

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Ho Chi Minh University of Banking

Department of Economic Mathematics

Machine Learning
Data Presentation
& Data transformation

Vuong Trong Nhan ([email protected])


Outline

Data presentation
o Numeric feature
o Categorical feature

Feature engineering

Data transformation

2
Data Representation

3
Data in Machine Learning

𝑥𝑖: input vector, independent variable

𝑦: response variable, dependent variable


𝑦 ∈ {-1, 1} or {0, 1}: binary classification
𝑦 ∈ ℤ: multi-label classification
𝑦 ∈ ℝ: regression
Predict 𝑦 when having observed some new 𝑥

4
Type of data

numeric

structured data

categorical
unstructured
Data
data

semi-
structured data

5
Type of data: structured

• Numeric variable
• Continuous
• Floating-Point Numbers (real numbers)
• Eg. flower’s sepal width and height
(in centimeters)
• Discrete
• Interval Values
• Eg. flowers cannot have 10.14 petals,
• though they might have 9, 10, or 11.

6
Type of data: structured

o Categorical variable: discrete or qualitative variables


o Nominal (định danh)
o Have two or more categories, but which do not
have an intrinsic order
o Eg. Eye color (brown, blue, green, etc)
o Ordinal (thứ tự)
o Have two or more categories, which can be
ordered or ranked
o Eg. Education level
o (High School, Bachelor's Degree, Master's Degree, Doctorate)

7
Data transformation

o Structured data
o Numeric to Categorical (binning)
o Categorical to Numeric (encoding)
o Unstructured to structured data
o Text
o Bag of Words
o TF-IDF
o …

8
Categorical to numeric
Customers
ID Gender Education Job Title Categorical Numeric
Level (Ordinal)
1 Female Bachelor Sales
Representative Ordinal encoder
Education Level Education Level
2 Male High Customer
school Service
Associate High school High school =0 0
Bachelor Bachelor =1 1
3 Male phD Software Master =2
Master 2
Engineer phD =3
phD 3
4 Female Master Marketing
Manager
Assume categories have inherent order (higher education
… …. …. implies higher knowledge):

Note:
o Applicable when categories have a natural order (e.g., customer satisfaction levels).
o Assigns numerical values that reflect the order (e.g., satisfied = 1, neutral = 2, dissatisfied = 3).
o Advantage: Captures the order information present in the data.
o Disadvantage: Only suitable for ordinal data, not nominal data (with no inherent order).

9
Categorical to numeric
Customers
Categorical Numeric
ID Gender Education Job Title
Level (Nominal)
1 Female Bachelor Sales
Representative Gender Label encoder
Gender
2 Male High Customer
school Service Female
Associate Male Female = 0 0
Male = 1 1
3 Male phD Software
Engineer

4 Female Master Marketing


Manager
… …. ….

Note:
o Simplest approach, assigns a unique integer to each category.
o Advantage: Easy to implement and understand.
o Disadvantage: Assumes an inherent order between categories, which might not be
true (e.g., assigning ”0" to Female and ”1" to Male doesn't imply Female is "better").
10
Categorical to numeric
Categorical Numeric
(Nominal)

Job Title
Software Marketing
( Sales Customer
Software One-Hot Encoding Engineer Manager Represen Service
Engineer tative Associate

Marketing 1 0 0 0
Manager 0 1 0 0
Sales 0 0 1 0
Representative
0 0 0 1
Customer
Service
Associate

Note:
o Creates a new binary feature for each category.
o Advantage: Preserves the categorical nature of the data and avoids imposing order.
o Disadvantage: Can lead to a high number of features if there are many categories,
increasing computational cost.
11
Categorical to numeric
Categorical Numeric
(Nominal) Frequency Encoding

Job Title Job Title Frequency Nomalize Frequency


[0, 1]
Software
Engineer Software Engineer 5 5/11 = 0.45

Marketing Marketing Manager 3/11 = 0.27


Manager 3
Sales Sales Representative 2/11 = 0.18
Representative
2
Customer Customer Service 1 1/11 = 0.09
Service Associate
Associate
total = 5 + 3 +2 +1 = 11

Note:
o Assigns a value to each category based on its frequency / normalized frequency in the data.
o More frequent categories get higher values.
o Advantage: Can be useful for capturing the importance of categories based on their prevalence.
o Disadvantage: May not be suitable for all machine learning models.

12
Numeric to Categorical

o Converting numerical data into categorical data, also known as binning


or discretization, involves grouping continuous numerical values into
distinct categories.
o Common scenarios where binning data is used:
Exploratory Data Analysis (EDA)
Feature Engineering for Machine Learning
Discretization for Specific Tasks

o Limitations:
• Loss of information: Grouping continuous data into categories leads to
some loss of information about the original data points.
• Arbitrary choices: The choice of bin boundaries can be subjective and
can significantly impact the results, especially with a small number of bins.
• Potential for bias: If not done carefully, binning can introduce bias into
the data, leading to misleading interpretations or unfair outcomes

13
Numeric to Categorical

o Some techniques:

Bin Income Range Description


1 < $20,000 Low Income
2 $20,000 - $39,999 Lower-Middle Income
3 $40,000 - $59,999 Middle Income
4 $60,000 - $79,999 Upper-Middle Income
5 > $80,000 High Income

14
Numeric to Categorical
Scenario: Imagine you have data on customer income (continuous
numerical values) and want to group them into categories for analysis.
Data: (Example income values:
[10 000, 25 000, 38 000, 52 000, 68 000, 85 000, 102 000, 120 000])

Equal-Width Binning
• Define the number of bins: Let's choose 3 bins for simplicity.
• Calculate the bin width: Divide the total income range (highest - lowest)
by the number of bins: (120,000 - 10,000) / 3 = 36,666.67 (rounded to
nearest whole number: 36,667)
• Create bins:
• Bin 1: < $36,667
• Bin 2: $36,667 - $73,333
• Bin 3: > $73,333

15
Numeric to Categorical
Scenario: Imagine you have data on customer income (continuous
numerical values) and want to group them into categories for analysis.
Data: (Example income values:
[10 000, 25 000, 38 000, 52 000, 68 000, 85 000, 102 000, 120 000])

Percentile-Based Binning
o Calculate percentiles: We'll use quartiles (dividing into 4 equal groups). You can choose
other percentiles based on your needs.
• Q1 (25th percentile): $25,000
• Q2 (50th percentile, median): $52,000
• Q3 (75th percentile): $85,000
• Create bins:
• Bin 1: < $25,000
• Bin 2: $25,000 - $52,000
• Bin 3: $52,000 - $85,000
• Bin 4: > $85,000 (Note: This approach creates 4 bins, but you can merge the last two
if desired for 3 bins)

16
Numeric to Categorical

Scenario: Imagine you have data on customer income (continuous


numerical values) and want to group them into categories for analysis.
Data: (Example income values:
[10 000, 25 000, 38 000, 52 000, 68 000, 85 000, 102 000, 120 000])

K-Means Clustering (Using k=3):


o Apply the K-Means algorithm:
Run the algorithm with k (number of clusters) set to 3.
This will group the data points into 3 clusters based on their income values.

o Define bin labels:


Based on the cluster centers (average income within each cluster), assign appropriate
labels to each bin, such as "Low Income," "Middle Income," and "High Income."

17
Numeric to Categorical

Comparison:
o These three techniques result in different binning
structures:
Equal-width: Creates bins of equal size, but might not
capture the underlying distribution if the data is skewed.
Percentile-based: Captures the distribution better, but
the bin sizes might not be equal.
K-Means: Creates data-driven bins based on natural
groupings, but requires choosing the number of clusters
(k) and might be more complex to interpret.

18
Numeric to Categorical

Choosing the Best Technique:


o The best technique depends on your specific goals
and the data characteristics.
Equal-width binning might be suitable if you need
equally sized groups or have a relatively uniform income
distribution.
Percentile-based binning is appropriate if you want to
capture the distribution of income levels and understand
the proportions within each range.
K-Means clustering can be useful if you want data-
driven groupings and the income distribution is not
easily captured by simple binning methods.
19
Unstructured data vs structured data

20
Text (unstructured) to vector (structured)

o Some techniques:
Bag-of-words
TF-IDF (Term Frequency - Inverse Document
Frequency)
Word2Vec
Doc2Vec
FastText
Other: GloVe, BERT, ELMo, …
o Notes:
Depends on intended use, data set size, and
computing resources.
BOW, TF-IDF is simple, suitable for small data sets.
Word2Vec, Doc2Vec, FastText are more efficient but
more complex, suitable for large data sets.
21

You might also like