0% found this document useful (0 votes)

20 views21 pages

02 - ML - Data Presentation-24-03-09

Uploaded by

11-Nguyễn Thị Quỳnh Châu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views21 pages

02 - ML - Data Presentation-24-03-09

Uploaded by

11-Nguyễn Thị Quỳnh Châu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Ho Chi Minh University of Banking

Department of Economic Mathematics

Machine Learning
Data Presentation
& Data transformation

Vuong Trong Nhan ([email protected])

Outline

Data presentation
o Numeric feature
o Categorical feature

Feature engineering

Data transformation

2
Data Representation

3
Data in Machine Learning

𝑥𝑖: input vector, independent variable

𝑦: response variable, dependent variable

𝑦 ∈ {-1, 1} or {0, 1}: binary classification
𝑦 ∈ ℤ: multi-label classification
𝑦 ∈ ℝ: regression
Predict 𝑦 when having observed some new 𝑥

4
Type of data

numeric

structured data

categorical
unstructured
Data
data

semi-
structured data

5
Type of data: structured

• Numeric variable
• Continuous
• Floating-Point Numbers (real numbers)
• Eg. flower’s sepal width and height
(in centimeters)
• Discrete
• Interval Values
• Eg. flowers cannot have 10.14 petals,
• though they might have 9, 10, or 11.

6
Type of data: structured

o Categorical variable: discrete or qualitative variables

o Nominal (định danh)
o Have two or more categories, but which do not
have an intrinsic order
o Eg. Eye color (brown, blue, green, etc)
o Ordinal (thứ tự)
o Have two or more categories, which can be
ordered or ranked
o Eg. Education level
o (High School, Bachelor's Degree, Master's Degree, Doctorate)

7
Data transformation

o Structured data
o Numeric to Categorical (binning)
o Categorical to Numeric (encoding)
o Unstructured to structured data
o Text
o Bag of Words
o TF-IDF
o …

8
Categorical to numeric
Customers
ID Gender Education Job Title Categorical Numeric
Level (Ordinal)
1 Female Bachelor Sales
Representative Ordinal encoder
Education Level Education Level
2 Male High Customer
school Service
Associate High school High school =0 0
Bachelor Bachelor =1 1
3 Male phD Software Master =2
Master 2
Engineer phD =3
phD 3
4 Female Master Marketing
Manager
Assume categories have inherent order (higher education
… …. …. implies higher knowledge):

Note:
o Applicable when categories have a natural order (e.g., customer satisfaction levels).
o Assigns numerical values that reflect the order (e.g., satisfied = 1, neutral = 2, dissatisfied = 3).
o Advantage: Captures the order information present in the data.
o Disadvantage: Only suitable for ordinal data, not nominal data (with no inherent order).

9
Categorical to numeric
Customers
Categorical Numeric
ID Gender Education Job Title
Level (Nominal)
1 Female Bachelor Sales
Representative Gender Label encoder
Gender
2 Male High Customer
school Service Female
Associate Male Female = 0 0
Male = 1 1
3 Male phD Software
Engineer

4 Female Master Marketing

Manager
… …. ….

Note:
o Simplest approach, assigns a unique integer to each category.
o Advantage: Easy to implement and understand.
o Disadvantage: Assumes an inherent order between categories, which might not be
true (e.g., assigning ”0" to Female and ”1" to Male doesn't imply Female is "better").
10
Categorical to numeric
Categorical Numeric
(Nominal)

Job Title
Software Marketing
( Sales Customer
Software One-Hot Encoding Engineer Manager Represen Service
Engineer tative Associate

Marketing 1 0 0 0
Manager 0 1 0 0
Sales 0 0 1 0
Representative
0 0 0 1
Customer
Service
Associate

Note:
o Creates a new binary feature for each category.
o Advantage: Preserves the categorical nature of the data and avoids imposing order.
o Disadvantage: Can lead to a high number of features if there are many categories,
increasing computational cost.
11
Categorical to numeric
Categorical Numeric
(Nominal) Frequency Encoding

Job Title Job Title Frequency Nomalize Frequency

[0, 1]
Software
Engineer Software Engineer 5 5/11 = 0.45

Marketing Marketing Manager 3/11 = 0.27

Manager 3
Sales Sales Representative 2/11 = 0.18
Representative
2
Customer Customer Service 1 1/11 = 0.09
Service Associate
Associate
total = 5 + 3 +2 +1 = 11

Note:
o Assigns a value to each category based on its frequency / normalized frequency in the data.
o More frequent categories get higher values.
o Advantage: Can be useful for capturing the importance of categories based on their prevalence.
o Disadvantage: May not be suitable for all machine learning models.

12
Numeric to Categorical

o Converting numerical data into categorical data, also known as binning

or discretization, involves grouping continuous numerical values into
distinct categories.
o Common scenarios where binning data is used:
Exploratory Data Analysis (EDA)
Feature Engineering for Machine Learning
Discretization for Specific Tasks

o Limitations:
• Loss of information: Grouping continuous data into categories leads to
some loss of information about the original data points.
• Arbitrary choices: The choice of bin boundaries can be subjective and
can significantly impact the results, especially with a small number of bins.
• Potential for bias: If not done carefully, binning can introduce bias into
the data, leading to misleading interpretations or unfair outcomes

13
Numeric to Categorical

o Some techniques:

Bin Income Range Description

1 < $20,000 Low Income
2 $20,000 - $39,999 Lower-Middle Income
3 $40,000 - $59,999 Middle Income
4 $60,000 - $79,999 Upper-Middle Income
5 > $80,000 High Income

14
Numeric to Categorical
Scenario: Imagine you have data on customer income (continuous
numerical values) and want to group them into categories for analysis.
Data: (Example income values:
[10 000, 25 000, 38 000, 52 000, 68 000, 85 000, 102 000, 120 000])

Equal-Width Binning
• Define the number of bins: Let's choose 3 bins for simplicity.
• Calculate the bin width: Divide the total income range (highest - lowest)
by the number of bins: (120,000 - 10,000) / 3 = 36,666.67 (rounded to
nearest whole number: 36,667)
• Create bins:
• Bin 1: < $36,667
• Bin 2: $36,667 - $73,333
• Bin 3: > $73,333

15
Numeric to Categorical
Scenario: Imagine you have data on customer income (continuous
numerical values) and want to group them into categories for analysis.
Data: (Example income values:
[10 000, 25 000, 38 000, 52 000, 68 000, 85 000, 102 000, 120 000])

Percentile-Based Binning
o Calculate percentiles: We'll use quartiles (dividing into 4 equal groups). You can choose
other percentiles based on your needs.
• Q1 (25th percentile): $25,000
• Q2 (50th percentile, median): $52,000
• Q3 (75th percentile): $85,000
• Create bins:
• Bin 1: < $25,000
• Bin 2: $25,000 - $52,000
• Bin 3: $52,000 - $85,000
• Bin 4: > $85,000 (Note: This approach creates 4 bins, but you can merge the last two
if desired for 3 bins)

16
Numeric to Categorical

Scenario: Imagine you have data on customer income (continuous

numerical values) and want to group them into categories for analysis.
Data: (Example income values:
[10 000, 25 000, 38 000, 52 000, 68 000, 85 000, 102 000, 120 000])

K-Means Clustering (Using k=3):

o Apply the K-Means algorithm:
Run the algorithm with k (number of clusters) set to 3.
This will group the data points into 3 clusters based on their income values.

o Define bin labels:

Based on the cluster centers (average income within each cluster), assign appropriate
labels to each bin, such as "Low Income," "Middle Income," and "High Income."

17
Numeric to Categorical

Comparison:
o These three techniques result in different binning
structures:
Equal-width: Creates bins of equal size, but might not
capture the underlying distribution if the data is skewed.
Percentile-based: Captures the distribution better, but
the bin sizes might not be equal.
K-Means: Creates data-driven bins based on natural
groupings, but requires choosing the number of clusters
(k) and might be more complex to interpret.

18
Numeric to Categorical

Choosing the Best Technique:

o The best technique depends on your specific goals
and the data characteristics.
Equal-width binning might be suitable if you need
equally sized groups or have a relatively uniform income
distribution.
Percentile-based binning is appropriate if you want to
capture the distribution of income levels and understand
the proportions within each range.
K-Means clustering can be useful if you want data-
driven groupings and the income distribution is not
easily captured by simple binning methods.
19
Unstructured data vs structured data

20
Text (unstructured) to vector (structured)

o Some techniques:
Bag-of-words
TF-IDF (Term Frequency - Inverse Document
Frequency)
Word2Vec
Doc2Vec
FastText
Other: GloVe, BERT, ELMo, …
o Notes:
Depends on intended use, data set size, and
computing resources.
BOW, TF-IDF is simple, suitable for small data sets.
Word2Vec, Doc2Vec, FastText are more efficient but
more complex, suitable for large data sets.
21

Advance AI & ML Certification Program
100% (1)
Advance AI & ML Certification Program
29 pages
Data Transformation
No ratings yet
Data Transformation
16 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
V DM Clustering
No ratings yet
V DM Clustering
76 pages
Exploring Categorical Data - Students
No ratings yet
Exploring Categorical Data - Students
40 pages
Feature Engineering
No ratings yet
Feature Engineering
43 pages
Stacked It
No ratings yet
Stacked It
28 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
ML Notes
No ratings yet
ML Notes
44 pages
Discret Ization
No ratings yet
Discret Ization
12 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Ch.3 Data Preprocessing
No ratings yet
Ch.3 Data Preprocessing
16 pages
DMDW 03
No ratings yet
DMDW 03
25 pages
A Two Step Clustering Method For Mixed Categorical and Numerical Data
No ratings yet
A Two Step Clustering Method For Mixed Categorical and Numerical Data
9 pages
Lecture 01-05 Data, Central Tendency PDF
No ratings yet
Lecture 01-05 Data, Central Tendency PDF
51 pages
Data Preparation PDF
No ratings yet
Data Preparation PDF
71 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
3-Random Projection and Compressed Sensing Technique-13-01-2025
No ratings yet
3-Random Projection and Compressed Sensing Technique-13-01-2025
84 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Data Discretization and Concept Hierarchy Generation - PPT
No ratings yet
Data Discretization and Concept Hierarchy Generation - PPT
21 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
DSBDA
No ratings yet
DSBDA
18 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
Data Preprocessing For Clustering
No ratings yet
Data Preprocessing For Clustering
40 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Machine Learning Pipeline: Created by Arbaz Ali
No ratings yet
Machine Learning Pipeline: Created by Arbaz Ali
32 pages
Pham 2011
No ratings yet
Pham 2011
17 pages
5 Data Pre Processing II
No ratings yet
5 Data Pre Processing II
26 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
DWDM (Unit-4) - 2
No ratings yet
DWDM (Unit-4) - 2
23 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Unit 1
No ratings yet
Unit 1
8 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
Data Mining For Exam
No ratings yet
Data Mining For Exam
10 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Feature Eng Cheat Sheet
No ratings yet
Feature Eng Cheat Sheet
5 pages
Lec03 DS351 DataAndImageModels
No ratings yet
Lec03 DS351 DataAndImageModels
82 pages
4 Data Pre Processing II
No ratings yet
4 Data Pre Processing II
26 pages
FDS Module 1 Notes
No ratings yet
FDS Module 1 Notes
27 pages
Unit 4
No ratings yet
Unit 4
65 pages
Lecture1-Introduction To Data Mining
No ratings yet
Lecture1-Introduction To Data Mining
46 pages
K Medoids
No ratings yet
K Medoids
101 pages
Business Analytics
No ratings yet
Business Analytics
4 pages
TouchCode Class 8
From Everand
TouchCode Class 8
Team Orange
No ratings yet
TouchCode Class 7: Coding Book
From Everand
TouchCode Class 7: Coding Book
Team Orange
No ratings yet
Medini CV
No ratings yet
Medini CV
3 pages
Transcript Intro ML Course of Gatech
No ratings yet
Transcript Intro ML Course of Gatech
10 pages
Introduction To Machine Learning by Urtasan
No ratings yet
Introduction To Machine Learning by Urtasan
92 pages
Big Data Metods
No ratings yet
Big Data Metods
23 pages
Automation and Analytics Using Python Certisured Intership Report
No ratings yet
Automation and Analytics Using Python Certisured Intership Report
49 pages
ARC180
No ratings yet
ARC180
103 pages
Exploring The Future Digital Transformation 4.0 & The Revolutionary in Pharma Industry D2
No ratings yet
Exploring The Future Digital Transformation 4.0 & The Revolutionary in Pharma Industry D2
34 pages
Evaluation of Machine Learning Approaches For Precision Farming in Smart Agriculture System A Comprehensive Review
No ratings yet
Evaluation of Machine Learning Approaches For Precision Farming in Smart Agriculture System A Comprehensive Review
30 pages
A Comprehensive Survey On Automatic Text Summarization With Exploration of LLM-Based Methods
No ratings yet
A Comprehensive Survey On Automatic Text Summarization With Exploration of LLM-Based Methods
31 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
70 pages
Abstract
No ratings yet
Abstract
2 pages
AI Impact Assessment
No ratings yet
AI Impact Assessment
29 pages
Back Propagation
No ratings yet
Back Propagation
56 pages
Thyrocare Abstract
No ratings yet
Thyrocare Abstract
5 pages
A Primer On Machine Learning in Subsurface Geosciences
100% (1)
A Primer On Machine Learning in Subsurface Geosciences
182 pages
Chapter 9 - BDMT
No ratings yet
Chapter 9 - BDMT
61 pages
Engineering Minors: School of Computer Science and Engineering
No ratings yet
Engineering Minors: School of Computer Science and Engineering
67 pages
Top 10 NLP Question - Answer
No ratings yet
Top 10 NLP Question - Answer
16 pages
4 Multilayer Perceptrons and Radial Basis Functions
No ratings yet
4 Multilayer Perceptrons and Radial Basis Functions
6 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
Fusion Sentiment Analysis For Enhanced E-Commerce Product Experience
No ratings yet
Fusion Sentiment Analysis For Enhanced E-Commerce Product Experience
12 pages
Research Proposal
No ratings yet
Research Proposal
10 pages
Part B - Unit 1-Introduction To Ai
No ratings yet
Part B - Unit 1-Introduction To Ai
9 pages
AI &ML LAB Manual
No ratings yet
AI &ML LAB Manual
41 pages
AI Phase2
No ratings yet
AI Phase2
13 pages
A Decision Making Tool For The Determination of The Distribution Center Location in Humanitarian Logistics Network
No ratings yet
A Decision Making Tool For The Determination of The Distribution Center Location in Humanitarian Logistics Network
17 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
12 pages
Students' Course Results Prediction Based On Data Processing and Machine Learning Methods
No ratings yet
Students' Course Results Prediction Based On Data Processing and Machine Learning Methods
13 pages
Introduction To Machine Learning
100% (1)
Introduction To Machine Learning
11 pages

02 - ML - Data Presentation-24-03-09

Uploaded by

02 - ML - Data Presentation-24-03-09

Uploaded by

Ho Chi Minh University of Banking

Department of Economic Mathematics

Vuong Trong Nhan ([email protected])

𝑥𝑖: input vector, independent variable

𝑦: response variable, dependent variable

o Categorical variable: discrete or qualitative variables

4 Female Master Marketing

Job Title Job Title Frequency Nomalize Frequency

Marketing Marketing Manager 3/11 = 0.27

o Converting numerical data into categorical data, also known as binning

Bin Income Range Description

Scenario: Imagine you have data on customer income (continuous

K-Means Clustering (Using k=3):

o Define bin labels:

Choosing the Best Technique:

You might also like