Hyp Data

The document describes methods for creating hypothetical data frames for modeling. It first generates a data frame with student IDs, gender, and programming language attributes filled with random sample values. It then applies one-hot encoding to transform categorical features into binary columns. A second data frame is generated containing student efficiency levels in various languages, also one-hot encoded. The document demonstrates how one-hot encoding can transform categorical data into numeric formats suitable for modeling.

Uploaded by

Subha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views3 pages

Hyp Data

Uploaded by

Subha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Hypothetical Data

2022-11-21

knitr::opts_chunk$set(echo = TRUE)

Required Libraries:

library(plyr)
library(dplyr)
library(caret)

Hypothetical Data

The hypothetical data can be made via hot encoding method. We can first create the data frame using some
random attributes. For example we can follow the following method;
ID: The ID of the student Gender: Gender of the student: “M” or “F” prog_language: Which programming
language the student is efficient in: (“SQL” or “Python” or “Java” or “R” or “C++”)
These attributes are randomly distributed for 10 students we have taken into consideration to create the
data named d1

id = 1:10
gender = sample(c("M","F"), size = 10, replace = T)
prog = sample(c("SQL", "Python", "Java","R","C++"), size = 10, replace = T)

d1 = data.frame(id,gender, prog)
d1

## id gender prog
## 1 1 F SQL
## 2 2 F Java
## 3 3 M Python
## 4 4 F C++
## 5 5 F Python
## 6 6 M Java
## 7 7 F R
## 8 8 F R
## 9 9 F SQL
## 10 10 F C++

1
Transforming Data

In such cases with features having multiple levels, one-hot (dummy) encoding is applied to the features,
creating a binary column for each category level and returning a sparse matrix. In each dummy variable,
the label “1” will represent the existence of the level in the variable, while the label “0” will represent its
non-existence.
One can see the argument fullrank=T, will create n-1 columns for a categorical variable with n unique levels.
The third line uses the output of the dummyVars() function to transform the dataset

dmy = dummyVars(" ~ .", data = d1, fullRank = T)

final = data.frame(predict(dmy, newdata = d1))
final

## id genderM progJava progPython progR progSQL

## 1 1 0 0 0 0 1
## 2 2 0 1 0 0 0
## 3 3 1 0 1 0 0
## 4 4 0 0 0 0 0
## 5 5 0 0 1 0 0
## 6 6 1 1 0 0 0
## 7 7 0 0 0 1 0
## 8 8 0 0 0 1 0
## 9 9 0 0 0 0 1
## 10 10 0 0 0 0 0

Another way of constructing the data could be using the efficiency measurement of different programming
language on which the student has the expertise. Here one can use three categories as “Low”, “Medium”
and “High”. Then we can use the same hot encoding method using the function in R

gender = sample(c("M","F"), size = 10, replace = T)

id = 1:10
eff_python = sample(c("Low", "Medium", "High"), size = 10, replace = T)
eff_R = sample(c("Low", "Medium", "High"), size = 10, replace = T)
eff_C_prog = sample(c("Low", "Medium", "High"), size = 10, replace = T)
eff_SQL = sample(c("Low", "Medium", "High"), size = 10, replace = T)
eff_java = sample(c("Low", "Medium", "High"), size = 10, replace = T)
d2 = data.frame(id, gender, eff_java, eff_python, eff_C_prog, eff_SQL, eff_R)
d2

## id gender eff_java eff_python eff_C_prog eff_SQL eff_R

## 1 1 M Low Low High High Medium
## 2 2 F High Medium Low High High
## 3 3 M High Low Low High Medium
## 4 4 F High High High Medium Low
## 5 5 F Medium High Medium High High
## 6 6 M Medium High Medium Medium High
## 7 7 M Medium Medium Low High Low
## 8 8 M Medium Medium Medium High Low
## 9 9 M Low High Low Low High
## 10 10 F Medium Medium Medium Medium High

Hence the final data with these attributes can be seen this way:

2
dmy_2 = dummyVars(" ~ .", data = d2, fullRank = T)
final_2 = data.frame(predict(dmy_2, newdata = d2))
final_2

## id genderM eff_javaLow eff_javaMedium eff_pythonLow eff_pythonMedium

## 1 1 1 1 0 1 0
## 2 2 0 0 0 0 1
## 3 3 1 0 0 1 0
## 4 4 0 0 0 0 0
## 5 5 0 0 1 0 0
## 6 6 1 0 1 0 0
## 7 7 1 0 1 0 1
## 8 8 1 0 1 0 1
## 9 9 1 1 0 0 0
## 10 10 0 0 1 0 1
## eff_C_progLow eff_C_progMedium eff_SQLLow eff_SQLMedium eff_RLow eff_RMedium
## 1 0 0 0 0 0 1
## 2 1 0 0 0 0 0
## 3 1 0 0 0 0 1
## 4 0 0 0 1 1 0
## 5 0 1 0 0 0 0
## 6 0 1 0 1 0 0
## 7 1 0 0 0 1 0
## 8 0 1 0 0 1 0
## 9 1 0 1 0 0 0
## 10 0 1 0 1 0 0

Machine Learning Project
67% (3)
Machine Learning Project
30 pages
Wow English Class-8
17% (6)
Wow English Class-8
72 pages
Machine Learning Project On Cars
92% (13)
Machine Learning Project On Cars
22 pages
Lab 5
0% (1)
Lab 5
5 pages
Jra Article p111 - 5
No ratings yet
Jra Article p111 - 5
22 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
Big Data Mid Term
No ratings yet
Big Data Mid Term
14 pages
Assign 3 Datamining
No ratings yet
Assign 3 Datamining
9 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
L1 - Data Pre-Processing & Steps of Building A Model
No ratings yet
L1 - Data Pre-Processing & Steps of Building A Model
30 pages
Lecture 5 Encoding
No ratings yet
Lecture 5 Encoding
35 pages
Lab 6 - Naive Bayesian Classification Exercises
No ratings yet
Lab 6 - Naive Bayesian Classification Exercises
9 pages
Ip 12 MT4 2024
No ratings yet
Ip 12 MT4 2024
1 page
Chap13 Quantitative Data Analysis Revised Jan2021
No ratings yet
Chap13 Quantitative Data Analysis Revised Jan2021
54 pages
Commands For Data Analysis Using R
No ratings yet
Commands For Data Analysis Using R
11 pages
DMDW 03
No ratings yet
DMDW 03
25 pages
RoMEOW Rizal Lesson Plan
No ratings yet
RoMEOW Rizal Lesson Plan
12 pages
Dealing With Categorical
No ratings yet
Dealing With Categorical
25 pages
Tagalog Was Declared The Official Language by The First Revolutionary Constitution in The Philippines
No ratings yet
Tagalog Was Declared The Official Language by The First Revolutionary Constitution in The Philippines
4 pages
On The Ethno-Cultural Basis of Ancient Macedonia - Dragi Mitrevski
100% (2)
On The Ethno-Cultural Basis of Ancient Macedonia - Dragi Mitrevski
16 pages
A Note On R
No ratings yet
A Note On R
90 pages
2021 - Data Mining DU CBCS
No ratings yet
2021 - Data Mining DU CBCS
4 pages
Exp 6
No ratings yet
Exp 6
9 pages
Practical 2 Kunal
No ratings yet
Practical 2 Kunal
6 pages
Creating Data For Analytics Through Design of Experiments
No ratings yet
Creating Data For Analytics Through Design of Experiments
7 pages
TP4-ML-features Encoding
No ratings yet
TP4-ML-features Encoding
4 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
35 pages
Handling of Categorical Data
No ratings yet
Handling of Categorical Data
18 pages
Experiment 2
No ratings yet
Experiment 2
7 pages
A Minor Project Report On DMT
No ratings yet
A Minor Project Report On DMT
11 pages
Experiment 2
No ratings yet
Experiment 2
7 pages
1 - PC 2024 Final Draft
No ratings yet
1 - PC 2024 Final Draft
37 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
All About Encoding - by Baijayanta Roy - Towards Data Science
No ratings yet
All About Encoding - by Baijayanta Roy - Towards Data Science
25 pages
ML Lab Programs PDF
No ratings yet
ML Lab Programs PDF
15 pages
All Exercises R
No ratings yet
All Exercises R
21 pages
Dealing With Categorical Data
No ratings yet
Dealing With Categorical Data
14 pages
R Cheatsheet ABCD
No ratings yet
R Cheatsheet ABCD
3 pages
Experiment 2
No ratings yet
Experiment 2
7 pages
Final Practical
No ratings yet
Final Practical
53 pages
1
No ratings yet
1
19 pages
7 - InnovatiCS - Categorical Data & Data Transformation
No ratings yet
7 - InnovatiCS - Categorical Data & Data Transformation
20 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
R Module 7 - Data Classes
No ratings yet
R Module 7 - Data Classes
45 pages
PPSD 1683560645
No ratings yet
PPSD 1683560645
9 pages
Progress Test 3A: Grammar
No ratings yet
Progress Test 3A: Grammar
6 pages
Datamining 2
No ratings yet
Datamining 2
54 pages
Lesson 26 - Chomsky Normal Form - Leftmost Derivation
No ratings yet
Lesson 26 - Chomsky Normal Form - Leftmost Derivation
22 pages
R Record-1
No ratings yet
R Record-1
57 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
PR List Dsbda
No ratings yet
PR List Dsbda
2 pages
Decisiontree 1
No ratings yet
Decisiontree 1
10 pages
Machine Learning Project Report
No ratings yet
Machine Learning Project Report
65 pages
Lec 13
No ratings yet
Lec 13
46 pages
Statistical Computing by Using R
100% (1)
Statistical Computing by Using R
11 pages
Opening Black Boxes: How To Leverage Explainable Machine Learning
No ratings yet
Opening Black Boxes: How To Leverage Explainable Machine Learning
11 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Cost Practical
No ratings yet
Cost Practical
13 pages
Translation Definitions and Related Terms
No ratings yet
Translation Definitions and Related Terms
21 pages
Rezultate Simulare Bac 2016 1
No ratings yet
Rezultate Simulare Bac 2016 1
30 pages
Assignment Problems
No ratings yet
Assignment Problems
7 pages
Batch-2 (Review 2)
No ratings yet
Batch-2 (Review 2)
19 pages
SET 1 Part A Marks, (
No ratings yet
SET 1 Part A Marks, (
10 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Date Preparation and Exploration:: Titanic Data - CSV
No ratings yet
Date Preparation and Exploration:: Titanic Data - CSV
5 pages
1 Onscreen Inter Resource Mod 1
No ratings yet
1 Onscreen Inter Resource Mod 1
8 pages
Unit 7 Higher Test
No ratings yet
Unit 7 Higher Test
2 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
Final Research Paper
No ratings yet
Final Research Paper
3 pages
HDFS Tutorial
No ratings yet
HDFS Tutorial
5 pages
Mahim Bora
No ratings yet
Mahim Bora
4 pages
Conectores, de Causa Efecto y Contraste
No ratings yet
Conectores, de Causa Efecto y Contraste
3 pages
Vistas of English For Specific Purposes - Nadezda Stojkovic
No ratings yet
Vistas of English For Specific Purposes - Nadezda Stojkovic
415 pages
Seedhouse 2009
No ratings yet
Seedhouse 2009
14 pages
Course Syllabus Eng 312
No ratings yet
Course Syllabus Eng 312
8 pages
Chat 20080114
No ratings yet
Chat 20080114
72 pages
MSC BDA Curriculum Outcomes
No ratings yet
MSC BDA Curriculum Outcomes
51 pages
How To Write A Summary
No ratings yet
How To Write A Summary
3 pages
'Stop' Using The Future Tense
No ratings yet
'Stop' Using The Future Tense
7 pages
Worksheet Reinforcement Unit 4
100% (2)
Worksheet Reinforcement Unit 4
3 pages
Meeting 1 Pre-Reading Activity
No ratings yet
Meeting 1 Pre-Reading Activity
8 pages
Past Continuous
No ratings yet
Past Continuous
21 pages
Characteristics of Spoken and Written Language
No ratings yet
Characteristics of Spoken and Written Language
2 pages
R Intro 2011
No ratings yet
R Intro 2011
115 pages
Soal Bahasa Inggris Kelas 4 Tentang Family
No ratings yet
Soal Bahasa Inggris Kelas 4 Tentang Family
7 pages
Developing A Genrebased Model
No ratings yet
Developing A Genrebased Model
15 pages
Pythona
No ratings yet
Pythona
13 pages
Business Email Writing
No ratings yet
Business Email Writing
5 pages
How Men Women: Listening
No ratings yet
How Men Women: Listening
1 page
Eng.7.QE1.Answer Sheet
No ratings yet
Eng.7.QE1.Answer Sheet
2 pages
PHD Adm 2024 25
No ratings yet
PHD Adm 2024 25
2 pages
One Word Substitution
No ratings yet
One Word Substitution
2 pages
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet

Hyp Data

Uploaded by

Hyp Data

Uploaded by

Hypothetical Data

dmy = dummyVars(" ~ .", data = d1, fullRank = T)

## id genderM progJava progPython progR progSQL

gender = sample(c("M","F"), size = 10, replace = T)

## id gender eff_java eff_python eff_C_prog eff_SQL eff_R

## id genderM eff_javaLow eff_javaMedium eff_pythonLow eff_pythonMedium

You might also like