0% found this document useful (0 votes)
32 views3 pages

Hyp Data

The document describes methods for creating hypothetical data frames for modeling. It first generates a data frame with student IDs, gender, and programming language attributes filled with random sample values. It then applies one-hot encoding to transform categorical features into binary columns. A second data frame is generated containing student efficiency levels in various languages, also one-hot encoded. The document demonstrates how one-hot encoding can transform categorical data into numeric formats suitable for modeling.

Uploaded by

Subha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views3 pages

Hyp Data

The document describes methods for creating hypothetical data frames for modeling. It first generates a data frame with student IDs, gender, and programming language attributes filled with random sample values. It then applies one-hot encoding to transform categorical features into binary columns. A second data frame is generated containing student efficiency levels in various languages, also one-hot encoded. The document demonstrates how one-hot encoding can transform categorical data into numeric formats suitable for modeling.

Uploaded by

Subha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Hypothetical Data

PP

2022-11-21

knitr::opts_chunk$set(echo = TRUE)

Required Libraries:

library(plyr)
library(dplyr)
library(caret)

Hypothetical Data

The hypothetical data can be made via hot encoding method. We can first create the data frame using some
random attributes. For example we can follow the following method;
ID: The ID of the student Gender: Gender of the student: “M” or “F” prog_language: Which programming
language the student is efficient in: (“SQL” or “Python” or “Java” or “R” or “C++”)
These attributes are randomly distributed for 10 students we have taken into consideration to create the
data named d1

id = 1:10
gender = sample(c("M","F"), size = 10, replace = T)
prog = sample(c("SQL", "Python", "Java","R","C++"), size = 10, replace = T)

d1 = data.frame(id,gender, prog)
d1

## id gender prog
## 1 1 F SQL
## 2 2 F Java
## 3 3 M Python
## 4 4 F C++
## 5 5 F Python
## 6 6 M Java
## 7 7 F R
## 8 8 F R
## 9 9 F SQL
## 10 10 F C++

1
Transforming Data

In such cases with features having multiple levels, one-hot (dummy) encoding is applied to the features,
creating a binary column for each category level and returning a sparse matrix. In each dummy variable,
the label “1” will represent the existence of the level in the variable, while the label “0” will represent its
non-existence.
One can see the argument fullrank=T, will create n-1 columns for a categorical variable with n unique levels.
The third line uses the output of the dummyVars() function to transform the dataset

dmy = dummyVars(" ~ .", data = d1, fullRank = T)


final = data.frame(predict(dmy, newdata = d1))
final

## id genderM progJava progPython progR progSQL


## 1 1 0 0 0 0 1
## 2 2 0 1 0 0 0
## 3 3 1 0 1 0 0
## 4 4 0 0 0 0 0
## 5 5 0 0 1 0 0
## 6 6 1 1 0 0 0
## 7 7 0 0 0 1 0
## 8 8 0 0 0 1 0
## 9 9 0 0 0 0 1
## 10 10 0 0 0 0 0

Another way of constructing the data could be using the efficiency measurement of different programming
language on which the student has the expertise. Here one can use three categories as “Low”, “Medium”
and “High”. Then we can use the same hot encoding method using the function in R

gender = sample(c("M","F"), size = 10, replace = T)


id = 1:10
eff_python = sample(c("Low", "Medium", "High"), size = 10, replace = T)
eff_R = sample(c("Low", "Medium", "High"), size = 10, replace = T)
eff_C_prog = sample(c("Low", "Medium", "High"), size = 10, replace = T)
eff_SQL = sample(c("Low", "Medium", "High"), size = 10, replace = T)
eff_java = sample(c("Low", "Medium", "High"), size = 10, replace = T)
d2 = data.frame(id, gender, eff_java, eff_python, eff_C_prog, eff_SQL, eff_R)
d2

## id gender eff_java eff_python eff_C_prog eff_SQL eff_R


## 1 1 M Low Low High High Medium
## 2 2 F High Medium Low High High
## 3 3 M High Low Low High Medium
## 4 4 F High High High Medium Low
## 5 5 F Medium High Medium High High
## 6 6 M Medium High Medium Medium High
## 7 7 M Medium Medium Low High Low
## 8 8 M Medium Medium Medium High Low
## 9 9 M Low High Low Low High
## 10 10 F Medium Medium Medium Medium High

Hence the final data with these attributes can be seen this way:

2
dmy_2 = dummyVars(" ~ .", data = d2, fullRank = T)
final_2 = data.frame(predict(dmy_2, newdata = d2))
final_2

## id genderM eff_javaLow eff_javaMedium eff_pythonLow eff_pythonMedium


## 1 1 1 1 0 1 0
## 2 2 0 0 0 0 1
## 3 3 1 0 0 1 0
## 4 4 0 0 0 0 0
## 5 5 0 0 1 0 0
## 6 6 1 0 1 0 0
## 7 7 1 0 1 0 1
## 8 8 1 0 1 0 1
## 9 9 1 1 0 0 0
## 10 10 0 0 1 0 1
## eff_C_progLow eff_C_progMedium eff_SQLLow eff_SQLMedium eff_RLow eff_RMedium
## 1 0 0 0 0 0 1
## 2 1 0 0 0 0 0
## 3 1 0 0 0 0 1
## 4 0 0 0 1 1 0
## 5 0 1 0 0 0 0
## 6 0 1 0 1 0 0
## 7 1 0 0 0 1 0
## 8 0 1 0 0 1 0
## 9 1 0 1 0 0 0
## 10 0 1 0 1 0 0

You might also like