Hyp Data
Hyp Data
PP
2022-11-21
knitr::opts_chunk$set(echo = TRUE)
Required Libraries:
library(plyr)
library(dplyr)
library(caret)
Hypothetical Data
The hypothetical data can be made via hot encoding method. We can first create the data frame using some
random attributes. For example we can follow the following method;
ID: The ID of the student Gender: Gender of the student: “M” or “F” prog_language: Which programming
language the student is efficient in: (“SQL” or “Python” or “Java” or “R” or “C++”)
These attributes are randomly distributed for 10 students we have taken into consideration to create the
data named d1
id = 1:10
gender = sample(c("M","F"), size = 10, replace = T)
prog = sample(c("SQL", "Python", "Java","R","C++"), size = 10, replace = T)
d1 = data.frame(id,gender, prog)
d1
## id gender prog
## 1 1 F SQL
## 2 2 F Java
## 3 3 M Python
## 4 4 F C++
## 5 5 F Python
## 6 6 M Java
## 7 7 F R
## 8 8 F R
## 9 9 F SQL
## 10 10 F C++
1
Transforming Data
In such cases with features having multiple levels, one-hot (dummy) encoding is applied to the features,
creating a binary column for each category level and returning a sparse matrix. In each dummy variable,
the label “1” will represent the existence of the level in the variable, while the label “0” will represent its
non-existence.
One can see the argument fullrank=T, will create n-1 columns for a categorical variable with n unique levels.
The third line uses the output of the dummyVars() function to transform the dataset
Another way of constructing the data could be using the efficiency measurement of different programming
language on which the student has the expertise. Here one can use three categories as “Low”, “Medium”
and “High”. Then we can use the same hot encoding method using the function in R
Hence the final data with these attributes can be seen this way:
2
dmy_2 = dummyVars(" ~ .", data = d2, fullRank = T)
final_2 = data.frame(predict(dmy_2, newdata = d2))
final_2