C3 DSC551 R Programming
C3 DSC551 R Programming
Science (R Programming)
3. Data Management in R Programming
1. Explains how to read data from various sources and how to write data to CSV file and text
files.
2. Discusses how to identify, handle, and replace missing values in datasets.
3. Explores the concept of converting data between different classes and highlights the
importance of understanding coercion rules.
4. Demonstrates how to combine datasets and how to select specific variables or
observations.
5. Illustrates how to sort data in ascending or descending order.
Warning
Make sure you close the edit() window before you run the next function/expression. Your
console is not ready to run until you close the window.
1 read.table()
2 ?read.table
file: the name of a file, or a connection (which will be opened for reading if necessary).
header: a logical value indicating whether the file contains the names of the variables as its
first line.
sep: the field separator character. Values on each line of the file are separated by this
character. If sep = "" (the default for read.table) the separator is ‘white space’, that is
one or more spaces, tabs, newlines or carriage returns.
col.names: a vector of optional names for the variables. The default is to use "V" followed
by the column number.
stringsAsFactors: should character variable be coded as factors? by default TRUE.
Tip
To read data from excel files, can use readxl or openxlsx package.
Or use menu File > Import Dataset >…
Warning
Make sure you close the window opened by file.choose() before you run to the next
line of your R script. Your console is not ready to run next line expression until you close the
window.
Tip
1 # Directly export to Excel format
2 library(openxlsx)
3 write.xlsx(your_data_frame, file = "my_data.xlsx")
4
5 # Export to SPSS statistical software
6 library(haven)
7 write_sav(your_data_frame, "my_data.sav")
You can use is. function to test the object type/structure. Example, is.numeric() ,
is.matrix() , is.data.frame() and others.
Arithmetic expressions and functions that contain missing values yield missing values.
1 sum(vec1)
[1] NA
We have to put na.rm=TRUE option to removes missing values prior to calculations and
applies the function to the remaining values.
1 sum(vec1, na.rm=TRUE)
[1] 26
can remove any observation with missing data by using the na.omit()
1 is.na(w)
[1] FALSE FALSE FALSE TRUE FALSE
1 # counting the number of NAs
2 sum(is.na(w))
[1] 1
1 xx <- airquality[1:10,]
2 xx
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
1 colMeans(xx)
Ozone Solar.R Wind Temp Month Day
NA NA 11.98 65.10 5.00 5.50
1 colMeans(xx, na.rm=TRUE)
Ozone Solar.R Wind Temp Month Day
23.125 172.625 11.980 65.100 5.000 5.500
Coercion occurs so that every element in the vector have the same class.
R will try to find a way to represent all of the objects in the vector in a reasonable fashion.
1 x1 <- 1:6
2 class(x1)
[1] "integer"
1 as.numeric(x1)
[1] 1 2 3 4 5 6
1 as.logical(x1)
[1] TRUE TRUE TRUE TRUE TRUE TRUE
1 as.character(x1)
[1] "1" "2" "3" "4" "5" "6"
Note
Warning
While powerful, explicit coercion requires careful consideration. Forcing a conversion when
it’s not appropriate can lead to data loss or unexpected results. Always make sure the
conversion makes sense in the context of your data and analysis.
If don’t need to specify a common key, can use the cbind() function
Warning
Two data frames must have the same variables, but they don’t have to be in the same order.
1 str(mydata1)
'data.frame': 45 obs. of 6 variables:
$ Gender : chr "Male" "Female" "Female" "Female" ...
$ Programs : chr "Statistics" "Business" "Sciences" "Statistics" ...
$ Car_Ownership: chr "Yes" "Yes" "No" "No" ...
$ Telco_Prefer : chr "Celcom" "DiGi" "Celcom" "Maxis" ...
$ Usage_GB : num 14.6 15.7 14.8 15.4 12.9 22.4 28 19.2 25.4 25.3 ...
$ Hour_Perday : num 3 3.8 4 3.5 3 6 6.5 5.5 6 6 ...
1 newdata1 <- mydata1[, c(1:3)]
2 head(newdata1)
Gender Programs Car_Ownership
1 Male Statistics Yes
2 Female Business Yes
3 Female Sciences No
4 Female Statistics No
5 Female Sciences No
6 Female Sciences Yes
1 head(newdata2)
Gender Programs Usage_GB
1 Male Statistics 14.6
2 Female Business 15.7
3 Female Sciences 14.8
4 Female Statistics 15.4
5 Female Sciences 12.9
6 Female Sciences 22.4
1 newdata2$Telco <- mydata1$Telco_Prefer
2 head(newdata2)
Gender Programs Usage_GB Telco
1 Male Statistics 14.6 Celcom
2 Female Business 15.7 DiGi
3 Female Sciences 14.8 Celcom
4 Female Statistics 15.4 Maxis
5 Female Sciences 12.9 U-Mobile
6 Female Sciences 22.4 Celcom
Tip
When you modify the variables, be careful not to change the original data. It is advisable to
use the source datasets to creates fresh data.frame object. Thus, even if you incorrectly edit
the contents, you don’t mess up the original data.
1 newdata6[order(newdata6$Usage_GB, newdata6$Hour_Perday), ]
Gender Programs Car_Ownership Telco_Prefer Usage_GB Hour_Perday
41 Female Sciences Yes Maxis 7.6 2.0
14 Female Account Yes Maxis 8.3 2.0
37 Female Account No Maxis 11.7 3.0
34 Male Statistics No Maxis 11.9 3.0
29 Female Business Yes Maxis 13.3 3.5
17 Male Business Yes Maxis 15.2 3.5
4 Female Statistics No Maxis 15.4 3.5
27 Male Business No Maxis 16.2 4.0
31 Male Account Yes Maxis 16.5 4.0
36 Female Account Yes Maxis 16.5 4.0
22 Female Account No Maxis 17.9 4.5
24 Male Account No Maxis 21.5 5.0
39 Male Statistics Yes Maxis 21.5 5.0
10 Female Sciences No Maxis 25.3 6.0
7 Male Account Yes Maxis 28.0 6.5
19 Male Sciences Yes Maxis 28.2 6.0