R Programming-Chapiter 6
R Programming-Chapiter 6
Chapter 6
DATA MANIPULATION IN R
Our goal here is to give some landmarks with the aim to have an idea of the features
of R to perform statistical and data analyses. R software is an environment within which
many classical and modern statistical techniques have been implemented [ref4]. In this
range, some statistical methods are available in a large number of packages. Some of them
are distributed with a base installation of R (about 25 packages supplied with R (called
“standard” and “recommended” packages) [4], and many other packages are contributed
and must be installed by the user [2] and re available through the CRAN family of Internet
sites (via https://CRAN.R-project.org) and elsewhere.
6.1 DataSets in R
The R programming language has many built-in datasets that can usually be used as
sample data to illustrate the performance of R functions.
For more details about datset packages in R consult the link : https://stat.ethz.ch/R-
manual/R-devel/library/datasets/html/00Index.html. It shows a set of existing dataset
in R that can be used and explored using statistical functions. Table 6.1 presents a few
dataset that existing in R.
43
Example: In this example, we will explore the dataset "airquality". To display the
dataset, we simply write the name of the dataset inside the print() as shown in the table
6.2
Example:
The function "read.table" allows to read text files saved in the current working directory
and then import the data from that particular text file as shown in the figure 6.2:
45
The function "read.csv" allows to read CSV file saved in the current working directory
and then import the data from that particular text file as shown in the figure 6.3:
The function "read.xlsx" allows to read Excel files saved in the current working di-
rectory and then import the data from that particular text file as shown in the figure
6.4:
The function "sink" allows exporting data to a text file in the current working directory
as shown in figure 6.5:
47
The function "write.csv" allows exporting data to CSV file saved in the current working
directory as shown in figure 6.6:
The following table (see table 6.4) summarizes the most important basic statistical
functions found in the R program, giving the name of the function, its implementation
method, and its role.
2 Trimmed Mean mean(Vector,trim=0.##) Calculate the mean of certain proportion of the vector.
4 Standard Deviation sd(Vector) measure the spread of the data in the vector.
5 Standard Error sd(Vector)/sqrt(length(Vector)) Display the error associated with a point estimate.
6 Median Absolute Deviation mad(Vector) calculate the average distance between each datapoint.
Example: We can use the summary() function to get statistical information about
the variable in the dataset as shown in figure 6.7. This function returns six statistical
summaries which are: min, First Quartile, Median, Mean, Third Quartile, and Max. The
example shows the statistical information about the Temp variable.
49
To formally do a good analysis of this data we need to follow the following steps:
1. Get to know the details of this dataset by using the functions "names(), col.names(),
row.names().
2. Defines the data cars by using the function "view()" which can be used to invoke a
spreadsheet-style data viewer within RStudio.
4. Use the function "summary()" to summarize the data frame into just one value or
vector.
5. Separate the information into two sections by using "summary()[,1]" and "summary()[2,]"
6. Plot the data and give a name to the x-axis by "speed", and a name to the y-axis
by "stop distance" and give this title "cars data".
51
7. Choose data of the variable "speed" and also of the variable "distance" and plot its
histogram.
8. check the ANOVA analysis for the following variables: "cars.1, cars.2, cars.3, cars.4".