02.Session-notes-1 and 2-Basic Data Analysis
02.Session-notes-1 and 2-Basic Data Analysis
SLN
You can see that the working directory is set. Now start the process.
Data Import to R
In this section, I will discuss the steps to be adopted for importing excel files into R. Please
note that files with other extensions also can be imported into R. But, for time being I will
mainly concentrate on excel files. I will discuss importing other extensions in the later
sessions. Let us first be comfortable with the excel files. All of us know that the data
collected, either primary or secondary, usually will be entered or stored in an excel file. The
data is stored as rows and columns. While rows indicate the responses, columns indicate
the variables on which the data is collected. One can visualize a data set in excel file as a
matrix that deals with various aspects related to the given situation. Note that, a given
situation can be understood by exhaustively listing all possible parameters that one can
list. Only then the data is said to be complete. One can make the data complete either by
experience or by taking expert advise or by conducting a thorough literature review. Once
the list is ready, then one has to associate a variable with each parameter. For example, the
parameters can be average revenue, average expense, median salary, average number
customers, average customer satisfaction etc. Corresponding to these, one can associate
variables such as revenue, expense, salary, number of customers, customer satisfaction etc.
Each variable is measured using appropriate scale. Note that, the parameters can be
categorical as well. Like, proportion of customer who are unhappy with the service and the
corresponding variable will be a binary response variable-happy customer or not. It all
depends on what we are measuring.
Let us move forward with the import of excel files to R. Note that, in order to import the
excel files to R, one has to download and install the corresponding package. It is easy to
remember the package names. For example, we need to read an excel file and the
corresponding package is “readxl”. We need to install the package and then call the
package. Once we install the package, it will be downloaded to a temporary folder and we
can call it whenever we need it. The following is the code for the same.
Select then file from the folder and say open. The data will be successfully imported to R.
Tabular Presentation
Assume that you wish to build univariate tables based on demographics and other store
related aspects. The following codes can be used.
Univariate tables
1. Table based on Gender.
table(cust_sat$Gender)
> table(cust_sat$Gender)
a b
31 19
> prop.table(table(cust_sat$Gender))
a b
0.62 0.38
> prop.table(table(cust_sat$Gender))*100
a b
62 38
From the above tables one can note that, there are 62% males and 38 females among the
50 customers considered.
2. Tables based on educational qualification, profession, marital status, place of stay, and
years of stay.
table(cust_sat$Edu_qua)
prop.table(table(cust_sat$Edu_qua))*100
table(cust_sat$Profession)
prop.table(table(cust_sat$Profession))*100
table(cust_sat$Marital_stat)
prop.table(table(cust_sat$Marital_stat))*100
table(cust_sat$Place)
prop.table(table(cust_sat$Edu_qua))*100
table(cust_sat$Years_stay)
prop.table(table(cust_sat$Years_stay))*100
table(cust_sat$No_times_visit)
prop.table(table(cust_sat$No_times_visit))*100
table(cust_sat$`Ave_amount spent`)
prop.table(table(cust_sat$`Ave_amount spent`))*100
table(cust_sat$Received_gift)
prop.table(table(cust_sat$Received_gift))*100
Bivariate tables
I will give one as an example and you can try others.
Cross tabulation of gender and marital status
table(cust_sat$Gender, cust_sat$Marital_stat)
prop.table(table(cust_sat$Gender, cust_sat$Marital_stat))*100
a b
a 23 8
b 5 14
> prop.table(table(cust_sat$Gender, cust_sat$Marital_stat))*100
a b
a 46 16
b 10 28
Multivariate tables
Gender*Educational Qualification*Profession
table(cust_sat$Gender, cust_sat$Edu_qua, cust_sat$Profession)
prop.table(table(cust_sat$Gender, cust_sat$Edu_qua, cust_sat$Profession))*100
ftable(prop.table(table(cust_sat$Gender, cust_sat$Edu_qua, cust_sat$Profession))*100)
Summary Statistics
Suppose that we wish to obtain the summary statistics for the variables that measure the
satisfaction levels of the customers. For this, we can use some of the packages available in R
and also existing built-in functions. For example, one can use summary() to get the
summary of the data set/variable.
summary(cust_sat$`Overall satisfaction`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.00 8.00 8.00 8.28 9.00 10.00
We can also get the summary statistics using the built-in functions available in the package
“psych”.
install.packages(“psych”)
library(psych)
One can use the function describe() to get the summary statistics.
describe(cust_sat$`Overall satisfaction`)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 50 8.28 0.76 8 8.3 1.48 7 10 3 0.06 -0.5 0.11
The above output gives us the summary of the variable overall satisfaction. Try to interpret
the same (assignment).
Another way of getting summary statistics is using the package “pastecs”.
install.packages("pastecs")
library(pastecs)
stat.desc(cust_sat[,c(12:15)])
Select the columns for which we wish to compute the summary statistics. For example, I
have selected the columns that measures the satisfaction levels. The following table gives
the output of the same.
Q12a Q12b Q12c Q12d
nbr.val 50.0000000 50.0000000 50.0000000 50.0000000
nbr.null 0.0000000 0.0000000 0.0000000 0.0000000
nbr.na 0.0000000 0.0000000 0.0000000 0.0000000
min 2.0000000 1.0000000 1.0000000 2.0000000
max 5.0000000 5.0000000 5.0000000 5.0000000
range 3.0000000 4.0000000 4.0000000 3.0000000
sum 205.0000000 178.0000000 190.0000000 205.0000000
median 4.0000000 4.0000000 4.0000000 4.0000000
mean 4.1000000 3.5600000 3.8000000 4.1000000
SE.mean 0.1040016 0.1962090 0.1456863 0.1040016
CI.mean.0.95 0.2089990 0.3942967 0.2927675 0.2089990
var 0.5408163 1.9248980 1.0612245 0.5408163
std.dev 0.7354022 1.3874069 1.0301575 0.7354022
coef.var 0.1793664 0.3897210 0.2710941 0.1793664
Now, suppose that one of use ask can we have summary tables based on the demographic
or any other factors. Then, we can use the function tapply().
tapply(cust_sat$`Overall satisfaction`, list(cust_sat$Gender), mean)
The above function gives you the mean values of the overall satisfaction for male and
female separately.
a b
8.225806 8.368421
a b
a a 8.500000 8.000000
b 8.000000 8.800000
c 8.285714 8.000000
d 8.000000 8.000000
b a 8.400000 8.000000
b 8.000000 8.500000
c 8.250000 8.500000
d 8.500000 8.500000
Please practice and try to create tables and summary for other variables. Thank you.