0% found this document useful (0 votes)
11 views

02.Session-notes-1 and 2-Basic Data Analysis

Uploaded by

nairsuraj725
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

02.Session-notes-1 and 2-Basic Data Analysis

Uploaded by

nairsuraj725
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Sessions 1 and 2 notes

SLN

Opening a new file in the R


Open the R studio. Go to file-> new file-> R script. Start entering the codes for further
practice.
Setting the working directory
Before starting the data import or analysis, it is always better to set the working directory.
This directory will then act as a reference directory for the entire work. To set a folder as a
working directory, we use the following code. (Change the backward slash to forward
slash).
Copy the address of the folder first.

setwd(“ F:\07.PGDM 2020\03.DAR\09.R-Codes”)


Now change the slash.
setwd(“F:/07.PGDM 2020/03.DAR/09.R-Codes”)
To run the code in R, press ctl+enter. To check whether the directory is set, use the
following code
getwd()

You can see that the working directory is set. Now start the process.

Data Import to R
In this section, I will discuss the steps to be adopted for importing excel files into R. Please
note that files with other extensions also can be imported into R. But, for time being I will
mainly concentrate on excel files. I will discuss importing other extensions in the later
sessions. Let us first be comfortable with the excel files. All of us know that the data
collected, either primary or secondary, usually will be entered or stored in an excel file. The
data is stored as rows and columns. While rows indicate the responses, columns indicate
the variables on which the data is collected. One can visualize a data set in excel file as a
matrix that deals with various aspects related to the given situation. Note that, a given
situation can be understood by exhaustively listing all possible parameters that one can
list. Only then the data is said to be complete. One can make the data complete either by
experience or by taking expert advise or by conducting a thorough literature review. Once
the list is ready, then one has to associate a variable with each parameter. For example, the
parameters can be average revenue, average expense, median salary, average number
customers, average customer satisfaction etc. Corresponding to these, one can associate
variables such as revenue, expense, salary, number of customers, customer satisfaction etc.
Each variable is measured using appropriate scale. Note that, the parameters can be
categorical as well. Like, proportion of customer who are unhappy with the service and the
corresponding variable will be a binary response variable-happy customer or not. It all
depends on what we are measuring.
Let us move forward with the import of excel files to R. Note that, in order to import the
excel files to R, one has to download and install the corresponding package. It is easy to
remember the package names. For example, we need to read an excel file and the
corresponding package is “readxl”. We need to install the package and then call the
package. Once we install the package, it will be downloaded to a temporary folder and we
can call it whenever we need it. The following is the code for the same.

Install the package


install.packages(“readxl”)

call the package


library(readxl)
Importing the data file
The data considered for the session is the “Customer satisfaction” data. This was
introduced to you all in term-1. I want to consider the same and explain the process of
conducting the analysis in R completely. In order to import the data file, the following codes
are used
cust_sat=read_excel(file.choose())
Here, cust_sat is the data file name assigned in R. read_excel() is the built-in function that
comes along with the “readxl” package. If one knows the path where the excel file is stored,
then the same can be copied to the function. In case if one doesn’t know the path, then the
file.choose() function can be used. As soon as this is used, a new window with name “select
file” will be opened. Sometimes it will be not shown directly. In such cases, one has to use
alt+tab to check for the window. Note that, R is case sensitive and one has to be careful
while typing the codes in R. Once the window is opened, one can navigate to the folder
where the excel file is stored and import the excel file to R. Assume that you have two or
three sheets in the same excel file. Then, one has to specify the sheet name in the
read_excel() function. For example, read_excel(file.choose(), sheet= “name of the sheet”).
Once the data import is done, one has to attach the data file.

Select then file from the folder and say open. The data will be successfully imported to R.

Attaching the data file to R


attach(cust_sat)

Opening the data file in the Reditor


fix(cust_sat)
Note that the Reditor has to be closed before excuting any other code. Till it is closed, other
codes will not be executed.

Viewing the data in R as seperate window


View(cust_sat)
After viewing the data in R, one can start understanding the data set and start the analysis.
Data analysis should be always linked to the objectives of the study. It is a one-to-one link
between the both. The variables from the data set have to be identified and the
corresponding data should be analyzed to draw appropriate inferences. I now discuss this
in detail and then explain how to analyze the data using R.

Basic Data Analysis


I now present the codes used for basic data analysis for the data imported to R. The data is
related to a store, where the store in-charge wants to measure the satisfaction levels of the
customers visiting the store regularly. For this, he collects data from the regular customers
using a well-designed questionnaire. The first part of the questionnaire has demographic
details of the customers, the second part has the details related to their visit to store and
other aspects, the third part has the statements that measure their satisfaction levels
towards various services being offered by the store. Satisfaction levels are measured on a
5-point Likert scale.
I first present the process for building the tables and then move on to summary statistics.
All of us know that the tables can be univariate, bivariate and multivariate.
Before proceeding to the analysis, it is better to know the variables and their structure.
This can be done using R.
To view the names of the variables in the data set imported, we can use
names(cust_sat)
The following is the output and it gives the names of the variables in the data set.

To know the structure of the data set, we use


str(cust_sat)
One can note that the above gives the structure of the data set. Gender is a character
variable, educational qualification is a categorical variable, etc. Other variables are given as
numeric variables. Structure includes the variable name, type and codes used.

Tabular Presentation
Assume that you wish to build univariate tables based on demographics and other store
related aspects. The following codes can be used.

Univariate tables
1. Table based on Gender.
table(cust_sat$Gender)
> table(cust_sat$Gender)

a b
31 19

> prop.table(table(cust_sat$Gender))

a b
0.62 0.38

> prop.table(table(cust_sat$Gender))*100

a b
62 38
From the above tables one can note that, there are 62% males and 38 females among the
50 customers considered.
2. Tables based on educational qualification, profession, marital status, place of stay, and
years of stay.
table(cust_sat$Edu_qua)
prop.table(table(cust_sat$Edu_qua))*100

table(cust_sat$Profession)
prop.table(table(cust_sat$Profession))*100

table(cust_sat$Marital_stat)
prop.table(table(cust_sat$Marital_stat))*100

table(cust_sat$Place)
prop.table(table(cust_sat$Edu_qua))*100

table(cust_sat$Years_stay)
prop.table(table(cust_sat$Years_stay))*100

3. Tables based on other aspects related to the customers


table(cust_sat$Years_Purchase)
prop.table(table(cust_sat$Years_Purchase))*100

table(cust_sat$No_times_visit)
prop.table(table(cust_sat$No_times_visit))*100

table(cust_sat$`Ave_amount spent`)
prop.table(table(cust_sat$`Ave_amount spent`))*100
table(cust_sat$Received_gift)
prop.table(table(cust_sat$Received_gift))*100

Bivariate tables
I will give one as an example and you can try others.
Cross tabulation of gender and marital status
table(cust_sat$Gender, cust_sat$Marital_stat)
prop.table(table(cust_sat$Gender, cust_sat$Marital_stat))*100

> table(cust_sat$Gender, cust_sat$Marital_stat)

a b
a 23 8
b 5 14
> prop.table(table(cust_sat$Gender, cust_sat$Marital_stat))*100

a b
a 46 16
b 10 28

Multivariate tables
Gender*Educational Qualification*Profession
table(cust_sat$Gender, cust_sat$Edu_qua, cust_sat$Profession)
prop.table(table(cust_sat$Gender, cust_sat$Edu_qua, cust_sat$Profession))*100
ftable(prop.table(table(cust_sat$Gender, cust_sat$Edu_qua, cust_sat$Profession))*100)

Summary Statistics
Suppose that we wish to obtain the summary statistics for the variables that measure the
satisfaction levels of the customers. For this, we can use some of the packages available in R
and also existing built-in functions. For example, one can use summary() to get the
summary of the data set/variable.
summary(cust_sat$`Overall satisfaction`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.00 8.00 8.00 8.28 9.00 10.00

We can also get the summary statistics using the built-in functions available in the package
“psych”.
install.packages(“psych”)
library(psych)
One can use the function describe() to get the summary statistics.
describe(cust_sat$`Overall satisfaction`)

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 50 8.28 0.76 8 8.3 1.48 7 10 3 0.06 -0.5 0.11

The above output gives us the summary of the variable overall satisfaction. Try to interpret
the same (assignment).
Another way of getting summary statistics is using the package “pastecs”.
install.packages("pastecs")
library(pastecs)
stat.desc(cust_sat[,c(12:15)])
Select the columns for which we wish to compute the summary statistics. For example, I
have selected the columns that measures the satisfaction levels. The following table gives
the output of the same.
Q12a Q12b Q12c Q12d
nbr.val 50.0000000 50.0000000 50.0000000 50.0000000
nbr.null 0.0000000 0.0000000 0.0000000 0.0000000
nbr.na 0.0000000 0.0000000 0.0000000 0.0000000
min 2.0000000 1.0000000 1.0000000 2.0000000
max 5.0000000 5.0000000 5.0000000 5.0000000
range 3.0000000 4.0000000 4.0000000 3.0000000
sum 205.0000000 178.0000000 190.0000000 205.0000000
median 4.0000000 4.0000000 4.0000000 4.0000000
mean 4.1000000 3.5600000 3.8000000 4.1000000
SE.mean 0.1040016 0.1962090 0.1456863 0.1040016
CI.mean.0.95 0.2089990 0.3942967 0.2927675 0.2089990
var 0.5408163 1.9248980 1.0612245 0.5408163
std.dev 0.7354022 1.3874069 1.0301575 0.7354022
coef.var 0.1793664 0.3897210 0.2710941 0.1793664

Now, suppose that one of use ask can we have summary tables based on the demographic
or any other factors. Then, we can use the function tapply().
tapply(cust_sat$`Overall satisfaction`, list(cust_sat$Gender), mean)
The above function gives you the mean values of the overall satisfaction for male and
female separately.
a b
8.225806 8.368421

tapply(cust_sat$`Overall satisfaction`, list(cust_sat$Gender, cust_sat$Edu_qua), mean)


The above function gives you the mean values of the overall satisfaction for male and
female across the categories of educational qualification.
a b c d
a 8.250000 8.666667 8.250000 8.0
b 8.333333 8.333333 8.333333 8.5

ftable(tapply(cust_sat$`Overall satisfaction`, list(cust_sat$Gender, cust_sat$Edu_qua,


cust_sat$Profession), mean))
The above function gives you the mean values of the overall satisfaction for male and
female across the categories of educational qualification and profession.

a b

a a 8.500000 8.000000
b 8.000000 8.800000
c 8.285714 8.000000
d 8.000000 8.000000
b a 8.400000 8.000000
b 8.000000 8.500000
c 8.250000 8.500000
d 8.500000 8.500000

Please practice and try to create tables and summary for other variables. Thank you.

You might also like