0% found this document useful (0 votes)

11 views

02.Session-notes-1 and 2-Basic Data Analysis

Uploaded by

nairsuraj725

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

02.Session-notes-1 and 2-Basic Data Analysis

Uploaded by

nairsuraj725

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Sessions 1 and 2 notes

SLN

Opening a new file in the R

Open the R studio. Go to file-> new file-> R script. Start entering the codes for further
practice.
Setting the working directory
Before starting the data import or analysis, it is always better to set the working directory.
This directory will then act as a reference directory for the entire work. To set a folder as a
working directory, we use the following code. (Change the backward slash to forward
slash).
Copy the address of the folder first.

setwd(“ F:\07.PGDM 2020\03.DAR\09.R-Codes”)

Now change the slash.
setwd(“F:/07.PGDM 2020/03.DAR/09.R-Codes”)
To run the code in R, press ctl+enter. To check whether the directory is set, use the
following code
getwd()

You can see that the working directory is set. Now start the process.

Data Import to R
In this section, I will discuss the steps to be adopted for importing excel files into R. Please
note that files with other extensions also can be imported into R. But, for time being I will
mainly concentrate on excel files. I will discuss importing other extensions in the later
sessions. Let us first be comfortable with the excel files. All of us know that the data
collected, either primary or secondary, usually will be entered or stored in an excel file. The
data is stored as rows and columns. While rows indicate the responses, columns indicate
the variables on which the data is collected. One can visualize a data set in excel file as a
matrix that deals with various aspects related to the given situation. Note that, a given
situation can be understood by exhaustively listing all possible parameters that one can
list. Only then the data is said to be complete. One can make the data complete either by
experience or by taking expert advise or by conducting a thorough literature review. Once
the list is ready, then one has to associate a variable with each parameter. For example, the
parameters can be average revenue, average expense, median salary, average number
customers, average customer satisfaction etc. Corresponding to these, one can associate
variables such as revenue, expense, salary, number of customers, customer satisfaction etc.
Each variable is measured using appropriate scale. Note that, the parameters can be
categorical as well. Like, proportion of customer who are unhappy with the service and the
corresponding variable will be a binary response variable-happy customer or not. It all
depends on what we are measuring.
Let us move forward with the import of excel files to R. Note that, in order to import the
excel files to R, one has to download and install the corresponding package. It is easy to
remember the package names. For example, we need to read an excel file and the
corresponding package is “readxl”. We need to install the package and then call the
package. Once we install the package, it will be downloaded to a temporary folder and we
can call it whenever we need it. The following is the code for the same.

Install the package

install.packages(“readxl”)

call the package

library(readxl)
Importing the data file
The data considered for the session is the “Customer satisfaction” data. This was
introduced to you all in term-1. I want to consider the same and explain the process of
conducting the analysis in R completely. In order to import the data file, the following codes
are used
cust_sat=read_excel(file.choose())
Here, cust_sat is the data file name assigned in R. read_excel() is the built-in function that
comes along with the “readxl” package. If one knows the path where the excel file is stored,
then the same can be copied to the function. In case if one doesn’t know the path, then the
file.choose() function can be used. As soon as this is used, a new window with name “select
file” will be opened. Sometimes it will be not shown directly. In such cases, one has to use
alt+tab to check for the window. Note that, R is case sensitive and one has to be careful
while typing the codes in R. Once the window is opened, one can navigate to the folder
where the excel file is stored and import the excel file to R. Assume that you have two or
three sheets in the same excel file. Then, one has to specify the sheet name in the
read_excel() function. For example, read_excel(file.choose(), sheet= “name of the sheet”).
Once the data import is done, one has to attach the data file.

Select then file from the folder and say open. The data will be successfully imported to R.

Attaching the data file to R

attach(cust_sat)

Opening the data file in the Reditor

fix(cust_sat)
Note that the Reditor has to be closed before excuting any other code. Till it is closed, other
codes will not be executed.

Viewing the data in R as seperate window

View(cust_sat)
After viewing the data in R, one can start understanding the data set and start the analysis.
Data analysis should be always linked to the objectives of the study. It is a one-to-one link
between the both. The variables from the data set have to be identified and the
corresponding data should be analyzed to draw appropriate inferences. I now discuss this
in detail and then explain how to analyze the data using R.

Basic Data Analysis

I now present the codes used for basic data analysis for the data imported to R. The data is
related to a store, where the store in-charge wants to measure the satisfaction levels of the
customers visiting the store regularly. For this, he collects data from the regular customers
using a well-designed questionnaire. The first part of the questionnaire has demographic
details of the customers, the second part has the details related to their visit to store and
other aspects, the third part has the statements that measure their satisfaction levels
towards various services being offered by the store. Satisfaction levels are measured on a
5-point Likert scale.
I first present the process for building the tables and then move on to summary statistics.
All of us know that the tables can be univariate, bivariate and multivariate.
Before proceeding to the analysis, it is better to know the variables and their structure.
This can be done using R.
To view the names of the variables in the data set imported, we can use
names(cust_sat)
The following is the output and it gives the names of the variables in the data set.

To know the structure of the data set, we use

str(cust_sat)
One can note that the above gives the structure of the data set. Gender is a character
variable, educational qualification is a categorical variable, etc. Other variables are given as
numeric variables. Structure includes the variable name, type and codes used.

Tabular Presentation
Assume that you wish to build univariate tables based on demographics and other store
related aspects. The following codes can be used.

Univariate tables
1. Table based on Gender.
table(cust_sat$Gender)
> table(cust_sat$Gender)

a b
31 19

> prop.table(table(cust_sat$Gender))

a b
0.62 0.38

> prop.table(table(cust_sat$Gender))*100

a b
62 38
From the above tables one can note that, there are 62% males and 38 females among the
50 customers considered.
2. Tables based on educational qualification, profession, marital status, place of stay, and
years of stay.
table(cust_sat$Edu_qua)
prop.table(table(cust_sat$Edu_qua))*100

table(cust_sat$Profession)
prop.table(table(cust_sat$Profession))*100

table(cust_sat$Marital_stat)
prop.table(table(cust_sat$Marital_stat))*100

table(cust_sat$Place)
prop.table(table(cust_sat$Edu_qua))*100

table(cust_sat$Years_stay)
prop.table(table(cust_sat$Years_stay))*100

3. Tables based on other aspects related to the customers

table(cust_sat$Years_Purchase)
prop.table(table(cust_sat$Years_Purchase))*100

table(cust_sat$No_times_visit)
prop.table(table(cust_sat$No_times_visit))*100

table(cust_sat$`Ave_amount spent`)
prop.table(table(cust_sat$`Ave_amount spent`))*100
table(cust_sat$Received_gift)
prop.table(table(cust_sat$Received_gift))*100

Bivariate tables
I will give one as an example and you can try others.
Cross tabulation of gender and marital status
table(cust_sat$Gender, cust_sat$Marital_stat)
prop.table(table(cust_sat$Gender, cust_sat$Marital_stat))*100

> table(cust_sat$Gender, cust_sat$Marital_stat)

a b
a 23 8
b 5 14
> prop.table(table(cust_sat$Gender, cust_sat$Marital_stat))*100

a b
a 46 16
b 10 28

Multivariate tables
Gender*Educational Qualification*Profession
table(cust_sat$Gender, cust_sat$Edu_qua, cust_sat$Profession)
prop.table(table(cust_sat$Gender, cust_sat$Edu_qua, cust_sat$Profession))*100
ftable(prop.table(table(cust_sat$Gender, cust_sat$Edu_qua, cust_sat$Profession))*100)

Summary Statistics
Suppose that we wish to obtain the summary statistics for the variables that measure the
satisfaction levels of the customers. For this, we can use some of the packages available in R
and also existing built-in functions. For example, one can use summary() to get the
summary of the data set/variable.
summary(cust_sat$`Overall satisfaction`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.00 8.00 8.00 8.28 9.00 10.00

We can also get the summary statistics using the built-in functions available in the package
“psych”.
install.packages(“psych”)
library(psych)
One can use the function describe() to get the summary statistics.
describe(cust_sat$`Overall satisfaction`)

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 50 8.28 0.76 8 8.3 1.48 7 10 3 0.06 -0.5 0.11

The above output gives us the summary of the variable overall satisfaction. Try to interpret
the same (assignment).
Another way of getting summary statistics is using the package “pastecs”.
install.packages("pastecs")
library(pastecs)
stat.desc(cust_sat[,c(12:15)])
Select the columns for which we wish to compute the summary statistics. For example, I
have selected the columns that measures the satisfaction levels. The following table gives
the output of the same.
Q12a Q12b Q12c Q12d
nbr.val 50.0000000 50.0000000 50.0000000 50.0000000
nbr.null 0.0000000 0.0000000 0.0000000 0.0000000
nbr.na 0.0000000 0.0000000 0.0000000 0.0000000
min 2.0000000 1.0000000 1.0000000 2.0000000
max 5.0000000 5.0000000 5.0000000 5.0000000
range 3.0000000 4.0000000 4.0000000 3.0000000
sum 205.0000000 178.0000000 190.0000000 205.0000000
median 4.0000000 4.0000000 4.0000000 4.0000000
mean 4.1000000 3.5600000 3.8000000 4.1000000
SE.mean 0.1040016 0.1962090 0.1456863 0.1040016
CI.mean.0.95 0.2089990 0.3942967 0.2927675 0.2089990
var 0.5408163 1.9248980 1.0612245 0.5408163
std.dev 0.7354022 1.3874069 1.0301575 0.7354022
coef.var 0.1793664 0.3897210 0.2710941 0.1793664

Now, suppose that one of use ask can we have summary tables based on the demographic
or any other factors. Then, we can use the function tapply().
tapply(cust_sat$`Overall satisfaction`, list(cust_sat$Gender), mean)
The above function gives you the mean values of the overall satisfaction for male and
female separately.
a b
8.225806 8.368421

tapply(cust_sat$`Overall satisfaction`, list(cust_sat$Gender, cust_sat$Edu_qua), mean)

The above function gives you the mean values of the overall satisfaction for male and
female across the categories of educational qualification.
a b c d
a 8.250000 8.666667 8.250000 8.0
b 8.333333 8.333333 8.333333 8.5

ftable(tapply(cust_sat$`Overall satisfaction`, list(cust_sat$Gender, cust_sat$Edu_qua,

cust_sat$Profession), mean))
The above function gives you the mean values of the overall satisfaction for male and
female across the categories of educational qualification and profession.

a b

a a 8.500000 8.000000
b 8.000000 8.800000
c 8.285714 8.000000
d 8.000000 8.000000
b a 8.400000 8.000000
b 8.000000 8.500000
c 8.250000 8.500000
d 8.500000 8.500000

Please practice and try to create tables and summary for other variables. Thank you.

BodorPro2.0.0Laser Cutting CNC System Users
100% (1)
BodorPro2.0.0Laser Cutting CNC System Users
150 pages
Costing and Quantitative Techniques Chapter 22
No ratings yet
Costing and Quantitative Techniques Chapter 22
18 pages
01.Session-notes-Data Import
No ratings yet
01.Session-notes-Data Import
3 pages
Download full Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell all chapters
100% (14)
Download full Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell all chapters
43 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
100% (2)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
40 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - Read Online Or Download Now
100% (6)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - Read Online Or Download Now
35 pages
Journal of Statistical Software: Reshaping Data With The Reshape Package
No ratings yet
Journal of Statistical Software: Reshaping Data With The Reshape Package
20 pages
R Programming
No ratings yet
R Programming
20 pages
STATAforEconWorkshop3
No ratings yet
STATAforEconWorkshop3
12 pages
Download full Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell all chapters
100% (20)
Download full Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell all chapters
43 pages
Using Excel To Clean and Prepare Data
No ratings yet
Using Excel To Clean and Prepare Data
9 pages
Database Analytics
No ratings yet
Database Analytics
29 pages
Fundamental Data Analysis
No ratings yet
Fundamental Data Analysis
14 pages
Inside People Soft Trees
100% (1)
Inside People Soft Trees
5 pages
Computer HW
No ratings yet
Computer HW
10 pages
Assign 1 AIOU Data Structure
No ratings yet
Assign 1 AIOU Data Structure
22 pages
Business Intelligence: Lab Mannual (CSP130)
No ratings yet
Business Intelligence: Lab Mannual (CSP130)
32 pages
Best programming language
No ratings yet
Best programming language
23 pages
OBIEE Interview Questions: Dashboard
No ratings yet
OBIEE Interview Questions: Dashboard
7 pages
Using Excel To Clean and Prepare Data For Analysis
No ratings yet
Using Excel To Clean and Prepare Data For Analysis
9 pages
Excel For Data Analysis
No ratings yet
Excel For Data Analysis
9 pages
Database Note
No ratings yet
Database Note
11 pages
Minor Unit 3-5 2 Marks
No ratings yet
Minor Unit 3-5 2 Marks
4 pages
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages
An Introductory SAS Course
No ratings yet
An Introductory SAS Course
17 pages
Fast Formula
No ratings yet
Fast Formula
8 pages
Primi Passi Dentro GRETL 22-23
No ratings yet
Primi Passi Dentro GRETL 22-23
46 pages
Tera Data
No ratings yet
Tera Data
14 pages
Week-3 Practice-Exercises
No ratings yet
Week-3 Practice-Exercises
6 pages
Twitter Return Vs S&P 500 Return
No ratings yet
Twitter Return Vs S&P 500 Return
7 pages
Excel Project Final
100% (1)
Excel Project Final
38 pages
BS51009 workshop 1
No ratings yet
BS51009 workshop 1
15 pages
How Can We Update A Record in Target Table Without Using Update Strategy?
No ratings yet
How Can We Update A Record in Target Table Without Using Update Strategy?
30 pages
DW - Chap 5
No ratings yet
DW - Chap 5
5 pages
Tableau Questions
No ratings yet
Tableau Questions
3 pages
Business Analytics and Data Mining Modeling Using R
No ratings yet
Business Analytics and Data Mining Modeling Using R
6 pages
Information Tech. Notes
No ratings yet
Information Tech. Notes
45 pages
ChatGPT For PowerBI and Azure
No ratings yet
ChatGPT For PowerBI and Azure
17 pages
Data Management: I. Importing Data From Excel To SPSS
No ratings yet
Data Management: I. Importing Data From Excel To SPSS
9 pages
Excel Notes
No ratings yet
Excel Notes
3 pages
Vijay Rathod Tableau
No ratings yet
Vijay Rathod Tableau
3 pages
Document PDF
No ratings yet
Document PDF
6 pages
Study Material For Lab - DM
No ratings yet
Study Material For Lab - DM
20 pages
Lab0 R Tutorial EHS
No ratings yet
Lab0 R Tutorial EHS
9 pages
Data Management in Stata
No ratings yet
Data Management in Stata
19 pages
Top Excel Interview Questions and Answers To Crack Job Interviews
No ratings yet
Top Excel Interview Questions and Answers To Crack Job Interviews
15 pages
Oracle Discoverer Desktop
No ratings yet
Oracle Discoverer Desktop
38 pages
Chapter 7 Slides
No ratings yet
Chapter 7 Slides
35 pages
Stats Lab1
No ratings yet
Stats Lab1
11 pages
data science
No ratings yet
data science
6 pages
Excelstatguide
No ratings yet
Excelstatguide
8 pages
Data Mning
No ratings yet
Data Mning
10 pages
Apunts BLOC 1 Estadística
No ratings yet
Apunts BLOC 1 Estadística
15 pages
FCA_MSEXCEL_LAB
No ratings yet
FCA_MSEXCEL_LAB
8 pages
Mini Project - Factor Hair Analysis: Sravanthi.M
100% (2)
Mini Project - Factor Hair Analysis: Sravanthi.M
24 pages
PW1 2
No ratings yet
PW1 2
20 pages
Practical List of DBMS: and Indices
No ratings yet
Practical List of DBMS: and Indices
27 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Intermediate Access: Access Essentials, #2
From Everand
Intermediate Access: Access Essentials, #2
M.L. Humphrey
No ratings yet
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
5/5 (2)
Excel 365 The IF Functions: Easy Excel 365 Essentials, #5
From Everand
Excel 365 The IF Functions: Easy Excel 365 Essentials, #5
M.L. Humphrey
No ratings yet
Chapter - 1 - Bais
No ratings yet
Chapter - 1 - Bais
27 pages
NEWS
No ratings yet
NEWS
31 pages
Microsoft Access 2021 for Beginners Pros. Complete Beginners to Experts Practical User Guide for Microsoft Access 2021
No ratings yet
Microsoft Access 2021 for Beginners Pros. Complete Beginners to Experts Practical User Guide for Microsoft Access 2021
192 pages
This Appendix Describes The Following MML Commands
No ratings yet
This Appendix Describes The Following MML Commands
32 pages
Informatica Batch 9
No ratings yet
Informatica Batch 9
65 pages
Read Me
No ratings yet
Read Me
2 pages
Logix Ap007b en P
No ratings yet
Logix Ap007b en P
56 pages
MPMC r15 Ece Manual 2019
No ratings yet
MPMC r15 Ece Manual 2019
87 pages
Mail Ucp Guide
No ratings yet
Mail Ucp Guide
49 pages
A225521651 - 28838 - 11 - 2023 - Simulation Questions
No ratings yet
A225521651 - 28838 - 11 - 2023 - Simulation Questions
4 pages
ERSP Get Started
No ratings yet
ERSP Get Started
106 pages
Az Storage Notes
No ratings yet
Az Storage Notes
13 pages
Az 100
No ratings yet
Az 100
184 pages
TSM HSM
No ratings yet
TSM HSM
188 pages
BM565 CW1 Assignment Brief 05.05
No ratings yet
BM565 CW1 Assignment Brief 05.05
6 pages
Operating System Tutorial
No ratings yet
Operating System Tutorial
94 pages
Device Management in Unix
100% (1)
Device Management in Unix
5 pages
Release Highlights 5000.8.2 and 5000.8.3
No ratings yet
Release Highlights 5000.8.2 and 5000.8.3
19 pages
Tutorial: Turning Your Python Ogre Game Scripts Into A Windows .Exe
No ratings yet
Tutorial: Turning Your Python Ogre Game Scripts Into A Windows .Exe
2 pages
MCSM Megathread
No ratings yet
MCSM Megathread
14 pages
Qlik Sense Security Rule List (v1.1)
No ratings yet
Qlik Sense Security Rule List (v1.1)
16 pages
Linux Operating System Security: Joen A. Sinamag
No ratings yet
Linux Operating System Security: Joen A. Sinamag
12 pages
Getting Started: Ocean Software Development Framework For Techlog
No ratings yet
Getting Started: Ocean Software Development Framework For Techlog
58 pages
Operating System Notes
100% (1)
Operating System Notes
20 pages
ParametricCurves V2 ANSYS18
No ratings yet
ParametricCurves V2 ANSYS18
14 pages
Sap Gui Help
No ratings yet
Sap Gui Help
75 pages
LogMeIn Pro UserGui
No ratings yet
LogMeIn Pro UserGui
140 pages
Tecplot
No ratings yet
Tecplot
35 pages