0% found this document useful (0 votes)
55 views26 pages

PSR ASSIGNMENT Final

1. The document contains solutions to probability questions involving normal distributions, binomial distributions, and conditional probability calculations. 2. For the first question, probabilities are calculated for various events related to a normal distribution with μ=77 and σ=3.4. 3. The second question finds probabilities for finishing times in a car race that are normally distributed with μ=145 minutes and σ=12 minutes. 4. Remaining questions use concepts like normal approximation to the binomial, and Bayes' theorem to calculate conditional probabilities.

Uploaded by

Tipu Thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views26 pages

PSR ASSIGNMENT Final

1. The document contains solutions to probability questions involving normal distributions, binomial distributions, and conditional probability calculations. 2. For the first question, probabilities are calculated for various events related to a normal distribution with μ=77 and σ=3.4. 3. The second question finds probabilities for finishing times in a car race that are normally distributed with μ=145 minutes and σ=12 minutes. 4. Remaining questions use concepts like normal approximation to the binomial, and Bayes' theorem to calculate conditional probabilities.

Uploaded by

Tipu Thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

2/26/22, 8:47 PM PSR_ASSIGNMENT_01

PSR_ASSIGNMENT_01
TIPU THAKUR; PG_ID - 12120032
26/02/2022
1. If a random variable has the normal distribution with µ= 77.0 and σ = 3.4, find the
probability that random variable is
(a) less than 72.6
(b) greater than 88.5
(c) between 81 and 84
(d) between 56 and 92

### µ= 77.0 and σ = 3.4

### for any point x, Z score will be given by

### Z = (x-µ) / σ
### Storing Values for all x in collection and then Creating Dataframe X for Random Variable

### Removing Scientific notation by options(scipen = 0)

options(scipen = 0)

library(scales)

µ <- 77.0

σ <- 3.4

Random_Var <- data.frame ( Observation =c(72.6,88.5,81,84,56,92) )

### Z Scores for all are given by Z = (x-µ) / σ

Random_Var$Z_Score = round( (Random_Var$Observation-µ) / σ , 2)

### Cumulative Probability to left as compared from Z table can be checked from pnorm table;

Random_Var$CDF_Observation = round( pnorm(Random_Var$Observation,µ,σ) , 10)

### Cumulative Probability can also be taken from Z Scores with mean = 0; sd = 1

Random_Var$CDF_Z_Score = round( pnorm(Random_Var$Z_Score,0,1) , 10)

### Cumulative Probability to Left of observation

Random_Var$Prob_Less_Than_x = round( pnorm(Random_Var$Z_Score,0,1) , 10)

### Cumulative Probability to Right of observation

Random_Var$Prob_Greater_Than_x = 1- round( pnorm(Random_Var$Z_Score,0,1) , 10)

paste("Table for All Z Values and CDF are given below")

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 1/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

## [1] "Table for All Z Values and CDF are given below"

Random_Var

## Observation Z_Score CDF_Observation CDF_Z_Score Prob_Less_Than_x

## 1 72.6 -1.29 0.0978123934 0.0985253290 0.0985253290

## 2 88.5 3.38 0.9996406613 0.9996375709 0.9996375709

## 3 81.0 1.18 0.8802965606 0.8809998925 0.8809998925

## 4 84.0 2.06 0.9802444266 0.9803007296 0.9803007296

## 5 56.0 -6.18 0.0000000003 0.0000000003 0.0000000003

## 6 92.0 4.41 0.9999948734 0.9999948315 0.9999948315

## Prob_Greater_Than_x

## 1 0.9014746710

## 2 0.0003624291

## 3 0.1190001075

## 4 0.0196992704

## 5 0.9999999997

## 6 0.0000051685

paste("a. Probability for less than a= 72.6 is F(a) = ", percent ( Random_Var[which(Random_Va
r$Observation == 72.6),5],.001 ))

## [1] "a. Probability for less than a= 72.6 is F(a) = 9.853%"

paste("b. Probability for greater than a= 88.5 is 1- F(a) = ", percent ( 1- Random_Var[which
(Random_Var$Observation == 88.5),5],.001 ))

## [1] "b. Probability for greater than a= 88.5 is 1- F(a) = 0.036%"

paste("c. Probability for between a:81 and b:84 is F(b) - F(a) = ", percent ( Random_Var[whi
ch(Random_Var$Observation == 84),5] - Random_Var[which(Random_Var$Observation == 81),5],.001
))

## [1] "c. Probability for between a:81 and b:84 is F(b) - F(a) = 9.930%"

paste("d. Probability for between a:56 and b:92 is F(b) - F(a) = ", percent ( Random_Var[whi
ch(Random_Var$Observation == 92),5] - Random_Var[which(Random_Var$Observation == 56),5],.001
))

## [1] "d. Probability for between a:56 and b:92 is F(b) - F(a) = 99.999%"

2. In a car race, the finishing times are normally distributed with mean 145 minutes
and standard deviation of 12 minutes.
(a) Find percentage of car racers whose finish time is between 130 and 160 minutes.
(b) Find percentage of car racers whose finish time is less than 130 minutes.

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 2/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

### µ= 145 and σ = 12

### for any point x, Z score will be given by

### Z = (x-µ) / σ
### Storing Values for all x in collection and then Creating Dataframe X for Random Variable

### Removing Scientific notation by options(scipen = 0)

options(scipen = 0)

library(scales)

µ <- 145

σ <- 12

Random_Var <- data.frame ( Observation =c(130,160) )

### Z Scores for all are given by Z = (x-µ) / σ

Random_Var$Z_Score = round( (Random_Var$Observation-µ) / σ , 2)

### Cumulative Probability to left as compared from Z table can be checked from pnorm table;

Random_Var$CDF_Observation = round( pnorm(Random_Var$Observation,µ,σ) , 10)

### Cumulative Probability can also be taken from Z Scores with mean = 0; sd = 1

Random_Var$CDF_Z_Score = round( pnorm(Random_Var$Z_Score,0,1) , 10)

### Cumulative Probability to Left of observation

Random_Var$Prob_Less_Than_x = round( pnorm(Random_Var$Z_Score,0,1) , 10)

### Cumulative Probability to Right of observation

Random_Var$Prob_Greater_Than_x = 1- round( pnorm(Random_Var$Z_Score,0,1) , 10)

paste("Table for All Z Values and CDF are given below")

## [1] "Table for All Z Values and CDF are given below"

Random_Var

## Observation Z_Score CDF_Observation CDF_Z_Score Prob_Less_Than_x

## 1 130 -1.25 0.1056498 0.1056498 0.1056498

## 2 160 1.25 0.8943502 0.8943502 0.8943502

## Prob_Greater_Than_x

## 1 0.8943502

## 2 0.1056498

paste("a. Percentage of Cars with Finish Time a:130 and b:160 is F(b) - F(a) = ", percent (
Random_Var[which(Random_Var$Observation == 160),5] - Random_Var[which(Random_Var$Observation
== 130),5],.001 ))

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 3/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

## [1] "a. Percentage of Cars with Finish Time a:130 and b:160 is F(b) - F(a) = 78.870%"

paste("b. Percentage of Cars with Finish Timeless than a:130 is F(a) = ", percent ( Random_V
ar[which(Random_Var$Observation == 130),5],.001 ))

## [1] "b. Percentage of Cars with Finish Timeless than a:130 is F(a) = 10.565%"

3. A test-taker has recently taken an aptitude test with 15 questions. All of the
questions are True/ False type in nature.
(a) What is the probability that the student got first five questions correct. (b) What is
the probability that the student got five questions correct.
Answer

a. Probability of getting first five correct. All Outcomes are Independent of each other. Probability
Calculation in case of Independent Events P( A ∩ B ) = P(A)*P(B)

Therefore P(C1∩ C2 ∩ C3 ∩ C4 ∩ C5 ∩ C-Any Outcome for Last 10) =

P(C1) * P(C2) * P(C3) * P(C4) * P(C5) * P(C-Any Outcome for Last 10)

1/2 * 1/2 * 1/2) * 1/2 * 1/2) * 1 = 1/32 = 0.03125

b. Probability of Getting Exactly Five Questions Right.

Here each event is not independent since one Right might impact probability of event. This is case of Binomial
Distribution.

## P(each trial) = 1/2

## N_Number of Trials = 15

## P (Exactly 5 Trials Right) = 15C5 * (p^(5)) * ((1-p)^(15-5))

p <- 0.5

n <- 15

r <- 5

P_5_Trial <- choose(n,r) * (p^r) * ((1-p)^(n-r))

paste("Probaility That 5 Trials are correct is ",P_5_Trial," i.e. ",round(P_5_Trial*100,2),


"%")

## [1] "Probaility That 5 Trials are correct is 0.091644287109375 i.e. 9.16 %"

4. 68% of the marks in exam are between 35 and 42. Assuming data is normally
distributed, what are the mean and standard deviation?
Assumption : Both scores 35 and 42 are symetric with respect to mean.

Thus Mean = (35 + 42) / 2 = 38.5 Thus 68% means - 2σ difference between these two, Thus σ = (42-35)/2 =
3.5

Means is 38.5 and S.D is 3.5

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 4/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

5. A professor asked students to take two exams. 30% of the class cleared both
exams and 55% of the class cleared the first exam. What percentage of class who
cleared first exam also cleared the second exam?
Answer

Unconditional Probability for Exam E1 = P(E1) = 0.55

Unconditional Probability for Clearing Exam E1 & Exam E2 = P(E1 ∩ E2) = 0.30

Conditional Probability by Bayes Theorem is

P (E2 | E1) = P(E1 ∩ E2) / P(E1) = 0.30 / 0.55 = 0.545

Thus, there is 0.545 probability of passing E2 if E1 already passed.

6. In India, 82% of all urban homes have a TV. 62% have a TV and DVD player. What
is probability that a home has a DVD player given that the home has a TV.
Answer

Unconditional Probability for Home with TV = P(TV) = 0.82

Unconditional Probability for Home with TV & DVD = P(TV ∩ DVD) = 0.62

Conditional Probability by Bayes Theorem is

P (DVD | TV) = P(TV ∩ DVD) / P(TV) = 0.62 / 0.82 = 0.7560976

Thus, homes with TV are 75.60% likely to have DVD.

7. You toss a coin three times. Assume that the coin is fair. What is the probability of
getting?
1) All three heads
2) Exactly one head
3) Given that you have seen exactly one head, what is the probability of getting at-•‐
least two heads?

## Checking Sample Space for Coins Toss - Total Number of Elements in Sample Space are 2^3 =
8

## Same can be done from Prob liabrary with tosscoin function ; which will given all possible
outcomes in a dataframe

library(prob)

## Loading required package: combinat

##

## Attaching package: 'combinat'

## The following object is masked from 'package:utils':

##

## combn

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 5/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

## Loading required package: fAsianOptions

## Loading required package: timeDate

## Loading required package: timeSeries

## Loading required package: fBasics

## Loading required package: fOptions

##

## Attaching package: 'prob'

## The following objects are masked from 'package:base':

##

## intersect, setdiff, union

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 6/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

samplespace <- tosscoin(3)

nsamp <- nrow (tosscoin(3))

## Identifying All outcomes to check for required outcomes;

## n_all_head has number of outcomes with all head

## n_exact_one_head has number of outcomes with exact one head

## n_atleast_two_head, n_atleast_one_head is unconditional probability of getting atleast two


/ one heads

n_all_head <- 0

n_exact_one_head <- 0

n_atleast_two_head <- 0

n_atleast_one_head <- 0

for (i in 1:nrow(samplespace))

if (samplespace[i,1] == 'H' && samplespace[i,2] == 'H' && samplespace[i,3] == 'H')

n_all_head <- n_all_head + 1

k<- as.vector(samplespace[i,])

if( length(k[k=='H']) == 1)

n_exact_one_head <- n_exact_one_head + 1

# Here We have Identified number of occurrence of H in a row

if( length(k[k=='H']) >= 2)

n_atleast_two_head <- n_atleast_two_head + 1

if( length(k[k=='H']) >= 1)

n_atleast_one_head <- n_atleast_one_head + 1

paste("a. Probibility of All Three Heads is ",n_all_head/nsamp)

## [1] "a. Probibility of All Three Heads is 0.125"

paste("b. Probibility of Exactly One Head is ",n_exact_one_head/nsamp)

## [1] "b. Probibility of Exactly One Head is 0.375"

## Probability at least 1 head in other : Unconditional Prob of >= 2 heads / unconditional Pr


obability of getting >= 1 heads

paste("c. Probibility of getting at least two head given that seen 1 head = ",n_atleast_two_h
ead,"/",n_atleast_one_head," = ",n_atleast_two_head/n_atleast_one_head)

## [1] "c. Probibility of getting at least two head given that seen 1 head = 4 / 7 = 0.571
428571428571"

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 7/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

8. Let x be the random variable that represents the speed of cars. x has μ = 90 and σ
= 10. We have to find the probability that x is higher than 100 or P(x > 100)

### µ= 90 and σ = 10

### for any point x, Z score will be given by

### Z = (x-µ) / σ
### Storing Values for all x in collection and then Creating Data frame X for Random Variable

### Removing Scientific notation by options(scipen = 0)

options(scipen = 0)

library(scales)

µ <- 90

σ <- 10

Random_Var <- data.frame ( Observation =c(100) )

### Z Scores for all are given by Z = (x-µ) / σ

Random_Var$Z_Score = round( (Random_Var$Observation-µ) / σ , 2)

### Cumulative Probability can also be taken from Z Scores with mean = 0; sd = 1

Random_Var$CDF_Z_Score = round( pnorm(Random_Var$Z_Score,0,1) , 10)

### Cumulative Probability to Left of observation

Random_Var$Prob_Less_Than_x = round( pnorm(Random_Var$Z_Score,0,1) , 10)

### Cumulative Probability to Right of observation

Random_Var$Prob_Greater_Than_x = 1- round( pnorm(Random_Var$Z_Score,0,1) , 10)

paste("Table for All Z Values and CDF are given below")

## [1] "Table for All Z Values and CDF are given below"

Random_Var

## Observation Z_Score CDF_Z_Score Prob_Less_Than_x Prob_Greater_Than_x

## 1 100 1 0.8413447 0.8413447 0.1586553

paste("a. Probability for Value of Random Variable x higher than a:100) = ", percent ( 1- Ran
dom_Var[which(Random_Var$Observation == 100),4],.001 ))

## [1] "a. Probability for Value of Random Variable x higher than a:100) = 15.866%"

9. The blue M&M was introduced in 1995. Before then, the color mix in a bag of plain
M&Ms was (30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan).
file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 8/26
2/26/22, 8:47 PM PSR_ASSIGNMENT_01

Afterward it was (24% Blue, 20% Green, 16% Orange, 14% Yellow, 13% Red, 13%
Brown). A friend of mine has two bags of M&Ms, and he tells me that one is from 1994
and one from 1996. He won’t tell me which is which, but he gives me one M&M from
each bag. One is yellow and one is green. What is the probability that the yellow M&M
came from the 1994 bag?
Answer

P(Y_94) = 0.20 P(Y_96) = 0.14 P(G_94) = 0.10 P(G_96) = 0.20

CASE A: Y_94 & G_96

CASE B: Y_96 & G_94

Probability of A & B is equal; Thus P(A) = P(B) = 0.5

P(Y) = P(A) * P(Y_94 ∩ G_96) + P(B) P(Y_96 ∩ G_94) P(Y) = (0.5 0.20 * 0.20) + (0.5 * 0.14 * 0.10) = 0.027

P(1994|Y) = P(CASE A|Y) = P(Y | CASE A) * P(A)) / P(Y) P(1994|Y) = (0.20 * 0.20 * 0.50) / 0.027 P(1994|Y) =
0.02 / 0.027 = 0.7407407

Thus 74% Chances Yellow came from 1994 Bag.

10. Find the daily stock price of Wal Mart for the last three months. (A good source for
the data is http://moneycentral.msn.comorYahooFinance
(http://moneycentral.msn.comorYahooFinance) or Google Finance (there are many
more such sources). You can ask for the three month chart and export the data to a
spreadsheet.)
(a) Calculate the mean and the standard deviation of the stock prices.
(b) Get the corresponding data for Target and calculate the mean and the standard
deviation.
(c) The coefficient of variation (CV)is defined as the ratio of the standard deviation
over the mean. Calculate the CV of Wal Mart and Target stock prices.
(d) If the CV of the daily stock prices is taken as an indicator of risk of the stock, how
do Wal-•‐Mart and Target stocks compare in terms of risk? (There are better measures
of risk, but we will use CV in this exercise.)
(e) Get the corresponding data of the Dow Jones Industrial Average (DJIA) and
compute its CV. How do Wal-•‐ Mart and Target stocks compare with the DJIA in
terms of risk?
(f) Suppose you bought 100 shares of Wal-•‐Mart stock three months ago and held it.
What are the mean and the standard deviation of the daily market price of your
holding for the three months?
Note: Question referenced from Aczel A., Sounderpandian J., Complete Business
Statistics (7ed.)

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 9/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

## Load Stock Prices for Walmart, TARGET, DJIA in three Data Frame

## setwd("C:/Users/tiput/")

stock_WMT <- read.csv("WMT.csv", header=T)

stock_TGT <- read.csv("TGT.csv", header=T)

stock_DJIA <- read.csv("DJIA.csv", header=T)

stock_WMT$STOCK_NAME = 'WALMART'

stock_TGT$STOCK_NAME = 'TARGET'

stock_DJIA$STOCK_NAME = 'DJIA'

## Merging Data Frame to generate single data frame

library(dplyr)

##

## Attaching package: 'dplyr'

## The following objects are masked from 'package:prob':

##

## intersect, setdiff, union

## The following objects are masked from 'package:timeSeries':

##

## filter, lag

## The following objects are masked from 'package:stats':

##

## filter, lag

## The following objects are masked from 'package:base':

##

## intersect, setdiff, setequal, union

stock_review = bind_rows(stock_WMT,stock_TGT,stock_DJIA)

fact_stock_review <- stock_review %>%

select (everything()) %>%

mutate (VALUE_100_STOCKS = Close*100 ) %>%

group_by(STOCK_NAME) %>%

summarise ( MEAN = mean(Close), STDDEV= sd(Close), MEAN_MKT_PRICE_100_UNITS = mean(VALUE_10


0_STOCKS), STDEV_MKT_PRICE_100_UNITS = sd(VALUE_100_STOCKS) ) %>%

mutate( CV_Coefficient = round(STDDEV / MEAN,4)) %>%

arrange( desc(CV_Coefficient) )

fact_stock_review

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 10/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

## # A tibble: 3 x 6

## STOCK_NAME MEAN STDDEV MEAN_MKT_PRICE_100~ STDEV_MKT_PRICE_1~ CV_Coefficient

## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>

## 1 TARGET 223. 13.7 22306. 1374. 0.0616

## 2 DJIA 35273. 864. 3527269. 86394. 0.0245

## 3 WALMART 140. 3.40 14008. 340. 0.0243

CV_Coefficient implies higher volatility as compared to Central Value of Stock; Target has highest
CV_Coffficient thus higher volatility as compared to other two stocks /indexes

Mean and Standard Deviation are depending on Scale. Thus Mean and Std Dev get multiplied by factor for
100 for 100 stocks Market Price.

11. For this problem, consider the dataset Prob_Assignment_Dataset.xlsx attached.


As the vigilant monitors of the Zappos.Com website, we are obsessed with who is
coming to our website and what they do during their visit. This challenge requires you
to look at data similar to that which Analytics teams at Zappos would use to assess
the overall performance of the business.
The data set is a fictional representation of actual data sets that we work with.
Perform unii variate descriptive analysis on the variables, and Submit a short
description explaining any insights or trends you discovered. Include graphs, data
tables, and other helpful visuals to communicate your discoveries. You do not have to
work on all of the variables (select any 3 - 4 that are of most interest to you).

setwd("C:/Users/tiput/Documents")

options(scipen = 0)

# This is done to remove scientfic formting

library(readxl)

dataset_site_review = read_excel("Prob_Dataset.xlsx")

library(dplyr)

library(ggplot2)

# Complete 6 Points Summary is given by summary function

summary(dataset_site_review)

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 11/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

## day site new_customer

## Min. :2013-01-01 00:00:00 Length:21061 Min. :0.000

## 1st Qu.:2013-06-10 00:00:00 Class :character 1st Qu.:0.000

## Median :2013-08-21 00:00:00 Mode :character Median :0.000

## Mean :2013-07-30 13:23:22 Mean :0.448

## 3rd Qu.:2013-10-27 00:00:00 3rd Qu.:1.000

## Max. :2013-12-31 00:00:00 Max. :1.000

## NA's :8259

## platform visits distinct_sessions orders

## Length:21061 Min. : 0 Min. : 0 Min. : 0.00

## Class :character 1st Qu.: 3 1st Qu.: 2 1st Qu.: 0.00

## Mode :character Median : 24 Median : 19 Median : 0.00

## Mean : 1935 Mean : 1515 Mean : 62.38

## 3rd Qu.: 360 3rd Qu.: 274 3rd Qu.: 7.00

## Max. :136057 Max. :107104 Max. :4916.00

##

## gross_sales bounces add_to_cart product_page_views

## Min. : 1 Min. : 0.0 Min. : 0.0 Min. : 0

## 1st Qu.: 79 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 3

## Median : 851 Median : 5.0 Median : 4.0 Median : 53

## Mean : 16473 Mean : 743.3 Mean : 166.3 Mean : 4358

## 3rd Qu.: 3145 3rd Qu.: 97.0 3rd Qu.: 43.0 3rd Qu.: 708

## Max. :707642 Max. :54512.0 Max. :7924.0 Max. :187601

## NA's :9576

## search_page_views

## Min. : 0

## 1st Qu.: 4

## Median : 82

## Mean : 8584

## 3rd Qu.: 1229

## Max. :506629

##

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 12/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

# Identifying Summary for All Columns ; We have taken unpivoted all columns in a single data
frame stacked one upon another ; with column name as category

summary_frame_numeric = bind_rows (

data.frame (CATEGORY = 'new_customer',VALUE= dataset_site_review


$new_customer ),

data.frame (CATEGORY = 'visits',VALUE= dataset_site_review$visit


s),

data.frame (CATEGORY = 'distinct_sessions',VALUE= dataset_site_r


eview$distinct_sessions),

data.frame (CATEGORY = 'orders',VALUE= dataset_site_review$order


s ),

data.frame (CATEGORY = 'gross_sales',VALUE= dataset_site_review


$gross_sales),

data.frame (CATEGORY = 'bounces',VALUE= dataset_site_review$boun


ces),

data.frame (CATEGORY = 'add_to_cart',VALUE= dataset_site_review


$add_to_cart),

data.frame (CATEGORY = 'product_page_views',VALUE= dataset_site_


review$product_page_views),

data.frame (CATEGORY = 'search_page_views',VALUE= dataset_site_r


eview$search_page_views)

# Structure of Dataframe is shown here as

head(summary_frame_numeric)

## CATEGORY VALUE

## 1 new_customer NA

## 2 new_customer NA

## 3 new_customer NA

## 4 new_customer NA

## 5 new_customer NA

## 6 new_customer NA

tail(summary_frame_numeric)

## CATEGORY VALUE

## 189544 search_page_views 0

## 189545 search_page_views 0

## 189546 search_page_views 0

## 189547 search_page_views 2

## 189548 search_page_views 0

## 189549 search_page_views 1

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 13/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

# Identifying Summary of All Columns in dplyr package

summary_review <- summary_frame_numeric %>%

select (everything()) %>%

group_by(CATEGORY) %>%

summarise ( MIN = min(VALUE,na.rm=T),

MAX= max(VALUE,na.rm=T),

VARIANCE = round( var(VALUE,na.rm=T), 5),

STDEV = round( sd(VALUE,na.rm=T) ,6)

) %>%

arrange( CATEGORY )

summary_review

## # A tibble: 9 x 5

## CATEGORY MIN MAX VARIANCE STDEV

## <chr> <dbl> <dbl> <dbl> <dbl>

## 1 add_to_cart 0 7924 2.55e+5 505.

## 2 bounces 0 54512 9.95e+6 3155.

## 3 distinct_sessions 0 107104 3.51e+7 5926.

## 4 gross_sales 1 707642 2.61e+9 51111.

## 5 new_customer 0 1 2.47e-1 0.497

## 6 orders 0 4916 6.77e+4 260.

## 7 product_page_views 0 187601 2.05e+8 14327.

## 8 search_page_views 0 506629 9.68e+8 31120.

## 9 visits 0 136057 5.55e+7 7449.

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 14/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

# CHECKING ITEM WISE DATA VALIDATION

# Checking for Data for new Customers

check_new_customer <- dataset_site_review %>%

select (everything()) %>%

group_by(new_customer) %>%

summarise ( COUNT = n() ) %>%

arrange( new_customer )

# Plotting data for availability of new Customers

ggplot(check_new_customer, aes(x = "", y = COUNT , fill = new_customer)) +

geom_col() +

geom_text(aes(label = COUNT),

position = position_stack(vjust = 0.5)) +

coord_polar(theta = "y") +

labs(

title = "Availbility of Data for New Customers",

caption = "Availbility of Data for New Customers"

) +

theme(

plot.title = element_text(size = 15L,

face = "bold",

hjust = 0.5),

axis.title.y = element_text(size = 15L,

face = "bold"),

axis.title.x = element_text(size = 10L,

face = "bold")

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 15/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

# Checking for Data for new Customers

check_platform <- dataset_site_review %>%

select (everything()) %>%

group_by(platform) %>%

summarise ( HITS_COUNT = n() ) %>%

mutate(perc_platform = round((HITS_COUNT / sum(HITS_COUNT)) * 100.00,2) ) %>%

arrange( platform )

# Plotting platform wise data

ggplot(check_platform) +

aes(x = platform, y = perc_platform) +

geom_bar(stat="identity", fill="steelblue") +

geom_text(aes(label=perc_platform), vjust=0.0, color="black", size=3.5)+

theme_linedraw() +

labs(

x = "Platform",

y = "Percentage",

title = "PLATFORM WISE CUSTOMER PENETRATION"

) +

coord_flip()

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 16/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

theme(

plot.title = element_text(size = 20L,

face = "bold",

hjust = 0.5),

axis.title.y = element_text(size = 15L,

face = "bold"),

axis.title.x = element_text(size = 10L,

face = "bold")

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 17/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

## List of 3

## $ axis.title.x:List of 11

## ..$ family : NULL

## ..$ face : chr "bold"

## ..$ colour : NULL

## ..$ size : int 10

## ..$ hjust : NULL

## ..$ vjust : NULL

## ..$ angle : NULL

## ..$ lineheight : NULL

## ..$ margin : NULL

## ..$ debug : NULL

## ..$ inherit.blank: logi FALSE

## ..- attr(*, "class")= chr [1:2] "element_text" "element"

## $ axis.title.y:List of 11

## ..$ family : NULL

## ..$ face : chr "bold"

## ..$ colour : NULL

## ..$ size : int 15

## ..$ hjust : NULL

## ..$ vjust : NULL

## ..$ angle : NULL

## ..$ lineheight : NULL

## ..$ margin : NULL

## ..$ debug : NULL

## ..$ inherit.blank: logi FALSE

## ..- attr(*, "class")= chr [1:2] "element_text" "element"

## $ plot.title :List of 11

## ..$ family : NULL

## ..$ face : chr "bold"

## ..$ colour : NULL

## ..$ size : int 20

## ..$ hjust : num 0.5

## ..$ vjust : NULL

## ..$ angle : NULL

## ..$ lineheight : NULL

## ..$ margin : NULL

## ..$ debug : NULL

## ..$ inherit.blank: logi FALSE

## ..- attr(*, "class")= chr [1:2] "element_text" "element"

## - attr(*, "class")= chr [1:2] "theme" "gg"

## - attr(*, "complete")= logi FALSE

## - attr(*, "validate")= logi TRUE

# Checking for Data for Visits & Quartiles for Visits

summary(dataset_site_review$visits)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 0 3 24 1935 360 136057

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 18/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

check_visits <- dataset_site_review %>%

select (everything()) %>%

mutate( VISITS_REC_BUCKETS = case_when( between(visits,0,3) ~ "1_QUAD", between(visits,4,2


4) ~ "2_QUAD", between(visits,25,360) ~ "3_QUAD", between(visits,361,999999) ~ "4_QUAD" )
) %>%

group_by(VISITS_REC_BUCKETS) %>%

summarise ( VISITS_COUNT = sum(visits), N_RECORDS = n() ) %>%

arrange( VISITS_REC_BUCKETS )

# Checking Data for Visits

ggplot(check_visits, aes(x = "", y = N_RECORDS , fill = VISITS_REC_BUCKETS)) +

geom_col() +

geom_text(aes(label = VISITS_COUNT),

position = position_stack(vjust = 0.5)) +

coord_polar(theta = "y") +

labs(

title = "QUARTILES WISE VISITS COUNTS AND SUMS"

) +

theme(

plot.title = element_text(size = 20L,

face = "bold",

hjust = 0.5),

axis.title.y = element_text(size = 15L,

face = "bold"),

axis.title.x = element_text(size = 10L,

face = "bold")

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 19/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

# Data Frequency can further be categorized as

check_visits_1 <- dataset_site_review %>%

select (everything()) %>%

group_by(orders) %>%

summarise ( N_RECORDS = n() )

dataset_site_review

## # A tibble: 21,061 x 12

## day site new_customer platform visits distinct_sessions


## <dttm> <chr> <dbl> <chr> <dbl> <dbl>
## 1 2013-12-02 00:00:00 Acme NA Windows 136057 107104
## 2 2013-11-21 00:00:00 Acme NA Windows 99608 84429
## 3 2013-12-09 00:00:00 Acme NA Windows 99561 78790
## 4 2013-12-10 00:00:00 Acme NA Windows 95470 75574
## 5 2013-11-29 00:00:00 Acme NA Windows 95193 75949
## 6 2013-12-16 00:00:00 Acme NA Windows 93583 75161
## 7 2013-12-17 00:00:00 Acme NA Windows 88473 70930
## 8 2013-12-04 00:00:00 Acme NA Windows 87825 70995
## 9 2013-12-11 00:00:00 Acme NA Windows 87542 70372
## 10 2013-12-05 00:00:00 Acme NA Windows 85251 69100
## # ... with 21,051 more rows, and 6 more variables: orders <dbl>,

## # gross_sales <dbl>, bounces <dbl>, add_to_cart <dbl>,

## # product_page_views <dbl>, search_page_views <dbl>

# Genrating Order Bucket Wise data

barplot(check_visits_1$orders)

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 20/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

After checking Data for New Customers, nearly 8300 records done have this field filled, thus data capture
mechanism needs to be optimised. Max Hits are coming through iOS i.e. 16.31% followed by Android and
Windows. Most of Store Visits in 4th Quadrant with high volumes. Max Number of Orders are 4000 in a day.

12. Consider the same dataset as earlier. Now perform bi-•‐variate data analysis as
discussed in last class to find out relationships between different variables. Write a
short description on what you find in the analyses along with any tables, graphs. You
do not have to analyze all variables, any 3 - 4 that interest you most should be the
focus here.

setwd("C:/Users/tiput/Documents")

library(readxl)

dataset_site_review = read_excel("Prob_Dataset.xlsx")

library(dplyr)

library(ggplot2)

library(calendR)

## ~~ Package calendR

## Visit https://r-coder.com/ for R tutorials ~~

## Plot for Days vs Distinct Sessions in every day

plot(dataset_site_review$day, dataset_site_review$distinct_sessions, xlab="Days", ylab="Dist


inct Sessions")

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 21/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

## Plot for Sites Wise Orders Placed

ggplot(dataset_site_review) +

aes(x = site, group = orders) +

geom_bar(fill = "#2659B4") +

labs(

x = "Sites",

y = "Number of Orders",

title = "Site Wise Orders Placed"

) +

theme_minimal()

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 22/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

## Plot for Orders vs Gross Sales Corelation

ggplot(dataset_site_review) +

aes(x = orders, y = gross_sales) +

geom_jitter(size = 1.5) +

labs(

x = "Orders",

y = "Gross Sales",

title = "Orders vs Gross Sales Corelation"

) +

theme_linedraw()

## Warning: Removed 9576 rows containing missing values (geom_point).

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 23/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

## Plot for Add to Cart vs Gross Sales Corelation

ggplot(dataset_site_review) +

aes(x = add_to_cart, y = gross_sales) +

geom_line(size = 1.75, colour = "#8E3213") +

labs(

x = "Add to Cart",

y = "Gross Sales",

title = "Add to Cart vs Gross Sales Corelation"

) +

theme_linedraw()

## Warning: Removed 3 row(s) containing missing values (geom_path).

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 24/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

## Plot for Site vide Visits vs Sales

ggplot(dataset_site_review) +

aes(x = visits, y = distinct_sessions) +

geom_point(shape = "circle", size = 1.5, colour = "#228B22") +

labs(

x = "Visits",

y = "Gross Sales",

title = "Site vide Visits vs Sales"

) +

theme_classic() +

facet_wrap(vars(site))

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 25/26


2/26/22, 8:47 PM PSR_ASSIGNMENT_01

file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 26/26

You might also like