PSR ASSIGNMENT Final
PSR ASSIGNMENT Final
PSR_ASSIGNMENT_01
TIPU THAKUR; PG_ID - 12120032
26/02/2022
1. If a random variable has the normal distribution with µ= 77.0 and σ = 3.4, find the
probability that random variable is
(a) less than 72.6
(b) greater than 88.5
(c) between 81 and 84
(d) between 56 and 92
### Z = (x-µ) / σ
### Storing Values for all x in collection and then Creating Dataframe X for Random Variable
options(scipen = 0)
library(scales)
µ <- 77.0
σ <- 3.4
### Cumulative Probability to left as compared from Z table can be checked from pnorm table;
### Cumulative Probability can also be taken from Z Scores with mean = 0; sd = 1
## [1] "Table for All Z Values and CDF are given below"
Random_Var
## Prob_Greater_Than_x
## 1 0.9014746710
## 2 0.0003624291
## 3 0.1190001075
## 4 0.0196992704
## 5 0.9999999997
## 6 0.0000051685
paste("a. Probability for less than a= 72.6 is F(a) = ", percent ( Random_Var[which(Random_Va
r$Observation == 72.6),5],.001 ))
paste("b. Probability for greater than a= 88.5 is 1- F(a) = ", percent ( 1- Random_Var[which
(Random_Var$Observation == 88.5),5],.001 ))
paste("c. Probability for between a:81 and b:84 is F(b) - F(a) = ", percent ( Random_Var[whi
ch(Random_Var$Observation == 84),5] - Random_Var[which(Random_Var$Observation == 81),5],.001
))
## [1] "c. Probability for between a:81 and b:84 is F(b) - F(a) = 9.930%"
paste("d. Probability for between a:56 and b:92 is F(b) - F(a) = ", percent ( Random_Var[whi
ch(Random_Var$Observation == 92),5] - Random_Var[which(Random_Var$Observation == 56),5],.001
))
## [1] "d. Probability for between a:56 and b:92 is F(b) - F(a) = 99.999%"
2. In a car race, the finishing times are normally distributed with mean 145 minutes
and standard deviation of 12 minutes.
(a) Find percentage of car racers whose finish time is between 130 and 160 minutes.
(b) Find percentage of car racers whose finish time is less than 130 minutes.
### Z = (x-µ) / σ
### Storing Values for all x in collection and then Creating Dataframe X for Random Variable
options(scipen = 0)
library(scales)
µ <- 145
σ <- 12
### Cumulative Probability to left as compared from Z table can be checked from pnorm table;
### Cumulative Probability can also be taken from Z Scores with mean = 0; sd = 1
## [1] "Table for All Z Values and CDF are given below"
Random_Var
## Prob_Greater_Than_x
## 1 0.8943502
## 2 0.1056498
paste("a. Percentage of Cars with Finish Time a:130 and b:160 is F(b) - F(a) = ", percent (
Random_Var[which(Random_Var$Observation == 160),5] - Random_Var[which(Random_Var$Observation
== 130),5],.001 ))
## [1] "a. Percentage of Cars with Finish Time a:130 and b:160 is F(b) - F(a) = 78.870%"
paste("b. Percentage of Cars with Finish Timeless than a:130 is F(a) = ", percent ( Random_V
ar[which(Random_Var$Observation == 130),5],.001 ))
## [1] "b. Percentage of Cars with Finish Timeless than a:130 is F(a) = 10.565%"
3. A test-taker has recently taken an aptitude test with 15 questions. All of the
questions are True/ False type in nature.
(a) What is the probability that the student got first five questions correct. (b) What is
the probability that the student got five questions correct.
Answer
a. Probability of getting first five correct. All Outcomes are Independent of each other. Probability
Calculation in case of Independent Events P( A ∩ B ) = P(A)*P(B)
P(C1) * P(C2) * P(C3) * P(C4) * P(C5) * P(C-Any Outcome for Last 10)
Here each event is not independent since one Right might impact probability of event. This is case of Binomial
Distribution.
## N_Number of Trials = 15
p <- 0.5
n <- 15
r <- 5
## [1] "Probaility That 5 Trials are correct is 0.091644287109375 i.e. 9.16 %"
4. 68% of the marks in exam are between 35 and 42. Assuming data is normally
distributed, what are the mean and standard deviation?
Assumption : Both scores 35 and 42 are symetric with respect to mean.
Thus Mean = (35 + 42) / 2 = 38.5 Thus 68% means - 2σ difference between these two, Thus σ = (42-35)/2 =
3.5
5. A professor asked students to take two exams. 30% of the class cleared both
exams and 55% of the class cleared the first exam. What percentage of class who
cleared first exam also cleared the second exam?
Answer
Unconditional Probability for Clearing Exam E1 & Exam E2 = P(E1 ∩ E2) = 0.30
6. In India, 82% of all urban homes have a TV. 62% have a TV and DVD player. What
is probability that a home has a DVD player given that the home has a TV.
Answer
Unconditional Probability for Home with TV & DVD = P(TV ∩ DVD) = 0.62
7. You toss a coin three times. Assume that the coin is fair. What is the probability of
getting?
1) All three heads
2) Exactly one head
3) Given that you have seen exactly one head, what is the probability of getting at-•‐
least two heads?
## Checking Sample Space for Coins Toss - Total Number of Elements in Sample Space are 2^3 =
8
## Same can be done from Prob liabrary with tosscoin function ; which will given all possible
outcomes in a dataframe
library(prob)
##
##
## combn
##
##
n_all_head <- 0
n_exact_one_head <- 0
n_atleast_two_head <- 0
n_atleast_one_head <- 0
for (i in 1:nrow(samplespace))
k<- as.vector(samplespace[i,])
if( length(k[k=='H']) == 1)
paste("c. Probibility of getting at least two head given that seen 1 head = ",n_atleast_two_h
ead,"/",n_atleast_one_head," = ",n_atleast_two_head/n_atleast_one_head)
## [1] "c. Probibility of getting at least two head given that seen 1 head = 4 / 7 = 0.571
428571428571"
8. Let x be the random variable that represents the speed of cars. x has μ = 90 and σ
= 10. We have to find the probability that x is higher than 100 or P(x > 100)
### µ= 90 and σ = 10
### Z = (x-µ) / σ
### Storing Values for all x in collection and then Creating Data frame X for Random Variable
options(scipen = 0)
library(scales)
µ <- 90
σ <- 10
### Cumulative Probability can also be taken from Z Scores with mean = 0; sd = 1
## [1] "Table for All Z Values and CDF are given below"
Random_Var
paste("a. Probability for Value of Random Variable x higher than a:100) = ", percent ( 1- Ran
dom_Var[which(Random_Var$Observation == 100),4],.001 ))
## [1] "a. Probability for Value of Random Variable x higher than a:100) = 15.866%"
9. The blue M&M was introduced in 1995. Before then, the color mix in a bag of plain
M&Ms was (30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan).
file:///D:/Google Drive/ISB APMBA/STUDIES/Assignments/002----PSR_ASSIGNMENT_dt_22-01-2022/PSR_ASSIGNMENT_01_Markdown.html 8/26
2/26/22, 8:47 PM PSR_ASSIGNMENT_01
Afterward it was (24% Blue, 20% Green, 16% Orange, 14% Yellow, 13% Red, 13%
Brown). A friend of mine has two bags of M&Ms, and he tells me that one is from 1994
and one from 1996. He won’t tell me which is which, but he gives me one M&M from
each bag. One is yellow and one is green. What is the probability that the yellow M&M
came from the 1994 bag?
Answer
P(Y) = P(A) * P(Y_94 ∩ G_96) + P(B) P(Y_96 ∩ G_94) P(Y) = (0.5 0.20 * 0.20) + (0.5 * 0.14 * 0.10) = 0.027
P(1994|Y) = P(CASE A|Y) = P(Y | CASE A) * P(A)) / P(Y) P(1994|Y) = (0.20 * 0.20 * 0.50) / 0.027 P(1994|Y) =
0.02 / 0.027 = 0.7407407
10. Find the daily stock price of Wal Mart for the last three months. (A good source for
the data is http://moneycentral.msn.comorYahooFinance
(http://moneycentral.msn.comorYahooFinance) or Google Finance (there are many
more such sources). You can ask for the three month chart and export the data to a
spreadsheet.)
(a) Calculate the mean and the standard deviation of the stock prices.
(b) Get the corresponding data for Target and calculate the mean and the standard
deviation.
(c) The coefficient of variation (CV)is defined as the ratio of the standard deviation
over the mean. Calculate the CV of Wal Mart and Target stock prices.
(d) If the CV of the daily stock prices is taken as an indicator of risk of the stock, how
do Wal-•‐Mart and Target stocks compare in terms of risk? (There are better measures
of risk, but we will use CV in this exercise.)
(e) Get the corresponding data of the Dow Jones Industrial Average (DJIA) and
compute its CV. How do Wal-•‐ Mart and Target stocks compare with the DJIA in
terms of risk?
(f) Suppose you bought 100 shares of Wal-•‐Mart stock three months ago and held it.
What are the mean and the standard deviation of the daily market price of your
holding for the three months?
Note: Question referenced from Aczel A., Sounderpandian J., Complete Business
Statistics (7ed.)
## Load Stock Prices for Walmart, TARGET, DJIA in three Data Frame
## setwd("C:/Users/tiput/")
stock_WMT$STOCK_NAME = 'WALMART'
stock_TGT$STOCK_NAME = 'TARGET'
stock_DJIA$STOCK_NAME = 'DJIA'
library(dplyr)
##
##
##
## filter, lag
##
## filter, lag
##
stock_review = bind_rows(stock_WMT,stock_TGT,stock_DJIA)
group_by(STOCK_NAME) %>%
arrange( desc(CV_Coefficient) )
fact_stock_review
## # A tibble: 3 x 6
CV_Coefficient implies higher volatility as compared to Central Value of Stock; Target has highest
CV_Coffficient thus higher volatility as compared to other two stocks /indexes
Mean and Standard Deviation are depending on Scale. Thus Mean and Std Dev get multiplied by factor for
100 for 100 stocks Market Price.
setwd("C:/Users/tiput/Documents")
options(scipen = 0)
library(readxl)
dataset_site_review = read_excel("Prob_Dataset.xlsx")
library(dplyr)
library(ggplot2)
summary(dataset_site_review)
## NA's :8259
##
## 1st Qu.: 79 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 3
## 3rd Qu.: 3145 3rd Qu.: 97.0 3rd Qu.: 43.0 3rd Qu.: 708
## NA's :9576
## search_page_views
## Min. : 0
## 1st Qu.: 4
## Median : 82
## Mean : 8584
## Max. :506629
##
# Identifying Summary for All Columns ; We have taken unpivoted all columns in a single data
frame stacked one upon another ; with column name as category
summary_frame_numeric = bind_rows (
head(summary_frame_numeric)
## CATEGORY VALUE
## 1 new_customer NA
## 2 new_customer NA
## 3 new_customer NA
## 4 new_customer NA
## 5 new_customer NA
## 6 new_customer NA
tail(summary_frame_numeric)
## CATEGORY VALUE
## 189544 search_page_views 0
## 189545 search_page_views 0
## 189546 search_page_views 0
## 189547 search_page_views 2
## 189548 search_page_views 0
## 189549 search_page_views 1
group_by(CATEGORY) %>%
MAX= max(VALUE,na.rm=T),
) %>%
arrange( CATEGORY )
summary_review
## # A tibble: 9 x 5
group_by(new_customer) %>%
arrange( new_customer )
geom_col() +
geom_text(aes(label = COUNT),
coord_polar(theta = "y") +
labs(
) +
theme(
face = "bold",
hjust = 0.5),
face = "bold"),
face = "bold")
group_by(platform) %>%
arrange( platform )
ggplot(check_platform) +
geom_bar(stat="identity", fill="steelblue") +
theme_linedraw() +
labs(
x = "Platform",
y = "Percentage",
) +
coord_flip()
theme(
face = "bold",
hjust = 0.5),
face = "bold"),
face = "bold")
## List of 3
## $ axis.title.x:List of 11
## $ axis.title.y:List of 11
## $ plot.title :List of 11
summary(dataset_site_review$visits)
group_by(VISITS_REC_BUCKETS) %>%
arrange( VISITS_REC_BUCKETS )
geom_col() +
geom_text(aes(label = VISITS_COUNT),
coord_polar(theta = "y") +
labs(
) +
theme(
face = "bold",
hjust = 0.5),
face = "bold"),
face = "bold")
group_by(orders) %>%
dataset_site_review
## # A tibble: 21,061 x 12
barplot(check_visits_1$orders)
After checking Data for New Customers, nearly 8300 records done have this field filled, thus data capture
mechanism needs to be optimised. Max Hits are coming through iOS i.e. 16.31% followed by Android and
Windows. Most of Store Visits in 4th Quadrant with high volumes. Max Number of Orders are 4000 in a day.
12. Consider the same dataset as earlier. Now perform bi-•‐variate data analysis as
discussed in last class to find out relationships between different variables. Write a
short description on what you find in the analyses along with any tables, graphs. You
do not have to analyze all variables, any 3 - 4 that interest you most should be the
focus here.
setwd("C:/Users/tiput/Documents")
library(readxl)
dataset_site_review = read_excel("Prob_Dataset.xlsx")
library(dplyr)
library(ggplot2)
library(calendR)
## ~~ Package calendR
ggplot(dataset_site_review) +
geom_bar(fill = "#2659B4") +
labs(
x = "Sites",
y = "Number of Orders",
) +
theme_minimal()
ggplot(dataset_site_review) +
geom_jitter(size = 1.5) +
labs(
x = "Orders",
y = "Gross Sales",
) +
theme_linedraw()
ggplot(dataset_site_review) +
labs(
x = "Add to Cart",
y = "Gross Sales",
) +
theme_linedraw()
ggplot(dataset_site_review) +
labs(
x = "Visits",
y = "Gross Sales",
) +
theme_classic() +
facet_wrap(vars(site))