Assignment 1-Descriptive Data Mining, Statistical Inference and Bootstrapping
Assignment 1-Descriptive Data Mining, Statistical Inference and Bootstrapping
Bootstrapping
R code:
> # Load data
> data <- read.csv("Desktop/Data_for_Assignment1_heavenlychocolates.csv",
+ stringsAsFactors = FALSE, check.names = FALSE)
> # Check data was loaded correctly
> str(data)
'data.frame': 50 obs. of 6 variables:
$ Customer : int 1 2 3 4 5 6 7 8 9 10 ...
$ Day : chr "Mon" "Wed" "Mon" "Tue" ...
$ Browser : chr "Chrome" "Other" "Chrome" "Firefox" ...
$ Time (min) : num 12 19.5 8.5 11.4 11.3 10.5 11.4 4.3 12.7 24.7 ...
$ Pages Viewed : int 4 6 4 2 4 6 2 6 3 7 ...
$ Amount Spent ($): num 54.5 94.9 26.7 44.7 66.3 ...
> # Summary statistics
> summary(data$`Time (min)`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.30 8.65 11.40 12.81 14.90 32.90
> summary(data$`Pages Viewed`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 3.25 4.50 4.82 6.00 10.00
> summary(data$`Amount Spent ($)`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.84 45.56 62.15 68.13 82.73 158.51
> # Histograms
> hist(data$`Time (min)`, main="Time Spent on Website (min)")
> hist(data$`Pages Viewed`, main="Pages Viewed")
The numerical summaries show that the average time spent on the website is
12.81 minutes, with most customers spending between 9 and 16 minutes.
The average number of pages viewed is 4.94, with most customers viewing
between 4 and 6 pages. The average amount spent per transaction is $68.34,
with most customers spending between $52 and $84.
The histograms show fairly normal distributions for time spent and amount
spent, while pages viewed is somewhat right-skewed. There are a few
outlier customers who spent a lot of time on the site, viewed many pages,
and spent large amounts. But overall, most customers have similar behavior.
R code:
> library(dplyr)
> library(ggplot2)
+ group_by(Day) %>%
+ summarise(Frequency = n(),
> day_summary
# A tibble: 7 × 4
+ geom_col() +
+ labs(title="Total Sales by Day of Week")
The data shows that Fridays have the highest total sales at $945, followed by
Mondays at $813. The lowest sales days are Saturdays and Sundays, with
totals of $379 and $218 respectively. This makes sense as more people are
likely shopping online during the work week. The average transaction
amount is also higher on Mondays ($90.4) and Fridays ($85.9). This
indicates there may be more business buyers making larger purchases early
and late in the work week, with weekends seeing more smaller personal
purchases. Overall, the analysis of sales by day of week provides good
insights into the weekly online shopping patterns of Heavenly Chocolates'
customers. Sales activity is highest during the typical work week, with
weekends being slower.
R code:
Chrome has the highest number of transactions at 27 and the highest total
sales at $1657. However, the average transaction amount is similar for
Chrome at $61.4 and Firefox at $76.8. Transactions from the "Other"
browser category have the highest average amount at $74.5, likely
indicating more business users making purchases with non-Chrome/Firefox
browsers.
In summary, while Chrome generates the most transaction volume and total
sales, the average order value is comparable to Firefox. The "Other" browser
category sees fewer transactions but higher ticket sizes on average,
suggesting more business buyers use alternative browsers like Safari or
Edge for larger purchases. Segmenting customers by browser type provides
insights into varying online shopping behaviors among the site's users.
R code:
# Scatterplot
+ geom_point() +
[1] 0.5800476
R code:
> # Scatterplot
+ geom_point() +
+ x="Pages Viewed",
[1] 0.7236669
R code:
> # Scatterplot
> ggplot(data, aes(x=`Pages Viewed`, y=`Time (min)`)) +
+ geom_point() +
+ labs(title="Pages Viewed vs. Time Spent",
+ x="Pages Viewed",
+ y="Time Spent (min)")
> # Correlation
> cor(data$`Pages Viewed`, data$`Time (min)`)
[1] 0.5955675
In summary, the data shows positive relationships between time spent, pages
viewed, and amount spent. Increasing customer engagement in terms of time
and page views could potentially increase sales. Mondays and Fridays are
the highest sales days, while Chrome is the most widely used browser.
Summary Statistic Time (min) Pages Viewed Amount Spent ($)
Mean 12.81 4.94 68.34
Median 11.75 4.50 63.30
Standard Deviation 5.66 1.91 34.10
Range 28.60 8.00 140.67
Min 4.30 2.00 17.84
Max 32.90 10.00 158.51
Sum 640.50 247.00 3417.02
1. Overall Probabilities
Questions:
b.To confirm the complete linkage distance of 2.577 units between the
clusters {10, 13} and {4, 20}, we can calculate the Euclidean distance
between each pair of observations between the two clusters and take the
maximum of those distances.