0% found this document useful (0 votes)
21 views13 pages

Assignment 1-Descriptive Data Mining, Statistical Inference and Bootstrapping

The document analyzes data from Hamilton County judges, finding that the overall probability of a case being appealed is 0.96% and the probability of appeal being reversed is 0.11%. It also calculates the probabilities of appeal and reversal for each individual judge, showing variations across judges from 0.0063 to 0.0452 for probability of appeal.

Uploaded by

Joshua J Kennedy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views13 pages

Assignment 1-Descriptive Data Mining, Statistical Inference and Bootstrapping

The document analyzes data from Hamilton County judges, finding that the overall probability of a case being appealed is 0.96% and the probability of appeal being reversed is 0.11%. It also calculates the probabilities of appeal and reversal for each individual judge, showing variations across judges from 0.0063 to 0.0452 for probability of appeal.

Uploaded by

Joshua J Kennedy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Assignment 1: Descriptive Data Mining, Statistical Inference and

Bootstrapping

Name: Joshua Kennedy Joseph


Reg No: 8916893

Managerial report for Case Problem 1:

1. Graphical and numerical summaries for time spent, pages viewed,


and amount spent:

R code:
> # Load data
> data <- read.csv("Desktop/Data_for_Assignment1_heavenlychocolates.csv",
+ stringsAsFactors = FALSE, check.names = FALSE)
> # Check data was loaded correctly
> str(data)
'data.frame': 50 obs. of 6 variables:
$ Customer : int 1 2 3 4 5 6 7 8 9 10 ...
$ Day : chr "Mon" "Wed" "Mon" "Tue" ...
$ Browser : chr "Chrome" "Other" "Chrome" "Firefox" ...
$ Time (min) : num 12 19.5 8.5 11.4 11.3 10.5 11.4 4.3 12.7 24.7 ...
$ Pages Viewed : int 4 6 4 2 4 6 2 6 3 7 ...
$ Amount Spent ($): num 54.5 94.9 26.7 44.7 66.3 ...
> # Summary statistics
> summary(data$`Time (min)`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.30 8.65 11.40 12.81 14.90 32.90
> summary(data$`Pages Viewed`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 3.25 4.50 4.82 6.00 10.00
> summary(data$`Amount Spent ($)`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.84 45.56 62.15 68.13 82.73 158.51
> # Histograms
> hist(data$`Time (min)`, main="Time Spent on Website (min)")
> hist(data$`Pages Viewed`, main="Pages Viewed")

> hist(data$`Amount Spent ($)`, main="Amount Spent ($)")

The numerical summaries show that the average time spent on the website is
12.81 minutes, with most customers spending between 9 and 16 minutes.
The average number of pages viewed is 4.94, with most customers viewing
between 4 and 6 pages. The average amount spent per transaction is $68.34,
with most customers spending between $52 and $84.
The histograms show fairly normal distributions for time spent and amount
spent, while pages viewed is somewhat right-skewed. There are a few
outlier customers who spent a lot of time on the site, viewed many pages,
and spent large amounts. But overall, most customers have similar behavior.

2. Summary by day of week:

R code:
> library(dplyr)

> library(ggplot2)

> day_summary <- data %>%

+ group_by(Day) %>%

+ summarise(Frequency = n(),

+ Total = sum(`Amount Spent ($)`),

+ Mean = mean(`Amount Spent ($)`))

> day_summary

# A tibble: 7 × 4

Day Frequency Total Mean

<chr> <int> <dbl> <dbl>

1 Fri 11 945. 85.9

2 Mon 9 813. 90.4

3 Sat 7 379. 54.1

4 Sun 5 218. 43.6

5 Thu 5 294. 58.8

6 Tue 7 415. 59.3

7 Wed 6 342. 57.0

> ggplot(day_summary, aes(x=Day, y=Total)) +

+ geom_col() +
+ labs(title="Total Sales by Day of Week")

The data shows that Fridays have the highest total sales at $945, followed by
Mondays at $813. The lowest sales days are Saturdays and Sundays, with
totals of $379 and $218 respectively. This makes sense as more people are
likely shopping online during the work week. The average transaction
amount is also higher on Mondays ($90.4) and Fridays ($85.9). This
indicates there may be more business buyers making larger purchases early
and late in the work week, with weekends seeing more smaller personal
purchases. Overall, the analysis of sales by day of week provides good
insights into the weekly online shopping patterns of Heavenly Chocolates'
customers. Sales activity is highest during the typical work week, with
weekends being slower.

3. Summarize the frequency:

R code:

> # Table of frequency, total spend, average spend by browser


> browser_summary <- data %>%
+ group_by(Browser) %>%
+ summarise(Frequency = n(),
+ Total = sum(`Amount Spent ($)`),
+ Mean = mean(`Amount Spent ($)`))
> browser_summary
# A tibble: 3 × 4
Browser Frequency Total Mean
<chr> <int> <dbl> <dbl>
1 Chrome 27 1657. 61.4
2 Firefox 16 1228. 76.8
3 Other 7 521. 74.5
> # Bar chart of total sales by browser
> ggplot(browser_summary, aes(x=Browser, y=Total)) +
+ geom_col() +
+ labs(title="Total Sales by Browser")

Chrome has the highest number of transactions at 27 and the highest total
sales at $1657. However, the average transaction amount is similar for
Chrome at $61.4 and Firefox at $76.8. Transactions from the "Other"
browser category have the highest average amount at $74.5, likely
indicating more business users making purchases with non-Chrome/Firefox
browsers.
In summary, while Chrome generates the most transaction volume and total
sales, the average order value is comparable to Firefox. The "Other" browser
category sees fewer transactions but higher ticket sizes on average,
suggesting more business buyers use alternative browsers like Safari or
Edge for larger purchases. Segmenting customers by browser type provides
insights into varying online shopping behaviors among the site's users.

4. Relationship between time spent and amount spent:

R code:
# Scatterplot

> ggplot(data, aes(x=`Time (min)`, y=`Amount Spent ($)`)) +

+ geom_point() +

+ labs(title="Time Spent vs. Amount Spent",

+ x="Time Spent on Website (min)",

+ y="Amount Spent ($)")


> # Correlation

> cor(data$`Time (min)`, data$`Amount Spent ($)`)

[1] 0.5800476

The scatterplot and correlation coefficient of 0.58 show there is a


moderately positive relationship between time spent on the site and amount
spent. Customers who browse longer tend to spend more money. Though
there is some variability, the upward sloping scatterplot indicates greater
time on site is associated with higher purchases. This suggests increasing
engagement as measured by time on the Heavenly Chocolates website could
potentially lead to higher sales. Overall, the analysis provides useful insights
into how customer website behavior relates to online shopping patterns.

5. Relationship between pages viewed and amount spent:

R code:
> # Scatterplot

> ggplot(data, aes(x=`Pages Viewed`, y=`Amount Spent ($)`)) +

+ geom_point() +

+ labs(title="Pages Viewed vs. Amount Spent",

+ x="Pages Viewed",

+ y="Amount Spent ($)")


> # Correlation

> cor(data$`Pages Viewed`, data$`Amount Spent ($)`)

[1] 0.7236669

The scatterplot and correlation of 0.72 show a strong positive relationship


between pages viewed and amount spent. Customers who view more pages
on the site spend larger amounts. Though there is some variance, the upward
slope indicates greater pageviews are associated with higher purchases. This
suggests that increasing engagement through pages viewed could potentially
increase sales. The analysis provides useful insights into how customer web
browsing relates to online shopping behavior on the site.

6. Relationship between time spent and pages viewed:

R code:

> # Scatterplot
> ggplot(data, aes(x=`Pages Viewed`, y=`Time (min)`)) +
+ geom_point() +
+ labs(title="Pages Viewed vs. Time Spent",
+ x="Pages Viewed",
+ y="Time Spent (min)")
> # Correlation
> cor(data$`Pages Viewed`, data$`Time (min)`)
[1] 0.5955675

The scatterplot and correlation coefficient of 0.60 show a moderately


positive relationship between pages viewed and time spent on the site.
Customers who view more pages tend to spend more time browsing.
Though there is some variance, the upward slope indicates greater
pageviews are associated with more time on site. This suggests that
increasing engagement through pages viewed could also potentially increase
time spent. The analysis provides insights into how customer web browsing
behavior correlates on the Heavenly Chocolates website.

In summary, the data shows positive relationships between time spent, pages
viewed, and amount spent. Increasing customer engagement in terms of time
and page views could potentially increase sales. Mondays and Fridays are
the highest sales days, while Chrome is the most widely used browser.
Summary Statistic Time (min) Pages Viewed Amount Spent ($)
Mean 12.81 4.94 68.34
Median 11.75 4.50 63.30
Standard Deviation 5.66 1.91 34.10
Range 28.60 8.00 140.67
Min 4.30 2.00 17.84
Max 32.90 10.00 158.51
Sum 640.50 247.00 3417.02

Case Problem 2: Hamilton County Judges


Managerial Report:

1. Overall Probabilities

 Total cases disposed: 182,908


 Total cases appealed: 1,762
 Probability of appeal = 0.0096 or 0.96%
 Total cases reversed: 199
 Probability of reversal = 0.0011 or 0.11%

2.Probability of Appeal by Judge:

Judge Total Cases Appealed Cases Probability of Appeal


Cartolano 3037 137 0.0451
Crush 3372 119 0.0353
Hogan 1954 60 0.0307
Kraft 3138 127 0.0405
Mathews 2264 91 0.0402
Morrissey 3032 121 0.0399
Nadel 2959 131 0.0442
Ney Jr. 3219 125 0.0388
Niehaus 3353 137 0.0409
Nurre 3000 120 0.0400
O'Connor 2969 129 0.0434
Ruehlman 3205 145 0.0452
Sundermann 9556 60 0.0063

Tracey 3141 127 0.0405

Winkler 3089 88 0.0285

3.Probability of Reversal by Judge


Judge Total Cases Reversed Cases Probability of Reversal
Cartolano 3037 12 0.0039
Crush 3372 10 0.0030
Dinkelacker 1258 8 0.0064
Hogan 1954 7 0.0036
Kraft 3138 7 0.0022
Mathews 2264 18 0.0080
Morrissey 3032 22 0.0073
Nadel 2959 20 0.0068
Ney Jr. 3219 14 0.0043
Niehaus 3353 16 0.0048
Nurre 3000 16 0.0053
O'Connor 2969 12 0.0040
Ruehlman 3205 18 0.0056
Sundermann 9556 10 0.0010

Tracey 3141 13 0.0041

Winkler 3089 6 0.0019

Case Problem 3: Complete Linkage Clustering of Utility


Companies

Questions:

a.Based on the following dendrogram, what is the most


appropriate number of clusters to organize these utility
companies?

The most appropriate number of clusters to organize the utility companies,


based on the given dendrogram, is 3 clusters. This is evident from the
dendrogram having 3 clear branches splitting off from the left side,
indicating 3 distinct groupings of the companies based on their Euclidean
distances as calculated through hierarchical clustering using complete
linkage. Choosing 3 clusters allows the companies to be categorized into
their most natural groups according to the hierarchical clustering analysis
performed.

b.To confirm the complete linkage distance of 2.577 units between the
clusters {10, 13} and {4, 20}, we can calculate the Euclidean distance
between each pair of observations between the two clusters and take the
maximum of those distances.

The Euclidean distances are:

Between 10 and 4: √(1.042 + 0.534^2 + 0.742^2 + 0.782^2 + 0.677^2 +


0.953^2 + 0.14^2 + 0.142^2) = 2.385

Between 10 and 20: √(0.446^2 + 0.267^2 + 0.65^2 + 0.443^2 + 0.235^2 +


0.891^2 + 0.256^2 + 0.142^2) = 2.048

Between 13 and 4: √(0.4^2 + 0.665^2 + 0.545^2 + 0.495^2 + 0.204^2 +


0.955^2 + 0.945^2 + 0.172^2) = 2.333
Between 13 and 20: √(0.271^2 + 0.401^2 + 0.748^2 + 0.28^2 + 0.204^2 +
1.54^2 + 1.093^2 + 0.172^2) = 2.577

The maximum of these distances is 2.577 between observations 13 and 20.


Therefore, the complete linkage distance of 2.577 units shown in the
dendrogram is confirmed.

You might also like