0% found this document useful (0 votes)
10 views5 pages

Case Study

This R-based case study analyzes the Online Retail Dataset, focusing on customer behavior, product sales, and revenue. It includes data transformation and visualization techniques using libraries like ggplot2 and dplyr, performing analyses such as revenue calculation, hypothesis testing, and correlation analysis. Key findings include identifying top products by revenue and revenue distribution across countries, along with statistical tests to evaluate differences and relationships in the data.

Uploaded by

rutvik waghmare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

Case Study

This R-based case study analyzes the Online Retail Dataset, focusing on customer behavior, product sales, and revenue. It includes data transformation and visualization techniques using libraries like ggplot2 and dplyr, performing analyses such as revenue calculation, hypothesis testing, and correlation analysis. Key findings include identifying top products by revenue and revenue distribution across countries, along with statistical tests to evaluate differences and relationships in the data.

Uploaded by

rutvik waghmare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Create an R-based case study that demonstrates data analysis, transformation, manipulation and

visualization techniques using a sample dataset. or

Create an R-based case study by using R code to analyze the Online Retail Dataset using a sample
dataset includes fields such as InvoiceNo, StockCode, Description, Quantity, InvoiceDate,
UnitPrice, CustomerID, and Country.

R-Code:

Step 1: Objective

The Online Retail Dataset contains information about online transactions, including invoice details,
product information, customer IDs, and country of origin. We'll use R to analyze this data and gain
insights into customer behavior, product sales, and revenue.

Step 2: Dataset

The dataset contains the following fields:

 InvoiceNo: Unique invoice number

 StockCode: Product code

 Description: Product description

 Quantity: Number of units sold

 InvoiceDate: Date of invoice

 UnitPrice: Price per unit

 CustomerID: Unique customer ID

 Country: Country of origin

Step 3: R-Code

# Load necessary libraries

install.packages("ggplot2")
library(ggplot2)
install.packages("dplyr")
library(dplyr)
install.packages("lubridate")
library(lubridate)
# Load the dataset

library(readr)
retail_data <- read_csv("C:/Users/DELL/Desktop/online_retail.csv")
View(retail_data)

# Explore the data

summary(retail_data)

# Convert InvoiceDate to date format

retail_data$InvoiceDate <- ymd_hms(retail_data$InvoiceDate)


retail_data$InvoiceDate

#Missing Value
Mean_Quantity = mean(retail_data$Quantity,na.rm =TRUE)
Mean_Quantity
retail_data$Quantity=ifelse(is.na(retail_data$Quantity ), Mean_Quantity , retail_data$Quantity)
retail_data$Quantity

# Calculate total revenue

retail_data$Revenue <- retail_data$Quantity * retail_data$UnitPrice


retail_data$Revenue

#OR

retail_data <- retail_data %>%


mutate(Revenue = Quantity * UnitPrice)
View(retail_data)

# Top 10 products by revenue

top_products <- retail_data %>%


group_by(Description) %>%
summarise(TotalRevenue = sum(Revenue)) %>%
arrange(desc(TotalRevenue)) %>%
head(10)
top_products

# Visualize top products

# Load ggplot2
library(ggplot2)
ggplot(top_products, aes(x = reorder(Description, TotalRevenue), y = TotalRevenue)) +
geom_col() +
xlab("Product") +
ylab("Revenue") +
ggtitle("Top 10 Products by Revenue")

Reorders the product names based on revenue so that bars appear in ascending/descending order.
# Revenue by country

Revenue_by_country <- retail_data %>%


group_by(Country) %>%
summarise(TotalRevenue = sum(Revenue)) %>%
arrange(desc(TotalRevenue))
Revenue_by_country

# Visualize Revenue by country

ggplot(Revenue_by_country, aes(x = reorder(Country, TotalRevenue), y = TotalRevenue)) +


geom_col() +
labs(title = "Revenue by Country", x = "Country", y = "Revenue")

Reorders the product names based on revenue so that bars appear in ascending/descending order.

#Descriptive Statistics

Summary(retail_data$UnitPrice)
Summary(retail_data$Revenue)

#Testing of Hypothesis

# 1) Two sample t test

#Null Hypothesis (H0): (μ1 = μ2) i.e.There is no significant difference in Revenue of United Kingdom
and Australia.
#Alternative Hypothesis (H1) :(μ1 ≠ μ2)i.e. There is a significant difference in Revenue of United
Kingdom and Australia.

# Filter data for two countries


country1_data <- retail_data %>% filter(Country == "United Kingdom")
country1_data
country2_data <- retail_data %>% filter(Country == "Australia")
country2_data

# Perform two-sample t-test


t_test_result <- t.test(country1_data$Revenue, country2_data$Revenue,var.equal=TRUE)
t_test_result

Decision to reject and fail to reject the H0

If P-Value > α, fail to reject the H0


If P-Value < α, reject the H0

# 2) ANOVA
#H0: Revenue do not vary significantly across different countries.
#H1: Revenue vary significantly across different countries.

# Perform ANOVA
anova_result <- aov(Revenue ~ Country, data = retail_data)
anova_result
summary(anova_result)

# Perform Tukey's HSD test


tukey_result <- TukeyHSD(anova_result)
tukey_result

#Decision to reject and fail to reject the H0

#If P-Value > α, fail to reject the H0


#If P-Value < α, reject the H0

#3) Correlational Analysis


#Null Hypothesis (H0): There's no significant relationship between Revenue and quantity (ρ = 0).
#Alternative Hypothesis (H1): There's a significant relationship between Revenue and quantity (ρ ≠
0).

# Perform correlation analysis


correlation_result <- cor.test(retail_data$ Revenue, retail_data$Quantity)
correlation_result

# Visualize the relationship


ggplot(retail_data, aes(x = Quantity, y = Revenue)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Relationship between Revenue and Quantity", x = "Quantity", y = " Revenue ")

 se = FALSE: Hides the shaded confidence interval around the line.

# Fit the linear regression model


model <- lm(Revenue ~ Quantity, data = retail_data)
summary(model)

You might also like