0% found this document useful (0 votes)
2 views

Data analysis with R

The capstone project focuses on collecting and analyzing real-world datasets to enhance data quality and insights through various stages including data collection, wrangling, exploratory analysis, modeling, and dashboard creation using R. Key methodologies involve using libraries such as Tidyverse and ggplot2 for data visualization, SQL for exploratory data analysis, and building predictive models with Tidymodels. The project culminates in an interactive R Shiny dashboard that displays predictive analytics related to bike-sharing demand in Seoul.

Uploaded by

Hoài An Nguyen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data analysis with R

The capstone project focuses on collecting and analyzing real-world datasets to enhance data quality and insights through various stages including data collection, wrangling, exploratory analysis, modeling, and dashboard creation using R. Key methodologies involve using libraries such as Tidyverse and ggplot2 for data visualization, SQL for exploratory data analysis, and building predictive models with Tidymodels. The project culminates in an interactive R Shiny dashboard that displays predictive analytics related to bike-sharing demand in Seoul.

Uploaded by

Hoài An Nguyen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Applied Data

Science with R
Capstone project
<Nguyen Hoai An>
<October 15th 2024>
Outline
• Executive Summary
• Introduction
• Methodology
• Results
• Conclusion
• Appendix

2
Executive Summary
Ø The project aims to collect and analyze real-world datasets through
various stages, enhancing data quality and gaining insights.
Ø The project involves tackling a challenge that requires data
collection, analysis, hypothesis testing, visualization, modeling, and
dashboard creation using real-world datasets.
Ø Key tasks include:
Ø Data Collection: Gathering and understanding data from multiple sources.
Ø Data Wrangling: Preparing data using regular expressions and Tidyverse.
Ø Exploratory Data Analysis: Utilizing SQL and visualization techniques via
Tidyverse and ggplot2.
Ø Modeling: Building linear regression models using Tidymodels.
Ø Dashboard Creation: Developing an interactive dashboard with R Shiny. 3
Introduction
•Module 3 - Performing Exploratory Data Analysis
with SQL, Tidyverse & ggplot2
•Module 1 - Capstone Overview and Data Collection
•EDA with SQL lab using RSQLite
•DC with Web Scraping Notebook
Hands-on Lab •EDA with SQL lab using RODBC with IBM DB2
•DC with OpenWeather API Notebook
•EDA with Data Visualization Lab
•Module 2 - Data Wrangling (DW)
•Module 4 - Predictive Analysis
•DW with Regular Expressions Notebook
•Building a Baseline Regression Model Lab
•DW with dplyr Notebook
•Improving the Linear Model lab

•Module 5 - Building a R Shiny Dashboard App

•Build a bike-sharing demand prediction app


•Module 6 - Present Your Data-Driven Insights

4
Methodology
• Perform data collection
• Perform data wrangling
• Perform exploratory data analysis (EDA) using SQL and
visualization
• Perform predictive analysis using regression models
• How to build the baseline model
• How to improve the baseline model
• Build a R Shiny dashboard app

5
Methodology

6
Data collection
use the ‘rvest’ library to obtain HTML
table from a web page, Collect Data from an API using httr and jsonlite

library(rvest) # Load required libraries


url <- "https://example.com" library(httr)
root_node <- read_html(url) library(jsonlite)
table_nodes <- html_nodes(root_node, "table") url <- "https://api.example.com/data" # Replace with actual API URL
df<- html_table(table_nodes[[1]], fill = TRUE) api_key <- "YOUR_API_KEY" # Replace with your actual API key
write.csv(df, "data", row.names = FALSE) data_query <- list(
q = "query_term", # Replace with actual query term or data filter
appid = api_key,# Use the API key as a query parameter
units = ”unit" ) # Replace with the appropriate unit system (optional)
# Obtain CSV from a URL response <- GET(url, query = data_query)
json_result <- content(response, as = "parsed", type = "application/json")
data <- data.frame(
url <- "https://example.com/data.csv"
column_1 = json_result$main$field1,# Replace with actual data fields
download.file(url, destfile = "data.csv")
column_2 = json_result$main$field2,
column_3 = json_result$field3 )
write.csv(data, file = "data.csv", row.names = FALSE)
Data collection Web Scraping Notebook

library(rvest)
1 use the ‘rvest’ library to obtain HTML table
from a web page,
url <- "https://example.com"
root_node <- read_html(url)
table_nodes <- html_nodes(root_node, "table")

2 convert the table into a data frame, df<- html_table(table_nodes[[1]], fill = TRUE)

3 summarize the data frame glimpse(df)

4 write the data frame to a csv file. write.csv(df, ”File_name", row.names = FALSE)
API request
1

Data collection
(‘httr’ library )

Parsing JSON Response


2
with OpenWeather API Notebook (‘jsonlite’ library)

3 Extracting and Storing Data

4 Fetching Data

5 Creating a Data Frame

6 Displaying the Data Frame

7 Saving to CSV
9
Data wrangling

data manipulation, cleaning, and transformation


workflow with:
• ‘tidyverse’ library:
• readr
• dplyr
• stringr
• tidyr
• ‘fastDummies’ package

10
Data wrangling

Standardize column names Normalize data

Summarize the class Create indicator (dummy)


of each column variables for categorical variables

cleaning up the values


Detect and handle missing values
in the web-scraped dataset

11
Data wrangling
# List of datasets
dataset_list <- c('data_1.csv ', 'data_2.csv’)

# Load necessary libraries


library(readr)
library(stringr)# for str_replace_all function

# Loop through each dataset


for (dataset_name in dataset_list){
# Read dataset without column specification messages
dataset <- read_csv(dataset_name, show_col_types = FALSE)

# Convert to uppercase and replace spaces with underscores


Standardize column names colnames(dataset) <- str_replace_all(toupper(colnames(dataset)), " ", "_")

# Save the standardized dataset


write.csv(dataset, dataset_name, row.names = FALSE)
}
12
Data wrangling
# Load necessary libraries
library(readr)
library(dplyr)
library(tidyr)

# Read the dataset


data_df <- read_csv(”data_1.csv", show_col_types = FALSE)

# Select specific columns


df <- data_df %>% select(column1, column2, column3)

# Summarize the class of each column and gather the results


df %>%
Summarize the class of each column summarize_all(class) %>%
gather(variable, class)

13
# check if the column contains any “strange” character
ref_pattern <- "\\[[A-z0-9]+\\]"
Data wrangling find_ref_pattern <- function(strings) grepl(ref_pattern, strings)
df %>%
cleaning up the values select(column_2) %>%
in the web-scraped dataset filter(find_ref_pattern(column_2)) %>%
slice(0:10)

# check if the column is purely numeric remove_ref <- function(strings) {


find_character <- function(strings) grepl("[^0-9]", strings) ref_pattern <- "\\[[A-z0-9]+\\]"
# Replace all matched substrings using str_replace_all()
df %>% result <- str_replace_all(strings, ref_pattern, "")
select(column_1) %>% # Trim the result to remove any extra spaces
filter(find_character(column_1)) %>% result <- str_trim(result)
slice(0:10) return(result) }

# Clean and replace all non-numeric characters # Apply the remove_ref function to the column_2 and column_3
df <- df %>% df <- df %>%
mutate(column_1 = str_replace_all(column_1, "[^0-9]", "")) mutate(column_2 = remove_ref(column_2),
column_3 = remove_ref(column_3) )

# Then check the result # Then check the result


14
Data wrangling
Detect and handle missing values

# Take a quick look at the dataset


summary(data_df)

# subset the NA values in the column. # Impute missing values for column_X with mean value
library(dplyr) library(dplyr)
data_df <- data_df %>% data_df <- data_df %>%
filter(!is.na(column_X)) mutate(column_X = ifelse(is.na(column_X), Mean_value, column_X))

15
EDA with SQL
library("RSQLite") Group_1 <- dbGetQuery(con, "
db_path <- ”dbname.sqlite" Run RSQLite and SELECT group_column, Grouping and
establish connection. Aggregating Data
con <- dbConnect(RSQLite::SQLite(), dbname = db_path) COUNT(*) AS total_records,
AVG(numeric_column) AS average_value
library(readr) FROM table_name
Load Data into GROUP BY group_column")
dbWriteTable(con, "table_name", read_csv(”File.csv",
Database
show_col_types = FALSE), overwrite = TRUE)
F_data <- dbGetQuery(con, "
T_count <- dbGetQuery(con, "SELECT COUNT(*) AS
Counting Records SELECT * Data Filtering
total_records FROM table_name ")
FROM table_name
WHERE condition1 AND condition2 AND ...")
T_value <- dbGetQuery(con, "SELECT
Summing a Column
SUM(column_name) AS total_value FROM table_name")
Trend_data <- dbGetQuery(con, "
A_value <- dbGetQuery(con, "SELECT AVG(column_name) Detecting Trends
SELECT time_column,
Finding Averages Over Time
AS average_value FROM table_name") COUNT(*) AS total_records,
AVG(numeric_column) AS average_value
Min_max <- dbGetQuery(con, "SELECT FROM table_name
Finding Minimum GROUP BY time_column
MIN(column_name) AS min_value, MAX(column_name)
and Maximum Values
AS max_value FROM table_name") ORDER BY time_column")
16
EDA with SQL
Season_1 <- dbGetQuery(con, " Group_1 <- dbGetQuery(con, "
Seasonality
SELECT season_column, SELECT * Outlier Detection
Patterns
COUNT(*) AS total_records, FROM table_name
AVG(numeric_column) AS average_value WHERE numeric_column > (SELECT AVG(numeric_column) +
FROM table_name 2 * STDDEV(numeric_column) FROM table_name)
GROUP BY season_column OR numeric_column < (SELECT AVG(numeric_column) - 2
ORDER BY season_column") * STDDEV(numeric_column) FROM table_name)")

F_data <- dbGetQuery(con, " Finding


SELECT group_column, Similarities
AVG(numeric_column) AS average_value, Between Groups
COUNT(*) AS total_records
FROM table_name Trend_data <- dbGetQuery(con, "
GROUP BY group_column SELECT group_column, Clustering and
ORDER BY average_value DESC") Similarity
AVG(numeric_column) AS average_value,
COUNT(*) AS total_records
FROM table_name
GROUP BY group_column
HAVING AVG(numeric_column) BETWEEN some_value AND another_value
ORDER BY average_value")
17
EDA with data visualization
library(tidyverse)
Use tidyverse and ggplot2 in R.
library(ggplot2)

data_frame %>% ggplot(aes(x = numeric_column)) + geom_histogram(binwidth = some_value,


Create Histograms. fill = "blue", color = "black") + labs(title = "Histogram of numeric_column",
x = "numeric_column", y = "Frequency") + theme_minimal()

data_frame %>%
ggplot(aes(x = numeric_column1, y = numeric_column2)) +
Generate Scatterplots. geom_point(color = "blue", size = 2) +
labs(title = "Scatterplot of numeric_column1 vs numeric_column2", x = "numeric_column1",
y = "numeric_column2") + theme_minimal()

data_frame %>%
ggplot(aes(x = categorical_column, y = numeric_column)) +
Employ Box Plots. geom_boxplot(fill = "blue", color = "black") +
labs(title = "Box Plot of numeric_column by categorical_column", x = "categorical_column",
y = "numeric_column") + theme_minimal()
18
Predictive analysis

Define Objective

Prepare Data
•collect predictors and
•target variable)

Build Initial Models (linear regression)

Identify Key Predictors (analyze coefficients)

Add Polynomial and Interaction Terms

Manage Complexity and Overfitting

Apply Regularization (e.g., Lasso or Ridge)

Evaluate Models (MSE, RMSE, R-squared)

Refine Models (adjust terms, compare)

Final Model Selection

19
Build a R Shiny dashboard

• Integrate Regression Models (Predict hourly demand using weather, date, and time data)
• Display Interactive Map (Leaflet map showing cities with predicted bike demand for the next five days)
• Enable User Interaction - Dropdown to select specific city or "All" for overview
• Generate Detailed Plots - ggplot to show demand trends for selected city, including temperature and
humidity
• Visualize Data Trends - Line charts for temperature and demand over 5 days - Scatterplot for demand vs.
humidity correlation

20
Results
• Exploratory data analysis results

• Predictive analysis results

• A dashboard demo in screenshots

21
EDA with SQL

22
Busiest bike rental times

• The result is a data frame displaying the top 10 instances of


bike rentals from the SEOUL_BIKE_SHARING table, focusing on
the highest rental counts with DATE and HOUR.

• The data indicates a consistent trend of high rentals during the


same hour (18:00) across multiple days in June and September
2018, suggesting that this time slot is particularly popular for
bike rentals.

23
Hourly popularity and temperature
by seasons
To find hourly popularity and temperature by season:

• the data frame retrieves the average temperature and average bike
rentals for each season and hour from the SEOUL_BIKE_SHARING
table.

• It groups the data by both SEASONS and HOUR, then orders the
results by the average bike rentals in descending order.

• The top 10 results are stored in the variable avg_hourly_temp_bikes.

• The table shows that the highest average bike rentals occur during the
summer months, particularly in the late afternoon and early evening
hours. The average temperature during these peak rental hours is
generally warm. This suggests that people in Seoul tend to use bike-
sharing services more frequently on warm evenings.
24
Rental Seasonality
• The result retrieves seasonal bike rental
statistics from the SEOUL_BIKE_SHARING
table with average, minimum, maximum,
and standard deviation of bike rentals for
each season.
• Overall, this data frame provides insights
into seasonal patterns in bike rentals,
highlighting that Summer not only has the
highest average rentals but also the
greatest variability, while Winter shows
lower averages and more consistency in
rental counts.

25
Weather Seasonality

• The result retrieves weather seasonal bike rental statistics from the SEOUL_BIKE_SHARING table with averages for rented bike
count and weather conditions such as temperature, humidity, wind speed, visibility, dew point temperature, solar radiation,
rainfall, and snowfall for each season.

• Overall, this data highlights seasonal trends in weather conditions and bike usage, showing that warmer seasons correlate with
higher bike rentals and more favorable weather conditions.

26
Bike-sharing info in Seoul

• The query joins two tables: WORLD_CITIES and


BIKE_SHARING_SYSTEMS to get information about
bike-sharing in Seoul.

• The data frame includes the city name, country,


latitude, longitude, population, and finally
indicates that there are 20,000 bikes in the bike-
sharing system in Seoul, South Korea.

27
Cities similar to Seoul
• The code retrieves data about cities that have
number of bicycles is between 15,000 and
20,000 from the WORLD_CITIES and
BIKE_SHARING_SYSTEMS tables.
• A join operation is performed on the city
names to combine information from both
tables.
• The result provides insights into cities with
moderate bike-sharing systems, particularly in
China, indicating their geographical
coordinates and population size alongside the
number of bicycles available.

28
EDA with Visualization

29
Bike rental
vs. Date

The plot shows rented bike counts over time, with denser
clusters of points toward the middle of 2018, indicating higher
bike rental activity around Summer and Autumn. The counts
drop off at both ends (early and late 2018), which could
suggest lower rentals in colder months or off-peak times.

30
Bike rental vs. Datetime

• This plot helps to observe both seasonal and


hourly trends in bike rentals.
• It reveals a consistent pattern of high rentals
during the same hour (18:00) throughout the
year 2018, suggesting that this time slot is
particularly popular for bike rentals.

31
Bike rental
histogram

• This plot helps to observe bike rentals with an overlayed


kernel density curve.
• It reveals both the discrete and continuous distributions
of bike rental counts.
• The distribution is right-skewed, meaning there are more
days with lower rented bike counts compared to higher
counts.
• The peak of the distribution is around 500 rented bikes,
suggesting that this is the most common number of bikes
rented in a day.
• There are some outliers on the right side of the plot,
representing days with exceptionally high numbers of
rented bikes. These might be due to special events,
holidays, or other factors that increase demand for bike
rentals.

32
Daily total rainfall and snowfall

The plot provides a clear comparison of daily total rainfall and snowfall over a specific period in 2018.
Rainfall was the predominant form of precipitation, with higher frequency and intensity during the warmer months (April to October).
Snowfall appears to be concentrated in the colder months, primarily in January.

33
Predictive analysis

34
Ranked
coefficients

• Based on the chart, the most important factors


influencing bike-sharing demand are weather-related
variables like RAINFALL, HUMIDITY, DEW POINT
TEMPERATURE, and TEMPERATURE. These variables seem
to have a stronger impact on bike-sharing usage
compared to seasonal variations, holidays, or other
environmental factors like SOLAR RADIATION, SNOWFALL,
VISIBILITY, and WIND SPEED.

• In the HOUR variable, we can see which times have a high


correlation with bike-sharing demand, suggesting that
these time slots are particularly popular for bike rentals.

35
Model
evaluation
Built at least 5 different models using
polynomial terms, interaction terms, and
regularizations

36
Find the best performing model
• Best model based on RMSE - 308.3473: Lasso Model
• Best model based on R-squared - 0.7628875: Lasso Model
• formula <- RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 6) +
poly(DEW_POINT_TEMPERATURE, 6) + SUMMER +
poly(SOLAR_RADIATION, 6) + H_18 + poly(VISIBILITY, 3) + AUTUMN +
H_19 + H_17 + poly(WIND_SPEED, 5) + H_20 + H_21 + H_8 + H_16 +
H_22 + NO_HOLIDAY + H_15 + H_14 + SPRING + H_13 + H_12 + H_23 +
H_9 + H_7 + H_11 + H_0 + H_10 + HOLIDAY + H_1 + poly(RAINFALL, 6) +
H_2 + H_6 + poly(SNOWFALL, 6) + H_3 + H_5 + H_4 + poly(HUMIDITY, 5) +
WINTER

37
Q-Q plot of the
best model

38
Dashboard

39
Max bike prediction overview map

Bike-sharing Demand Prediction App -


Clearly states the title of the application.
World Map: Depicts the globe, providing a
visual context for the data being displayed.
City Markers predict the demand in bike
rentals.
Weather Indicators: indicating the prediction
weather condition.
City Selector: Allows users to select a specific
city for detailed analysis, options include
"All" (for a global view).

40
Prediction bike-sharing demand in London
• Map: Shows the location of London.
• City Markers (green) predict the
small demand in bike rentals.
• Weather information: prediction
weather conditions in London.
• Temperature chart: Shows the
temperature in London over time.
• Bike Count Prediction Chart:
Displays the predicted number of
bikes in London for the next three
hours.
• Time and Bike Count Prediction:
Indicates the time of access and the
corresponding predicted bike
demand.
• Bike Prediction Chart : Shows the
relationship between bike
predictions and humidity levels.

41
Prediction bike-sharing demand in Seoul
• Map: Shows the location of Seoul.
• City Markers (yellow) predict the
medium demand in bike rentals.
• Weather information: prediction
weather conditions in Seoul.
• Temperature chart: Shows the
temperature in Seoul over time.
• Bike Count Prediction Chart: Displays
the predicted number of bikes in
Seoul for the next three hours.
• Time and Bike Count Prediction:
Indicates the time of access and the
corresponding predicted bike
demand.
• Bike Prediction Chart : Shows the
relationship between bike predictions
and humidity levels.
42
Prediction bike-sharing demand in Suzhou
• Map: Shows the location of Suzhou.
• City Markers (yellow) predict the
medium demand in bike rentals.
• Weather information: prediction
weather conditions in Suzhou.
• Temperature chart: Shows the
temperature in Suzhou over time.
• Bike Count Prediction Chart: Displays
the predicted number of bikes in
Suzhou for the next three hours.
• Time and Bike Count Prediction:
Indicates the time of access and the
corresponding predicted bike
demand.
• Bike Prediction Chart : Shows the
relationship between bike predictions
and humidity levels.
43
Prediction bike-sharing demand in New York
• Map: Shows the location of New York.
• City Markers (yellow) predict the
medium demand in bike rentals.
• Weather information: prediction
weather conditions in New York.
• Temperature chart: Shows the
temperature in New York over time.
• Bike Count Prediction Chart: Displays
the predicted number of bikes in
New York for the next three hours.
• Time and Bike Count Prediction:
Indicates the time of access and the
corresponding predicted bike
demand.
• Bike Prediction Chart : Shows the
relationship between bike predictions
and humidity levels.
44
Prediction bike-sharing demand in Paris
• Map: Shows the location of Paris.
• City Markers (yellow) predict the
medium demand in bike rentals.
• Weather information: prediction
weather conditions in Paris.
• Temperature chart: Shows the
temperature in Paris over time.
• Bike Count Prediction Chart: Displays
the predicted number of bikes in
Paris for the next three hours.
• Time and Bike Count Prediction:
Indicates the time of access and the
corresponding predicted bike
demand.
• Bike Prediction Chart : Shows the
relationship between bike predictions
and humidity levels.
45
CONCLUSION
• The comprehensive workflow undertaken in this project highlights the critical stages of data
handling, modeling, and visualization, ultimately aimed at predicting bike demand based on
weather and time factors.
• The project has been utilizing SQL and visualization tools like Tidyverse and ggplot2 enabled a
thorough exploration of the data, facilitating insights into patterns and trends that inform modeling
efforts.
• The project not only performs building a linear regression model along with polynomial and
regularized models using Tidymodels illustrated the iterative process of model refinement to
identify the best-performing approach for predicting bike demand.
• The R Shiny application integrates regression models for hourly bike demand predictions, featuring
an interactive map and detailed visualizations to enhance user engagement and insights into
demand trends and variable relationships, such as weather, date, time and humidity.

46
APPENDIX 1. Data Collection
Web Scraping Notebook with OpenWeather API Notebook

47
APPENDIX 2. Data wrangling

48
APPENDIX 2. Data wrangling

49
APPENDIX 2. Data wrangling

50
APPENDIX 2. Data wrangling

51
APPENDIX 3. EDA with SQL

52
APPENDIX 3. EDA with SQL

53
APPENDIX 3. EDA with SQL

54
APPENDIX 3. EDA with data visualization

55
APPENDIX 3. EDA with data visualization

56
APPENDIX 3. EDA with data visualization

57
APPENDIX 3. EDA with data visualization

58
APPENDIX 3. EDA with data visualization

59
APPENDIX 4.
Predictive
analysis

60
APPENDIX 4.
Predictive
analysis

61
APPENDIX 4.
Predictive
analysis

62
APPENDIX 4.
Predictive
analysis

63
APPENDIX 4.
Predictive
analysis

64
APPENDIX 4.
Predictive
analysis

65
APPENDIX 4.
Predictive
analysis

66
APPENDIX 5. Build a R Shiny dashboard
APPENDIX 5. Build a R Shiny dashboard
Prediction bike-sharing demand in London

68
APPENDIX 5. Build a R Shiny dashboard
Prediction bike-sharing demand in Seoul

69
APPENDIX 5. Build a R Shiny dashboard
Prediction bike-sharing demand in Suzhou

70
APPENDIX 5. Build a R Shiny dashboard
Prediction bike-sharing demand in New York

71
APPENDIX 5. Build a R Shiny dashboard
Prediction bike-sharing demand in Paris

72

You might also like