Data analysis with R
Data analysis with R
Science with R
Capstone project
<Nguyen Hoai An>
<October 15th 2024>
Outline
• Executive Summary
• Introduction
• Methodology
• Results
• Conclusion
• Appendix
2
Executive Summary
Ø The project aims to collect and analyze real-world datasets through
various stages, enhancing data quality and gaining insights.
Ø The project involves tackling a challenge that requires data
collection, analysis, hypothesis testing, visualization, modeling, and
dashboard creation using real-world datasets.
Ø Key tasks include:
Ø Data Collection: Gathering and understanding data from multiple sources.
Ø Data Wrangling: Preparing data using regular expressions and Tidyverse.
Ø Exploratory Data Analysis: Utilizing SQL and visualization techniques via
Tidyverse and ggplot2.
Ø Modeling: Building linear regression models using Tidymodels.
Ø Dashboard Creation: Developing an interactive dashboard with R Shiny. 3
Introduction
•Module 3 - Performing Exploratory Data Analysis
with SQL, Tidyverse & ggplot2
•Module 1 - Capstone Overview and Data Collection
•EDA with SQL lab using RSQLite
•DC with Web Scraping Notebook
Hands-on Lab •EDA with SQL lab using RODBC with IBM DB2
•DC with OpenWeather API Notebook
•EDA with Data Visualization Lab
•Module 2 - Data Wrangling (DW)
•Module 4 - Predictive Analysis
•DW with Regular Expressions Notebook
•Building a Baseline Regression Model Lab
•DW with dplyr Notebook
•Improving the Linear Model lab
4
Methodology
• Perform data collection
• Perform data wrangling
• Perform exploratory data analysis (EDA) using SQL and
visualization
• Perform predictive analysis using regression models
• How to build the baseline model
• How to improve the baseline model
• Build a R Shiny dashboard app
5
Methodology
6
Data collection
use the ‘rvest’ library to obtain HTML
table from a web page, Collect Data from an API using httr and jsonlite
library(rvest)
1 use the ‘rvest’ library to obtain HTML table
from a web page,
url <- "https://example.com"
root_node <- read_html(url)
table_nodes <- html_nodes(root_node, "table")
2 convert the table into a data frame, df<- html_table(table_nodes[[1]], fill = TRUE)
4 write the data frame to a csv file. write.csv(df, ”File_name", row.names = FALSE)
API request
1
Data collection
(‘httr’ library )
4 Fetching Data
7 Saving to CSV
9
Data wrangling
10
Data wrangling
11
Data wrangling
# List of datasets
dataset_list <- c('data_1.csv ', 'data_2.csv’)
13
# check if the column contains any “strange” character
ref_pattern <- "\\[[A-z0-9]+\\]"
Data wrangling find_ref_pattern <- function(strings) grepl(ref_pattern, strings)
df %>%
cleaning up the values select(column_2) %>%
in the web-scraped dataset filter(find_ref_pattern(column_2)) %>%
slice(0:10)
# Clean and replace all non-numeric characters # Apply the remove_ref function to the column_2 and column_3
df <- df %>% df <- df %>%
mutate(column_1 = str_replace_all(column_1, "[^0-9]", "")) mutate(column_2 = remove_ref(column_2),
column_3 = remove_ref(column_3) )
# subset the NA values in the column. # Impute missing values for column_X with mean value
library(dplyr) library(dplyr)
data_df <- data_df %>% data_df <- data_df %>%
filter(!is.na(column_X)) mutate(column_X = ifelse(is.na(column_X), Mean_value, column_X))
15
EDA with SQL
library("RSQLite") Group_1 <- dbGetQuery(con, "
db_path <- ”dbname.sqlite" Run RSQLite and SELECT group_column, Grouping and
establish connection. Aggregating Data
con <- dbConnect(RSQLite::SQLite(), dbname = db_path) COUNT(*) AS total_records,
AVG(numeric_column) AS average_value
library(readr) FROM table_name
Load Data into GROUP BY group_column")
dbWriteTable(con, "table_name", read_csv(”File.csv",
Database
show_col_types = FALSE), overwrite = TRUE)
F_data <- dbGetQuery(con, "
T_count <- dbGetQuery(con, "SELECT COUNT(*) AS
Counting Records SELECT * Data Filtering
total_records FROM table_name ")
FROM table_name
WHERE condition1 AND condition2 AND ...")
T_value <- dbGetQuery(con, "SELECT
Summing a Column
SUM(column_name) AS total_value FROM table_name")
Trend_data <- dbGetQuery(con, "
A_value <- dbGetQuery(con, "SELECT AVG(column_name) Detecting Trends
SELECT time_column,
Finding Averages Over Time
AS average_value FROM table_name") COUNT(*) AS total_records,
AVG(numeric_column) AS average_value
Min_max <- dbGetQuery(con, "SELECT FROM table_name
Finding Minimum GROUP BY time_column
MIN(column_name) AS min_value, MAX(column_name)
and Maximum Values
AS max_value FROM table_name") ORDER BY time_column")
16
EDA with SQL
Season_1 <- dbGetQuery(con, " Group_1 <- dbGetQuery(con, "
Seasonality
SELECT season_column, SELECT * Outlier Detection
Patterns
COUNT(*) AS total_records, FROM table_name
AVG(numeric_column) AS average_value WHERE numeric_column > (SELECT AVG(numeric_column) +
FROM table_name 2 * STDDEV(numeric_column) FROM table_name)
GROUP BY season_column OR numeric_column < (SELECT AVG(numeric_column) - 2
ORDER BY season_column") * STDDEV(numeric_column) FROM table_name)")
data_frame %>%
ggplot(aes(x = numeric_column1, y = numeric_column2)) +
Generate Scatterplots. geom_point(color = "blue", size = 2) +
labs(title = "Scatterplot of numeric_column1 vs numeric_column2", x = "numeric_column1",
y = "numeric_column2") + theme_minimal()
data_frame %>%
ggplot(aes(x = categorical_column, y = numeric_column)) +
Employ Box Plots. geom_boxplot(fill = "blue", color = "black") +
labs(title = "Box Plot of numeric_column by categorical_column", x = "categorical_column",
y = "numeric_column") + theme_minimal()
18
Predictive analysis
Define Objective
Prepare Data
•collect predictors and
•target variable)
19
Build a R Shiny dashboard
• Integrate Regression Models (Predict hourly demand using weather, date, and time data)
• Display Interactive Map (Leaflet map showing cities with predicted bike demand for the next five days)
• Enable User Interaction - Dropdown to select specific city or "All" for overview
• Generate Detailed Plots - ggplot to show demand trends for selected city, including temperature and
humidity
• Visualize Data Trends - Line charts for temperature and demand over 5 days - Scatterplot for demand vs.
humidity correlation
20
Results
• Exploratory data analysis results
21
EDA with SQL
22
Busiest bike rental times
23
Hourly popularity and temperature
by seasons
To find hourly popularity and temperature by season:
• the data frame retrieves the average temperature and average bike
rentals for each season and hour from the SEOUL_BIKE_SHARING
table.
• It groups the data by both SEASONS and HOUR, then orders the
results by the average bike rentals in descending order.
• The table shows that the highest average bike rentals occur during the
summer months, particularly in the late afternoon and early evening
hours. The average temperature during these peak rental hours is
generally warm. This suggests that people in Seoul tend to use bike-
sharing services more frequently on warm evenings.
24
Rental Seasonality
• The result retrieves seasonal bike rental
statistics from the SEOUL_BIKE_SHARING
table with average, minimum, maximum,
and standard deviation of bike rentals for
each season.
• Overall, this data frame provides insights
into seasonal patterns in bike rentals,
highlighting that Summer not only has the
highest average rentals but also the
greatest variability, while Winter shows
lower averages and more consistency in
rental counts.
25
Weather Seasonality
• The result retrieves weather seasonal bike rental statistics from the SEOUL_BIKE_SHARING table with averages for rented bike
count and weather conditions such as temperature, humidity, wind speed, visibility, dew point temperature, solar radiation,
rainfall, and snowfall for each season.
• Overall, this data highlights seasonal trends in weather conditions and bike usage, showing that warmer seasons correlate with
higher bike rentals and more favorable weather conditions.
26
Bike-sharing info in Seoul
27
Cities similar to Seoul
• The code retrieves data about cities that have
number of bicycles is between 15,000 and
20,000 from the WORLD_CITIES and
BIKE_SHARING_SYSTEMS tables.
• A join operation is performed on the city
names to combine information from both
tables.
• The result provides insights into cities with
moderate bike-sharing systems, particularly in
China, indicating their geographical
coordinates and population size alongside the
number of bicycles available.
28
EDA with Visualization
29
Bike rental
vs. Date
The plot shows rented bike counts over time, with denser
clusters of points toward the middle of 2018, indicating higher
bike rental activity around Summer and Autumn. The counts
drop off at both ends (early and late 2018), which could
suggest lower rentals in colder months or off-peak times.
30
Bike rental vs. Datetime
31
Bike rental
histogram
32
Daily total rainfall and snowfall
The plot provides a clear comparison of daily total rainfall and snowfall over a specific period in 2018.
Rainfall was the predominant form of precipitation, with higher frequency and intensity during the warmer months (April to October).
Snowfall appears to be concentrated in the colder months, primarily in January.
33
Predictive analysis
34
Ranked
coefficients
35
Model
evaluation
Built at least 5 different models using
polynomial terms, interaction terms, and
regularizations
36
Find the best performing model
• Best model based on RMSE - 308.3473: Lasso Model
• Best model based on R-squared - 0.7628875: Lasso Model
• formula <- RENTED_BIKE_COUNT ~ poly(TEMPERATURE, 6) +
poly(DEW_POINT_TEMPERATURE, 6) + SUMMER +
poly(SOLAR_RADIATION, 6) + H_18 + poly(VISIBILITY, 3) + AUTUMN +
H_19 + H_17 + poly(WIND_SPEED, 5) + H_20 + H_21 + H_8 + H_16 +
H_22 + NO_HOLIDAY + H_15 + H_14 + SPRING + H_13 + H_12 + H_23 +
H_9 + H_7 + H_11 + H_0 + H_10 + HOLIDAY + H_1 + poly(RAINFALL, 6) +
H_2 + H_6 + poly(SNOWFALL, 6) + H_3 + H_5 + H_4 + poly(HUMIDITY, 5) +
WINTER
37
Q-Q plot of the
best model
38
Dashboard
39
Max bike prediction overview map
40
Prediction bike-sharing demand in London
• Map: Shows the location of London.
• City Markers (green) predict the
small demand in bike rentals.
• Weather information: prediction
weather conditions in London.
• Temperature chart: Shows the
temperature in London over time.
• Bike Count Prediction Chart:
Displays the predicted number of
bikes in London for the next three
hours.
• Time and Bike Count Prediction:
Indicates the time of access and the
corresponding predicted bike
demand.
• Bike Prediction Chart : Shows the
relationship between bike
predictions and humidity levels.
41
Prediction bike-sharing demand in Seoul
• Map: Shows the location of Seoul.
• City Markers (yellow) predict the
medium demand in bike rentals.
• Weather information: prediction
weather conditions in Seoul.
• Temperature chart: Shows the
temperature in Seoul over time.
• Bike Count Prediction Chart: Displays
the predicted number of bikes in
Seoul for the next three hours.
• Time and Bike Count Prediction:
Indicates the time of access and the
corresponding predicted bike
demand.
• Bike Prediction Chart : Shows the
relationship between bike predictions
and humidity levels.
42
Prediction bike-sharing demand in Suzhou
• Map: Shows the location of Suzhou.
• City Markers (yellow) predict the
medium demand in bike rentals.
• Weather information: prediction
weather conditions in Suzhou.
• Temperature chart: Shows the
temperature in Suzhou over time.
• Bike Count Prediction Chart: Displays
the predicted number of bikes in
Suzhou for the next three hours.
• Time and Bike Count Prediction:
Indicates the time of access and the
corresponding predicted bike
demand.
• Bike Prediction Chart : Shows the
relationship between bike predictions
and humidity levels.
43
Prediction bike-sharing demand in New York
• Map: Shows the location of New York.
• City Markers (yellow) predict the
medium demand in bike rentals.
• Weather information: prediction
weather conditions in New York.
• Temperature chart: Shows the
temperature in New York over time.
• Bike Count Prediction Chart: Displays
the predicted number of bikes in
New York for the next three hours.
• Time and Bike Count Prediction:
Indicates the time of access and the
corresponding predicted bike
demand.
• Bike Prediction Chart : Shows the
relationship between bike predictions
and humidity levels.
44
Prediction bike-sharing demand in Paris
• Map: Shows the location of Paris.
• City Markers (yellow) predict the
medium demand in bike rentals.
• Weather information: prediction
weather conditions in Paris.
• Temperature chart: Shows the
temperature in Paris over time.
• Bike Count Prediction Chart: Displays
the predicted number of bikes in
Paris for the next three hours.
• Time and Bike Count Prediction:
Indicates the time of access and the
corresponding predicted bike
demand.
• Bike Prediction Chart : Shows the
relationship between bike predictions
and humidity levels.
45
CONCLUSION
• The comprehensive workflow undertaken in this project highlights the critical stages of data
handling, modeling, and visualization, ultimately aimed at predicting bike demand based on
weather and time factors.
• The project has been utilizing SQL and visualization tools like Tidyverse and ggplot2 enabled a
thorough exploration of the data, facilitating insights into patterns and trends that inform modeling
efforts.
• The project not only performs building a linear regression model along with polynomial and
regularized models using Tidymodels illustrated the iterative process of model refinement to
identify the best-performing approach for predicting bike demand.
• The R Shiny application integrates regression models for hourly bike demand predictions, featuring
an interactive map and detailed visualizations to enhance user engagement and insights into
demand trends and variable relationships, such as weather, date, time and humidity.
46
APPENDIX 1. Data Collection
Web Scraping Notebook with OpenWeather API Notebook
47
APPENDIX 2. Data wrangling
48
APPENDIX 2. Data wrangling
49
APPENDIX 2. Data wrangling
50
APPENDIX 2. Data wrangling
51
APPENDIX 3. EDA with SQL
52
APPENDIX 3. EDA with SQL
53
APPENDIX 3. EDA with SQL
54
APPENDIX 3. EDA with data visualization
55
APPENDIX 3. EDA with data visualization
56
APPENDIX 3. EDA with data visualization
57
APPENDIX 3. EDA with data visualization
58
APPENDIX 3. EDA with data visualization
59
APPENDIX 4.
Predictive
analysis
60
APPENDIX 4.
Predictive
analysis
61
APPENDIX 4.
Predictive
analysis
62
APPENDIX 4.
Predictive
analysis
63
APPENDIX 4.
Predictive
analysis
64
APPENDIX 4.
Predictive
analysis
65
APPENDIX 4.
Predictive
analysis
66
APPENDIX 5. Build a R Shiny dashboard
APPENDIX 5. Build a R Shiny dashboard
Prediction bike-sharing demand in London
68
APPENDIX 5. Build a R Shiny dashboard
Prediction bike-sharing demand in Seoul
69
APPENDIX 5. Build a R Shiny dashboard
Prediction bike-sharing demand in Suzhou
70
APPENDIX 5. Build a R Shiny dashboard
Prediction bike-sharing demand in New York
71
APPENDIX 5. Build a R Shiny dashboard
Prediction bike-sharing demand in Paris
72