CV0003
CV0003
MINI PROJECT
PRSENTATION
Source : https://covidtracking.com/
Documentation : https://covidtracking.com/data/api
Define your own problem on any dataset you extract
using the APIs.
Primarily offers historic and current values of US
COVID-19 stats.
INTRODUCTION
COVID-19, caused by the SARS-CoV-2 virus,
emerged in late 2019 and quickly spread globally,
leading to one of the largest pandemics in recent
history.
The virus impacted all aspects of daily life, including
public health systems, economies, and social
behaviors, due to its high transmission rate and
severity.
Understanding COVID-19 trends has become crucial
for managing public health responses, predicting
outbreaks, and planning future pandemic
preparedness.
PROBLEM & OBJECTIVES
WHAT IS KEY TRENDS IN COVID-19 EVENT?
Daily.csv
Identify significant turning points in COVID-19 case
numbers, hospitalizations, and mortality rates.
https://covidtracking.com/data/api
DATA
EXTRACTION &
CLEANING
DATA EXTRACTION & CLEANING
DATA EXTRACTION
CORRELATION
HEATMAP
Relationships between variables, such as cases,
hospitalizations, and deaths. It helps identify related
trends, guiding feature selection and highlighting
significant interactions for deeper analysis.
EXPLORATORY DATA ANALYSIS (EDA)
STATISTICAL
DESCRIPTION
Table showing Mean, std, Quartile,
Median by .describe function
NORMALIZED
BOXPLOT
Visualize the statistical information of
normalized data by using boxplot
EXPLORATORY DATA ANALYSIS (EDA)
Correct the date format. Allign with Time series analysis requirement, %Y%m%d
Normalize the data. ensure that variables with different scales or ranges are brought
to a comparable scale, making them suitable for plotting on the same graph or for
algorithms that are sensitive to feature magnitude.
Regression Classification
Rupture
Turning Point
SMOOTHING PARAMETER
Higher (~1): more sensitive in recent data and ignore older one
HIGH SMOOTHING LEVEL Lower (~0): smoother forecast but slow respond in recent data
WEAKNESS
It may lead to overfitting by closely following random noise,
which can reduce the generalization of the forecast.
LOW SMOOTHING LEVEL
STRENGTH
Smoother Forecasts and More Stable: creates a smoother,
more stable forecast that is less reactive to recent changes.
It focuses more on the long-term trend rather than short-
term fluctuations.
Reduces Overfitting Risk: can reduce the risk of overfitting,
as the forecast line smooths out random variations and
focuses on broader patterns.
WEAKNESS
Less reactive in short-term or sudden changes
Code Snippet MACHINE LEARNING :
CLASSIFICATION
CLASSIFICATION INTO LOW,
MEDIUM, HIGH
Calculate the 33rd and 66th percentiles
of the "hospitalizedCurrently" data.
low (below the 33rd percentile),
medium (between the 33rd and 66th
percentiles), and high (above the 66th
percentile)
ENSEMBLE LEARNING
Ensemble learning technique that combines multiple
decision trees to make accurate and robust
prediction.
BOOTSTRAP SAMPLING
Each tree in the forest is trained on a random subset
of the training data (with replacement).
RANDOMNESS
When splitting nodes, a random subset of
features is considered to create diverse trees.
AVERAGING IN REGRESSION
For regression tasks, the average of all tree
outputs is taken as the final prediction.
VOTING MECHANISM
For classification tasks, each tree makes a
prediction, and the final output is determined by
majority voting.
RANDOM FOREST
OBJECTIVE
predict whether daily COVID-19 hospitalization cases fall
into the categories of "low," "medium," or "high"
SELECTING PREDICTOR
['positiveIncrease', 'negativeIncrease', 'deathIncrease',
'totalTestResultsIncrease']
WHAT IS LSTM?
LSTM is a type of Recurrent Neural Network (RNN)
designed to handle sequences and time-series
data.
It overcomes the limitations of traditional RNNs by
solving the "short-term memory" issue.
APPLICATION
In this project, the LSTM model is used to predict
future COVID-19 cases by learning from past case
trends.
Data is scaled and divided into training, validation,
and testing sets to enhance accuracy.
LSTM
DATA SCALING
Why Scale? Scaling ensures that all data is
within a similar range, which helps the LSTM
model learn effectively.
MinMaxScaler: This tool scales data to a range
between 0 and 1, which is ideal for LSTM input.
CREATING SEQUENCE
Window Size: A window size of 5 is used,
meaning each data point is based on the
previous 5 values.
Features (X) and Labels (y): The LSTM function
creates input-output pairs based on the
specified window size.
LSTM (MINMAXSCALAR)
LSTM
MODEL DESIGNING
Model Architecture
Sequential Model: A linear stack of layers, ideal
for time series forecasting with LSTM.
Input Layer
Window Size: The input shape is (window_size,
1), where the window size is 5.
LSTM(64): 64 units in the LSTM layer to capture
time dependencies in data. This helps the model
retain memory over long sequences, which is
crucial for time series forecasting.
Dense(8, activation='relu'): A dense layer with 8
nodes and ReLU activation to process learned
features.
Dense(1, activation='linear'): A single-node
output layer with linear activation for continuous
output, ideal for forecasting.
LSTM MODEL SUMMARY
LSTM
TRAINING
Checkpoint:
Saves the model only when it achieves a new best score on
validation data. It Prevents overfitting and ensures that the best
model configuration is retained.
Compiling the Model:
MeanSquaredError() minimizes the difference between predicted
and actual values.
Adam optimizer with a learning rate of 0.0001, enhancing stability
and convergence.
Tracks RootMeanSquaredError (RMSE) for model performance on
training and validation.
Training Execution:
Runs training for 100 epochs with both training and validation data.
Monitors performance using the ModelCheckpoint callback.
Loading the Best Model:
Reloads the best version of the trained model (saved in
"LSTM_model.keras") for accurate evaluation and predictions.
LSTM
PLOTTING
Inverse Transformation :
Converts scaled predictions and actual values back to original
COVID-19 case counts.
Separate predictions for training, validation, and testing sets.
Combining Predictions and Actual Values:
Concatenate predictions and actual values for a unified view.
Create a DataFrame with Prediction and Actual columns indexed
by date.
Plotting the Results:
Visualizes both Prediction and Actual on the same graph.
Labels and titles clarify the plot's purpose.
Gridlines and Legend aid readability and data interpretation.
Performance Metrics:
MAPE measures average error percentage between predictions
and actuals.
RMSE quantifies prediction error magnitude.
LSTM MODEL VISUALISATION
LSTM
FORECASTING
Setting Up the Forecast:
Choose future_days (e.g., 30 days) to set the forecast length.
Start with the final sequence of test data for prediction.
Iterative Forecasting:
Input the last sequence into the model to predict the next day’s
case count and append the predicted case to the forecast list.
Add the prediction to the sequence and remove the oldest value,
maintaining the window_size length for further predictions.
Inverse Scaling & Date Generation:
Transform predictions back to the original COVID case scale
Extend the timeline by future_days.
Plotting the Forecast:
Actual and predicted cases are plotted along with the 30-day
forecast. Titles and labels clarify each line’s significance.
Forecast Table:
Print the DataFrame of forecasted values, showing daily cases
over the specified forecast period.
LSTM MODEL VISUALISATION
(FORECASTING)
WHAT WE APPLIED FROM
THE LECTURE WHAT WE EXPLORED