0% found this document useful (0 votes)
15 views43 pages

CV0003

Uploaded by

sooyoungchoi093
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views43 pages

CV0003

Uploaded by

sooyoungchoi093
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

CV0003: INTRODUCTION TO

DATA SCIENCE AND AI

MINI PROJECT
PRSENTATION

Presented by : CV4 Group 6


TABLE OF CONTENT
INTRODUCTION 03
DATA EXTRACTION & CLEANING 07
EXPLORATORY DATA ANALYSIS (EDA) 10
MACHINE LEARNING 15
KEY INSIGHTS 42
INTRODUCTION
(Intro to Dataset & Problem
Statement)
DATASET
Dataset 2 : The COVID Tracking Project APIs

Source : https://covidtracking.com/
Documentation : https://covidtracking.com/data/api
Define your own problem on any dataset you extract
using the APIs.
Primarily offers historic and current values of US
COVID-19 stats.
INTRODUCTION
COVID-19, caused by the SARS-CoV-2 virus,
emerged in late 2019 and quickly spread globally,
leading to one of the largest pandemics in recent
history.
The virus impacted all aspects of daily life, including
public health systems, economies, and social
behaviors, due to its high transmission rate and
severity.
Understanding COVID-19 trends has become crucial
for managing public health responses, predicting
outbreaks, and planning future pandemic
preparedness.
PROBLEM & OBJECTIVES
WHAT IS KEY TRENDS IN COVID-19 EVENT?
Daily.csv
Identify significant turning points in COVID-19 case
numbers, hospitalizations, and mortality rates.

The dataset contains historic U.S.


WHAT WAS THE IMPACT OF INTERVENTIONS? COVID-19 data from January 2020
Examine how specific events (e.g., lockdowns, vaccine) to March 2021. It includes 420
affected COVID-19 trends & improving pubic health policy daily records with 25 columns,
covering variables such as positive
HOW TO DEVELOP FORECASTING MODELS? and negative cases,
Use machine learning to forecast future trends of hospitalizations, ICU usage,
hospitalization and positive cases (potential outbreaks). ventilator counts, and death.

https://covidtracking.com/data/api
DATA
EXTRACTION &
CLEANING
DATA EXTRACTION & CLEANING
DATA EXTRACTION

DROP COLUMNS WITH TOO MANY MISSING VALUES

FILL MISSING VALUE, BY USING MEDIAN

CONVERT DATE INTO DATETIME FORMAT

CONVERT COLUMNS WITH NUMERIC DATA

DROP IRRELEVANT COLUMNS

RENAME COLUMN FOR CLARITY


DATA EXTRACTION
EXPLORATORY
DATA ANALYSIS
(EDA)
EXPLORATORY DATA ANALYSIS (EDA)

CORRELATION
HEATMAP
Relationships between variables, such as cases,
hospitalizations, and deaths. It helps identify related
trends, guiding feature selection and highlighting
significant interactions for deeper analysis.
EXPLORATORY DATA ANALYSIS (EDA)

STATISTICAL
DESCRIPTION
Table showing Mean, std, Quartile,
Median by .describe function

NORMALIZED
BOXPLOT
Visualize the statistical information of
normalized data by using boxplot
EXPLORATORY DATA ANALYSIS (EDA)

Correct the date format. Allign with Time series analysis requirement, %Y%m%d

Remove unecessary, repeated, empty variable from dataset

Filtering Data starting from March 2020,

Normalize the data. ensure that variables with different scales or ranges are brought
to a comparable scale, making them suitable for plotting on the same graph or for
algorithms that are sensitive to feature magnitude.

Marking Past Intervention on COVID case by vertical line


EXPLORATORY DATA ANALYSIS (EDA)
MACHINE
LEARNING
Machine Learning

Supervised Learning Unsupervised Learning

Regression Classification
Rupture
Turning Point

Linear Regression Random Forest


Random Forest
Exponential smoothing
LSTM
MACHINE LEARNING TECHNIQUES :
LINEAR REGRESSION

Measures the goodness of fit


of model
Determines how the variables are mutually dependeent
to each other

Import the Linear Regression model:


Split the data into train and test sets uniformly
Fit Linear Regression Model in training dataset
Check Coefficient of Linear Regression Model
Plot the graph

Repeat for x values of hospitalised cases and Icu cases


LIMITATIONS OF LINEAR REGRESSION

Low Expected Variance and


high MSE, model is not an
accurate model!
EXPONENTIAL SMOOTHING
OBJECTIVE
forecast COVID-19 cases by weighting past observation
and giving more importance in recent data

SMOOTHING PARAMETER
Higher (~1): more sensitive in recent data and ignore older one
HIGH SMOOTHING LEVEL Lower (~0): smoother forecast but slow respond in recent data

WHY EXPONENTIAL SMOOTHING?


Simplicity: Easy to implement and computationally
efficient.
Adaptability: Flexible in adjusting to recent changes,
with variants available to account for trends and
seasonality.
Short-Term Focus: Best suited for short- to medium-
term forecasts, where recent trends are expected to
continue.

LOW SMOOTHING LEVEL


HIGH SMOOTHING LEVEL
STRENGTH
More Responsive to Recent Changes: more weight on
recent data, better suited to datasets where recent trends
or changes are significant.
Captures Volatility: useful for volatile datasets where
fluctuations are important, such as daily or weekly data with
frequent shifts.

WEAKNESS
It may lead to overfitting by closely following random noise,
which can reduce the generalization of the forecast.
LOW SMOOTHING LEVEL
STRENGTH
Smoother Forecasts and More Stable: creates a smoother,
more stable forecast that is less reactive to recent changes.
It focuses more on the long-term trend rather than short-
term fluctuations.
Reduces Overfitting Risk: can reduce the risk of overfitting,
as the forecast line smooths out random variations and
focuses on broader patterns.

WEAKNESS
Less reactive in short-term or sudden changes
Code Snippet MACHINE LEARNING :
CLASSIFICATION
CLASSIFICATION INTO LOW,
MEDIUM, HIGH
Calculate the 33rd and 66th percentiles
of the "hospitalizedCurrently" data.
low (below the 33rd percentile),
medium (between the 33rd and 66th
percentiles), and high (above the 66th
percentile)

Line plot by Plt : Matplotlib.pyplot


RANDOM FOREST

ENSEMBLE LEARNING
Ensemble learning technique that combines multiple
decision trees to make accurate and robust
prediction.

BOOTSTRAP SAMPLING
Each tree in the forest is trained on a random subset
of the training data (with replacement).

RANDOMNESS
When splitting nodes, a random subset of
features is considered to create diverse trees.

AVERAGING IN REGRESSION
For regression tasks, the average of all tree
outputs is taken as the final prediction.

VOTING MECHANISM
For classification tasks, each tree makes a
prediction, and the final output is determined by
majority voting.
RANDOM FOREST
OBJECTIVE
predict whether daily COVID-19 hospitalization cases fall
into the categories of "low," "medium," or "high"

SELECTING PREDICTOR
['positiveIncrease', 'negativeIncrease', 'deathIncrease',
'totalTestResultsIncrease']

WHY RANDOM FOREST


Able to capture complex relationships, interactions
between variables, and robustness against overfitting.
~ 82 % ACCURACY
RANDOM FOREST

“Positive Increase” Variable


contribute the most to the
Hospitalization Forecasting
RANDOM FOREST : CODE SNIPPET
UNSUPERVISED LEARNING : RUPTURE

PURPOSE OF RUPTURE FUNCTION


Detects significant changes in the trend of
time-series data.

HOW RUPTURE FUNCTION WORKS?


Highlighting the effect of interventions
lockdowns, mask mandates, and vaccinations.
Red Dashed Lines show where the function
identified statistically significant shifts in the
hospitalization trend.
RUPTURE
INTERPRETATION IN COVID-19 DATA
After vaccine approvals (Pfizer, Moderna),
there’s a clear shift in the trend, indicating a
decline in hospitalizations. effectiveness of
vaccination in lowering hospitalizations.

Vaccine takes 25 days to work

Spike up cases in July because of Easing of


Lockdowns and Reopenings, Summer Travel
and Gatherings: U.S. Independence Day
celebrations (July 4), Improvement in testing
capabilities.
TURNING POINT

Smoothed line using moving average


method.

Identify significant turning points in


COVID-19 hospitalization data and
key events (e.g., lockdowns, vaccine
approvals) represented by vertical
lines.

Red represent Declining trend, and


Green represent Inclining trend.
LSTM MODEL
LSTM (LONG SHORT-TERM
MEMORY) MODEL

WHAT IS LSTM?
LSTM is a type of Recurrent Neural Network (RNN)
designed to handle sequences and time-series
data.
It overcomes the limitations of traditional RNNs by
solving the "short-term memory" issue.

WHY USE LSTM?


Captures Sequential Patterns: LSTM is capable of
learning long-term dependencies in data, making
it ideal for time-series forecasting.
Effective Memory Units: LSTM cells have memory
gates that allow it to remember relevant
information and forget unnecessary data as
needed.
LSTM (LONG SHORT-TERM
MEMORY) MODEL

HOW LSTM WORKS?


Input Gates: Control what new information enters
the memory cell.
Forget Gates: Decide what information to discard
from the cell.
Output Gates: Control what information to output
based on the cell state.

APPLICATION
In this project, the LSTM model is used to predict
future COVID-19 cases by learning from past case
trends.
Data is scaled and divided into training, validation,
and testing sets to enhance accuracy.
LSTM
DATA SCALING
Why Scale? Scaling ensures that all data is
within a similar range, which helps the LSTM
model learn effectively.
MinMaxScaler: This tool scales data to a range
between 0 and 1, which is ideal for LSTM input.

CREATING SEQUENCE
Window Size: A window size of 5 is used,
meaning each data point is based on the
previous 5 values.
Features (X) and Labels (y): The LSTM function
creates input-output pairs based on the
specified window size.
LSTM (MINMAXSCALAR)
LSTM
MODEL DESIGNING
Model Architecture
Sequential Model: A linear stack of layers, ideal
for time series forecasting with LSTM.

Input Layer
Window Size: The input shape is (window_size,
1), where the window size is 5.
LSTM(64): 64 units in the LSTM layer to capture
time dependencies in data. This helps the model
retain memory over long sequences, which is
crucial for time series forecasting.
Dense(8, activation='relu'): A dense layer with 8
nodes and ReLU activation to process learned
features.
Dense(1, activation='linear'): A single-node
output layer with linear activation for continuous
output, ideal for forecasting.
LSTM MODEL SUMMARY
LSTM
TRAINING
Checkpoint:
Saves the model only when it achieves a new best score on
validation data. It Prevents overfitting and ensures that the best
model configuration is retained.
Compiling the Model:
MeanSquaredError() minimizes the difference between predicted
and actual values.
Adam optimizer with a learning rate of 0.0001, enhancing stability
and convergence.
Tracks RootMeanSquaredError (RMSE) for model performance on
training and validation.
Training Execution:
Runs training for 100 epochs with both training and validation data.
Monitors performance using the ModelCheckpoint callback.
Loading the Best Model:
Reloads the best version of the trained model (saved in
"LSTM_model.keras") for accurate evaluation and predictions.
LSTM
PLOTTING
Inverse Transformation :
Converts scaled predictions and actual values back to original
COVID-19 case counts.
Separate predictions for training, validation, and testing sets.
Combining Predictions and Actual Values:
Concatenate predictions and actual values for a unified view.
Create a DataFrame with Prediction and Actual columns indexed
by date.
Plotting the Results:
Visualizes both Prediction and Actual on the same graph.
Labels and titles clarify the plot's purpose.
Gridlines and Legend aid readability and data interpretation.
Performance Metrics:
MAPE measures average error percentage between predictions
and actuals.
RMSE quantifies prediction error magnitude.
LSTM MODEL VISUALISATION
LSTM
FORECASTING
Setting Up the Forecast:
Choose future_days (e.g., 30 days) to set the forecast length.
Start with the final sequence of test data for prediction.
Iterative Forecasting:
Input the last sequence into the model to predict the next day’s
case count and append the predicted case to the forecast list.
Add the prediction to the sequence and remove the oldest value,
maintaining the window_size length for further predictions.
Inverse Scaling & Date Generation:
Transform predictions back to the original COVID case scale
Extend the timeline by future_days.
Plotting the Forecast:
Actual and predicted cases are plotted along with the 30-day
forecast. Titles and labels clarify each line’s significance.
Forecast Table:
Print the DataFrame of forecasted values, showing daily cases
over the specified forecast period.
LSTM MODEL VISUALISATION
(FORECASTING)
WHAT WE APPLIED FROM
THE LECTURE WHAT WE EXPLORED

Inspecting Data ML : Suprvised Learning


“.info()”, “,head()”, and “.describe()” Exponential Smoothing, Random
Forest, LSTM

Visualizing and Forecasting ML : Unsuprvised Learning


Data
heatmap, scatterplot, and boxplots
Ruptrue and Turning Point
Linear Regression
THANK
YOU

You might also like