0% found this document useful (0 votes)

23 views46 pages

Climate Change Modeling

The Climate Change Modeling project utilizes machine learning to analyze user comments from NASA's Facebook page about climate change, aiming to predict and understand various climate indicators. The project involves multiple steps including data preparation, exploration, model training, and future projections, while ensuring ethical considerations for data privacy. The dataset, although small, provides opportunities for sentiment and trend analysis, and the project emphasizes the importance of impartiality in handling diverse public opinions.

Uploaded by

realiitian159

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views46 pages

Climate Change Modeling

Uploaded by

realiitian159

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Project Title Climate Change Modeling

Tools Jupyter Notebook and VS code

Technologies Machine learning

Domain
Data Science

Project Difficulties level Advanced

Dataset : Dataset is available in the given link. You can download it at your
convenience.

Click here to download data set

About Dataset
Overview

This dataset encompasses over 500 user comments collected from high-performing
posts on NASA's Facebook page dedicated to climate change
(https://web.facebook.com/NASAClimateChange/). The comments, gathered from
various posts between 2020 and 2023, offer a diverse range of public opinions and
sentiments about climate change and NASA's related activities.
Data Science Applications

Despite not being a large dataset, it offers valuable opportunities for analysis and
Natural Language Processing (NLP). Potential applications include:

● Sentiment Analysis: Gauge public opinion on climate change and NASA's

communication strategies.
● Trend Analysis: Identify shifts in public sentiment over the specified period.
● Engagement Analysis: Understand the correlation between the content of a
post and user engagement.
● Topic Modeling: Discover prevalent themes in public discourse about climate
change.

Column Descriptors

1. Date: The date and time when the comment was posted.
2. LikesCount: The number of likes each comment received.
3. ProfileName: The anonymized name of the user who posted the comment.
4. CommentsCount: The number of responses each comment received.
5. Text: The actual text content of the comment.

Ethical Considerations and Data Privacy

All profile names in this dataset have been hashed using SHA-256 to ensure privacy
while maintaining data usability. This approach aligns with ethical data mining
practices, ensuring that individual privacy is respected without compromising the
dataset's analytical value.

Acknowledgements
We extend our gratitude to NASA and their Facebook platform for facilitating open
discussions on climate change. Their commitment to fostering public engagement and
awareness on this critical global issue is deeply appreciated.

Note to Data Scientists

As data scientists analyze this dataset, it is crucial to approach the data impartially.
Climate change is a subject with diverse viewpoints, and it is important to handle the
data and any derived insights in a manner that respects these different perspectives.

Climate Change Modeling Machine Learning Project

Project Overview

The Climate Change Modeling project aims to develop a machine learning model to
predict and understand various aspects of climate change. This can include predicting
temperature changes, sea level rise, extreme weather events, and other related
phenomena. The project involves analyzing historical climate data, identifying trends,
and making future projections to help in planning and mitigation efforts.

Project Steps

1. Understanding the Problem

○ The goal is to predict and model various climate change indicators, such as
temperature anomalies, precipitation patterns, and sea level changes,
using historical climate data and machine learning techniques.
2. Dataset Preparation
○ Data Sources: Collect data from sources like NOAA (National Oceanic and
Atmospheric Administration), NASA, IPCC (Intergovernmental Panel on
Climate Change), and other climate research organizations.
○ Features: Include variables like temperature, precipitation, CO2 levels,
solar radiation, sea level, and other relevant environmental factors.
○ Labels: Climate change indicators such as temperature anomalies, sea
level rise, frequency of extreme weather events.
3. Data Exploration and Visualization
○ Load and explore the dataset using descriptive statistics and visualization
techniques.
○ Use libraries like Pandas for data manipulation and Matplotlib/Seaborn for
visualization.
○ Identify trends, patterns, and correlations in the data.
4. Data Preprocessing
○ Handle missing values through imputation or removal.
○ Standardize or normalize continuous features.
○ Encode categorical variables using techniques like one-hot encoding.
○ Split the dataset into training, validation, and testing sets.
5. Feature Engineering
○ Create new features that may be useful for prediction, such as rolling
averages or lagged variables.
○ Perform feature selection to identify the most relevant features for the
model.
6. Model Selection and Training
○ Choose appropriate machine learning algorithms based on the problem.
Common choices include:
■ Linear Regression
■ Decision Trees
■ Random Forest
■ Gradient Boosting Machines (e.g., XGBoost)
■ Neural Networks
■ Long Short-Term Memory (LSTM) networks for time series data
○ Train multiple models to find the best-performing one.
7. Model Evaluation
○ Evaluate the models using metrics like Mean Absolute Error (MAE), Mean
Squared Error (MSE), and R-squared.
○ Use cross-validation to ensure the model generalizes well to unseen data.
○ Visualize model performance using plots like residual plots and predicted
vs. actual plots.
8. Future Projections
○ Use the trained model to make future projections of climate change
indicators.
○ Validate the projections using available data and compare them with
scientific forecasts and models.
9. Scenario Analysis
○ Conduct scenario analysis to understand the impact of different factors
(e.g., CO2 emission scenarios) on climate change.
○ Use the model to simulate different scenarios and assess their potential
impact.
10. Deployment (Optional)
○ Deploy the model using a web framework like Flask or Django.
○ Create a user-friendly interface where users can input data and receive
climate change predictions and scenarios.
11. Documentation and Reporting
○ Document the entire process, including data exploration, preprocessing,
feature engineering, model training, evaluation, and projections.
○ Create a final report or presentation summarizing the project, results, and
insights.
Example: You can get the basic idea how you can create a project from here

Sample Code

Here’s a basic example using Python and scikit-learn to model climate change
indicators

# Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset

# Example: Using a mock dataset with climate data
data = pd.read_csv('climate_data.csv')

# Explore the dataset

print(data.head())
print(data.describe())

# Preprocess the data

# Separate features and labels
X = data.drop('temperature_anomaly', axis=1)
y = data['temperature_anomaly']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Standardize the features

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the model

model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'MAE: {mae}')
print(f'MSE: {mse}')
print(f'R2: {r2}')

# Plot the results

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel('Actual Temperature Anomaly')
plt.ylabel('Predicted Temperature Anomaly')
plt.title('Actual vs Predicted Temperature Anomaly')
plt.show()

# Future projections (mock example)

# Assuming we have future data for the same features
future_data = pd.read_csv('future_climate_data.csv')
future_data_scaled = scaler.transform(future_data)
future_predictions = model.predict(future_data_scaled)

print(future_predictions)

This code demonstrates loading a climate dataset, preprocessing the data, training a
Random Forest regressor, evaluating the model, and making future projections.

Additional Tips

● Incorporate domain expertise to ensure the model's predictions are realistic and
scientifically valid.
● Use advanced time series forecasting techniques like LSTM networks for more
accurate long-term predictions.
● Continuously update the model with new data to improve its accuracy and
relevance over time.
● Collaborate with climate scientists to validate and interpret the model's
predictions.
Example: You can get the basic idea how you can create a project from here

Sample code with output

%%capture
# Install relevant libraries
!pip install geopandas folium

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import random
import os
from tqdm.notebook import tqdm

import geopandas as gpd

from shapely.geometry import Point
import folium

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
pd.options.display.float_format = '{:.5f}'.format
pd.options.display.max_rows = None

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# You can ignore the Shapely GEOS warning :-)

/opt/conda/lib/python3.7/site-packages/geopandas/_compat.py:115
: UserWarning: The Shapely GEOS version (3.9.1-CAPI-1.14.2) is
incompatible with the GEOS version PyGEOS was compiled with
(3.10.4-CAPI-1.16.2). Conversions between both will be slow.
shapely_geos_version, geos_capi_version_string

In [3]:
# Set seed for reproducability
SEED = 2023
random.seed(SEED)
np.random.seed(SEED)
2. Loading and previewing data

In [4]:
DATA_PATH = '/kaggle/input/playground-series-s3e20'
# Load files
train = pd.read_csv(os.path.join(DATA_PATH, 'train.csv'))
test = pd.read_csv(os.path.join(DATA_PATH, 'test.csv'))
samplesubmission = pd.read_csv(os.path.join(DATA_PATH,
'sample_submission.csv'))

# Preview train dataset

train.head()

Out[4]:

ID l Sulp Su Sul C Cl Cl Cl C Cl Cl Cl
l Sulph Sulph
_L o w hurD lp phu lo ou o o l ou o o
a urDio urDio
A n e ioxid hu rDi u d_ u u o d_ u u
t y xide_ xide_
T g e e_S rD oxi . d cl d d u se d d
i e SO2_ SO2_
_L i k O2_ io de_ . _ ou _ _c d ns _s _s
t a colu slant
O t _ colu xi sen . cl d_ cl lo _ or e ol
u r mn_n _colu
N u n mn_ de sor o ba o u s _a ns ar
d umbe mn_n
_ d o num _c _az u se u d u zi or _
e r_den umbe
Y e ber_ lo imu d _p d _ rf m _z az
sity_ r_den
E dens ud th_ _t re _ o a ut e im
A ity amf sity _fr ang o ss b pti c h_ nit ut
R ac le p ur a ca e an h h
_ tio _ e s l_ _ gl _ _
W n h e d a e a a
E ei _ e l n n
E g h pt b gl gl
K ht ei h e e e
g d
ht o

ID
_-
3 2
0. - 2 -1
6 6 1 0 3
51 0 9 61 3
6 1 5. . -1 5.
0_ . . 2 0. 08 8.
-98. . 4. 5. 5 2 2. 6
29 5 2 0 -0.0 0.603 -0.00 25 5. 7
0 0 593 . 4 1 6 7 62 3
.2 1 9 1 0011 02 007 56 80 8
89 . 3 2 8 2 89 2
90 0 0 9 7 95 6
6 0 5 2 9 4
_2 0 0 7 4
2 4 3 9 2
01 0 0 2
2 8
9_
00

16. .
ID - 2 2 0.00 0.728 0.000 0. 3 66 3 8. 0 30 3 -1
1 1 592 .
_- 0 9 0 002 21 01 13 6 96 1 6 . .3 9. 4
86 .
0. . . 1 09 5 9. 7 9 2 59 5 5.
51 5 2 9 9 1. 47 4. 0 5 38 5 1
0_ 1 9 1 87 5 6 6 7 8
29 0 0 9 3 7 0 8 6 3
.2 0 0 0 2 3 3 9
90 0 0 3 4 3
_2 1 2
01
9_
01

ID
_-
4 3
0. - 2 -1
2 5 2 0 3
51 0 9 60 4
1 1 1. . 0.
0_ . . 2 0. 06 15 2.
72. . 6. 6. 1 2 4
29 5 2 0 0.00 0.748 0.000 11 8. .3 5
2 2 795 . 9 2 0 5 0
.2 1 9 1 051 20 38 00 89 77 1
84 . 8 8 3 1 1
90 0 0 9 2 44 88 9
6 2 4 1 8
_2 0 0 5 5
4 6 1 0 2
01 0 0 4
9 7
9_
02

N .
ID - 2 2 Na 5 51 4 1 0 -1 2 -1
3 3 NaN NaN NaN a .
_- 0 9 0 N 2 06 1 5. . 1. 4. 3
N .
0. . . 1 2 4. 8 3 2 29 3 2.
51 5 2 9 8. 54 0. 8 6 34 8 6
0_ 1 9 5 73 9 6 2 0 0 6
29 0 0 0 4 7 9 0 3 5
.2 0 0 7 3 0 4 6 8
90 0 0 7 3 3
_2 4 2
01
9_
03

ID
_-
3 3
0. - 2 -1
9 3 0 3
51 0 9 63 4
8 5 8. . 7.
0_ . . 2 0. 75 38 1.
4.1 . 0. 5. 11 2 3
29 5 2 0 -0.0 0.676 -0.00 12 1. .5 5
4 4 212 . 5 7 4 3 9
.2 1 9 1 0008 30 005 11 12 32 0
7 . 9 1 6 5 2
90 0 0 9 6 57 26 9
8 0 9 8 9
_2 0 0 8 8
1 1 5 8
01 0 0 1
2 1
9_
04

5 rows × 76 columns

In [5]:
# Preview test dataset
test.head()

Out[5]:

Cl
ID Cl C C Cl C Cl
o
_ o lo Cl lo o l o
u
L u u o u u o u
S Cl d
A d d u d d u d
ul Sul ou _
T Sulp Sulp Sulp _ _ d _ _ d _
ph phu d_ s
_ l hurD hurDi hurDi cl cl _c cl cl _ s
l ur rDi se e
L o w ioxid oxide oxide o o lo o o s ol
a Di oxi ns n
O n e e_S _SO _SO u u u u u u ar
t y ox de_ or s
N g e O2_ 2_col 2_sla . d d d d d rf _
i e id sen _a or
_ i k colu umn nt_co . _t _ _ _ _ a a
t a e_ sor zi _
Y t _ mn_ _nu lumn . o t b b o c zi
u r cl _az m z
E u n num mber _num p o as a pt e m
d ou imu ut e
A d o ber_ _den ber_ _ p e s ic _ ut
e d_ th_ h_ ni
R e dens sity_ densi pr _ _ e al a h
fra ang an th
_ ity amf ty e h pr _ _ l _
cti le gl _
W s ei es h d b a
on e a
E s g su ei e e n
n
E ur h re g pt d gl
gl
K e t ht h o e
e
ID
_-
0.
5
1
0 3 8 4
7
_ - 2 6 4 1 -1
4 0 3
2 0 9 0 7 0 7. 3
7 . -1 3.
9. . . 2 2 2 4 9 3.
N . 2. 2 00 6
2 5 2 0 Na 2. . 7. 3 0
0 0 NaN NaN NaN a . 3 4 .1 9
9 1 9 2 N 0 3 9 5 4
N . 1 0 13 7
0 0 0 2 2 1 3 6 7
3 7 79 0
_ 0 0 7 3 7 2 5
4 7 4
2 0 0 3 4 5 5
8
0 4 8 0
2
2
_
0
0

-3
ID - 2 2 0. 4 6 5 5 1 0 4 -1
76. . 0.
_- 0 9 0 0.00 0.69 0.000 00 8 4 4 4 1. . 2. 3
1 1 239 . 51
0. . . 2 046 116 32 00 5 7 9 7 4 2 4 8.
20 . 03
5 5 2 2 0 3 6 1 6. 4 9 0 6
2
1 1 9 9. . 5. 1 8 3 2 3
0 0 0 7 1 7 4 4 1 5 2
_ 0 0 3 4 0 7 4 2 9 8
2 0 0 7 7 8 1 2
9. 2 3 5 6
2 4 2 8
9
0
_
2
0
2
2
_
0
1

ID 3 8 3
7
_- - 2 4 9 9 -1
9 1 0 4
0. 0 9 1 8 0 4
8 0. . 5.
5 . . 2 0. 3 4 0 39 4.
-42 . 4. 7 2 9
1 5 2 0 0.00 0.60 0.000 07 3. . 6. .0 7
2 2 .05 . 7 5 6 3
0 1 9 2 016 511 11 98 0 7 0 87 8
534 . 9 3 7 6
_ 0 0 2 7 8 9 9 36 4
5 1 1 4
2 0 0 0 5 3 9
7 8 3 8
9. 0 0 4 7 7 9
0
2 7 0 5
9
0
_
2
0
2
2
_
0
2

ID
_-
0.
5 5 6 5
5
1 - 2 0 0 7 -1
0 1 0 4
0 0 9 8 1 6 3
1 1. . -2 2.
_ . . 2 0. 5 4 4 5.
72. . 4. 7 3 4. 1
2 5 2 0 0.00 0.69 0.000 20 4. . 6. 0
3 3 169 . 7 6 0 46 4
9. 1 9 2 035 692 24 10 9 7 3 2
57 . 2 4 4 51 0
2 0 0 2 3 9 2 6 7
4 5 6 3 4
9 0 0 1 4 8 8
1 6 8 2
0 0 0 0 0 3 9
2
_ 8 6 7
2
0
2
2
_
0
3

ID
_-
0.
5
1
0 4 6 5
5
_ - 2 6 8 2 -1
8 1 0 3
2 0 9 5 4 8 3
4 3. . -1 0.
9. . . 2 0. 9 9 9 5.
-0.0 76. . 9. 0 2 2. 1
2 5 2 0 0.58 -0.00 20 4. . 6. 5
4 4 003 190 . 2 6 8 90 2
9 1 9 2 053 018 43 6 2 5 0
2 86 . 8 5 4 78 2
0 0 0 2 5 8 8 4 0
0 3 2 5 6
_ 0 0 5 0 1 1
3 2 2 4
2 0 0 1 4 8 2
9
0 4 8 7
2
2
_
0
4

5 rows × 75 columns
In [6]:
# Preview sample submission file
samplesubmission.head()

Out[6]:

ID_LAT_LON_YE emiss
AR_WEEK ion

ID_-0.510_29.290 81.94
0
_2022_00 000

ID_-0.510_29.290 81.94
1
_2022_01 000

ID_-0.510_29.290 81.94
2
_2022_02 000

ID_-0.510_29.290 81.94
3
_2022_03 000
ID_-0.510_29.290 81.94
4
_2022_04 000

In [7]:
# Check size and shape of datasets
train.shape, test.shape, samplesubmission.shape

Out[7]:

((79023, 76), (24353, 75), (24353, 2))

In [8]:
# Train to test sets ratio
(test.shape[0]) / (train.shape[0] + test.shape[0])

Out[8]:

0.23557692307692307

3. Statistical summaries

In [9]:
# Train statistical summary
train.describe(include = 'all')
Out[9]:

C
Cl Cl
ID l C Cl C Cl
o o C
_ o lo o l o
u u lo
L u u u o u
S d Cl d u
A Sul d d d u d
ul _ ou _ d
T Sulp Sulp Sulp ph _ _ _ d _
ph cl d_ s _
_ l hur hurDi hurDi urD cl cl cl _ s
l ur o se e s
L o w Diox oxide oxide ioxi o o o s ol
a Di u ns n ol
O n e ide_ _SO _SO de u u u u ar
t y ox d or s ar
N g e SO2 2_col 2_sla _se . d d d rf _
i e id _ _a or _
_ i k _col umn nt_co nso . _ _ _ a a
t a e_ b zi _ z
Y t _ umn _nu lumn r_a . t b o c zi
u r cl a m z e
E u n _nu mber _nu zim o a pt e m
d ou s ut e ni
A d o mbe _den mber uth p s ic _ ut
e d_ e h_ ni th
R e r_de sity_ _den _a _ e al a h
fra _ an th _
_ nsity amf sity ngl h _ _ l _
cti pr gl _ a
W e e h d b a
on e e a n
E i ei e e n
ss n gl
E g g pt d gl
ur gl e
K h ht h o e
e e
t

c 7 7 7 7 7 644 6441 6441 64 64 . 7 7 7 7 7 78 7 7 7

o 9 9 9 9 9 14.0 4.00 4.00 41 41 . 8 8 8 8 8 53 8 8 8
u 0 0 0 0 0 000 000 000 4. 4.0 . 5 5 5 5 5 9. 5 5 5
n 2 2 2 2 2 0 00 00 3 3 3 3 3 00 3 3 3
t 3 3 3 3 3 00 00 9 9. 9. 9. 9 00 9. 9. 9.
. . . . 0 . 0 0 0 . 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0

u
7
n
9 NNNN N . N N N N N N N N N
i Na
0 a a a a NaN NaN NaN a . a a a a a a a a a
q N
2 NNNN N . N N N N N N N N N
u
3
e

ID
_-
t 0. NNNN N . N N N N N N N N N
Na
o 5 a a a a NaN NaN NaN a . a a a a a a a a a
N
p 1 NNNN N . N N N N N N N N N
0
_
2
9.
2
9
0
_
2
0
1
9
_
0
0

f
NNNN N . N N N N N N N N N
r Na
1 a a a a NaN NaN NaN a . a a a a a a a a a
e N
NNNN N . N N N N N N N N N
q

- 2 2
2 5 5 4 1 0 4 -8 2
1 9 6
0 5 9 6 9. . -1 0. 6. 7.
m . . . 0. -7.
N 2 . 9 4 7 1 2 0. 4 8 9
e 8 8 0 0.00 0.83 0.00 15 92
a 0 . 2 2 0. 3 7 78 3 0 2
a 9 8 0 005 485 004 84 58
N . . . 0. 4 9 1 48 6 0 5
n 1 0 0 2 7
0 3 2 3 2 4 3 9 5 9
0 1 0
0 7 9 0 4 6 8 8 8
7 5 0
0 7 7 8
0 4 4 7
0 8 6

1
9 1
1 4
0 0 0 0 3 1 0 3
5 2 6. 4.
. . . 5 5 3. . 7.
. 0. 64. 8 30 4 4
s N 6 8 8 . 1. 9. 5 0 8
2 0.00 0.18 0.00 07 26 . .3 2 0
t a 9 1 1 . 1 2 4 4 3
9 027 538 021 13 33 5 74 8 3
d N 4 0 6 . 6 5 7 9 7
7 6 7 0 46 2 8
5 3 5 3 1 0 4 2
1 3 2 4
2 8 0 6 5 5 3 7
6 0
1 8
0

2 1 2
1
- 2 0 0 4 -1
0 0 0 1
3 8 1 5 7 1. 2. 5
. 5 . -1 0.
. . 9 0. -17 0 7 8 9 3.
mN 0 -0.0 . 0. 0 02 8
2 2 . 0.24 -0.00 00 9.5 . 9. 4 9 4
i a 0 010 . 4 1 .7 1
9 2 0 182 089 00 37 6 0 4 8 6
n N 0 0 . 9 7 39 8
9 8 0 0 06 6 3 5 8 4
0 6 7 73 2
0 0 0 1 3 3 7 2
0 8 0 9
0 0 0 7 7 1
2
0 8 0
2 4 5
3
- 2 0 1 5 3 -1
6 0 3 2
2 9 1 3 9 1 9. 2
8 . -3 5. 4.
. . 9 . 0. -56 5 7 9 5.
2 N -0.0 . 0. 2 0. 8 6
4 2 . 0 0.70 -0.00 11 .78 . 5. 7 9
5 a 001 . 8 4 30 2 8
5 6 0 0 582 008 05 23 4 7 4 9
%N 0 . 5 1 91 9 6
1 2 0 0 3 8 0 7 5 1
6 4 7 9 7
0 0 0 0 0 9 7 1
3 5 1 6
0 0 0 0 5 9 6
4
0 2 3

2 5 5
4
- 2 0 2 5 9
6 1 0 4 -8 2
1 9 2 6 7 3
2 5. . -1 1. 4. 8.
. . 0 . 0. -12 3 3
5 N . 1. 1 2 2. 1 6 3
8 8 . 0 0.00 0.80 0.00 16 .44 . 2.
0 a . 7 3 7 67 1 4 3
8 8 0 0 002 912 002 18 17 8 5
%N . 5 0 2 39 9 4 3
2 3 0 0 5 3 5 3
5 6 7 1 6 3 6
0 0 0 0 4 2
1 9 5 3 5 3
0 0 0 0 3 5
7
0 1 5

- 3 2 3 0.00 0.94 0.00 0. 72. 6 6 5 2 0 9. 4 -4 3

7 N .
1 0 0 9 015 279 012 21 05 5 5 5 3. . 40 4. 8. 1.
5 a .
. . 2 . 18 99 4 6 7 7 3 22 4 1 4
%N 3 4 1 0 2 9 . 2 6 2. 8 0 0 4 3 9
0 7 . 0 . 3. 9 5 2 6 2 9
3 1 0 0 3 8 8 0 8 2 7 8
0 0 0 0 0 4 3 3 9 7 0 8
0 0 0 0 3 2 2
0 6 6 2
0 4 8

1
2 8 1
2
- 3 0 5 9 1 2
3 0 6 -2 4
0 1 2 2 2 3 5
8 . 5. 2. 2.
. . 1 . 0. 12 9 8 0. 78
mN . 4 7 9 6 0
5 5 . 0 0.00 1.88 0.00 30 2.0 1. 4. 0 .2
a a . . 3 5 5 6
1 3 0 0 419 524 424 00 95 6 2 0 23
x N . 2 6 1 3 0
0 2 0 0 0 20 1 3 0 04
3 5 2 1 4
0 0 0 0 5 9 0
9 1 5 7 4
0 0 0 0 5 4 0
4
0 8 6
6

11 rows × 76 columns

From the above statistical summary, we can deduce some of the following insights:

● The train data provided ranges from year 2019 to 2021

● Minimum recorded CO2 emissions is 0.32064 and a maximum of 3167.76800
● Week of the year starts from 0 to 52
● The latitude and longitudes ranges show that the regions are mostly within
Rwanda

In [10]:
# Target variable distribution
sns.set_style('darkgrid')
plt.figure(figsize = (13, 7))
sns.histplot(train.emission, kde = True, bins = 15)
plt.title('Target variable distribution', y = 1.02, fontsize =
15)
display(plt.show(), train.emission.skew())
None

10.173825825101622

The target variable is skewed to the right with a a degree of ~7.

Some of the techniques used to handle skewness include:

● Log transform
● Box-cox transform
● Square root transform
● etc

4. Outliers

In [11]:
# Plotting boxplot for the CO2 emissions
sns.set_style('darkgrid')
plt.figure(figsize = (13, 7))
sns.boxplot(train.emission)
plt.title('Boxplot showing CO2 emission outliers', y = 1.02,
fontsize = 15)
plt.show()
Outliers are those data points which differ significantly from other observations
present in given dataset.

Suggestions on how to handle outliers:

● Transforming the outliers by scaling - log transformation, box-cox

transformation ...
● Dropping outliers
● Imputation by replacing outliers with mean, median ...

5. Geo Visualisation - EDA

In [12]:
# Combine train and test for easy visualisation
train_coords = train.drop_duplicates(subset = ['latitude',
'longitude'])
test_coords = test.drop_duplicates(subset = ['latitude',
'longitude'])
train_coords['set_type'], test_coords['set_type'] = 'train',
'test'

all_data = pd.concat([train_coords, test_coords], ignore_index

= True)
# Create point geometries

geometry = gpd.points_from_xy(all_data.longitude,
all_data.latitude)
geo_df = gpd.GeoDataFrame(
all_data[["latitude", "longitude", "set_type"]],
geometry=geometry
)

# Preview the geopandas df

geo_df.head()

Out[12]:
latitu longit set_t
geometry
de ude ype

-0.51 29.29 POINT (29.29000

0 train
000 000 -0.51000)

-0.52 29.47 POINT (29.47200

1 train
800 200 -0.52800)

-0.54 29.65 POINT (29.65300

2 train
700 300 -0.54700)

-0.56 30.03 POINT (30.03100

3 train
900 100 -0.56900)

-0.59 29.10 POINT (29.10200

4 train
800 200 -0.59800)

In [13]:
# Create a canvas to plot your map on
all_data_map = folium.Map(prefer_canvas=True)
# Create a geometry list from the GeoDataFrame
geo_df_list = [[point.xy[1][0], point.xy[0][0]] for point in
geo_df.geometry]

# Iterate through list and add a marker for each volcano,

color-coded by its type.
i = 0
for coordinates in geo_df_list:
# assign a color marker for the type set
if geo_df.set_type[i] == "train":
type_color = "green"
elif geo_df.set_type[i] == "test":
type_color = "orange"

# Place the markers

all_data_map.add_child(
folium.CircleMarker(
location=coordinates,
radius = 1,
weight = 4,
zoom =10,
popup=
"Set: " + str(geo_df.set_type[i]) + "<br>"
"Coordinates: " + str([round(x, 2) for x in
geo_df_list[i]]),
color = type_color),
)
i = i + 1
all_data_map.fit_bounds(all_data_map.get_bounds())
all_data_map

Out[13]:

Make this Notebook Trusted to load map: File -> Trust Notebook

6. Missing values and duplicates

In [14]:
# Check for missing values
train.isnull().sum().any(), test.isnull().sum().any()

Out[14]:

(True, True)

In [15]:
# Plot missing values in train set
ax = train.isna().sum().sort_values(ascending =
False)[:15][::-1].plot(kind = 'barh', figsize = (9, 10))
plt.title('Percentage of Missing Values Per Column in Train
Set', fontdict={'size':15})
for p in ax.patches:
percentage
='{:,.0f}%'.format((p.get_width()/train.shape[0])*100)
width, height =p.get_width(),p.get_height()
x=p.get_x()+width+0.02
y=p.get_y()+height/2
ax.annotate(percentage,(x,y))
Suggestions on how to handle missing values:

● Fill in missing values with mode, mean, median..

● Drop Missing datapoints with missing values
● Fill in with a large number e.g -999999

In [16]:
# Check for duplicates
train.duplicated().any(), test.duplicated().any()
Out[16]:

(False, False)

7. Date features EDA

In [17]:
# Year countplot
plt.figure(figsize = (14, 7))
sns.countplot(x = 'year', data = train)
plt.title('Year count plot - Train')
plt.show()
In [18]:
# Year countplot
plt.figure(figsize = (4, 7))
sns.countplot(x = 'year', data = test)
plt.title('Year count plot - Test')
plt.show()
● The number of observations of CO2 emissions are the same across the years
(2019, 2020, 2021)
● Year 2022 (in the test set) has fewer number of observations

In [19]:
# Week countplot
plt.figure(figsize = (14, 7))
sns.countplot(x = 'week_no', data = train)
plt.title('Week count plot - Train')
plt.show()

● The number of observations of CO2 emissions are relatively the same across
the weeks

In [20]:
train.drop_duplicates(subset = ['year',
'week_no']).groupby(['year'])[['week_no']].count()

Out[20]:
week
_no

ye
ar

20
53
19

20
53
20

20
53
21

8. Correlations - EDA

In [21]:
# Top 20 correlated features to the target
top20_corrs =
abs(train.corr()['emission']).sort_values(ascending =
False).head(20)
top20_corrs

Out[21]:
emission
1.00000
longitude
0.10275
UvAerosolLayerHeight_aerosol_height
0.06901
UvAerosolLayerHeight_aerosol_pressure
0.06814
Cloud_surface_albedo
0.04659
CarbonMonoxide_H2O_column_number_density
0.04322
CarbonMonoxide_CO_column_number_density
0.04133
Formaldehyde_tropospheric_HCHO_column_number_density_amf
0.04026
UvAerosolLayerHeight_aerosol_optical_depth
0.04016
UvAerosolLayerHeight_sensor_azimuth_angle
0.03514
NitrogenDioxide_solar_azimuth_angle
0.03342
Formaldehyde_tropospheric_HCHO_column_number_density
0.03333
SulphurDioxide_solar_azimuth_angle
0.03234
Formaldehyde_solar_azimuth_angle
0.03081
NitrogenDioxide_sensor_altitude
0.02754
UvAerosolLayerHeight_solar_azimuth_angle
0.02721
NitrogenDioxide_sensor_azimuth_angle
0.02710
CarbonMonoxide_solar_azimuth_angle
0.02628
SulphurDioxide_sensor_azimuth_angle
0.02508
Ozone_solar_azimuth_angle
0.02485

Name: emission, dtype: float64

In [22]:
# Quantify correlations between features
corr = train[list(top20_corrs.index)].corr()
plt.figure(figsize = (13, 8))
sns.heatmap(corr, cmap='RdYlGn', annot = True, center = 0)
plt.title('Correlogram', fontsize = 15, color = 'darkgreen')
plt.show()

9. Timeseries visualization - EDA

In [23]:

linkcode
# Sample a unique location and visualize its emissions across
the years
train.latitude, train.longitude = round(train.latitude, 2),
round(train.longitude, 2)
sample_loc = train[(train.latitude == -0.510) &
(train.longitude == 29.290)]

# Plot a line plot

sns.set_style('darkgrid')
fig, axes = plt.subplots(nrows = 3, ncols = 1, figsize = (13,
10))
fig.suptitle('Co2 emissions for location lat -23.75 lon 28.75',
y=1.02, fontsize = 15)

for ax, data, year, color, in zip(axes.flatten(), sample_loc,

sample_loc.year.unique(), ['#882255','#332288', '#999933' ,
'orangered']):
df = sample_loc[sample_loc.year == year]
sns.lineplot(x=df.week_no,y= df.emission, ax = ax, label =
year, color = color)
plt.legend()
plt.tight_layout()
Reference code

Climate Change Analysis Project
No ratings yet
Climate Change Analysis Project
9 pages
Rainfall Prediction
100% (2)
Rainfall Prediction
33 pages
Climate Change
No ratings yet
Climate Change
97 pages
Machine Learning and Climate Change
No ratings yet
Machine Learning and Climate Change
97 pages
# Foundation Models For Weather and Climate Data Understanding A Comprehensive Survey
No ratings yet
# Foundation Models For Weather and Climate Data Understanding A Comprehensive Survey
38 pages
Climax: A Foundation Model For Weather and Climate
No ratings yet
Climax: A Foundation Model For Weather and Climate
41 pages
Rameez Raja Final
No ratings yet
Rameez Raja Final
38 pages
Vansh Report 5555555555-2
No ratings yet
Vansh Report 5555555555-2
43 pages
Paradoxes
No ratings yet
Paradoxes
528 pages
Literature Survey On AI-Driven Early Sepsis Prediction Using Clinical Data
No ratings yet
Literature Survey On AI-Driven Early Sepsis Prediction Using Clinical Data
42 pages
AI Project
No ratings yet
AI Project
30 pages
UGEO - HM70A - Operation Manual (Vol1)
100% (1)
UGEO - HM70A - Operation Manual (Vol1)
232 pages
RAVI
No ratings yet
RAVI
26 pages
Mirza Kayesh Begg - 250274290 - CompleteReport
No ratings yet
Mirza Kayesh Begg - 250274290 - CompleteReport
12 pages
JCST-2307-13627 Proof Hi
No ratings yet
JCST-2307-13627 Proof Hi
17 pages
Documentation Weather Analysis
No ratings yet
Documentation Weather Analysis
22 pages
CliMatters Paper
No ratings yet
CliMatters Paper
23 pages
### Mitigating Climate Change A Data Science Approach To Climate Prediction - 20240522 - 080344 - 0000
No ratings yet
### Mitigating Climate Change A Data Science Approach To Climate Prediction - 20240522 - 080344 - 0000
15 pages
Ravi 1918595 Miniprojectreport
No ratings yet
Ravi 1918595 Miniprojectreport
11 pages
Synopsis On Weather Forecast
No ratings yet
Synopsis On Weather Forecast
24 pages
Developing AI Models To Predict and Mitigate The Effects of Climate
No ratings yet
Developing AI Models To Predict and Mitigate The Effects of Climate
20 pages
Creative Non-Fiction - Q3 - W6
100% (5)
Creative Non-Fiction - Q3 - W6
17 pages
VTP Interview Questions and Answers (VLAN Trunking Protocol) - Networker Interview
100% (1)
VTP Interview Questions and Answers (VLAN Trunking Protocol) - Networker Interview
2 pages
PROJECT
No ratings yet
PROJECT
13 pages
Team9 ITBgg
No ratings yet
Team9 ITBgg
10 pages
LPG Gas Leaking Detecting Robot
No ratings yet
LPG Gas Leaking Detecting Robot
14 pages
EN Climate Change Lesson by Slidesgo
No ratings yet
EN Climate Change Lesson by Slidesgo
8 pages
Major Synopsis
No ratings yet
Major Synopsis
9 pages
Air Quality: & Pollution
No ratings yet
Air Quality: & Pollution
25 pages
Naan Mud Halva NPP T
No ratings yet
Naan Mud Halva NPP T
14 pages
DSG Bring Your Own Project
No ratings yet
DSG Bring Your Own Project
8 pages
Climate Change Analysis Using Machine Learning
No ratings yet
Climate Change Analysis Using Machine Learning
9 pages
Naan Mud Hal Van Report
No ratings yet
Naan Mud Hal Van Report
12 pages
Weather Forecasting
No ratings yet
Weather Forecasting
12 pages
Techniques To Preprocess The Climate Projections-A Review: Shweta Panjwani S. Naresh Kumar
No ratings yet
Techniques To Preprocess The Climate Projections-A Review: Shweta Panjwani S. Naresh Kumar
13 pages
Evolution of Entrepreneurship: The 17 Century The Middle Ages The Earliest Stage
0% (1)
Evolution of Entrepreneurship: The 17 Century The Middle Ages The Earliest Stage
2 pages
Big Data Analytics For Weather Prediction Research Paper
No ratings yet
Big Data Analytics For Weather Prediction Research Paper
8 pages
Climate Change Detection Using Python
No ratings yet
Climate Change Detection Using Python
10 pages
Presentation Template
No ratings yet
Presentation Template
10 pages
Climate Change Modeling
No ratings yet
Climate Change Modeling
10 pages
Implement Classification and Time Series Analysis in Tensorflow
No ratings yet
Implement Classification and Time Series Analysis in Tensorflow
7 pages
Phase 2-Climate Crusader
No ratings yet
Phase 2-Climate Crusader
9 pages
Climate Change Forecasting Using Machine Learning Algorithms
No ratings yet
Climate Change Forecasting Using Machine Learning Algorithms
10 pages
Project Proposal On Climate Prediction
No ratings yet
Project Proposal On Climate Prediction
3 pages
Smart Climate Prediction
No ratings yet
Smart Climate Prediction
4 pages
GX 7, GX 11: Instruction Book
No ratings yet
GX 7, GX 11: Instruction Book
76 pages
Weatherforecasting
No ratings yet
Weatherforecasting
10 pages
Shreya 1
No ratings yet
Shreya 1
4 pages
Climate Change
No ratings yet
Climate Change
3 pages
Phase 3
No ratings yet
Phase 3
23 pages
Climate Change Detection Using Python
No ratings yet
Climate Change Detection Using Python
10 pages
Aml Weather
No ratings yet
Aml Weather
6 pages
Dummy Research Paper 2
No ratings yet
Dummy Research Paper 2
2 pages
Climate Change Analysis and Prediction (Literature Review) Complete
No ratings yet
Climate Change Analysis and Prediction (Literature Review) Complete
4 pages
Climate Change Analysis and Prediction (Literature Review) Complete
No ratings yet
Climate Change Analysis and Prediction (Literature Review) Complete
4 pages
Climate Change Detection Using Python
No ratings yet
Climate Change Detection Using Python
10 pages
Dummy Research Paper 1
No ratings yet
Dummy Research Paper 1
2 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
4 pages
Document
No ratings yet
Document
3 pages
The Role of Machine Learning in Climate Change Prediction and Mitigation
No ratings yet
The Role of Machine Learning in Climate Change Prediction and Mitigation
3 pages
Weather Forecasting
No ratings yet
Weather Forecasting
10 pages
Energy Virtual Work Diagram
100% (1)
Energy Virtual Work Diagram
2 pages
PT-1 PPT Template (1) .PPTX - AutoRecovered
No ratings yet
PT-1 PPT Template (1) .PPTX - AutoRecovered
1 page
Rainfall
No ratings yet
Rainfall
24 pages
Climate Change Predictor
No ratings yet
Climate Change Predictor
1 page
FM Modulators: Experiment 7
100% (2)
FM Modulators: Experiment 7
17 pages
Instruction Manual: Millivoltmeter
100% (1)
Instruction Manual: Millivoltmeter
51 pages
Grade 8 and 9 Workbook
No ratings yet
Grade 8 and 9 Workbook
155 pages
Determination of The Observed Value of C
No ratings yet
Determination of The Observed Value of C
401 pages
Upsc Cms Guru Answerkey2022p1
No ratings yet
Upsc Cms Guru Answerkey2022p1
45 pages
Group 1 - FC1 G12 01 STEM - 1st Draft of RRL
No ratings yet
Group 1 - FC1 G12 01 STEM - 1st Draft of RRL
10 pages
Procurement Profile
No ratings yet
Procurement Profile
18 pages
Brief Summary of Peptic Ulcers
No ratings yet
Brief Summary of Peptic Ulcers
3 pages
Analysis of Organic Acids 2370 PDF
No ratings yet
Analysis of Organic Acids 2370 PDF
22 pages
Oscar Ccoa Codes v1
No ratings yet
Oscar Ccoa Codes v1
247 pages
CT 230
No ratings yet
CT 230
21 pages
Alt II (Apron)
No ratings yet
Alt II (Apron)
25 pages
CRC
No ratings yet
CRC
35 pages
FPGA Implementation of Simplified SVPWM Algorithm For Three Phase Voltage Source Inverter
No ratings yet
FPGA Implementation of Simplified SVPWM Algorithm For Three Phase Voltage Source Inverter
8 pages
100 Tareas Me Dejan en Ingles
No ratings yet
100 Tareas Me Dejan en Ingles
2 pages
EOY Subject Information 2024 9 Sec 1 G3
No ratings yet
EOY Subject Information 2024 9 Sec 1 G3
2 pages
Session 2. Legal, Technological, Accounting, Political Environments and The Role of Culture
No ratings yet
Session 2. Legal, Technological, Accounting, Political Environments and The Role of Culture
25 pages
PR Resume
No ratings yet
PR Resume
6 pages
2011 05 08 The Backslider
No ratings yet
2011 05 08 The Backslider
2 pages
PP FFA Chapter Wise - DEC'23 Updated
No ratings yet
PP FFA Chapter Wise - DEC'23 Updated
5 pages
Lohmann - Poultry
No ratings yet
Lohmann - Poultry
12 pages
(STUDI KASUS: Yayasan Sosial Dana Priangan) : Perancangan Sistem Informasi Museum Budaya Tionghoa Bandung Berbasis Web
No ratings yet
(STUDI KASUS: Yayasan Sosial Dana Priangan) : Perancangan Sistem Informasi Museum Budaya Tionghoa Bandung Berbasis Web
6 pages
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Computational Science: An Introduction for Scientists and Engineers
From Everand
Computational Science: An Introduction for Scientists and Engineers
Christopher D Wentworth
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet