0% found this document useful (0 votes)
14 views

BA Assignment - v4

ba

Uploaded by

rasikajs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

BA Assignment - v4

ba

Uploaded by

rasikajs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Weather Forecasting Analysis using the Linear Regression Algorithm

Rasika Jayawardena
CBO12267- Msc. in IT
Asia Pacific Institute of Information Technology
Colombo, Srilanka
[email protected]

ABSTRACT ensuring worker safety and saving money (Hafiz , 2018).

The paper discusses the increasing diversity of information ➢ Sports


sources and the exponential growth of data volume,
particularly in open data initiatives and platforms. It highlights Weather conditions like rainfall, lightning, and wind
the importance of Information Visualization (OD) in various significantly impact sports industries, such as sailing races
fields and sectors, particularly CSV files, and how it can help and golfers, and forecasting helps determine when to
users quickly understand the structure and issues of these files. cover courts during rain (Aishwari, 2023).

INTRODUCTION ➢ Transportation

The exponential growth of heterogeneous data from various Weather forecasts help shipping lines predict storm
sources, including the Internet of Things and user-generated dangers, determine sailing routes, and avoid flight delays
content, presents both challenges and opportunities for and cancellations worldwide (Mahesh, 2019).
businesses and academics (Brend,2016). This data, ➢ Disaster Management
encompassing both structured and unstructured data, has
transformed operations through improved customer service, Natural disasters worldwide can be predicted and
payments, business models, and online engagement predicted using big data analytics, reducing fatalities,
(Tanvi,2021). saving lives, and reducing economic damage. The
accuracy and lead time vary by disaster type (Kuldeep,
BACKGROUND 2020)
Weather refers to the daily fluctuations of the atmosphere,
collected through various observations like sea, ground, and An overview of the dataset
radar. This data is used to forecast weather through various
S/N Attribute Description Data Type
applications and models. Weather forecasts are crucial for 1 Time Observation timestamps. Date
daily decision-making, impacting agriculture, irrigation, 2 Weathercode A numerical code that Numeric
represents the current
marine trade, and saving lives from accidents. They also affect weather conditions.
industries, transportation, disaster management, and energy 3 Temperature_2m_max The maximum temperature Numeric
at 2 meters.
management. 4 Temperature_2m_min The minimum temperature Numeric
at 2 meters.
➢ Agriculture/Food 5 Temperature_2m_mean The mean temperature at 2 Numeric
meters.
Weather forecasting and big data analytics can improve 6 Apparent_temperature_max The maximum apparent
value of temperatures is
Numeric

agricultural production by predicting soil erosion, determined by considering


overwatering, and drought, enabling farmers to estimate factors like wind chill and
heat index.
food prices, and supermarket chains to control stock more 7 Apparent_temperature_min The minimum apparent Numeric
value of temperatures is
efficiently (Sudhnya ,2019). determined by considering
factors like wind chill and
➢ Tourism heat index.
8 Apparent_temperature_mean The mean apparent value Numeric
of temperatures is
Climate conditions are crucial in tourism, as they determined by considering
influence people's choice of destinations for various factors like wind chill and
heat index.
purposes. Weather forecasts are essential for ensuring 9 Sunrise Each day's sunrise time. Datetime
safety and convenience, and can also estimate the 10 Sunset
Each day's sunset time . Datetime

industry's benefits based on climate change (Simone 11 Shortwave radiation sum Observed shortwave Numeric
radiation
,2021). 12 Precipitation sum Precipitation duration Numeric

➢ Construction 13 Rain sum Rain sum Numeric


14 Snowfall sum Snowfall sum Numeric
15 Precipitation hours Precipitation hours Numeric
Weather forecasting helps protect construction workers, 16 Windspeed_10m_max Wind speed maximums at Numeric
activities, and resources from climate-related hazards, 10 meters above ground
17 Windgusts_10m_max 10 meter maximum wind Numeric
enabling earlier identification and planning, thereby gust values
18 Winddirection_10m_dominant A 10 meter high wind Numeric DEFINING THE QUESTING
direction dominates.
19 et0_fao_evapotranspiration ET0 based on Penman- Numeric
Monteith equation This study examines Sri Lanka's weather data from 2010 to
provided by FAO 2023, focusing on the impact of weather on agriculture.
20 latitude City latitudes and Numeric
elevations Variables like temperature, precipitation, and sunlight hours
21 longitude City longitude and Numeric are crucial. Time-related variables are used to analyze seasonal
elevations
22 elevation Geographic coordinates Numeric changes and precipitation patterns. Extreme weather events
and elevations of each city are recognized using parameters like 'precipitation_sum' and
23 Country Country names associated String
with each weather 'temperature_2m_max'. Stakeholders, such as agricultural
observation. experts, environmental scientists, and local communities, are
24 City Observation cities' names String
discussed to understand the problem and its causes. This helps
. frame the problem more accurately (Randima ,2022).

.COLLECTING THE DATA

The Sri Lanka Weather Dataset, available on Kaggle, contains


comprehensive weather data for 30 major cities in Sri Lanka
from 2010-2023. The data is analyzed for exploratory data
analysis (EDA) to ensure accuracy and consistency, checking
column names, data types, and overall consistency. The
dataset is used for better understanding and decision-making
(Randima,2022).

CLEANING THE DATA

Data errors, duplicates, or mislabeled items from multiple


sources can negatively impact outcomes and algorithms. Clean
data preparation involves removing garbage, incorrect,
duplicate, corrupt, or incomplete data from a dataset. Cleaner
data sets are essential for the analytical process and
information science, making data analytical tools and business
intelligence more effective and user-friendly (Randima ,2022).

RESEARCH METHODOLOGY

The data analysis cycle is a process of transforming raw data


into meaningful insights that can inform decision-making. The
cycle typically consists of the following steps (Randima
,2022).

• Defining the question


• Collecting the data ➢ Error-Free Data
• Cleaning the data
• Analyzing the data Data Cleaning is a crucial process that removes errors and
• Share the results garbage values from data, enhancing analysis efficiency
and saving time. Inaccurate data can lead to mistakes, and
it's easier to fix incorrect or corrupt data when errors are
monitored, and proper reporting is done.

➢ Data Quality

The analysis of weather data can be improved by


addressing potential data quality issues such as missing
values, incorrect date/time formats, temperature and
precipitation outliers, inconsistent unit values, invalid wind To clean CSV data with Python, follow these steps:
direction values, and errors in geographical coordinates
and elevation. 1. Importing the necessary libraries: Access the Pandas
library by typing "import pandas as pd". Depending on
➢ Accurate and Efficient your cleaning needs, you might also need to import
NumPy and Matplotlib.
The weather dataset, including time, temperature,
precipitation, wind, and geographic coordinates, was 2. Reading the CSV file: Pandas' read_csv() function
validated, handled, and standardized to address issues like reads CSV files with parameters like file path, delimiter,
inconsistent temperature values, outliers, and data entry and column names, returning a Data Frame object for
errors, enhancing its reliability for climate research and data manipulation and cleaning.
agriculture planning.
3. Checking for missing values: Use isnull() to create a
➢ Complete Data Boolean mask for missing values, sum each column's
missing values, remove them with dropna() or fillna()
The complete weather dataset includes essential fields like
functions.
time, weather code, temperature, sunrise, sunset,
precipitation sums, rain sums, snowfall sums, and more. It 4. Removing duplicate rows: The function
has been cleaned and validated to address missing values, 'drop_duplicates()' can be used to remove duplicate rows
inconsistent temperature values, outliers, and data entry from a Data Frame, with parameters like'subset' and
errors. This refined dataset is an excellent resource for 'keep' allowing for checking for duplicates in specific
climate research, agriculture planning, and energy columns or keeping only the first occurrence.
management.
5. Handling outliers: The describe() function can be used
➢ Maintain Data Consistency to identify outliers in numerical columns of a Data
Frame. Using the Z-score method or the Interquartile
Consistency in data can be measured by comparing
Range (IQR) method, outliers in specific fields like
systems within the same dataset or across multiple datasets. temperature, precipitation, and wind speed can be
The weather dataset, which includes fields like time, mitigated. This helps in climate research, agriculture
weather code, temperature, sunrise, sunset, and more, planning, and energy management by filtering out
requires consistency in various fields. The dataset uses extreme values in temperature and precipitation,
uniform units, date/time formats, and measurement ensuring more reliable analyses.
methods to maintain high accuracy, improving its use in 6. Data transformation: To optimize data analysis, it
climate research, agriculture planning, and energy may be necessary to convert data types, normalize
management. data, or create new columns based on existing data,
such as date columns.
DATA CLEANING CYCLE 7. Saving the cleaned data: The Pandas library's to_csv()
function allows cleaning data to be saved as a new CSV
Data cleaning is a data analysis method that involves file, taking parameters like file path, index, and header.
analyzing, identifying, and correcting untidy raw data,
8. Data validation: After cleaning, it's crucial to validate
filling in missing values, identifying errors, and correcting the data by comparing it with original or external data
them, with techniques varying depending on the dataset
sources to obtain the necessary information.
type (Jiawei, 2012).
VISUALIZING THE DATA Analyse Temperature, Precipitation and Wind speed
in Srilanka Monthly
In order to gain a better understanding of the data, plots are
used to visualize it (J. Phys, 2021).

The Sri Lanka diagram below shows the cities that are
available in the data set.

Analyse Temperature, Precipitation and Wind speed


in Srilanka from 2010 to 2023

Analyse Temperature, Precipitation and Wind speed


in Srilanka Anually
Using minimum and maximum temperatures,
precipitation, and wind speed to analyze weather conditions
can provide numerous benefits to agriculture, sports, tourism,
and travel.

Weather data is crucial to crop management, pest and


disease control, and soil moisture management in agriculture.
In addition to monitoring wind speed, farmers also use
temperature and precipitation patterns to optimize planting,
irrigation, and harvesting schedules. A farmer can also adjust
pest management strategies based on weather information in
order to predict and manage pest and disease outbreaks. As a
result, farmers are able to plan irrigation schedules and
implement better water conservation strategies using
Analyse Temperature, Precipitation in Seasons precipitation data.

During outdoor events, sports organizers can implement


safety precautions based on weather data. A ball's bounce and
movement are affected by pitch hardness and moisture content.
Rain and precipitation can dampen pitches, favoring spin
bowling. A bowler's swing and movement are affected by wind
speed, which alters the trajectory of the ball. Weather conditions
can affect player performance, with high temperatures causing
fatigue and dehydration, and precipitation making conditions
slippery. As well as bowlers and batsmen, wind speed
influences ball behavior. As a result of weather conditions,
teams strategize their strategies, such as using swing bowlers
when the conditions are swing-friendly. Weather interruptions
can change match dynamics, affecting schedules and outcomes.

Tourism destinations and outdoor activities are heavily


influenced by weather conditions. Travelers prioritize
destinations with favorable weather conditions for specific
Analyse Temperature, Precipitation and Wind Speed activities, and tourism industry professionals use weather data
for cities in Sri Lanka to promote such destinations. The weather forecast helps
travelers plan outdoor activities such as hiking, sightseeing, or
beach trips, ensuring they are prepared appropriately and have
a pleasant experience. A traveler's knowledge of precipitation
and wind speed is particularly valuable for optimizing their
travel experiences.

There is a good correlation between maximum temperature and


apparent maximum temperature.
The intercept value 6.10439771 represents the linear
regression model's y-intercept. The model predicts a target
value of approximately -6.10 when all independent variables
(features) are zero. Therefore, even if all features have no
impact on the target variable (e.g., temperature, humidity, etc.),
the model still predicts a negative value.
In the same way, this intercept value 0.75067964 is the
y-intercept for another linear regression model. When all
features are zero, the model predicts a target value of
approximately -0.75.This indicates that the model has a
baseline prediction even without any features.
When all features are zero, the model predicts a target
value of approximately 2.69. This positive baseline prediction
indicates that the model expects a non-zero value regardless of
The graph below shows the relationship between temperature feature inputs.
_2m_max and precipitation sum, which would lead us to
believe that it would be a very high correlation apparently, but
as we can see, we do not have this correlation.

Prediction using Regression.


Based on regression modeling, I made predictions for three
scenarios.

1. X = weather['temperature_max'].values.reshape(-1,1)

y = weather['evapotranspiration'].values.reshape(-1,1)

2. X = weather['temperature_max'].values.reshape(-1,1)
PLOT DIAGRAMS OF THREE SCENARIOS
y = weather['apparent_temperature'].values.reshape(-1,1)

3. X = weather['temperature_2m_max'].values.reshape(-1,1)

y = weathet['target'].values.reshape(-1,1)

A regression model's R-squared indicates how well it fits the


observed data. The percentage of variance in the dependent
variable can be explained by the independent variable
(Gbadamosi, 2019).
According to the first scenario, the R-squared value is
0.5033482483253012. This value represents the coefficient of
determination (R-squared) for a linear regression model using
the given X_test and Y_test. Based on the features in X_test,
approximately 50.33% of the variance in y_test can be
explained by the linear regression model.
Second scenario, the R-squared value is
0.7342544181383424. The model with this X_test provides a
better fit than the previous one, since it explains more of the
variability in the target variable based on the R-squared value.
Based on this, a linear regression model with a different set of
features can explain approximately 73.43% of the variance in
y_test.
As a result, the R-squared value is
0.8167653460930548. Here, the R-squared value is 81.68%,
which means that it captures a significant portion of the
variability in the target variable, suggesting a strong fit.
A linear regression model's intercept represents the
predicted value of the dependent variable (y) when all
independent variables (X) are zero. The intercept values
provided can be interpreted as follows:
9. DM Randima Dinalankara, Dulani Yasara Mudunkotuwa
(2022), Machine learning based Weather Prediction Model
for Short Term Weather Prediction in Sri Lanka
10. Jiawei Han, Micheline Kamber, Jian Pei (2012), Data
Mining Concepts and Techniques
11. J. Phys (2021), Research on Python Data Visualization
Technology
12. Gbadamosi Babatunde, Adeniyi Abidemi Emmanuel,
Ogundokun Roseline Oluwaseun, Oladosu Bukola Bunmi,
Anyaiwe Ehiedu Precious (2019) Impact of Climatic
Change on Agricultural Product Yield Using KMeans and
Multiple Linear Regressions
CONCLUSION

It is the forecasting of the weather that is the most scientific and


technologically challenging problem in the world. In order to
predict weather conditions, two things need to be done
correctly: collecting data from the meteorological department
and selecting the right data mining methods. Accuracy of the
model and its timely output are the two most important aspects
of weather prediction. Due to the complex nature of the problem
domain of weather forecasting. it is very feasible to use data
mining techniques to provide some accurate results in a
thorough manner. Weather prediction is improved by applying
more than one data mining technique in parallel. By combining
several forecasting and data mining techniques, we attempt to
forecast different weather conditions. The proposed model
achieved an impressive classification accuracy using limited
parameters despite having many parameters.

REFERENCES

1. Bernd Skiera (2016), Data, Data and Even More Data:


Harvesting Insights From the Data Jungle
2. Tanvi Patil, Dr Kamal Shah (2021), Weather Forecasting
Analysis using Linear and Logistic Regression Algorithm
3. Sudhnya Kashikar, , Sumedha Patil , Ameya Vedantwar ,
Shivani Katpatal , Sofia Pillai(2019), Weather Prediction
using Scikit-Learn
4. Simone Lionetti , Daniel Pfaffli , Marc Pouly , Tim Vor Der
Bruck and Philipp Wegelin, (2021), Tourism Forecast with
Weather, Event, and Cross-industry Data.
5. Hafiz A. Alaka, Lukumon O. Oyedele, Hakeem A.
Owolabi, Muhammad Bilal, Saheed O. Ajayi, Olugbenga
O. Akinade (2018), A framework for big data analytics
approach to failure prediction of construction firms
6. Aishwary Kumar Tiwari, Ujjawal Tomar (2023), Sports
Results Prediction.
7. K. Mahesh Babu, J. Rene Beulah (2019), Air Quality
Prediction based on Supervised Machine Learning
Methods
8. Kuldeep Chaurasia , Unnam Tarun , Guddanti V. S. Sarala
, Komal Soni (2020), AI Based Prediction of Daily Rainfall
from Satellite Observation for Disaster Management

You might also like