0% found this document useful (0 votes)
19 views

BA Assignment_pdf_v4

The paper analyzes weather forecasting in Sri Lanka from 2010 to 2023 using linear regression, emphasizing the impact of weather on various sectors such as agriculture, sports, and tourism. It discusses the importance of data cleaning and visualization in ensuring accurate predictions and decision-making. The findings indicate that the linear regression models can explain a significant portion of variance in weather-related outcomes, highlighting the challenges and importance of accurate weather forecasting.

Uploaded by

rcj5831
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

BA Assignment_pdf_v4

The paper analyzes weather forecasting in Sri Lanka from 2010 to 2023 using linear regression, emphasizing the impact of weather on various sectors such as agriculture, sports, and tourism. It discusses the importance of data cleaning and visualization in ensuring accurate predictions and decision-making. The findings indicate that the linear regression models can explain a significant portion of variance in weather-related outcomes, highlighting the challenges and importance of accurate weather forecasting.

Uploaded by

rcj5831
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Weather Forecasting Analysis using the Linear Regression Algorithm

Rasika Jayawardena
CBO12267- Msc. in IT
Asia Pacific Institute of Information Technology
Colombo, Srilanka
[email protected]

ABSTRACT ensuring worker safety and saving money (Hafiz , 2018).

The paper discusses the increasing diversity of information ➢ Sports


sources and the exponential growth of data volume,
particularly in open data initiatives and platforms. It highlights Weather conditions like rainfall, lightning, and wind
the importance of Information Visualization (OD) in various significantly impact sports industries, such as sailing races
fields and sectors, particularly CSV files, and how it can help and golfers, and forecasting helps determine when to
users quickly understand the structure and issues of these files. cover courts during rain (Aishwari, 2023).

INTRODUCTION ➢ Transportation

The exponential growth of heterogeneous data from various Weather forecasts help shipping lines predict storm
sources, including the Internet of Things and user-generated dangers, determine sailing routes, and avoid flight delays
content, presents both challenges and opportunities for and cancellations worldwide (Mahesh, 2019).
businesses and academics (Brend,2016). This data, ➢ Disaster Management
encompassing both structured and unstructured data, has
transformed operations through improved customer service, Natural disasters worldwide can be predicted and
payments, business models, and online engagement predicted using big data analytics, reducing fatalities,
(Tanvi,2021). saving lives, and reducing economic damage. The
accuracy and lead time vary by disaster type (Kuldeep,
BACKGROUND 2020)
Weather refers to the daily fluctuations of the atmosphere,
collected through various observations like sea, ground, and An overview of the dataset
radar. This data is used to forecast weather through various
S/N Attribute Description Data Type
applications and models. Weather forecasts are crucial for 1 Time Observation timestamps. Date
daily decision-making, impacting agriculture, irrigation, 2 Weathercode A numerical code that Numeric
represents the current
marine trade, and saving lives from accidents. They also affect weather conditions.
industries, transportation, disaster management, and energy 3 Temperature_2m_max The maximum temperature Numeric
at 2 meters.
management. 4 Temperature_2m_min The minimum temperature Numeric
at 2 meters.
➢ Agriculture/Food 5 Temperature_2m_mean The mean temperature at 2 Numeric
meters.
Weather forecasting and big data analytics can improve 6 Apparent_temperature_max The maximum apparent
value of temperatures is
Numeric

agricultural production by predicting soil erosion, determined by considering


overwatering, and drought, enabling farmers to estimate factors like wind chill and
heat index.
food prices, and supermarket chains to control stock more 7 Apparent_temperature_min The minimum apparent Numeric
value of temperatures is
efficiently (Sudhnya ,2019). determined by considering
factors like wind chill and
➢ Tourism heat index.
8 Apparent_temperature_mean The mean apparent value Numeric
of temperatures is
Climate conditions are crucial in tourism, as they determined by considering
influence people's choice of destinations for various factors like wind chill and
heat index.
purposes. Weather forecasts are essential for ensuring 9 Sunrise Each day's sunrise time. Datetime
safety and convenience, and can also estimate the 10 Sunset
Each day's sunset time . Datetime

industry's benefits based on climate change (Simone 11 Shortwave radiation sum Observed shortwave Numeric
radiation
,2021). 12 Precipitation sum Precipitation duration Numeric

➢ Construction 13 Rain sum Rain sum Numeric


14 Snowfall sum Snowfall sum Numeric
15 Precipitation hours Precipitation hours Numeric
Weather forecasting helps protect construction workers, 16 Windspeed_10m_max Wind speed maximums at Numeric
activities, and resources from climate-related hazards, 10 meters above ground
17 Windgusts_10m_max 10 meter maximum wind Numeric
enabling earlier identification and planning, thereby gust values
18 Winddirection_10m_dominant A 10 meter high wind Numeric COLLECTING THE DATA
direction dominates.
19 et0_fao_evapotranspiration ET0 based on Penman- Numeric
Monteith equation The Sri Lanka Weather Dataset, available on Kaggle, contains
provided by FAO comprehensive weather data for 30 major cities in Sri Lanka
20 latitude City latitudes and Numeric
elevations from 2010-2023. The data is analyzed for exploratory data
21 longitude City longitude and Numeric analysis (EDA) to ensure accuracy and consistency, checking
elevations
22 elevation Geographic coordinates Numeric column names, data types, and overall consistency. The
and elevations of each city dataset is used for better understanding and decision-making
23 Country Country names associated String
with each weather (Randima,2022).
observation.
24 City Observation cities' names String CLEANING THE DATA

. Data errors, duplicates, or mislabeled items from multiple


sources can negatively impact outcomes and algorithms. Clean
data preparation involves removing garbage, incorrect,
RESEARCH METHODOLOGY
duplicate, corrupt, or incomplete data from a dataset. Cleaner
The data analysis cycle is a process of transforming raw data data sets are essential for the analytical process and
into meaningful insights that can inform decision-making. The information science, making data analytical tools and business
cycle typically consists of the following steps (Randima intelligence more effective and user-friendly (Randima ,2022).
,2022).

• Defining the question


• Collecting the data
• Cleaning the data
• Analyzing the data
• Share the results

➢ Error-Free Data

Data Cleaning is a crucial process that removes errors and


garbage values from data, enhancing analysis efficiency
and saving time. Inaccurate data can lead to mistakes, and
it's easier to fix incorrect or corrupt data when errors are
DEFINING THE QUESTING monitored, and proper reporting is done.

This study examines Sri Lanka's weather data from 2010 to ➢ Data Quality
2023, focusing on the impact of weather on agriculture.
Variables like temperature, precipitation, and sunlight hours The analysis of weather data can be improved by
are crucial. Time-related variables are used to analyze seasonal addressing potential data quality issues such as missing
changes and precipitation patterns. Extreme weather events values, incorrect date/time formats, temperature and
are recognized using parameters like 'precipitation_sum' and precipitation outliers, inconsistent unit values, invalid wind
'temperature_2m_max'. Stakeholders, such as agricultural direction values, and errors in geographical coordinates
experts, environmental scientists, and local communities, are and elevation.
discussed to understand the problem and its causes. This helps ➢ Accurate and Efficient
frame the problem more accurately (Randima ,2022).
The weather dataset, including time, temperature,
. precipitation, wind, and geographic coordinates, was
validated, handled, and standardized to address issues like
inconsistent temperature values, outliers, and data entry
errors, enhancing its reliability for climate research and
agriculture planning.
➢ Complete Data 3. Checking for missing values: Use isnull() to create a
Boolean mask for missing values, sum each column's
The complete weather dataset includes essential fields like missing values, remove them with dropna() or fillna()
time, weather code, temperature, sunrise, sunset, functions.
precipitation sums, rain sums, snowfall sums, and more. It
has been cleaned and validated to address missing values, 4. Removing duplicate rows: The function
inconsistent temperature values, outliers, and data entry 'drop_duplicates()' can be used to remove duplicate rows
errors. This refined dataset is an excellent resource for from a Data Frame, with parameters like'subset' and
climate research, agriculture planning, and energy 'keep' allowing for checking for duplicates in specific
management. columns or keeping only the first occurrence.

➢ Maintain Data Consistency 5. Handling outliers: The describe() function can be used
to identify outliers in numerical columns of a Data
Consistency in data can be measured by comparing Frame. Using the Z-score method or the Interquartile
systems within the same dataset or across multiple datasets. Range (IQR) method, outliers in specific fields like
The weather dataset, which includes fields like time, temperature, precipitation, and wind speed can be
weather code, temperature, sunrise, sunset, and more, mitigated. This helps in climate research, agriculture
requires consistency in various fields. The dataset uses planning, and energy management by filtering out
uniform units, date/time formats, and measurement extreme values in temperature and precipitation,
methods to maintain high accuracy, improving its use in ensuring more reliable analyses.
6. Data transformation: To optimize data analysis, it
climate research, agriculture planning, and energy
may be necessary to convert data types, normalize
management.
data, or create new columns based on existing data,
DATA CLEANING CYCLE such as date columns.
7. Saving the cleaned data: The Pandas library's to_csv()
Data cleaning is a data analysis method that involves function allows cleaning data to be saved as a new CSV
analyzing, identifying, and correcting untidy raw data, file, taking parameters like file path, index, and header.
filling in missing values, identifying errors, and correcting 8. Data validation: After cleaning, it's crucial to validate
them, with techniques varying depending on the dataset the data by comparing it with original or external data
type (Jiawei, 2012). sources to obtain the necessary information.

VISUALIZING THE DATA

In order to gain a better understanding of the data, plots are


used to visualize it (J. Phys, 2021).

The Sri Lanka diagram below shows the cities that are
available in the data set.

To clean CSV data with Python, follow these steps:

1. Importing the necessary libraries: Access the Pandas


library by typing "import pandas as pd". Depending on
your cleaning needs, you might also need to import
NumPy and Matplotlib.

2. Reading the CSV file: Pandas' read_csv() function


reads CSV files with parameters like file path, delimiter,
and column names, returning a Data Frame object for
data manipulation and cleaning.
Analyse Temperature, Precipitation and Wind speed Analyse Temperature, Precipitation and Wind speed
in Srilanka from 2010 to 2023 in Srilanka Anually

Analyse Temperature, Precipitation and Wind speed


in Srilanka Monthly

Analyse Temperature, Precipitation in Seasons


slippery. As well as bowlers and batsmen, wind speed
influences ball behavior. As a result of weather conditions,
teams strategize their strategies, such as using swing bowlers
when the conditions are swing-friendly. Weather interruptions
can change match dynamics, affecting schedules and outcomes.

Tourism destinations and outdoor activities are heavily


influenced by weather conditions. Travelers prioritize
Analyse Temperature, Precipitation and Wind Speed destinations with favorable weather conditions for specific
for cities in Sri Lanka activities, and tourism industry professionals use weather data
to promote such destinations. The weather forecast helps
travelers plan outdoor activities such as hiking, sightseeing, or
beach trips, ensuring they are prepared appropriately and have
a pleasant experience. A traveler's knowledge of precipitation
and wind speed is particularly valuable for optimizing their
travel experiences.

There is a good correlation between maximum temperature and


apparent maximum temperature.

Using minimum and maximum temperatures,


precipitation, and wind speed to analyze weather conditions
can provide numerous benefits to agriculture, sports, tourism,
and travel.

Weather data is crucial to crop management, pest and


disease control, and soil moisture management in agriculture.
In addition to monitoring wind speed, farmers also use
temperature and precipitation patterns to optimize planting,
irrigation, and harvesting schedules. A farmer can also adjust
pest management strategies based on weather information in
The graph below shows the relationship between temperature
order to predict and manage pest and disease outbreaks. As a
_2m_max and precipitation sum, which would lead us to
result, farmers are able to plan irrigation schedules and
believe that it would be a very high correlation apparently, but
implement better water conservation strategies using
as we can see, we do not have this correlation.
precipitation data.

During outdoor events, sports organizers can implement Prediction using Regression.
safety precautions based on weather data. A ball's bounce and Based on regression modeling, I made predictions for three
movement are affected by pitch hardness and moisture content. scenarios.
Rain and precipitation can dampen pitches, favoring spin
bowling. A bowler's swing and movement are affected by wind 1. X = df['temperature_max'].values.reshape(-1,1)
speed, which alters the trajectory of the ball. Weather conditions
can affect player performance, with high temperatures causing y = df['evapotranspiration'].values.reshape(-1,1)
fatigue and dehydration, and precipitation making conditions
2. X = df['temperature_max'].values.reshape(-1,1) PLOT DIAGRAMS OF THREE SCENARIOS

y = df['apparent_temperature'].values.reshape(-1,1)

3. X = df['temperature_2m_max'].values.reshape(-1,1)

y = df['target'].values.reshape(-1,1)

A regression model's R-squared indicates how well it fits the


observed data. The percentage of variance in the dependent
variable can be explained by the independent variable
(Gbadamosi, 2019).
According to the first scenario, the R-squared value is
0.5033482483253012. This value represents the coefficient of
determination (R-squared) for a linear regression model using
the given X_test and Y_test. Based on the features in X_test,
approximately 50.33% of the variance in y_test can be
explained by the linear regression model.
Second scenario, the R-squared value is
0.7342544181383424. The model with this X_test provides a
better fit than the previous one, since it explains more of the
variability in the target variable based on the R-squared value.
Based on this, a linear regression model with a different set of
features can explain approximately 73.43% of the variance in
y_test.
As a result, the R-squared value is
0.8167653460930548. Here, the R-squared value is 81.68%,
which means that it captures a significant portion of the
variability in the target variable, suggesting a strong fit.
A linear regression model's intercept represents the
predicted value of the dependent variable (y) when all
independent variables (X) are zero. The intercept values
provided can be interpreted as follows:
The intercept value 6.10439771 represents the linear
regression model's y-intercept. The model predicts a target
value of approximately -6.10 when all independent variables
(features) are zero. Therefore, even if all features have no
impact on the target variable (e.g., temperature, humidity, etc.),
the model still predicts a negative value.
In the same way, this intercept value 0.75067964 is the
y-intercept for another linear regression model. When all
features are zero, the model predicts a target value of CONCLUSION
approximately -0.75.This indicates that the model has a
baseline prediction even without any features. It is the forecasting of the weather that is the most scientific and
When all features are zero, the model predicts a target technologically challenging problem in the world. In order to
value of approximately 2.69. This positive baseline prediction predict weather conditions, two things need to be done
indicates that the model expects a non-zero value regardless of correctly: collecting data from the meteorological department
feature inputs. and selecting the right data mining methods. Accuracy of the
model and its timely output are the two most important aspects
of weather prediction. Due to the complex nature of the problem
domain of weather forecasting. it is very feasible to use data
mining techniques to provide some accurate results in a
thorough manner. Weather prediction is improved by applying
more than one data mining technique in parallel. By combining
several forecasting and data mining techniques, we attempt to
forecast different weather conditions. The proposed model
achieved an impressive classification accuracy using limited
parameters despite having many parameters.
REFERENCES

1. Bernd Skiera (2016), Data, Data and Even More Data:


Harvesting Insights From the Data Jungle
2. Tanvi Patil, Dr Kamal Shah (2021), Weather Forecasting
Analysis using Linear and Logistic Regression Algorithm
3. Sudhnya Kashikar, , Sumedha Patil , Ameya Vedantwar ,
Shivani Katpatal , Sofia Pillai(2019), Weather Prediction
using Scikit-Learn
4. Simone Lionetti , Daniel Pfaffli , Marc Pouly , Tim Vor Der
Bruck and Philipp Wegelin, (2021), Tourism Forecast with
Weather, Event, and Cross-industry Data.
5. Hafiz A. Alaka, Lukumon O. Oyedele, Hakeem A.
Owolabi, Muhammad Bilal, Saheed O. Ajayi, Olugbenga
O. Akinade (2018), A framework for big data analytics
approach to failure prediction of construction firms
6. Aishwary Kumar Tiwari, Ujjawal Tomar (2023), Sports
Results Prediction.
7. K. Mahesh Babu, J. Rene Beulah (2019), Air Quality
Prediction based on Supervised Machine Learning
Methods
8. Kuldeep Chaurasia , Unnam Tarun , Guddanti V. S. Sarala
, Komal Soni (2020), AI Based Prediction of Daily Rainfall
from Satellite Observation for Disaster Management
9. DM Randima Dinalankara, Dulani Yasara Mudunkotuwa
(2022), Machine learning based Weather Prediction Model
for Short Term Weather Prediction in Sri Lanka
10. Jiawei Han, Micheline Kamber, Jian Pei (2012), Data
Mining Concepts and Techniques
11. J. Phys (2021), Research on Python Data Visualization
Technology
12. Gbadamosi Babatunde, Adeniyi Abidemi Emmanuel,
Ogundokun Roseline Oluwaseun, Oladosu Bukola Bunmi,
Anyaiwe Ehiedu Precious (2019) Impact of Climatic
Change on Agricultural Product Yield Using KMeans and
Multiple Linear Regressions

You might also like