BA Assignment_pdf_v4
BA Assignment_pdf_v4
Rasika Jayawardena
CBO12267- Msc. in IT
Asia Pacific Institute of Information Technology
Colombo, Srilanka
[email protected]
INTRODUCTION ➢ Transportation
The exponential growth of heterogeneous data from various Weather forecasts help shipping lines predict storm
sources, including the Internet of Things and user-generated dangers, determine sailing routes, and avoid flight delays
content, presents both challenges and opportunities for and cancellations worldwide (Mahesh, 2019).
businesses and academics (Brend,2016). This data, ➢ Disaster Management
encompassing both structured and unstructured data, has
transformed operations through improved customer service, Natural disasters worldwide can be predicted and
payments, business models, and online engagement predicted using big data analytics, reducing fatalities,
(Tanvi,2021). saving lives, and reducing economic damage. The
accuracy and lead time vary by disaster type (Kuldeep,
BACKGROUND 2020)
Weather refers to the daily fluctuations of the atmosphere,
collected through various observations like sea, ground, and An overview of the dataset
radar. This data is used to forecast weather through various
S/N Attribute Description Data Type
applications and models. Weather forecasts are crucial for 1 Time Observation timestamps. Date
daily decision-making, impacting agriculture, irrigation, 2 Weathercode A numerical code that Numeric
represents the current
marine trade, and saving lives from accidents. They also affect weather conditions.
industries, transportation, disaster management, and energy 3 Temperature_2m_max The maximum temperature Numeric
at 2 meters.
management. 4 Temperature_2m_min The minimum temperature Numeric
at 2 meters.
➢ Agriculture/Food 5 Temperature_2m_mean The mean temperature at 2 Numeric
meters.
Weather forecasting and big data analytics can improve 6 Apparent_temperature_max The maximum apparent
value of temperatures is
Numeric
industry's benefits based on climate change (Simone 11 Shortwave radiation sum Observed shortwave Numeric
radiation
,2021). 12 Precipitation sum Precipitation duration Numeric
➢ Error-Free Data
This study examines Sri Lanka's weather data from 2010 to ➢ Data Quality
2023, focusing on the impact of weather on agriculture.
Variables like temperature, precipitation, and sunlight hours The analysis of weather data can be improved by
are crucial. Time-related variables are used to analyze seasonal addressing potential data quality issues such as missing
changes and precipitation patterns. Extreme weather events values, incorrect date/time formats, temperature and
are recognized using parameters like 'precipitation_sum' and precipitation outliers, inconsistent unit values, invalid wind
'temperature_2m_max'. Stakeholders, such as agricultural direction values, and errors in geographical coordinates
experts, environmental scientists, and local communities, are and elevation.
discussed to understand the problem and its causes. This helps ➢ Accurate and Efficient
frame the problem more accurately (Randima ,2022).
The weather dataset, including time, temperature,
. precipitation, wind, and geographic coordinates, was
validated, handled, and standardized to address issues like
inconsistent temperature values, outliers, and data entry
errors, enhancing its reliability for climate research and
agriculture planning.
➢ Complete Data 3. Checking for missing values: Use isnull() to create a
Boolean mask for missing values, sum each column's
The complete weather dataset includes essential fields like missing values, remove them with dropna() or fillna()
time, weather code, temperature, sunrise, sunset, functions.
precipitation sums, rain sums, snowfall sums, and more. It
has been cleaned and validated to address missing values, 4. Removing duplicate rows: The function
inconsistent temperature values, outliers, and data entry 'drop_duplicates()' can be used to remove duplicate rows
errors. This refined dataset is an excellent resource for from a Data Frame, with parameters like'subset' and
climate research, agriculture planning, and energy 'keep' allowing for checking for duplicates in specific
management. columns or keeping only the first occurrence.
➢ Maintain Data Consistency 5. Handling outliers: The describe() function can be used
to identify outliers in numerical columns of a Data
Consistency in data can be measured by comparing Frame. Using the Z-score method or the Interquartile
systems within the same dataset or across multiple datasets. Range (IQR) method, outliers in specific fields like
The weather dataset, which includes fields like time, temperature, precipitation, and wind speed can be
weather code, temperature, sunrise, sunset, and more, mitigated. This helps in climate research, agriculture
requires consistency in various fields. The dataset uses planning, and energy management by filtering out
uniform units, date/time formats, and measurement extreme values in temperature and precipitation,
methods to maintain high accuracy, improving its use in ensuring more reliable analyses.
6. Data transformation: To optimize data analysis, it
climate research, agriculture planning, and energy
may be necessary to convert data types, normalize
management.
data, or create new columns based on existing data,
DATA CLEANING CYCLE such as date columns.
7. Saving the cleaned data: The Pandas library's to_csv()
Data cleaning is a data analysis method that involves function allows cleaning data to be saved as a new CSV
analyzing, identifying, and correcting untidy raw data, file, taking parameters like file path, index, and header.
filling in missing values, identifying errors, and correcting 8. Data validation: After cleaning, it's crucial to validate
them, with techniques varying depending on the dataset the data by comparing it with original or external data
type (Jiawei, 2012). sources to obtain the necessary information.
The Sri Lanka diagram below shows the cities that are
available in the data set.
During outdoor events, sports organizers can implement Prediction using Regression.
safety precautions based on weather data. A ball's bounce and Based on regression modeling, I made predictions for three
movement are affected by pitch hardness and moisture content. scenarios.
Rain and precipitation can dampen pitches, favoring spin
bowling. A bowler's swing and movement are affected by wind 1. X = df['temperature_max'].values.reshape(-1,1)
speed, which alters the trajectory of the ball. Weather conditions
can affect player performance, with high temperatures causing y = df['evapotranspiration'].values.reshape(-1,1)
fatigue and dehydration, and precipitation making conditions
2. X = df['temperature_max'].values.reshape(-1,1) PLOT DIAGRAMS OF THREE SCENARIOS
y = df['apparent_temperature'].values.reshape(-1,1)
3. X = df['temperature_2m_max'].values.reshape(-1,1)
y = df['target'].values.reshape(-1,1)