Air Quality Prediction Using Machine Learning Algorithms
Air Quality Prediction Using Machine Learning Algorithms
-----------------------------------------------------------------------------------------------------------------------------
Abstract: :Examining and protecting air quality has become one of the most essential activities for the government in many
industrial and urban areas today. The meteorological and traffic factors, burning of fossil fuels, and industrial parameters play
significant roles in air pollution.With this increasing air pollution,Weare in need of implementing models which will record
information about concentrations of air pollutants(so2,no2,etc).The deposition of this harmful gases in the air is affecting the quality of
people’s lives, especially in urban areas. Lately, many researchers began to use Big Data Analytics approach as there are
environmental sensing networks and sensor data available.In this paper, machine learning techniques are used to predict the
concentration of so2 in the environment. Sulphur dioxide irritates the skin and mucous membranes of the eyes, nose, throat, and
lungs.Models in time series are employed to predict the so2 readings in nearing years or months.
Keywords: Machine Learning, Time Series, Prediction, Air Quality, SO2
www.ijcat.com 367
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 367-370, 2019, ISSN:-2319–8656
(ANN), Genetic Algorithm ANN Model, Random forest, So, to summarize we have deleted the following features from
decision tree, Deep belief network are the algorithms which our dataset :
were used and various pros and cons of the model were
state,pm2_5,agency, stn_code,
presented.[5]
sampling_date and location_monitoring_station
We have simplified the type attribute to contain only one of
. the three categories: industrial, residential, other.For SO2 and
3. DATASET NO2, we replaced nan values by mean.For date, we have
dropped nan values as there were only 3 null values.
3.1 Dataset/Source: Kaggle So after pre-processing our dataset contains 60,380 rows and
Structured/Unstructured data:Structured Data in CSV 7 columns.
format.
4. EXPLORATORY DATA ANALYSIS:
Dataset Description: The below graph shows concentration of so2 over
The dataset consists of around 450000 records of all the the years.It was highest in the years of 1997 and
states of India.We worked only on Dataset of 2001 and lowest in the years 1988 and 2003
Maharashtra.So we had 60383 records. This dataset .However,it is stable for the latest years.
consist of 13 attributes listed below.
1)stn_code
2)sampling_date
3) state
4) location
5) agency
6)type
7)so2
8)no2
9)rspm
10) spm This graph shows that the amount of so2 is
highest in the industrial areas.
11)location_monitoring_station
12)pm2_5
13)date
Splitting for Testing :Data Splitting was done as 80% for From this graph we can conclude that Nagpur
training and 20% for testing. has the deadliest amount of so2 as compared to
other cities whereas Akole , Amravati are
Preprocessing and Feature Selection: sparsely polluted followed by Jalna and
We only studied and applied algorithms on the data of Kolhapur.
Maharashtra State .Hence, no. of rows was reduced to 60,383
and state column automatically is of no more use.
All the values in pm2_5 were null values ,so we dropped the
column.The agency’s name have nothing to do with how
much polluted the state is. Similarly, stn_code is also not
useful.
The date is a cleaner representation of sampling_date attribute
and so we will eliminate the redundancy by removing the
latter. location_monitoring_station attribute is again
unnecessary as it contains the location of the monitoring
station which we do not need to consider for the analysis.
www.ijcat.com 368
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 367-370, 2019, ISSN:-2319–8656
X(t+1) = b0 + b1*X(t-1) + b2*X(t-2) This model is not able to show expected output as the data is
Because the regression model uses data from the same input not in sequence as per date column.The same is the problem
variable at previous time steps, it is referred to as an for cities.If we predict for the entire state, it wont be helpful
autoregression (regression of self).[6] So we will be now calculating AQI and use classification
models further.
www.ijcat.com 369
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 367-370, 2019, ISSN:-2319–8656
www.ijcat.com 370