Weather Prediction Model Using Random Forest Algorithm and Apache Spark
Weather Prediction Model Using Random Forest Algorithm and Apache Spark
Volume 3 Issue 6, October 2019 Available Online: www.ijtsrd.com e-ISSN: 2456 – 6470
I. INTRODUCTION
Weather forecasting had always been one of the major look for correcting any errors, if present. These computers
technologically and scientifically challenging issues around not only make graphs but also predict how the graphs may
the world. This is mainly due to two factors: Firstly, it is look sometime in the near future. This estimation of weather
consumed for several human activities and secondly, by computers is acknowledged as numerical weather
because of opportunism, which is created by numerous prediction[1]. Hence, for predicting weather by numerical
technological advances that are directly associated to the means, meteorologists went on developing some
concrete research field, such as the evolution of computation atmospheric models, which approximate atmosphere by
and improvement in the measurement systems. Hence, consuming mathematical equations to portray how
making an exact pre- diction contributes to one of the major atmosphere and rain will have transformations over time.
challenges that meteorologists are facing around the world. These equations are automated into the computer, and the
From ancient times, the weather prediction had been one of data for the current atmospheric conditions are provided
the most interesting and fascinating study domains. into the computer. Computers solve these equations to
Scientists have been working to forecast the meteorological conclude how different atmospheric variables may change
features by utilizing a number of approaches, some of these over upcoming years. The resultant is known as prognostic
approaches being better than the others in terms of chart, which is a forecast chart drawn by the computer.
accuracy. Weather forecasting encompasses predicting in
what way current state of atmosphere will get altered. II. PREDICTING WEATHER
Existing weather situations are attained by ground Fig. 1 shows that initially the weather data source is
observations, such as the observations from aircrafts, ships, collected from weather sensors and power stations. These
satellites, and radars. The information is directed to the weather data can be collected in the different data sources
meteorological centers, which collect, analyze, and project like kafka, flume etc. In the proposed system the data set is
the data into a variety of graphs and charts. The computers loaded into the spark API and using random forest algorithm
imprint lines on graphs with the help of meteorologists, who to regress and classify the weather data.
@ IJTSRD | Unique Paper ID – IJTSRD29133 | Volume – 3 | Issue – 6 | September - October 2019 Page 549
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
Where K represents the number of trees in the forest and F
represents the number of input variables randomly chosen
at each split respectively. The number of trees can be
determined experimentally. And, we can add the successive
trees during the training procedure until the OOB error
stabilizes. The RF procedure is not overly sensitive to the
value of F. The inventors of the algorithm recommend F =
n/3 for the regression RFs. Another parameter is the
minimum node size m. The smaller the minimum node size,
the deeper the trees. In many publications m = 5 is
recommended. And this is the default value in many
programs which implement RFs. RFs show small sensitivity
Figure 1: Design of the system to this parameter.
A. RANDOM FORESTS MODEL Using RFs we can determine the prediction strength or
Random Forests (RF) is the most popular methods in data importance of variables which is useful for ranking the
mining. The method is widely used in different time series variables and their selection, to interpret data and to
forecasting fields, such as biostatistics, climate monitoring, understand underlying phenomena. The variable importance
planning in energy industry and weather forecasting. can be estimated in RF as the increase in prediction error if
Random forest (RF) is an ensemble learning algorithm that the values of that variable are randomly permuted across the
can handle both high- dimension classification as well as OOB samples. The increase in error as a result of this
regression. RF is a tree- based ensemble method where all permuting is averaged over all trees, and divided by the
trees depend on a collection of random variables. That is, the standard deviation over the entire ensemble. The more the
forest is grown from many regression trees put together, increase of OOB error is, the more important is the variable.
forming an ensemble [4]. After individual trees in ensemble
are fitted using bootstrap samples, the final decision is The original training dataset is formalized as S = {(xi,yj),
obtained by aggregating over the ensemble, i.e. by averaging i=1,2,…..,N; j=1,2,….,M} where x is a sample and y is a feature
the output for regression or by voting for classification. This variable of S. Namely, the original training dataset contains N
procedure called bagging improves the stability and samples, and there are M feature variables in each sample. The
accuracy of the model, reduces variance and helps to avoid main process of the construction of the RF algorithm is
overfitting. The bias of the bagged trees is the same as that of presented in Fig. 2.
the individual trees, but the variance is decreased by
reducing the correlation between trees (this is discussed in
[10]). Random forests correct for decision trees' habit of
overfitting to their training set and produce a limiting value
of the generalization error [6].
@ IJTSRD | Unique Paper ID – IJTSRD29133 | Volume – 3 | Issue – 6 | September - October 2019 Page 550
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
At the same time, the records that are not to be selected in
each sampling period are composed as an Out-Of-Bag (OOB)
dataset. In this way, k OOB sets are constructed as a
collection of SOOB:
@ IJTSRD | Unique Paper ID – IJTSRD29133 | Volume – 3 | Issue – 6 | September - October 2019 Page 551
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
[4] K. Singh, S. C. Guntuku, A. Thakur, and C. Hota, “Big data [8] C. Lindner, P. A. Bromiley, M. C. Ionita, and T. F. Cootes,
analytics framework for peer-to-peer botnet detection “Robust and accurate shape model matching using
using random forests,” Information Sciences, vol. 278, random forest regression-voting,” Pattern Analysis and
pp. 488–497, September 2014. Machine Intelligence, IEEE Transactions on, vol. 25, no.
3, pp. 1–14, December 2014. [4] What is Twitter and
[5] Apache, “Spark,” Website, June 2016, http: //spark-
How does it work? http://www.lifewire.com/ what-is –
project.org. [9] L. Breiman, “Random forests,” Machine
twitter.
Learning, vol. 45, no. 1, pp. 5–32, October 2001.
[9] S. Tyree, K. Q. Weinberger, and K. Agrawal, “Parallel
[6] G. Wu and P. H. Huang, “A vectorization-optimization
boosted regression trees for web search ranking,” in
method-based type-2 fuzzy neural network for noisy
International Conference on World Wide Web, March
data classification,” Fuzzy Systems, IEEE Transactions
2011, pp.387–396.
on, vol. 21, no. 1, pp. 1–15, February 2013.
[10] D. Warneke and O. Kao, “Exploiting dynamic resource
[7] H. Abdulsalam, D. B. Skillicorn,and P. Martin,
allocation for efficient parallel data processing in the
“Classification using streaming random forests,”
cloud,” Parallel and Distributed Systems, IEEE
Knowledge and Data Engineering, IEEE Transactions
Transactions on, vol. 22, no. 6, pp. 985–997, June 2011.
on, vol. 23, no. 1, pp. 22–36, January 2011.
@ IJTSRD | Unique Paper ID – IJTSRD29133 | Volume – 3 | Issue – 6 | September - October 2019 Page 552