0% found this document useful (0 votes)
33 views4 pages

Weather Prediction Model Using Random Forest Algorithm and Apache Spark

The document discusses using random forest algorithms and Apache Spark to build a weather prediction model. It describes collecting weather data from various sources, loading the data into Spark, and using random forest regression and classification to predict weather. The key aspects of random forest algorithms like bagging, out-of-bag error, and variable importance are also explained.

Uploaded by

Editor IJTSRD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views4 pages

Weather Prediction Model Using Random Forest Algorithm and Apache Spark

The document discusses using random forest algorithms and Apache Spark to build a weather prediction model. It describes collecting weather data from various sources, loading the data into Spark, and using random forest regression and classification to predict weather. The key aspects of random forest algorithms like bagging, out-of-bag error, and variable importance are also explained.

Uploaded by

Editor IJTSRD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Journal of Trend in Scientific Research and Development (IJTSRD)

Volume 3 Issue 6, October 2019 Available Online: www.ijtsrd.com e-ISSN: 2456 – 6470

Weather Prediction Model using


Random Forest Algorithm and Apache Spark
Thin Thin Swe1, Phyu Phyu1, Sandar Pa Pa Thein2
1Lecturer, Faculty of Information Science, 2Lecturer, Faculty of Computing,
1,2University of Computer Studies, Pathein, Myanmar

ABSTRACT How to cite this paper: Thin Thin Swe |


One of the greatest challenge that meteorological department faces are to Phyu Phyu | Sandar Pa Pa Thein "Weather
predict weather accurately. These predictions are important because they Prediction Model using Random Forest
influence daily life and also affect the economy of a state or even a nation. Algorithm and Apache Spark" Published in
Weather predictions are also necessary since they form the first level of International Journal
preparation against the natural disasters which may make difference between of Trend in Scientific
life and death. They also help to reduce the loss of resources and minimizing Research and
the mitigation steps that are expected to be taken after a natural disaster Development (ijtsrd),
occurs. This research work focuses on analyzing algorithm on big data that are ISSN: 2456-6470,
suitable for weather prediction and highlights the performance analysis with Volume-3 | Issue-6,
Random Forest algorithms in the spark framework. October 2019, IJTSRD29133
pp.549-552, URL:
KEYWORDS: Weather forecasting, Apache Spark, Random Forest algorithms https://www.ijtsrd.com/papers/ijtsrd291
(RF); Big Data Analysis. 33.pdf

Copyright © 2019 by author(s) and


International Journal of Trend in Scientific
Research and Development Journal. This is
an Open Access article distributed under
the terms of the
Creative Commons
Attribution License
(CC BY 4.0)
(http://creativecommons.org/licenses/by
/4.0)

I. INTRODUCTION
Weather forecasting had always been one of the major look for correcting any errors, if present. These computers
technologically and scientifically challenging issues around not only make graphs but also predict how the graphs may
the world. This is mainly due to two factors: Firstly, it is look sometime in the near future. This estimation of weather
consumed for several human activities and secondly, by computers is acknowledged as numerical weather
because of opportunism, which is created by numerous prediction[1]. Hence, for predicting weather by numerical
technological advances that are directly associated to the means, meteorologists went on developing some
concrete research field, such as the evolution of computation atmospheric models, which approximate atmosphere by
and improvement in the measurement systems. Hence, consuming mathematical equations to portray how
making an exact pre- diction contributes to one of the major atmosphere and rain will have transformations over time.
challenges that meteorologists are facing around the world. These equations are automated into the computer, and the
From ancient times, the weather prediction had been one of data for the current atmospheric conditions are provided
the most interesting and fascinating study domains. into the computer. Computers solve these equations to
Scientists have been working to forecast the meteorological conclude how different atmospheric variables may change
features by utilizing a number of approaches, some of these over upcoming years. The resultant is known as prognostic
approaches being better than the others in terms of chart, which is a forecast chart drawn by the computer.
accuracy. Weather forecasting encompasses predicting in
what way current state of atmosphere will get altered. II. PREDICTING WEATHER
Existing weather situations are attained by ground Fig. 1 shows that initially the weather data source is
observations, such as the observations from aircrafts, ships, collected from weather sensors and power stations. These
satellites, and radars. The information is directed to the weather data can be collected in the different data sources
meteorological centers, which collect, analyze, and project like kafka, flume etc. In the proposed system the data set is
the data into a variety of graphs and charts. The computers loaded into the spark API and using random forest algorithm
imprint lines on graphs with the help of meteorologists, who to regress and classify the weather data.

@ IJTSRD | Unique Paper ID – IJTSRD29133 | Volume – 3 | Issue – 6 | September - October 2019 Page 549
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
Where K represents the number of trees in the forest and F
represents the number of input variables randomly chosen
at each split respectively. The number of trees can be
determined experimentally. And, we can add the successive
trees during the training procedure until the OOB error
stabilizes. The RF procedure is not overly sensitive to the
value of F. The inventors of the algorithm recommend F =
n/3 for the regression RFs. Another parameter is the
minimum node size m. The smaller the minimum node size,
the deeper the trees. In many publications m = 5 is
recommended. And this is the default value in many
programs which implement RFs. RFs show small sensitivity
Figure 1: Design of the system to this parameter.
A. RANDOM FORESTS MODEL Using RFs we can determine the prediction strength or
Random Forests (RF) is the most popular methods in data importance of variables which is useful for ranking the
mining. The method is widely used in different time series variables and their selection, to interpret data and to
forecasting fields, such as biostatistics, climate monitoring, understand underlying phenomena. The variable importance
planning in energy industry and weather forecasting. can be estimated in RF as the increase in prediction error if
Random forest (RF) is an ensemble learning algorithm that the values of that variable are randomly permuted across the
can handle both high- dimension classification as well as OOB samples. The increase in error as a result of this
regression. RF is a tree- based ensemble method where all permuting is averaged over all trees, and divided by the
trees depend on a collection of random variables. That is, the standard deviation over the entire ensemble. The more the
forest is grown from many regression trees put together, increase of OOB error is, the more important is the variable.
forming an ensemble [4]. After individual trees in ensemble
are fitted using bootstrap samples, the final decision is The original training dataset is formalized as S = {(xi,yj),
obtained by aggregating over the ensemble, i.e. by averaging i=1,2,…..,N; j=1,2,….,M} where x is a sample and y is a feature
the output for regression or by voting for classification. This variable of S. Namely, the original training dataset contains N
procedure called bagging improves the stability and samples, and there are M feature variables in each sample. The
accuracy of the model, reduces variance and helps to avoid main process of the construction of the RF algorithm is
overfitting. The bias of the bagged trees is the same as that of presented in Fig. 2.
the individual trees, but the variance is decreased by
reducing the correlation between trees (this is discussed in
[10]). Random forests correct for decision trees' habit of
overfitting to their training set and produce a limiting value
of the generalization error [6].

The RF generalization error is estimated by an out-of-bag


(OOB) error, i.e. the error for training points which are not
contained in the bootstrap training sets (about one-third of
the points are left out in each bootstrap training set). An OOB
error estimate is almost identical to that obtained by N-fold
cross-validation. The large advantage of RFs is that they can
be fitted in one sequence, with cross-validation being
performed along the way. The training can be terminated
when the OOB error stabilizes [7]. The algorithm of RF for
regression is shown in Figure-2[5].

Fig.2. Process of the construction of the RF Algorithm

The steps of the construction of the random forest algorithm


are as follows.

Step1: Sampling k training subsets.


In this step, k training subsets are sampled from the original
training dataset S in a bootstrap sampling man-ner. Namely,
N records are selected from S by a random sampling and
replacement method in each sampling time. After the current
Figure2. Algorithm of RF for regression [8] step, k training subsets are constructed as a collection of
training subsets ST rain:

STrain = {S1; S2,…….,Sk}.

@ IJTSRD | Unique Paper ID – IJTSRD29133 | Volume – 3 | Issue – 6 | September - October 2019 Page 550
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
At the same time, the records that are not to be selected in
each sampling period are composed as an Out-Of-Bag (OOB)
dataset. In this way, k OOB sets are constructed as a
collection of SOOB:

SOOB = {OOB1; OOB2,….., OOBk},

where k << N, Si ∩ OOBi = ϕ and Si OOBi = S. To obtain the


classification accuracy of each tree model, these OOB sets are
used as testing sets after the training process.

Step2: Constructing each decision tree model.


In an RF model, each meta decision tree is created by CART Fig.3. Architecture of spark
algorithm from each training subset Si. In the growth process
of each tree, m feature variables of dataset Si are randomly Apache Spark is very good for in memory computing. Spark
selected from M variables. In each tree node’s splitting has its own cluster management but it can work with
process, the gain ratio of each feature variable is calculated, Hadoop also. There are three core building blocks of Spark
and the best one is chosen as the splitting node. This splitting programming. Resilient Distributed Datasets (RDD),
process is repeated until a leaf node is generated. Finally, k Transformations and Action. RDD is an immutable data
decision trees are trained from k training subsets in the structure on which various transformations can be applied.
same way. After transformation any action on RDD can lead to complete
lineage execution of transformation before result is
Step3: Collecting k trees into an RF model. produced.
The k trained trees are collected into an RF model, which is
defined in Eq. (1):

H(X, Κj ) = ∑k hi(x, Κj),(j=1,2,….,m) (1)


i=1

where hi(x;j) is a meta decision tree classifier, X are the input


feature vectors of the training dataset, and j is an
independent and identically distributed random vector that
determines the growth process of the tree.

To dig why we select random forest algorithm, the following


presents some benefits:
Fig.4. Working with RDD in Spark
 Random forest algorithm can be used for both
classifications and regression task. III. CONCLUSIONS
 It provides higher accuracy. In this paper, a random forest algorithm has been proposed
for big data. The accuracy of the RF algorithm is optimized
 Random forest classifier will handle the missing values through dimension-reduction and the weighted vote
and maintain the accuracy of a large proportion of data. approach. Then, combining data-parallel from different data
 If there are more trees, it won’t allow overfitting trees in station and task-parallel optimization is performed and
the model. implemented on Apache Spark. Taking advantage of the
data-parallel optimization, the training dataset is reused and
 It has the power to handle a large data set with higher the volume of data is reduced significantly. Benefitting from
dimensionality [3]. the task-parallel optimization, the data transmission cost is
effectively reduced and the performance of the algorithm is
B. APACHE SPARK obviously improved. Experimental results indicate the
Apache Spark is an all-purpose data processing and machine superiority and notable strengths of RF over the other
learning tool can be used for a variety of operations. Data algorithms in terms of classification accuracy, performance,
scientist, application developer can integrate Apache Spark and scalability. For future work, we will focus on the
into their application to query, analyze, transform as scale. It incremental parallel random forest algorithm for data
is 100 times faster than Hadoop MapReduce. It can handle streams in cloud environment, and improve the data
petabytes of data at once, distribute over a cluster of allocation and task scheduling mechanism for the algorithm
thousands of cooperating virtual or physical servers. Apache on a distributed and parallel environment.
Spark has been developed in Scala and it support Python, R,
Java and off course Scala Apache spark is fast and general References
purpose engine for large scale data processing [9-10]. [1] Guidelines on Climate Metadata and homogenization
Architecture of spark has spark core at it bottom and on top World Climate data and monitoring program, Geneva.
of which Spark SQL, MLlib, Spark streaming and GraphX
libraries are provided for data processing[2]. [2] https://spark.org
[3] https://www.newgenapps.com/blog/random-forest-
analysis-in-ml-and-when-to-use-it

@ IJTSRD | Unique Paper ID – IJTSRD29133 | Volume – 3 | Issue – 6 | September - October 2019 Page 551
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
[4] K. Singh, S. C. Guntuku, A. Thakur, and C. Hota, “Big data [8] C. Lindner, P. A. Bromiley, M. C. Ionita, and T. F. Cootes,
analytics framework for peer-to-peer botnet detection “Robust and accurate shape model matching using
using random forests,” Information Sciences, vol. 278, random forest regression-voting,” Pattern Analysis and
pp. 488–497, September 2014. Machine Intelligence, IEEE Transactions on, vol. 25, no.
3, pp. 1–14, December 2014. [4] What is Twitter and
[5] Apache, “Spark,” Website, June 2016, http: //spark-
How does it work? http://www.lifewire.com/ what-is –
project.org. [9] L. Breiman, “Random forests,” Machine
twitter.
Learning, vol. 45, no. 1, pp. 5–32, October 2001.
[9] S. Tyree, K. Q. Weinberger, and K. Agrawal, “Parallel
[6] G. Wu and P. H. Huang, “A vectorization-optimization
boosted regression trees for web search ranking,” in
method-based type-2 fuzzy neural network for noisy
International Conference on World Wide Web, March
data classification,” Fuzzy Systems, IEEE Transactions
2011, pp.387–396.
on, vol. 21, no. 1, pp. 1–15, February 2013.
[10] D. Warneke and O. Kao, “Exploiting dynamic resource
[7] H. Abdulsalam, D. B. Skillicorn,and P. Martin,
allocation for efficient parallel data processing in the
“Classification using streaming random forests,”
cloud,” Parallel and Distributed Systems, IEEE
Knowledge and Data Engineering, IEEE Transactions
Transactions on, vol. 22, no. 6, pp. 985–997, June 2011.
on, vol. 23, no. 1, pp. 22–36, January 2011.

@ IJTSRD | Unique Paper ID – IJTSRD29133 | Volume – 3 | Issue – 6 | September - October 2019 Page 552

You might also like