machine learning tree
machine learning tree
ABSTRACT Machine learning algorithms have been intensively applied to perform load forecasting to
obtain better accuracies as compared to traditional statistical methods. However, with the huge increase in
data size, sophisticated models have to be created which require big data platforms. Optimal and effective
use of the available computational resources can be attained by maximizing the effective utilization of the
cluster nodes. Parallel computing is demanded to allow for optimal resource utilization in dealing with
smart grid big data. In this paper, a master-slave parallel computing paradigm is utilized and experimented
with for load forecasting in a multi-AMI environment. The paper proposes a concurrent job scheduling
algorithm in a multi-energy data source environment using Apache Spark. An efficient resource utilization
strategy is proposed for submitting multiple Spark jobs to reduce job completion time. The optimal value
of clustering is used in this paper to cluster the data into groups to be able to reduce the computational
time additionally. Multiple tree-based machine learning algorithms are tested with parallel computation
to evaluate the performance with tunable parameters on a real-world dataset. One thousand distribution
transformers’ real data from Spain for three years are used to demonstrate the performance of the proposed
methodology with a trade-off between accuracy and processing time.
INDEX TERMS Apache spark, concurrent computing, load forecasting, parallel processing, resource
management.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
57372 VOLUME 9, 2021
A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark
of samples. Traditional ML algorithms were built with the scheduling algorithm proposed to perform load forecast-
assumption that the data can fit into memory, but with the era ing on multiple datasets utilizing apache spark. Section IV
of big data, it becomes challenging for ML models to adapt to describes the experimental setup of the optimal scheduling
the deluge of data. Big data due to its nature of velocity also algorithm along with the results obtained by implement-
imposes the challenge of all the data being available during ing the proposed methodology on real big data. Section V
training. Therefore, as the size of the data increases, dis- concludes the paper.
tributed processing frameworks, parallel data structures, data
reuse, and data partitioning become important characteristics.
Resilient distributed datasets (RDDs) implemented in a Spark II. RELATED WORK
cluster computing framework exhibit in-memory characteris- Many papers have proposed benchmarking results with the
tics [4]. This leads to the use of the typical architecture which use of ML for load forecasting, but in this section, the essence
can accommodate both the cluster computing framework and of big data smart grid load forecasting using spark is outlined.
machine learning capabilities. To improve the performance of The widely installed smart meters collect huge amounts of
the big data machine learning algorithms, manipulations in load data for each of the grid’s distribution transformers.
terms of the way ML algorithms execute and the processing Many computing frameworks [6]–[9] have been developed
infrastructure is necessary. Among the various ML paradigms for the analysis of big data but, MapReduce [6] is the most
in big data, this paper focuses on tree-based methods and famous one because of its features of fault-tolerance, parallel
ensemble learning techniques. Splitting a deluge of data into computation, and flexibility. Apache Spark [10] proposed by
multiple datasets to perform training with the ML models the Zaharia et al. team emerged to overcome the drawbacks in
has gained significant improvement in the learning process MapReduce. It is an open-source framework and is 100 times
in terms of the big data context. For example, the authors faster than Hadoop MapReduce [11]. Spark can execute over
in [5] applied ensemble learning to subsamples of big data several cluster managers such as Hadoop YARN [12], Apache
improving learning accuracy and simultaneously decreased Mesos [13], and spark’s standalone scheduler. Spark can
the computation time. also interface with a variety of data storage repositories like
The multi AMI infrastructure mostly concentrates on fore- Hadoop Distributed File System (HDFS) [14], Hive [15],
casting the load of all the distribution transformers (DT’s) at Hbase [16], to name a few. However, spark supports dis-
the same time. In this paper, a novel scheduling technique tributed computing resulting in a communication overhead
with the help of the Apache spark platform is proposed to increase. Previous research has observed that by only increas-
short-long term forecast the load of all the one thousand ing the computational capability, JCT reduces but then starts
transformers simultaneously. The spark cluster submits big increasing in communication overhead [17]. Hence the pro-
data analytics tasks as spark jobs and the computational posed scheduling algorithm in this paper focuses to utilize the
resources are allocated optimally to these spark jobs. The available computation capability and still be able to submit
amount allocated to these jobs is customizable by the user multiple jobs with the help of an optimal scheduling algo-
which affects the Job Completion Time (JCT) significantly. rithm and not losing on communication overhead.
This paper utilizes ML algorithms such as Spark Random Highly cited algorithms for forecasting smart grid data
Forest and Spark Gradient boosted regression trees for train- include linear regression, SVM and its variants [18], and
ing and forecasting the load. The proposed method performs artificial neural networks (ANN) [19] [20]. A pooling-based
load forecasting by submitting multiple jobs concurrently on deep recurrent neural network (DRNN) was proposed to
the data sets utilizing the cluster resources optimally. learn the spatial information, which outperformed Support
The main contributions of this paper can be summarized as Vector regressor (SVR), Auto-Regressive Integrated Moving
follows: Average (ARIMA), and the classical deep recurrent neural
1) Proposing an optimal scheduling algorithm to perform network (RNN) [21]. In [22], Happy et al, proposed a statis-
load forecasting with parallel and distributed execution tical approach for load forecasting using quantile regression
in a multi-AMI environment on the smart grid big data. random forest, risk assessment index, and probability map.
2) Tuning the ML models to gain high accuracy along with In [23], a backpropagation approach was utilized to perform
measures to combat overfitting. short-term load forecasting utilizing weather data. In [24],
3) Testing the proposed methodology on all the one thou- Wei et al performed midterm load forecasting of power sup-
sand transformers’ data without grouping, then a com- ply unit (PSU) considered as a collection of distribution trans-
parison is made against the proposed grouping technique formers. The authors have utilized a dynamic based network
to show its merits. (DBN), with a peak load of all the distribution transformers
The performance of the proposed method is tested on real within a PSU summed. All of the summed load values are
big data from an industry partner Iberdrola to validate its utilized to train and forecast the load using sparks standalone
effect on ascended performance. cluster. However, the use of complete data to train instead
The paper is organized as follows. Section II dis- of summed load values can result in better training accura-
cusses the related research in the field of load forecasting cies but will require optimized scheduling method which is
using big data platforms. Section III describes the novel achieved in this paper.
This paper focuses on an hourly day-ahead load forecast the time in the previous case (with n clusters) is much less
with the use of spark ML tree-based algorithms. The models than without clustering, provided the data size for each of
are trained with the spark.ml Application Programming Inter- the jobs in both cases is the same. The proposed parallel
face (API) of spark which is data frame based facilitating ML and sequential approach of the tree-based ensemble model
pipelines and feasible feature transformations [25]. is deployed on the spark. Fig. 2 is an illustration of the
employed master-slave parallel computing paradigm where a
III. PROPOSED LOAD FORECASTING METHODOLOGY single master and multiple slaves are used. To incorporate the
FOR OPTIMIZED COMPUTATION WITH APACHE SPARK proposed methodology parallelism in datastore and training
The Spark ML library supports tree-based models namely are discussed further.
spark ml decision trees and ensemble models namely spark
ml random forest and spark ml gradient boosted regression
trees [26]. Spark session connects to the master node to
submit jobs, where each job is split into stages, and stages
are further split into tasks. Adding more tasks to a single
job if possible is recommended as compared to starting new
jobs to avoid start-up costs. In the case of data from mul-
tiple transformers, each dataset can be assigned as a job.
To reduce the execution time of the load forecasting models,
multiple DTs’ load forecasting is performed simultaneously
with the help of parallel job submission in spark. Moreover,
the shortest job submitted may consume fewer resources as
compared to the other jobs submitted. To overcome this,
python’s thread pool concurrency feature in addition to the
spark fair scheduler can be used. A solution is to decompose
the complete dataset into a cluster of transformers IDs and
use multiple computing nodes to train the clustered model
with an added sequential step to test the model of each of FIGURE 2. Experimental setup of spark framework for load forecasting.
the transformers within the clusters. However, it is necessary
to train clustered models first and then test the individual
models within the clusters. This attempts to add multiple
layers of parallel processes executed sequentially as iterated A. DATASTORE PARALLELISM
in Fig. 1. For n clusters, the number of iterations to train the The big data of the transformers load values with the times-
clustered data is n/j, where j is the number of jobs submitted tamp is stored in the HDFS with replication factor 3. The
simultaneously. As the n clusters are accessed repeatedly by resulting load data partitions are constructed into RDDs and
the processes, the training data pertaining to n clusters is stored in the corresponding data nodes. The number of par-
cached into memory. Similarly for t transformers belonging titions is automatically set by spark as one partition for a
to a cluster, the number of iterations in total to test the block of file, however, the data is repartitioned to 20 which is
holdout data of each of the transformers (TF) is n ∗ (t/j). equal to the number of cores in each of the nodes using the
As the value of (n ∗ (t/j)) is larger than n/j in all cases of t, pyspark programming interface. Spark by default supports
Kyro serializer, that is almost 10x faster than the default java
serializer. Kyro serializer is a graph serialization framework
which is efficient and fast, and performs direct copy from
object to object rather than a transition phase of bytes in the
middle.
B. TRAINING PARALLELISM
The data from HDFS is read into the spark data frame for the
analysis. By using the data frame API only, all the physical
execution is compiled in native spark using Java Virtual
Machine (JVM), while only the logical plan is constructed in
pyspark [27]. The use of data frame API in pyspark results in
efficient execution as it avoids the creation of key-value pairs
that occur in scala. Data frames in spark are immutable like
RDD and are conceptually similar to a pandas data frame or a
relational database. However, the important difference is the
FIGURE 1. Nested parallelism with spark (sequential and parallel runs). execution of transformations and actions in spark. The spark’s
catalyst optimizer creates an optimized logical plan before IV. OPTIMAL SCHEDULING ALGORITHM
sending an instruction to the spark driver. As the catalyst Scheduling jobs considering the available resources is chal-
optimizer functions are the same across all the language APIs, lenging. An optimal scheduling algorithm is necessary to
data frames provide equivalent performance to all the spark schedule jobs to be able to reduce the execution time. As per
API. Once a logical plan is created, it visualizes it as a the requirement of load prediction of multiple transformers
Directed Acyclic Graph (DAG) as shown in Fig. 3, and is at the same time, two scheduling algorithms are leveraged in
distributed among all the tasks in a job to be able to perform this paper. In this section, the solutions of optimal scheduling
each of the stages concurrently. when communication costs are ignored and when considered
Considering the merits of spark, it is used as the big data are discussed.
processing platform in our application for two of the main
computing tasks:
A. IGNORING COMMUNICATION COSTS
1) Average load matrices calculation: the elements of the
average load matrix consist of load averaged for 1 lag Considering w workers available and M jobs to be executed,
day, 7 lag days, etc. The data is inputted into the matrix three cases can be obtained where (w < M ) &(wx = M ),
calculation from the historical data stored in HDFS and (w < M ) &(wx < M ), and w ≥ M where x is a multiple
the computations are carried out in pyspark. of M resulting in wx = M . The algorithm in this section is
2) Simultaneous training of DTs’ load forecasting models structured as follows.
with the help of thread pools in python and multiple jobs Step 1: Submit the array of tasks to the w workers
in spark utilizing a fair scheduler. Step 2: w jobs are submitted to the available w workers
Step 3: Whenever a processor becomes available, assign it D : data filtered as per meterID
the unexecuted ready job with the highest priority. tfs: transformer allocated to a cluster with clusterID
Submitting the jobs with the help of a thread pool as csv: an empty csv file to accumulate all the results
discussed in section III, w jobs are submitted at the same Initialize:
time. Considering the three cases, the algorithm flow can be def Cluster(clusterID):
elaborated for three of the cases as below: DclusterID = clusterArray [clusterID] ;
Case I : Cache DclusterID into memory for repeated access;
Create the train and holdout data from DclusterID ;
pool ω (train, [1, 2, . . . .w])
Perform ML modeling on grouped train data by
(w < M ) &(wx = M ) performing hyperparameter tuning;
Tm1 Tm2 Tm3 Tm4 ..Tmw Chose the hyperparameters with least error and store the
. model M clusterID ;
end def
xtimes
def forecast(n, Dt ):
. Create the train and holdout data from Dt ;
Tm1 Tm2 Tm3 Tm4 ..TMw Chose the hyperparameters with least error and store the
model M ;
Case II: Perform testing on holdout data with M clusterID ;
pool ω (train, [1, 2, . . . .w]) Use model M to predict the holdout data;
(w < M ) &(wx < M ) Test the accuracy of the predicted model;
Read the results in the csv file;
Tm1 Tm2 Tm3 Tm4 ..Tmw Update csv file with training accuracy and holdout
. dataset accuracy along with the meterID;
x − 1times end def
. Output:
csv: the accuracy of holdout data of all the T models.
Tm1 Tm2 Tm3 ..TmM −(x−1)w
Case III: 1. groupBy D with meterID and timestamp
2. Perform clustering on optimal value of k to obtain the
pool ω (train, [1, 2, . . . .w]) group of clusters as clusterlist
(w ≥ M ) 3. Split D into an array of dataframes based on the clus-
Tm1 Tm2 Tm3 Tm4 ..TMw terlist as clusterArray [D1 , D2 , . . . .Dn ] where n is the
number of clusters
where Tm is the time taken for individual job execution and 4. call pool.map function with cluster function and
is assumed to be the same. In case III, the computational clusterID as varibles;
capabilities are not as high most of the time when the number The function cluster is called n times in j batches
of jobs to be submitted is in the thousands. The total execution resulting in n/j iterations. If any processor is available
time in all three cases can be summarized in (1) as: the available job is assigned to the processor. The
results are updated simultaneously and any point is
xT , where (w < M ) and (wx = M )
equal to the number of forecast processes completed
Ttotaltime = xT , where (w < M ) and &(wx < M ) (1)
execution;
T , where w ≥ M
5. Close pool;
Because of the way the number of concurrent jobs is sub- 6. Call join function after all the n/j iterations are com-
mitted, w workers are assigned for each step of parallel pleted;
runs. Although at the last step of execution wx < M still 7. Open a csv file to store results;
takes the same amount of time, as w workers are assigned to 8. for n:
perform the job. Algorithm 1 details the overall proposed load 9. tfs = clusterlist [n] ;
forecasting methodology based on the optimal scheduling 10. Split Dn into an array of dataframes based on the
algorithm discussed in section III. tfs as tfArray [D1 , D2 , . . . .Dt ] where t is the number
of transformers belonging to cluster n.
Algorithm 1 Proposed Optimal Scheduling Algorithm 11. call pool.map function for forecast function with n
Input: and Dt as varibles;
j : the number of batches The function forecast is called t times in x batches.
T : an array consisting of each of the meterID’s If any processor is available the available
w : number of workers (indicates the number of cores in job is assigned to the processor. The results are
an executor) updated simultaneously and any point is equal to the
number of forecast processes completed execution; for the subset dataset Dm concerning the data subset m for
12. close pool; transformer level load forecasting.
13. call join function after all the t/j iterations
are completed; V. CASE STUDY
14. end for Firstly, the experimental setup is introduced in this section.
15. return csv Secondly, the performance of the proposed scheduling algo-
rithms is evaluated. Finally, the results are presented and
B. CONSIDERING COMMUNICATION COSTS discussed.
The main idea of this scheduling task is to augment the
scheduling with new precedence relations to be able to com- A. EXPERIMENTAL SETUP
pensate for the communication time. By clustering the jobs 1) CLUSTER CONFIGURATION
into C clusters and submitting them to the same worker, The apache-spark platform, where all the computations are
the overall communication between clusters will be mini- performed, consists of one master node and 5 slave nodes as
mized. If T̃ is the time taken by a cluster including the shown in Fig. 2. Each of the 5 compute nodes is Linux-based
communication costs, and y is a multiple of the total number and contains 24 physical CPU cores – 2 processor sockets
of clusters resulting in wy = C, yT̃ , is the time taken for all with 12 cores per socket – and 128GB of RAM. The inter-
the jobs where y < x. connect is comprised of the Cray Aries network, which is
employed both for MPI as well as for storage traffic [28].
C. OBJECTIVE FUNCTION Hadoop 2.8.0 and spark 3.0.0 are installed on both the master
This section attempts to create the theoretical functions for and slave nodes. The load forecasting algorithm is imple-
parallel and sequential training approaches and to propose mented in Python 3.6.4.
an implementation solution based on the spark platform. The
collected transformers power data is denoted as D where D1 , 2) DATA COLLECTION, STORAGE, & PREPROCESSING
D2 , D3 , . . . DM denote the data for meter m. The data Dm con- In used experiments, the dataset consists of load value and
sists of F features namely month, day, year, etc. Therefore, timestamp of 1000 transformers meters of the Iberdrola net-
the chunk of data for a meter ID can be expressed by Dm as work [29]. The data is split into 90% (Jan 2017 to Jun 2019)
following: of training and 10% (July 2019 to September 2019) holdout
dataset. The total dataset counts to around ∼24000000. The
Dm = [X1m , X2m , . . . ..XFm ] (2) data was collected from the utility company in an Optimized
[M [M [N m
D= Dm = Dm (3) Row Columnar (ORC) format and was stored in the HDFS
m=1 m=1 n=1 n storage on 5 data nodes and replicated 3 times. Currently,
where Xfm is the feature f of the chunk of the data for a meter spark supports timestamp input with the help of flint time
ID m; N m is the size of the mth dataset. This chunk of data is series as flint context and not flint session. As a consequence
trainable input to the machine learning model. Additionally, of this limitation, the timestamp is split into the year, month,
based on the data decomposition shown in (2), the mean day, and hour.
square error (MSE) for regression of the parallel training of Fig. 4 shows the power consumption pattern for all the
the ML model is represented as 3 years in the top left, data with large load values on the
1 XM top right, and the frequency of the load values in the bottom
RMSE OOB = min Jm left and bottom right graphs. It can be noted that the bot-
N m=1
1 XM XN m m tom left graph is right-skewed, and after log normalization,
= min J (4) the spread of the data is more diverse comparatively but still
N m=1 n=1 n
not normally distributed. The bottom right graph also has
And, the loss function Jnm of the sample n in data with a log+1 normalization as the data consists of load values of 0.
subset, m is given by (5) It can be noticed that the data is right-skewed.
s
1 In the short-term load forecasting (STLF) scenario of this
m
Jn = (5) work, the load value of 1000 DTs needs to be forecasted at
OOB X m 2
PN
N n=1 ym n − ŷn n the same time. The data profile of each of the DT ranges from
where J m in (6) is the loss function of the mth data set January 2017 to September 2019. Based on the above infor-
mation, an ideal load forecasting model for STLF requires
XN m
Jm = Jnm (6) 1) The time series of the historical data for the load profile.
n=1 2) The parameters of the trained ML model to accelerate
ym m
n and ŷn are the observed and the predicted load values, the load forecasting.
respectively, of sample n in data subset m; and N is the dimen- 3) The trained model of a single DT consisting of simple
sion of each of the output samples. The ML model training is parameters, yet accurate.
performed to minimize the RMSE OOB in (4) and obtain the 4) Model executed and realized efficiently on a parallel
trees using the dataset D. Similar procedures are performed processing platform i.e., apache-spark.
3) SPARK OPTIMIZATION
Besides spark being an in-memory computing framework,
it runs on top of the Java Virtual machines (JVMs). Hence
tuning the JVM parameters is necessary to improve the per-
formance of spark. In this paper, the authors have identified
three key spark parameters that impact the utilization of
resources to reduce the workload execution time. The paper
has also focused on the right choice of parameters that impact
the memory serialization, data compression, caching, and
repartitioning of data. Compressing serialized RDDs helps in
saving substantial space at the expense of some extra CPU
FIGURE 4. Top left - Load distribution across all the three years (The time. Compression of RDD in shuffle operations has a great
vertical axis indicates the load value in kWh and the x axis indicates the
time stamp). Top right – Data with large load values greater than
advantage due to it random read/write and multiple times
1000 kWh (The vertical axis indicates the transformer id the data belongs read/write. Compression of spark RDD is achieved with the
to and the x axis indicates the load value in kWh. Bottom left – Frequency help of codec. Experiments are conducted considering: i)
of the load distribution limiting to 1000 kWh. Bottom right – Frequency of
log normalized load plus 1. various combinations of several executors, ii) the number of
cores per executor, and iii) the amount of memory for each
of the executors. If CO is the total number of cores in the
Spark tree models support both continuous and categorical configuration then
features, partitioning data by rows, and distributed training.
Algorithms available in spark ml are used for performance CO = E ∗ COperE (8)
comparison, which includes the spark decision trees (Spark where E is the total number of executors assigned and
DT) and tree ensembles i.e., spark parallelized random forests COperE is the number of cores assigned per executor in the
(Spark RF), and spark gradient boosted trees (Spark GBTs). spark configuration. The distribution of total memory in the
spark configuration is given as follows.
B. PERFORMANCE EVALUATION
1) AVERAGE RMSE MEM = (0.9 ∗ MEMperE ∗ E) + (0.1 ∗ MEMperE ∗ E)
The objective of future load consumption is to predict the (9)
load with high precision and speed to have near real-time
processing ability. Root mean square error (RMSE) is used as where MEMperE is the memory assigned per executor.
the error metric because of its wide use. To evaluate the pre- The second term in (9) is the overhead memory allocated
dictive performance, the training dataset is separated from the to each of the executors which accounts for virtual machine
holdout dataset (data never used for training). All the models overheads or other native overheads. The additional memory
are built on the training data and optimized to obtain as low and is usually chosen as either 10% of the executor memory
RMSEtrain as possible and predicted on the holdout dataset to or a minimum of 384MB by the spark cluster computing
note the RMSEholdout . Moreover, to evaluate the performance system [30]. Further, the MEMperE is divided into two frac-
on all the holdout datasets for different transformers, the tions, one for memory and the other for storage. The memory
average RMSE (ARMSE) is calculated as described below: fraction handles the data structures, out-of-memory error and
the storage fraction handles the cached blocks of data. The
1 XM values of CO and MEM can vary and are very specific to the
ARMSE = RMSE holdout 1 < i < M (7)
M i=1 cluster used for configuring spark. Choosing a larger value
The ARMSE shows how well the ML model learns the of E results in reducing the COperE to balance the CO.
data for all the distribution transformers. The reason for Similarly, choosing a larger value of E reduces the MEMperE
choosing ARMSE to have high average accuracy across all to balance the MEM .
the distribution transformers and not just one or a few.
C. RESULTS AND DISCUSSION
2) EXECUTION TIME In this section, the metrics discussed in the previous section
An important objective of choosing the proposed methodol- are evaluated on the dataset to showcase the benefits of the
ogy is to reduce the processing time of the transformer’s data. optimal scheduling algorithm. The ARMSE and the execution
FIGURE 5. Performance evaluation. (a) shows the speedup for various cluster sizes for a concurrent job submission size of 18 and (b) presents the
speedup of increasing the number of jobs. A value of y = 93 is chosen for all the job submission values.
FIGURE 7. Comparison of compute time at various stages of load forecasting. (a) Results obtained for the time taken to perform clustering,
training time and testing time on the holdout dataset for SLR(spark LR), SDT, SRF and SGBT. (b) The execution time involves clustering, training
of grouped data, testing on clustered data, training on individual transformers and testing on individual transformers for the spark ML models.
transformers with clustering, the training time of the individ- TABLE 1. Comparison of performance of ML model in terms of execution
time with previous works.
ual transformers, and the testing time for individual trans-
formers) for the 1000 models is shown in Fig. 7(b). The
time taken by the gradient boosted algorithm is the high-
est compared to the other spark ml algorithms. Although
both random forest and gradient boosted trees are ensemble
models, the random forest takes noticeably lesser time as
compared to random forest. Inference out of this observa-
tion is that gradient boosted is a boosting algorithm that
is quite sequential and is intended to take more execution
time whereas multiple trees in the random forest can be run
parallelly across the nodes to speed up the execution. The
times observed in Fig. 7(a) show the lowest training time for
spark decision tree regressor. It can be noted that the time
taken to perform testing is almost close to the training time.
This is evident from the proposed methodology which states
that performing analysis on grouped data is preferred over with the largest number of cores per executor shows the
individual transformers data. However, as the testing has to lowest run time as per the secondary y-axis in Fig. 8. As the
be performed on all the DT’s datasets, grouping cannot be job submission computes multiple jobs at the same time more
performed to reduce execution time. number of workers helps in the distribution of the jobs to
To compare the results with different previous works, more number of workers. However, choosing more E and less
the comparison has been done with datasets of similar sizes COperE is not expected to be efficient as the work will be
and computational capacities utilized have been documented. distributed across more executors resulting in larger transfer
The comparison has been done with methodologies that have of data across the executors. The choice of executors less
utilized distributed ML modeling with Apache Spark. The than 5 is not possible as the number of nodes in the configu-
comparison with previous works is presented in Table 1. It can ration is 5, each node consists of a total of 120GB memory.
be observed that although the proposed methodology consists Reducing the number of executors will result in each executor
of a dataset size of ∼24 million records is superior in terms of to contain more than 120GB which exceeds the threshold and
execution time as compared to previous works in performing is practically not possible. Hence a choice of 5 executors and
distributed machine learning on big data. 20 cores per executor is decided as an optimized combination
of the spark configuration. It is worth to mention also that
as the number of executors are increased, the MEMperE is
2) VALIDATION OF SPARK OPTIMIZATION reduced as it is distributed among the executors, to sum up
To validate the use of an optimal number of COperE, to MEM .
experiments are conducted based on various combinations Other than time, communication overhead and data trans-
of COperE and E which in turn affects the MEMperE. fer is also a concern in distributed computing. Other than
Fig. 8 displays the comparison of run-time for various combi- time, communication overhead and data transfer is also a
nations of executors and cores per executor. The combination concern in distributed computing. By increasing the depth of
FIGURE 8. Run time comparison for various spark optimization FIGURE 9. ARMSE of training and holdout dataset for spark decision tree.
parameters. The spot above 820 nodes result in overfitting of the datasets.
TABLE 3. Final ARMSE, for training and holdout dataset after choosing tuned parameters.
FIGURE 10. ARMSE comparison of training and holdout dataset for all the DT’s.
spark in terms of execution can be observed here. Thus, it processing. One distinctive characteristic of the proposed
can be concluded that the spark RF performs better than the methodology is to be able to submit a maximum number of
other spark ml models under comparison. jobs and to process all the jobs in parallel. Several experi-
Fig. 10 shows the plot of RMSE of all the distribution ments were performed to optimize the scheduling strategy
transformers under consideration. The red line indicates the in terms of ML model error and execution time. A large
RMSE(kWh) obtained for all the distribution transformers. number of DTs training procedures were performed with
The blue line indicated as the holdout RMSE (RMSE of the reduced run-times which allow handling big data that is too
data never used for training) is the forecasting error in kWh. large to be stored. The training time for a group of 93 clus-
To measure the quality of the trained models the holdout ters data with a data size of ∼24 million records was per-
RMSE is expected to be as close as possible to the train- formed in ∼50 sec and forecasting of 1000 transformers with
ing RMSE. From Fig. 10, it can be observed that the blue ∼2.4 million records took ∼57 sec. The total time includ-
line follows the red line for almost all the transformers. For ing, grouping, training, and forecasting was performed in
randomly chosen DTs, indexed as 0, 78, 208, 91, 13, 39, ∼450 sec. The other important achievement of this paper is
104, 1, 52 present the training RMSE and holdout RMSE 2 times faster execution time with the use of thread pool and
zoomed in the top right of the figure. The plots indicate that fair scheduler. This is a good optimization strategy for load
the forecasting accuracy follows the training accuracy closely forecasting using multi-sensor big datasets. Empirical eval-
attributing to the fact that the built ML model is quite robust uations significantly outperformed the previously proposed
in terms of performance while increasing the speedup when iterative algorithms. Moreover, the proposed ML models
a large number of jobs is performed. achieved higher accuracies. The merits shown in the exper-
iment indicated that there is great potential for the proposed
VI. CONCLUSION method to be used in big data load forecasting of multi AMI
In this paper, a smart scheduling algorithm to perform load environments.
forecasting on multiple DTs was proposed. The proposed As this work chooses the optimized cluster value of 93,
approach was implemented on Apache spark to not only the next plan is to conduct experiments to investigate the
deal with the challenges associated with computation time optimal cluster value utilizing the proposed approach while
while handling the big data but also to submit jobs using an using the spark platform. Also, scaling the dataset to more
optimized methodology in a parallel manner. The processed than 1000 DTs requires more than a minimum of 100 jobs
big data was partitioned into various chunks and cached to to be submitted. Scaling the size of the spark cluster to an
improve the performance in terms of storage and in-memory optimal value is a subject for future work.
ACKNOWLEDGMENT [19] S. S. Reddy and J. A. Momoh, ‘‘Short term electrical load forecasting using
The HPC (and/or scientific visualization) resources and ser- back propagation neural networks,’’ in Proc. North Amer. Power Symp.
(NAPS), Sep. 2014, pp. 1–6, doi: 10.1109/NAPS.2014.6965453.
vices used in this work were provided by the Research [20] S. S. Reddy, C.-M. Jung, and K. J. Seog, ‘‘Day-ahead electricity price fore-
Computing group in Texas A&M University at Qatar. casting using back propagation neural networks and weighted least square
Research Computing is funded by the Qatar Founda- technique,’’ Frontiers Energy, vol. 10, no. 1, pp. 105–113, Mar. 2016, doi:
10.1007/s11708-016-0393-y.
tion for Education, Science and Community Development [21] H. Shi, M. Xu, and R. Li, ‘‘Deep learning for household load forecasting—
(http://www.qf.org.qa). A novel pooling deep RNN,’’ IEEE Trans. Smart Grid, vol. 9, no. 5,
pp. 5271–5280, Sep. 2018, doi: 10.1109/TSG.2017.2686012.
REFERENCES [22] H. Aprillia, H.-T. Yang, and C.-M. Huang, ‘‘Statistical load forecast-
[1] P. Wang, B. Liu, and T. Hong, ‘‘Electric load forecasting with recency ing using optimal quantile regression random forest and risk assessment
effect: A big data approach,’’ Int. J. Forecasting, vol. 32, no. 3, index,’’ IEEE Trans. Smart Grid, vol. 12, no. 2, pp. 1467–1480, Mar. 2021,
pp. 585–597, Jul. 2016, doi: 10.1016/j.ijforecast.2015.09.006. doi: 10.1109/tsg.2020.3034194.
[2] A. L’Heureux, K. Grolinger, H. F. Elyamany, and [23] S. S. Reddy, ‘‘Bat algorithm-based back propagation approach for short-
M. A. M. Capretz, ‘‘Machine learning with big data: Challenges and term load forecasting considering weather factors,’’ Electr. Eng., vol. 100,
approaches,’’ IEEE Access, vol. 5, no. 1, pp. 7776–7797, 2017, doi: 10. no. 3, pp. 1297–1303, Sep. 2018, doi: 10.1007/s00202-017-0587-2.
1109/ACCESS.2017.2696365. [24] W. Jiang, H. Tang, L. Wu, H. Huang, and H. Qi, ‘‘Parallel processing
[3] I. W. Tsang, J. T. Kwok, and P.-M. Cheung, ‘‘Core vector machines: of probabilistic models-based power supply unit mid-term load forecast-
Fast SVM training on very large data sets,’’ J. Mach. Learn. Res., vol. 6, ing with apache spark,’’ IEEE Access, vol. 7, pp. 7588–7598, 2019, doi:
pp. 363–392, Apr. 2005. 10.1109/ACCESS.2018.2890339.
[4] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, [25] Classification and Regression—Spark 3.0.1 Documentation. Accessed:
‘‘Spark: Cluster computing with working sets,’’ in 2nd USENIX Work. Hot Nov. 26, 2020. [Online]. Available: https://spark.apache.org/docs/latest/
Top. Cloud Comput. (HotCloud), vol. 10, 2010, p. 95. ml-classification-regression.html#decision-trees
[5] Y. Tang, Z. Xu, and Y. Zhuang, ‘‘Bayesian network structure learning from [26] X. Meng, J. Bradley, B. Yavuz, and E. Sparks, ‘‘MLlib: Machine learning
big data: A reservoir sampling based ensemble method,’’ in Proc. Int. Conf. in Apache Spark,’’ J. Mach. Learn. Res., vol. 17, no. 1, pp. 1235–1241,
Database Syst. Adv. Appl., vol. 9645. Dallas, TX, USA, 2016, pp. 209–222, 2016.
doi: 10.1007/978-3-319-32055-7_18. [27] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng,
[6] J. Dean and S. Ghemawat, ‘‘MapReduce: Simplified data processing on T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia, ‘‘Spark SQL: Rela-
large clusters,’’ Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008, tional data processing in spark,’’ in Proc. ACM SIGMOD Int. Conf. Man-
doi: 10.1145/1327452.1327492. age. Data, May 2015, pp. 1383–1394, doi: 10.1145/2723372.2742797.
[7] P. Mika, ‘‘Flink: Semantic Web technology for the extraction and analysis [28] TAMUQ Research Computing Policies–Research Computing @
of social networks,’’ J. Web Semantics, vol. 3, nos. 2–3, pp. 211–223, TAMUQ. Accessed: Jan. 11, 2021. [Online]. Available: https://rc-
Oct. 2005, doi: 10.1016/j.websem.2005.05.006. docs.qatar.tamu.edu/index.php/Main_Page
[8] A. Baldominos, E. Albacete, Y. Saez, and P. Isasi, ‘‘A scalable machine [29] STAR Project–Iberdrola. Accessed: Jan. 11, 2021. [Online]. Available:
learning online service for big data real-time analysis,’’ in Proc. IEEE https://www.iberdrola.com/about-us/lines-business/flagship-projects/star-
Symp. Comput. Intell. Big Data (CIBD), Orlando, FL, USA, Dec. 2014, project
pp. 1–8, doi: 10.1109/CIBD.2014.7011537. [30] The Apache Software Foundation. Spark Configuration. Accessed:
[9] Y. Zhang, S. Chen, Q. Wang, and G. Yu, ‘‘i2 MapReduce: Feb. 11, 2021. [Online]. Available: http://spark.apache.org/docs/1.2.1/ec2-
Incremental mapreduce for mining evolving big data,’’ IEEE Trans. scripts.html
Knowl. Data Eng., vol. 27, no. 7, pp. 1906–1919, Jul. 2015, doi: [31] D. Syed, H. Abu-Rub, A. Ghrayeb, S. S. Refaat, M. Houchati,
10.1109/TKDE.2015.2397438. O. Bouhali, and S. Banales, ‘‘Deep learning-based short-term load
[10] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, forecasting approach in smart grid with clustering and consumption
M. J. Franklin, S. Shenker, and I. Stoica, ‘‘Resilient distributed datasets: A pattern recognition,’’ IEEE Access, early access, Apr. 8, 2021, doi:
fault-tolerant abstraction for in-memory cluster computing,’’ in Proc. 9th 10.1109/ACCESS.2021.3071654.
USENIX Symp. Networked Syst. Design Implement., San Jose, CA, USA, [32] D. Syed, S. S. Refaat, and H. Abu-Rub, ‘‘Performance evalua-
2012, pp. 15–28. tion of distributed machine learning for load forecasting in smart
[11] N. Bharill, A. Tiwari, and A. Malviya, ‘‘Fuzzy based scalable grids,’’ in Proc. Cybern. Informat. (K&I), Jan. 2020, pp. 1–6, doi:
clustering algorithms for handling big data using apache spark,’’ 10.1109/KI48306.2020.9039797.
IEEE Trans. Big Data, vol. 2, no. 4, pp. 339–352, Dec. 2016, doi: [33] Y. Xu, H. Liu, and Z. Long, ‘‘A distributed computing frame-
10.1109/tbdata.2016.2622288. work for wind speed big data forecasting on apache spark,’’ Sustain.
[12] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, Energy Technol. Assessments, vol. 37, Feb. 2020, Art. no. 100582, doi:
R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, 10.1016/j.seta.2019.100582.
O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, ‘‘Apache [34] A. Zainab, D. Syed, A. Ghrayeb, H. Abu-Rub, S. S. Refaat, M. Houchati,
Hadoop YARN: Yet another resource negotiator,’’ in Proc. 4th Annu. O. Bouhali, and S. Banales Lopez, ‘‘A multiprocessing-based sensitivity
Symp. Cloud Comput., Santa Clara, CA, USA, Oct. 2013, pp. 1–16, analysis of machine learning algorithms for load forecasting of electric
doi: 10.1145/2523616.2523633. power distribution system,’’ IEEE Access, vol. 9, pp. 31684–31694, 2021,
[13] B. Hindman, A Konwinski, M Zaharia, A Ghodsi, A. D. Joseph, R. H. Katz, doi: 10.1109/ACCESS.2021.3059730.
S. Shenker, and I. Stoica, ‘‘Mesos: A platform for fine-grained resource
sharing in the data center,’’ in Proc. NSDI, 2011, vol. 11, no. 2011,
pp. 295–308. AMEEMA ZAINAB (Member, IEEE) received the
[14] T. White, Hadoop: The Definitive Guide, 3rd ed. Sebastopol, CA, USA: bachelor’s degree in electronics and communica-
O’Reilly Media, 2012. tion engineering from Osmania University, Hyder-
[15] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, abad, India, in 2013, and the M.S. degree in data
P. S. Wyckoff, and R. Murthy, ‘‘Hive: A warehousing solution over a map- science and engineering from Hamad Bin Khalifa
reduce framework,’’ Proc. VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, University (HBKU), Qatar. She is currently pursu-
2009.
ing the Ph.D. degree in electrical engineering with
[16] L. George, HBase: The Definitive Guide: Random Access to Your Planet-
Texas A&M University (TAMU), College Station,
Sized Data, 1st ed. Sebastopol, CA, USA: O’Reilly Media, 2011.
[17] Z. Hu, D. Li, and D. Guo, ‘‘Balance resource allocation for spark jobs based
TX, USA. She has three years of industry expe-
on prediction of the optimal resource,’’ Tsinghua Sci. Technol., vol. 25, rience, working as a Data Analytics Professional,
no. 4, pp. 487–497, Aug. 2020, doi: 10.26599/TST.2019.9010054. supporting audit at Deloitte Touche LLP, Hyderabad. She is also a base
[18] R. E. Edwards, J. New, and L. E. Parker, ‘‘Predicting future hourly residen- SAS Certified Programmer. Her research interests include data science, big
tial electrical consumption: A machine learning case study,’’ Energy Build- data machine learning, power forecasting, and big data management in the
ings, vol. 49, pp. 591–603, Jun. 2012, doi: 10.1016/j.enbuild.2012.03.010. smart grids.
VOLUME 9, 2021 57383
A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark
ALI GHRAYEB (Fellow, IEEE) received the Ph.D. SHADY S. REFAAT (Senior Member, IEEE)
degree in electrical engineering from The Uni- received the B.A.Sc., M.A.Sc., and Ph.D. degrees
versity of Arizona, Tucson, AZ, USA, in 2000. in EE from Cairo University, Giza, Egypt, in 2002,
He was a Professor with the Department of Electri- 2007, and 2013, respectively. For more than
cal and Computer Engineering, Concordia Univer- 12 years, he has worked in the industry as an Engi-
sity, Montreal, Canada. He is currently a Professor neering Team Leader, a Senior EE, and an Electri-
with the Department of Electrical and Computer cal Design Engineer. He is currently an Associate
Engineering, Texas A&M University at Qatar. Research Scientist with the Department of ECEN,
His research interests include wireless and mobile TAMU-Q. He has published more than 100 journal
communications, physical layer security, massive and conference papers. His main research interests
MIMO, and visible light communications. He served as an instructor or include power systems, electrical machines, smart grid, big data, devel-
a co-instructor in technical tutorials at several major IEEE conferences. opment of fault-tolerant systems, reliability of power grids and electric
He served as the Executive Chair for the 2016 IEEE WCNC Conference. machinery, fault detection, condition monitoring, and energy management
He has served on the editorial board of several IEEE and non-IEEE journals. systems. He is also a member of IET and the SGC-Q.