0% found this document useful (0 votes)

1 views

machine learning tree

This paper presents a distributed tree-based machine learning approach for short-term load forecasting using Apache Spark, focusing on optimizing resource utilization through a master-slave parallel computing paradigm. It introduces a concurrent job scheduling algorithm to enhance performance in a multi-energy data source environment, utilizing real-world data from one thousand distribution transformers in Spain. The methodology aims to balance accuracy and processing time, demonstrating significant improvements in load forecasting capabilities with the application of various tree-based ML algorithms.

Uploaded by

sumamarceline1963

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

machine learning tree

Uploaded by

sumamarceline1963

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Received March 22, 2021, accepted April 5, 2021, date of publication April 12, 2021, date of current version

April 20, 2021.

Digital Object Identifier 10.1109/ACCESS.2021.3072609

Distributed Tree-Based Machine Learning for

Short-Term Load Forecasting With Apache Spark
AMEEMA ZAINAB 1,2 , (Member, IEEE), ALI GHRAYEB 2 , (Fellow, IEEE),
HAITHAM ABU-RUB 2 , (Fellow, IEEE), SHADY S. REFAAT 2 , (Senior Member, IEEE),
AND OTHMANE BOUHALI3,4 , (Member, IEEE)
1 Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
2 Department of Electrical and Computer Engineering, Texas A&M University at Qatar, Doha 5825, Qatar
3 Research Computing, Texas A&M University at Qatar, Doha 5825, Qatar
4 Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha 5825, Qatar

Corresponding author: Ameema Zainab ([email protected])

This work was supported in part by the National Priorities Research Program (NPRP) Grant from the Qatar National Research Fund
(a member of Qatar Foundation) under Grant NPRP10-0101-170082, in part by the Internal Seed Grant from Texas A&M University at
Qatar, in part by IBERDROLA Qatar Science and Technology Park (QSTP) LLC, and in part by the Open Access Funding by the Qatar
National Library.

ABSTRACT Machine learning algorithms have been intensively applied to perform load forecasting to
obtain better accuracies as compared to traditional statistical methods. However, with the huge increase in
data size, sophisticated models have to be created which require big data platforms. Optimal and effective
use of the available computational resources can be attained by maximizing the effective utilization of the
cluster nodes. Parallel computing is demanded to allow for optimal resource utilization in dealing with
smart grid big data. In this paper, a master-slave parallel computing paradigm is utilized and experimented
with for load forecasting in a multi-AMI environment. The paper proposes a concurrent job scheduling
algorithm in a multi-energy data source environment using Apache Spark. An efficient resource utilization
strategy is proposed for submitting multiple Spark jobs to reduce job completion time. The optimal value
of clustering is used in this paper to cluster the data into groups to be able to reduce the computational
time additionally. Multiple tree-based machine learning algorithms are tested with parallel computation
to evaluate the performance with tunable parameters on a real-world dataset. One thousand distribution
transformers’ real data from Spain for three years are used to demonstrate the performance of the proposed
methodology with a trade-off between accuracy and processing time.

INDEX TERMS Apache spark, concurrent computing, load forecasting, parallel processing, resource
management.

I. INTRODUCTION decision-making processes and big data provides power in

With the development of the smart infrastructure in the elec- better decision making. For an hourly load forecast, if the
trical grids, the data collected from various units and loca- algorithm takes few hours to calibrate the hour-ahead fore-
tions over time have begun to receive the attention of grid casting model, then it is a big data problem [1]. The comput-
operators and research centers. Data centers usually collect ing time can not be longer than the lead time for any business
15-minutes to one-hour frequency of grid data, which creates application.
enormous amounts of data streams. The power grid opera- To manage big data and perform ML with big data, most of
tors are looking forward to creating data analytics solutions the researchers focused on one of the important challenges,
to benefit from these enormous amounts of collected data. i.e., handling large size of data stored historically in the
Processing large amounts of data and deriving insights will data centers [2]. Several ML algorithms were designed with
help in the purpose of knowledge discovery and better deci- the assumption that the entire dataset fits in the memory.
sion making. Machine learning (ML) techniques help in the This assumption negatively affects the ML algorithms while
impeding their performance. For instance, a support vector
2 and a train-

The associate editor coordinating the review of this manuscript and machine (SVM) has a space complexity of O m
ing complexity of O m3 [3], where m indicates the number

approving it for publication was Lin Zhang .

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
57372 VOLUME 9, 2021
A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark

of samples. Traditional ML algorithms were built with the scheduling algorithm proposed to perform load forecast-
assumption that the data can fit into memory, but with the era ing on multiple datasets utilizing apache spark. Section IV
of big data, it becomes challenging for ML models to adapt to describes the experimental setup of the optimal scheduling
the deluge of data. Big data due to its nature of velocity also algorithm along with the results obtained by implement-
imposes the challenge of all the data being available during ing the proposed methodology on real big data. Section V
training. Therefore, as the size of the data increases, dis- concludes the paper.
tributed processing frameworks, parallel data structures, data
reuse, and data partitioning become important characteristics.
Resilient distributed datasets (RDDs) implemented in a Spark II. RELATED WORK
cluster computing framework exhibit in-memory characteris- Many papers have proposed benchmarking results with the
tics [4]. This leads to the use of the typical architecture which use of ML for load forecasting, but in this section, the essence
can accommodate both the cluster computing framework and of big data smart grid load forecasting using spark is outlined.
machine learning capabilities. To improve the performance of The widely installed smart meters collect huge amounts of
the big data machine learning algorithms, manipulations in load data for each of the grid’s distribution transformers.
terms of the way ML algorithms execute and the processing Many computing frameworks [6]–[9] have been developed
infrastructure is necessary. Among the various ML paradigms for the analysis of big data but, MapReduce [6] is the most
in big data, this paper focuses on tree-based methods and famous one because of its features of fault-tolerance, parallel
ensemble learning techniques. Splitting a deluge of data into computation, and flexibility. Apache Spark [10] proposed by
multiple datasets to perform training with the ML models the Zaharia et al. team emerged to overcome the drawbacks in
has gained significant improvement in the learning process MapReduce. It is an open-source framework and is 100 times
in terms of the big data context. For example, the authors faster than Hadoop MapReduce [11]. Spark can execute over
in [5] applied ensemble learning to subsamples of big data several cluster managers such as Hadoop YARN [12], Apache
improving learning accuracy and simultaneously decreased Mesos [13], and spark’s standalone scheduler. Spark can
the computation time. also interface with a variety of data storage repositories like
The multi AMI infrastructure mostly concentrates on fore- Hadoop Distributed File System (HDFS) [14], Hive [15],
casting the load of all the distribution transformers (DT’s) at Hbase [16], to name a few. However, spark supports dis-
the same time. In this paper, a novel scheduling technique tributed computing resulting in a communication overhead
with the help of the Apache spark platform is proposed to increase. Previous research has observed that by only increas-
short-long term forecast the load of all the one thousand ing the computational capability, JCT reduces but then starts
transformers simultaneously. The spark cluster submits big increasing in communication overhead [17]. Hence the pro-
data analytics tasks as spark jobs and the computational posed scheduling algorithm in this paper focuses to utilize the
resources are allocated optimally to these spark jobs. The available computation capability and still be able to submit
amount allocated to these jobs is customizable by the user multiple jobs with the help of an optimal scheduling algo-
which affects the Job Completion Time (JCT) significantly. rithm and not losing on communication overhead.
This paper utilizes ML algorithms such as Spark Random Highly cited algorithms for forecasting smart grid data
Forest and Spark Gradient boosted regression trees for train- include linear regression, SVM and its variants [18], and
ing and forecasting the load. The proposed method performs artificial neural networks (ANN) [19] [20]. A pooling-based
load forecasting by submitting multiple jobs concurrently on deep recurrent neural network (DRNN) was proposed to
the data sets utilizing the cluster resources optimally. learn the spatial information, which outperformed Support
The main contributions of this paper can be summarized as Vector regressor (SVR), Auto-Regressive Integrated Moving
follows: Average (ARIMA), and the classical deep recurrent neural
1) Proposing an optimal scheduling algorithm to perform network (RNN) [21]. In [22], Happy et al, proposed a statis-
load forecasting with parallel and distributed execution tical approach for load forecasting using quantile regression
in a multi-AMI environment on the smart grid big data. random forest, risk assessment index, and probability map.
2) Tuning the ML models to gain high accuracy along with In [23], a backpropagation approach was utilized to perform
measures to combat overfitting. short-term load forecasting utilizing weather data. In [24],
3) Testing the proposed methodology on all the one thou- Wei et al performed midterm load forecasting of power sup-
sand transformers’ data without grouping, then a com- ply unit (PSU) considered as a collection of distribution trans-
parison is made against the proposed grouping technique formers. The authors have utilized a dynamic based network
to show its merits. (DBN), with a peak load of all the distribution transformers
The performance of the proposed method is tested on real within a PSU summed. All of the summed load values are
big data from an industry partner Iberdrola to validate its utilized to train and forecast the load using sparks standalone
effect on ascended performance. cluster. However, the use of complete data to train instead
The paper is organized as follows. Section II dis- of summed load values can result in better training accura-
cusses the related research in the field of load forecasting cies but will require optimized scheduling method which is
using big data platforms. Section III describes the novel achieved in this paper.

VOLUME 9, 2021 57373

A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark

This paper focuses on an hourly day-ahead load forecast the time in the previous case (with n clusters) is much less
with the use of spark ML tree-based algorithms. The models than without clustering, provided the data size for each of
are trained with the spark.ml Application Programming Inter- the jobs in both cases is the same. The proposed parallel
face (API) of spark which is data frame based facilitating ML and sequential approach of the tree-based ensemble model
pipelines and feasible feature transformations [25]. is deployed on the spark. Fig. 2 is an illustration of the
employed master-slave parallel computing paradigm where a
III. PROPOSED LOAD FORECASTING METHODOLOGY single master and multiple slaves are used. To incorporate the
FOR OPTIMIZED COMPUTATION WITH APACHE SPARK proposed methodology parallelism in datastore and training
The Spark ML library supports tree-based models namely are discussed further.
spark ml decision trees and ensemble models namely spark
ml random forest and spark ml gradient boosted regression
trees [26]. Spark session connects to the master node to
submit jobs, where each job is split into stages, and stages
are further split into tasks. Adding more tasks to a single
job if possible is recommended as compared to starting new
jobs to avoid start-up costs. In the case of data from mul-
tiple transformers, each dataset can be assigned as a job.
To reduce the execution time of the load forecasting models,
multiple DTs’ load forecasting is performed simultaneously
with the help of parallel job submission in spark. Moreover,
the shortest job submitted may consume fewer resources as
compared to the other jobs submitted. To overcome this,
python’s thread pool concurrency feature in addition to the
spark fair scheduler can be used. A solution is to decompose
the complete dataset into a cluster of transformers IDs and
use multiple computing nodes to train the clustered model
with an added sequential step to test the model of each of FIGURE 2. Experimental setup of spark framework for load forecasting.
the transformers within the clusters. However, it is necessary
to train clustered models first and then test the individual
models within the clusters. This attempts to add multiple
layers of parallel processes executed sequentially as iterated A. DATASTORE PARALLELISM
in Fig. 1. For n clusters, the number of iterations to train the The big data of the transformers load values with the times-
clustered data is n/j, where j is the number of jobs submitted tamp is stored in the HDFS with replication factor 3. The
simultaneously. As the n clusters are accessed repeatedly by resulting load data partitions are constructed into RDDs and
the processes, the training data pertaining to n clusters is stored in the corresponding data nodes. The number of par-
cached into memory. Similarly for t transformers belonging titions is automatically set by spark as one partition for a
to a cluster, the number of iterations in total to test the block of file, however, the data is repartitioned to 20 which is
holdout data of each of the transformers (TF) is n ∗ (t/j). equal to the number of cores in each of the nodes using the
As the value of (n ∗ (t/j)) is larger than n/j in all cases of t, pyspark programming interface. Spark by default supports
Kyro serializer, that is almost 10x faster than the default java
serializer. Kyro serializer is a graph serialization framework
which is efficient and fast, and performs direct copy from
object to object rather than a transition phase of bytes in the
middle.

B. TRAINING PARALLELISM
The data from HDFS is read into the spark data frame for the
analysis. By using the data frame API only, all the physical
execution is compiled in native spark using Java Virtual
Machine (JVM), while only the logical plan is constructed in
pyspark [27]. The use of data frame API in pyspark results in
efficient execution as it avoids the creation of key-value pairs
that occur in scala. Data frames in spark are immutable like
RDD and are conceptually similar to a pandas data frame or a
relational database. However, the important difference is the
FIGURE 1. Nested parallelism with spark (sequential and parallel runs). execution of transformations and actions in spark. The spark’s

57374 VOLUME 9, 2021

A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark

FIGURE 3. Spark-based DAG visualization for random forest regressor.

catalyst optimizer creates an optimized logical plan before IV. OPTIMAL SCHEDULING ALGORITHM
sending an instruction to the spark driver. As the catalyst Scheduling jobs considering the available resources is chal-
optimizer functions are the same across all the language APIs, lenging. An optimal scheduling algorithm is necessary to
data frames provide equivalent performance to all the spark schedule jobs to be able to reduce the execution time. As per
API. Once a logical plan is created, it visualizes it as a the requirement of load prediction of multiple transformers
Directed Acyclic Graph (DAG) as shown in Fig. 3, and is at the same time, two scheduling algorithms are leveraged in
distributed among all the tasks in a job to be able to perform this paper. In this section, the solutions of optimal scheduling
each of the stages concurrently. when communication costs are ignored and when considered
Considering the merits of spark, it is used as the big data are discussed.
processing platform in our application for two of the main
computing tasks:
A. IGNORING COMMUNICATION COSTS
1) Average load matrices calculation: the elements of the
average load matrix consist of load averaged for 1 lag Considering w workers available and M jobs to be executed,
day, 7 lag days, etc. The data is inputted into the matrix three cases can be obtained where (w < M ) &(wx = M ),
calculation from the historical data stored in HDFS and (w < M ) &(wx < M ), and w ≥ M where x is a multiple
the computations are carried out in pyspark. of M resulting in wx = M . The algorithm in this section is
2) Simultaneous training of DTs’ load forecasting models structured as follows.
with the help of thread pools in python and multiple jobs Step 1: Submit the array of tasks to the w workers
in spark utilizing a fair scheduler. Step 2: w jobs are submitted to the available w workers

VOLUME 9, 2021 57375

A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark

Step 3: Whenever a processor becomes available, assign it D : data filtered as per meterID
the unexecuted ready job with the highest priority. tfs: transformer allocated to a cluster with clusterID
Submitting the jobs with the help of a thread pool as csv: an empty csv file to accumulate all the results
discussed in section III, w jobs are submitted at the same Initialize:
time. Considering the three cases, the algorithm flow can be def Cluster(clusterID):
elaborated for three of the cases as below: DclusterID = clusterArray [clusterID] ;
Case I : Cache DclusterID into memory for repeated access;
Create the train and holdout data from DclusterID ;
pool ω (train, [1, 2, . . . .w])
Perform ML modeling on grouped train data by
(w < M ) &(wx = M ) performing hyperparameter tuning;
Tm1 Tm2 Tm3 Tm4 ..Tmw Chose the hyperparameters with least error and store the
. model M clusterID ;
end def
xtimes
def forecast(n, Dt ):
. Create the train and holdout data from Dt ;
Tm1 Tm2 Tm3 Tm4 ..TMw Chose the hyperparameters with least error and store the
model M ;
Case II: Perform testing on holdout data with M clusterID ;
pool ω (train, [1, 2, . . . .w]) Use model M to predict the holdout data;
(w < M ) &(wx < M ) Test the accuracy of the predicted model;
Read the results in the csv file;
Tm1 Tm2 Tm3 Tm4 ..Tmw Update csv file with training accuracy and holdout
. dataset accuracy along with the meterID;
x − 1times end def
. Output:
csv: the accuracy of holdout data of all the T models.
Tm1 Tm2 Tm3 ..TmM −(x−1)w
Case III: 1. groupBy D with meterID and timestamp
2. Perform clustering on optimal value of k to obtain the
pool ω (train, [1, 2, . . . .w]) group of clusters as clusterlist
(w ≥ M ) 3. Split D into an array of dataframes based on the clus-
Tm1 Tm2 Tm3 Tm4 ..TMw terlist as clusterArray [D1 , D2 , . . . .Dn ] where n is the
number of clusters
where Tm is the time taken for individual job execution and 4. call pool.map function with cluster function and
is assumed to be the same. In case III, the computational clusterID as varibles;
capabilities are not as high most of the time when the number The function cluster is called n times in j batches
of jobs to be submitted is in the thousands. The total execution resulting in n/j iterations. If any processor is available
time in all three cases can be summarized in (1) as: the available job is assigned to the processor. The
results are updated simultaneously and any point is

 xT , where (w < M ) and (wx = M )

equal to the number of forecast processes completed
Ttotaltime = xT , where (w < M ) and &(wx < M ) (1)
execution;
T , where w ≥ M


5. Close pool;
Because of the way the number of concurrent jobs is sub- 6. Call join function after all the n/j iterations are com-
mitted, w workers are assigned for each step of parallel pleted;
runs. Although at the last step of execution wx < M still 7. Open a csv file to store results;
takes the same amount of time, as w workers are assigned to 8. for n:
perform the job. Algorithm 1 details the overall proposed load 9. tfs = clusterlist [n] ;
forecasting methodology based on the optimal scheduling 10. Split Dn into an array of dataframes based on the
algorithm discussed in section III. tfs as tfArray [D1 , D2 , . . . .Dt ] where t is the number
of transformers belonging to cluster n.
Algorithm 1 Proposed Optimal Scheduling Algorithm 11. call pool.map function for forecast function with n
Input: and Dt as varibles;
j : the number of batches The function forecast is called t times in x batches.
T : an array consisting of each of the meterID’s If any processor is available the available
w : number of workers (indicates the number of cores in job is assigned to the processor. The results are
an executor) updated simultaneously and any point is equal to the

57376 VOLUME 9, 2021

A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark

number of forecast processes completed execution; for the subset dataset Dm concerning the data subset m for
12. close pool; transformer level load forecasting.
13. call join function after all the t/j iterations
are completed; V. CASE STUDY
14. end for Firstly, the experimental setup is introduced in this section.
15. return csv Secondly, the performance of the proposed scheduling algo-
rithms is evaluated. Finally, the results are presented and
B. CONSIDERING COMMUNICATION COSTS discussed.
The main idea of this scheduling task is to augment the
scheduling with new precedence relations to be able to com- A. EXPERIMENTAL SETUP
pensate for the communication time. By clustering the jobs 1) CLUSTER CONFIGURATION
into C clusters and submitting them to the same worker, The apache-spark platform, where all the computations are
the overall communication between clusters will be mini- performed, consists of one master node and 5 slave nodes as
mized. If T̃ is the time taken by a cluster including the shown in Fig. 2. Each of the 5 compute nodes is Linux-based
communication costs, and y is a multiple of the total number and contains 24 physical CPU cores – 2 processor sockets
of clusters resulting in wy = C, yT̃ , is the time taken for all with 12 cores per socket – and 128GB of RAM. The inter-
the jobs where y < x. connect is comprised of the Cray Aries network, which is
employed both for MPI as well as for storage traffic [28].
C. OBJECTIVE FUNCTION Hadoop 2.8.0 and spark 3.0.0 are installed on both the master
This section attempts to create the theoretical functions for and slave nodes. The load forecasting algorithm is imple-
parallel and sequential training approaches and to propose mented in Python 3.6.4.
an implementation solution based on the spark platform. The
collected transformers power data is denoted as D where D1 , 2) DATA COLLECTION, STORAGE, & PREPROCESSING
D2 , D3 , . . . DM denote the data for meter m. The data Dm con- In used experiments, the dataset consists of load value and
sists of F features namely month, day, year, etc. Therefore, timestamp of 1000 transformers meters of the Iberdrola net-
the chunk of data for a meter ID can be expressed by Dm as work [29]. The data is split into 90% (Jan 2017 to Jun 2019)
following: of training and 10% (July 2019 to September 2019) holdout
dataset. The total dataset counts to around ∼24000000. The
Dm = [X1m , X2m , . . . ..XFm ] (2) data was collected from the utility company in an Optimized
[M [M [N m
D= Dm = Dm (3) Row Columnar (ORC) format and was stored in the HDFS
m=1 m=1 n=1 n storage on 5 data nodes and replicated 3 times. Currently,
where Xfm is the feature f of the chunk of the data for a meter spark supports timestamp input with the help of flint time
ID m; N m is the size of the mth dataset. This chunk of data is series as flint context and not flint session. As a consequence
trainable input to the machine learning model. Additionally, of this limitation, the timestamp is split into the year, month,
based on the data decomposition shown in (2), the mean day, and hour.
square error (MSE) for regression of the parallel training of Fig. 4 shows the power consumption pattern for all the
the ML model is represented as 3 years in the top left, data with large load values on the
1 XM top right, and the frequency of the load values in the bottom
RMSE OOB = min Jm left and bottom right graphs. It can be noted that the bot-
N m=1
1 XM XN m m tom left graph is right-skewed, and after log normalization,
= min J (4) the spread of the data is more diverse comparatively but still
N m=1 n=1 n
not normally distributed. The bottom right graph also has
And, the loss function Jnm of the sample n in data with a log+1 normalization as the data consists of load values of 0.
subset, m is given by (5) It can be noticed that the data is right-skewed.
s
1 In the short-term load forecasting (STLF) scenario of this
m
Jn = (5) work, the load value of 1000 DTs needs to be forecasted at
OOB X m 2
PN
N n=1 ym n − ŷn n the same time. The data profile of each of the DT ranges from
where J m in (6) is the loss function of the mth data set January 2017 to September 2019. Based on the above infor-
mation, an ideal load forecasting model for STLF requires
XN m
Jm = Jnm (6) 1) The time series of the historical data for the load profile.
n=1 2) The parameters of the trained ML model to accelerate
ym m
n and ŷn are the observed and the predicted load values, the load forecasting.
respectively, of sample n in data subset m; and N is the dimen- 3) The trained model of a single DT consisting of simple
sion of each of the output samples. The ML model training is parameters, yet accurate.
performed to minimize the RMSE OOB in (4) and obtain the 4) Model executed and realized efficiently on a parallel
trees using the dataset D. Similar procedures are performed processing platform i.e., apache-spark.

VOLUME 9, 2021 57377

A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark

To improve the performance in terms of execution time, total

time Ttotaltime is first measured by submitting individual jobs
and T̃totaltime by considering a cluster of jobs. The time is
compared in both cases to choose the methodology with the
lowest execution time and still retain the skewed distribution
of the multiple meters data.

3) SPARK OPTIMIZATION
Besides spark being an in-memory computing framework,
it runs on top of the Java Virtual machines (JVMs). Hence
tuning the JVM parameters is necessary to improve the per-
formance of spark. In this paper, the authors have identified
three key spark parameters that impact the utilization of
resources to reduce the workload execution time. The paper
has also focused on the right choice of parameters that impact
the memory serialization, data compression, caching, and
repartitioning of data. Compressing serialized RDDs helps in
saving substantial space at the expense of some extra CPU
FIGURE 4. Top left - Load distribution across all the three years (The time. Compression of RDD in shuffle operations has a great
vertical axis indicates the load value in kWh and the x axis indicates the
time stamp). Top right – Data with large load values greater than
advantage due to it random read/write and multiple times
1000 kWh (The vertical axis indicates the transformer id the data belongs read/write. Compression of spark RDD is achieved with the
to and the x axis indicates the load value in kWh. Bottom left – Frequency help of codec. Experiments are conducted considering: i)
of the load distribution limiting to 1000 kWh. Bottom right – Frequency of
log normalized load plus 1. various combinations of several executors, ii) the number of
cores per executor, and iii) the amount of memory for each
of the executors. If CO is the total number of cores in the
Spark tree models support both continuous and categorical configuration then
features, partitioning data by rows, and distributed training.
Algorithms available in spark ml are used for performance CO = E ∗ COperE (8)
comparison, which includes the spark decision trees (Spark where E is the total number of executors assigned and
DT) and tree ensembles i.e., spark parallelized random forests COperE is the number of cores assigned per executor in the
(Spark RF), and spark gradient boosted trees (Spark GBTs). spark configuration. The distribution of total memory in the
spark configuration is given as follows.
B. PERFORMANCE EVALUATION
1) AVERAGE RMSE MEM = (0.9 ∗ MEMperE ∗ E) + (0.1 ∗ MEMperE ∗ E)
The objective of future load consumption is to predict the (9)
load with high precision and speed to have near real-time
processing ability. Root mean square error (RMSE) is used as where MEMperE is the memory assigned per executor.
the error metric because of its wide use. To evaluate the pre- The second term in (9) is the overhead memory allocated
dictive performance, the training dataset is separated from the to each of the executors which accounts for virtual machine
holdout dataset (data never used for training). All the models overheads or other native overheads. The additional memory
are built on the training data and optimized to obtain as low and is usually chosen as either 10% of the executor memory
RMSEtrain as possible and predicted on the holdout dataset to or a minimum of 384MB by the spark cluster computing
note the RMSEholdout . Moreover, to evaluate the performance system [30]. Further, the MEMperE is divided into two frac-
on all the holdout datasets for different transformers, the tions, one for memory and the other for storage. The memory
average RMSE (ARMSE) is calculated as described below: fraction handles the data structures, out-of-memory error and
the storage fraction handles the cached blocks of data. The
1 XM values of CO and MEM can vary and are very specific to the
ARMSE = RMSE holdout 1 < i < M (7)
M i=1 cluster used for configuring spark. Choosing a larger value
The ARMSE shows how well the ML model learns the of E results in reducing the COperE to balance the CO.
data for all the distribution transformers. The reason for Similarly, choosing a larger value of E reduces the MEMperE
choosing ARMSE to have high average accuracy across all to balance the MEM .
the distribution transformers and not just one or a few.
C. RESULTS AND DISCUSSION
2) EXECUTION TIME In this section, the metrics discussed in the previous section
An important objective of choosing the proposed methodol- are evaluated on the dataset to showcase the benefits of the
ogy is to reduce the processing time of the transformer’s data. optimal scheduling algorithm. The ARMSE and the execution

57378 VOLUME 9, 2021

A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark

FIGURE 5. Performance evaluation. (a) shows the speedup for various cluster sizes for a concurrent job submission size of 18 and (b) presents the
speedup of increasing the number of jobs. A value of y = 93 is chosen for all the job submission values.

times are noted under experiments to determine the robust-

ness of load forecasting methodology using spark.

1) VALIDATION OF EXECUTION TIME

In this experiment, the proposed optimal scheduling method
is validated in terms of the training time and the forecasting
time. The total time T and T̃ are measured for both cases
of x and y number of jobs submitted. The proposed scheme
is tested for x using the K-means clustering algorithm to
group the data in order to obtain clusters with higher accuracy.
To validate the proposed method, various chunks of y values
are considered and compared against the time taken for x
number of chunks of data. For the given data as the value FIGURE 6. Performance benefit in terms of execution time with
scheduling that can be achieved by using thread pool.
of x is 1000, values ranging from 750 to 25 are chosen as
shown in Fig. 5(a). The speedup is calculated for the various
combinations by performing T /T̃ . As the value of y increases,
the size of the data is distributed among the y chunks which multiple times for various processing stages. As shown in
also affects the processing time for different sizes of y. For Fig. 5 (b), based on the right y-axis, the speedup is approxi-
varying values of y the speedup is increasing, indicating the mately increasing linearly up to a value close to a number of
time T̃ is always less than T for all the values of y. Choosing cores and makes it less linear after a value of 18-20. When
a lesser value of y and still not losing on speedup is recom- the number of jobs submitted increases above the threshold
mended, as in practice it will help in reducing the execution of a possible number of concurrent threads that can be sub-
time in cases of performing representative clustering. Hence mitted, the data transfer among the processes increases com-
the proposed optimal scheduling algorithm stated in section munication overhead which eventually increases the parallel
IV improves the performance by reducing the time to perform management overhead. Hence a trend of less linearity can be
the analysis. similarly, for a y value of 93, varying values observed clearly after a value of jobs = 18. Fig. 6, shows
of the thread pool are performed to analyze the speedup as the benefits in terms of scheduling that can be achieved by
shown in Fig. 5(b). The choice value of y = 93 is based on using thread pool. The experiments are performed y = 93 and
the kmedoids clustering approach proposed in [31]. Hence a randomly chosen group value of y = 193 to compare the
the value of y is chosen as 93 to group data closest to each performance. The execution time is lesser in both the cases
other, rather than choosing a random value of y. Compared as proposed for a value y < x. The training time for yT̃ , for
with a single job submission, the calculation time tends to y = 93 executes 2.26 times, and for Y = 193, 2 times faster
decrease gradually as the number of concurrent jobs sub- for a pool value of 18, as compared to without threadpool.
mitted increases based on the left axis in Fig. 5(b). Massive Additionally, the clustering time, training time, and testing
jobs are distributed across the slave nodes, which reduces the time for a value of k = 93 is estimated based on the clus-
computational load. The spark computing platform captures tering performed to choose the best cluster number value as
the intermediates results to memory resulting in the ineffi- shown in Fig. 7(a). The total execution time (includes the
ciency of iterative processing where each data frame is called training time of grouped clusters, testing time of individual

VOLUME 9, 2021 57379

A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark

FIGURE 7. Comparison of compute time at various stages of load forecasting. (a) Results obtained for the time taken to perform clustering,
training time and testing time on the holdout dataset for SLR(spark LR), SDT, SRF and SGBT. (b) The execution time involves clustering, training
of grouped data, testing on clustered data, training on individual transformers and testing on individual transformers for the spark ML models.

transformers with clustering, the training time of the individ- TABLE 1. Comparison of performance of ML model in terms of execution
time with previous works.
ual transformers, and the testing time for individual trans-
formers) for the 1000 models is shown in Fig. 7(b). The
time taken by the gradient boosted algorithm is the high-
est compared to the other spark ml algorithms. Although
both random forest and gradient boosted trees are ensemble
models, the random forest takes noticeably lesser time as
compared to random forest. Inference out of this observa-
tion is that gradient boosted is a boosting algorithm that
is quite sequential and is intended to take more execution
time whereas multiple trees in the random forest can be run
parallelly across the nodes to speed up the execution. The
times observed in Fig. 7(a) show the lowest training time for
spark decision tree regressor. It can be noted that the time
taken to perform testing is almost close to the training time.
This is evident from the proposed methodology which states
that performing analysis on grouped data is preferred over with the largest number of cores per executor shows the
individual transformers data. However, as the testing has to lowest run time as per the secondary y-axis in Fig. 8. As the
be performed on all the DT’s datasets, grouping cannot be job submission computes multiple jobs at the same time more
performed to reduce execution time. number of workers helps in the distribution of the jobs to
To compare the results with different previous works, more number of workers. However, choosing more E and less
the comparison has been done with datasets of similar sizes COperE is not expected to be efficient as the work will be
and computational capacities utilized have been documented. distributed across more executors resulting in larger transfer
The comparison has been done with methodologies that have of data across the executors. The choice of executors less
utilized distributed ML modeling with Apache Spark. The than 5 is not possible as the number of nodes in the configu-
comparison with previous works is presented in Table 1. It can ration is 5, each node consists of a total of 120GB memory.
be observed that although the proposed methodology consists Reducing the number of executors will result in each executor
of a dataset size of ∼24 million records is superior in terms of to contain more than 120GB which exceeds the threshold and
execution time as compared to previous works in performing is practically not possible. Hence a choice of 5 executors and
distributed machine learning on big data. 20 cores per executor is decided as an optimized combination
of the spark configuration. It is worth to mention also that
as the number of executors are increased, the MEMperE is
2) VALIDATION OF SPARK OPTIMIZATION reduced as it is distributed among the executors, to sum up
To validate the use of an optimal number of COperE, to MEM .
experiments are conducted based on various combinations Other than time, communication overhead and data trans-
of COperE and E which in turn affects the MEMperE. fer is also a concern in distributed computing. Other than
Fig. 8 displays the comparison of run-time for various combi- time, communication overhead and data transfer is also a
nations of executors and cores per executor. The combination concern in distributed computing. By increasing the depth of

57380 VOLUME 9, 2021

A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark

FIGURE 8. Run time comparison for various spark optimization FIGURE 9. ARMSE of training and holdout dataset for spark decision tree.
parameters. The spot above 820 nodes result in overfitting of the datasets.

TABLE 2. Performance of ML model in terms of RMSE and training time

to monitor the effect of deep networks.
tree or the number of nodes while training can be regulated.
An experiment is performed by increasing the depth of the
tree and the num of nodes in the trained model is monitored
(refer to Fig. 9). The x-axis shows the number of nodes and
the y-axis the performance measure in terms of RMSE. It can
be noted that the red line in Fig. 9, which indicates the training
RMSE, is decreasing with an increase in the number of nodes
by fitting the dataset onto the trees as deeper as possible.
Whereas, the blue line which denotes the holdout RMSE does
not show such a trend. After a point, the models’ performance
starts deteriorating. This point is called the sweet spot where
the models tend to start overfitting. Hence the depth of the
model has to be restricted below this point which indicates a
value of 820 nodes for a tree depth of 9 in Fig. 9. By taking
a random forest regressor (refer to Table 2), it is noticed that such a measure on the average RMSE of all the transformers
as each of the tree grows larger, after a max depth of 10, large data, a max depth of 8 is chosen to perform training on the
task transfer warning is shown by spark indicating that deep spark ML models.
models with large number of tree nodes are being transferred
across the tasks which results in more data transfer. Referring 4) VALIDATION OF ACCURACY
to Table 2, it can be noted that by increasing the depth of the This section discusses the results after considering the vali-
model, the training accuracy is reducing and the time taken is dation of accuracy, computation time, and overfitting in the
close to each other. However, after a max depth of 10, there previous sections. For each of the spark ml models, the per-
is a jump in the time and the time is gradually increasing. formance of the training dataset and holdout dataset are
This indicates that more amount of time is being utilized in compared. Table 3 reports the ARMSE computed for all the
transferring data, hence such a scenario has to be avoided 1000 datasets. Holdout ARMSE indicates the quality of load
during the execution or the spark parameters have to be tuned forecasting. Thus, a lower value of ARMSE indicates a better
further to accommodate large task binaries. load forecasting model. The table indicates that the values of
holdout ARMSE are the lowest for the spark random forest
3) OVERFITTING regression model. Referring back to Fig. 7 (b), the execution
Most of the machine learning models perform accurately time for the random forest is not the lowest but is comparable
post tuning of hyperparameters. However, excess tuning of to the spark DT model. The training RMSE is lowest for
parameters tends to fit the training data so accurately that the Gradient-Boosted Trees, but the lowest holdout RMSE is
model is overfitted. Once overfit, the models do not perform noted for the spark random forest algorithm. Even though the
as expected on the new forecasting dataset. To avoid such a random forest is an ensemble model, the execution time is
case many measures are taken to avoid overfitting in the train- not as large compared to other models. This is because the
ing data. Consideration of holdout data set which has never way spark performs its execution is that it utilizes its parallel
been used in the training is one of the measures to prevent computing capability to execute each of the decision trees
overfitting. In the case of tree ML models, the depth of the individually and gives back the result. The actual power of

VOLUME 9, 2021 57381

A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark

TABLE 3. Final ARMSE, for training and holdout dataset after choosing tuned parameters.

FIGURE 10. ARMSE comparison of training and holdout dataset for all the DT’s.

spark in terms of execution can be observed here. Thus, it processing. One distinctive characteristic of the proposed
can be concluded that the spark RF performs better than the methodology is to be able to submit a maximum number of
other spark ml models under comparison. jobs and to process all the jobs in parallel. Several experi-
Fig. 10 shows the plot of RMSE of all the distribution ments were performed to optimize the scheduling strategy
transformers under consideration. The red line indicates the in terms of ML model error and execution time. A large
RMSE(kWh) obtained for all the distribution transformers. number of DTs training procedures were performed with
The blue line indicated as the holdout RMSE (RMSE of the reduced run-times which allow handling big data that is too
data never used for training) is the forecasting error in kWh. large to be stored. The training time for a group of 93 clus-
To measure the quality of the trained models the holdout ters data with a data size of ∼24 million records was per-
RMSE is expected to be as close as possible to the train- formed in ∼50 sec and forecasting of 1000 transformers with
ing RMSE. From Fig. 10, it can be observed that the blue ∼2.4 million records took ∼57 sec. The total time includ-
line follows the red line for almost all the transformers. For ing, grouping, training, and forecasting was performed in
randomly chosen DTs, indexed as 0, 78, 208, 91, 13, 39, ∼450 sec. The other important achievement of this paper is
104, 1, 52 present the training RMSE and holdout RMSE 2 times faster execution time with the use of thread pool and
zoomed in the top right of the figure. The plots indicate that fair scheduler. This is a good optimization strategy for load
the forecasting accuracy follows the training accuracy closely forecasting using multi-sensor big datasets. Empirical eval-
attributing to the fact that the built ML model is quite robust uations significantly outperformed the previously proposed
in terms of performance while increasing the speedup when iterative algorithms. Moreover, the proposed ML models
a large number of jobs is performed. achieved higher accuracies. The merits shown in the exper-
iment indicated that there is great potential for the proposed
VI. CONCLUSION method to be used in big data load forecasting of multi AMI
In this paper, a smart scheduling algorithm to perform load environments.
forecasting on multiple DTs was proposed. The proposed As this work chooses the optimized cluster value of 93,
approach was implemented on Apache spark to not only the next plan is to conduct experiments to investigate the
deal with the challenges associated with computation time optimal cluster value utilizing the proposed approach while
while handling the big data but also to submit jobs using an using the spark platform. Also, scaling the dataset to more
optimized methodology in a parallel manner. The processed than 1000 DTs requires more than a minimum of 100 jobs
big data was partitioned into various chunks and cached to to be submitted. Scaling the size of the spark cluster to an
improve the performance in terms of storage and in-memory optimal value is a subject for future work.

57382 VOLUME 9, 2021

A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark

ACKNOWLEDGMENT [19] S. S. Reddy and J. A. Momoh, ‘‘Short term electrical load forecasting using
The HPC (and/or scientific visualization) resources and ser- back propagation neural networks,’’ in Proc. North Amer. Power Symp.
(NAPS), Sep. 2014, pp. 1–6, doi: 10.1109/NAPS.2014.6965453.
vices used in this work were provided by the Research [20] S. S. Reddy, C.-M. Jung, and K. J. Seog, ‘‘Day-ahead electricity price fore-
Computing group in Texas A&M University at Qatar. casting using back propagation neural networks and weighted least square
Research Computing is funded by the Qatar Founda- technique,’’ Frontiers Energy, vol. 10, no. 1, pp. 105–113, Mar. 2016, doi:
10.1007/s11708-016-0393-y.
tion for Education, Science and Community Development [21] H. Shi, M. Xu, and R. Li, ‘‘Deep learning for household load forecasting—
(http://www.qf.org.qa). A novel pooling deep RNN,’’ IEEE Trans. Smart Grid, vol. 9, no. 5,
pp. 5271–5280, Sep. 2018, doi: 10.1109/TSG.2017.2686012.
REFERENCES [22] H. Aprillia, H.-T. Yang, and C.-M. Huang, ‘‘Statistical load forecast-
[1] P. Wang, B. Liu, and T. Hong, ‘‘Electric load forecasting with recency ing using optimal quantile regression random forest and risk assessment
effect: A big data approach,’’ Int. J. Forecasting, vol. 32, no. 3, index,’’ IEEE Trans. Smart Grid, vol. 12, no. 2, pp. 1467–1480, Mar. 2021,
pp. 585–597, Jul. 2016, doi: 10.1016/j.ijforecast.2015.09.006. doi: 10.1109/tsg.2020.3034194.
[2] A. L’Heureux, K. Grolinger, H. F. Elyamany, and [23] S. S. Reddy, ‘‘Bat algorithm-based back propagation approach for short-
M. A. M. Capretz, ‘‘Machine learning with big data: Challenges and term load forecasting considering weather factors,’’ Electr. Eng., vol. 100,
approaches,’’ IEEE Access, vol. 5, no. 1, pp. 7776–7797, 2017, doi: 10. no. 3, pp. 1297–1303, Sep. 2018, doi: 10.1007/s00202-017-0587-2.
1109/ACCESS.2017.2696365. [24] W. Jiang, H. Tang, L. Wu, H. Huang, and H. Qi, ‘‘Parallel processing
[3] I. W. Tsang, J. T. Kwok, and P.-M. Cheung, ‘‘Core vector machines: of probabilistic models-based power supply unit mid-term load forecast-
Fast SVM training on very large data sets,’’ J. Mach. Learn. Res., vol. 6, ing with apache spark,’’ IEEE Access, vol. 7, pp. 7588–7598, 2019, doi:
pp. 363–392, Apr. 2005. 10.1109/ACCESS.2018.2890339.
[4] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, [25] Classification and Regression—Spark 3.0.1 Documentation. Accessed:
‘‘Spark: Cluster computing with working sets,’’ in 2nd USENIX Work. Hot Nov. 26, 2020. [Online]. Available: https://spark.apache.org/docs/latest/
Top. Cloud Comput. (HotCloud), vol. 10, 2010, p. 95. ml-classification-regression.html#decision-trees
[5] Y. Tang, Z. Xu, and Y. Zhuang, ‘‘Bayesian network structure learning from [26] X. Meng, J. Bradley, B. Yavuz, and E. Sparks, ‘‘MLlib: Machine learning
big data: A reservoir sampling based ensemble method,’’ in Proc. Int. Conf. in Apache Spark,’’ J. Mach. Learn. Res., vol. 17, no. 1, pp. 1235–1241,
Database Syst. Adv. Appl., vol. 9645. Dallas, TX, USA, 2016, pp. 209–222, 2016.
doi: 10.1007/978-3-319-32055-7_18. [27] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng,
[6] J. Dean and S. Ghemawat, ‘‘MapReduce: Simplified data processing on T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia, ‘‘Spark SQL: Rela-
large clusters,’’ Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008, tional data processing in spark,’’ in Proc. ACM SIGMOD Int. Conf. Man-
doi: 10.1145/1327452.1327492. age. Data, May 2015, pp. 1383–1394, doi: 10.1145/2723372.2742797.
[7] P. Mika, ‘‘Flink: Semantic Web technology for the extraction and analysis [28] TAMUQ Research Computing Policies–Research Computing @
of social networks,’’ J. Web Semantics, vol. 3, nos. 2–3, pp. 211–223, TAMUQ. Accessed: Jan. 11, 2021. [Online]. Available: https://rc-
Oct. 2005, doi: 10.1016/j.websem.2005.05.006. docs.qatar.tamu.edu/index.php/Main_Page
[8] A. Baldominos, E. Albacete, Y. Saez, and P. Isasi, ‘‘A scalable machine [29] STAR Project–Iberdrola. Accessed: Jan. 11, 2021. [Online]. Available:
learning online service for big data real-time analysis,’’ in Proc. IEEE https://www.iberdrola.com/about-us/lines-business/flagship-projects/star-
Symp. Comput. Intell. Big Data (CIBD), Orlando, FL, USA, Dec. 2014, project
pp. 1–8, doi: 10.1109/CIBD.2014.7011537. [30] The Apache Software Foundation. Spark Configuration. Accessed:
[9] Y. Zhang, S. Chen, Q. Wang, and G. Yu, ‘‘i2 MapReduce: Feb. 11, 2021. [Online]. Available: http://spark.apache.org/docs/1.2.1/ec2-
Incremental mapreduce for mining evolving big data,’’ IEEE Trans. scripts.html
Knowl. Data Eng., vol. 27, no. 7, pp. 1906–1919, Jul. 2015, doi: [31] D. Syed, H. Abu-Rub, A. Ghrayeb, S. S. Refaat, M. Houchati,
10.1109/TKDE.2015.2397438. O. Bouhali, and S. Banales, ‘‘Deep learning-based short-term load
[10] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, forecasting approach in smart grid with clustering and consumption
M. J. Franklin, S. Shenker, and I. Stoica, ‘‘Resilient distributed datasets: A pattern recognition,’’ IEEE Access, early access, Apr. 8, 2021, doi:
fault-tolerant abstraction for in-memory cluster computing,’’ in Proc. 9th 10.1109/ACCESS.2021.3071654.
USENIX Symp. Networked Syst. Design Implement., San Jose, CA, USA, [32] D. Syed, S. S. Refaat, and H. Abu-Rub, ‘‘Performance evalua-
2012, pp. 15–28. tion of distributed machine learning for load forecasting in smart
[11] N. Bharill, A. Tiwari, and A. Malviya, ‘‘Fuzzy based scalable grids,’’ in Proc. Cybern. Informat. (K&I), Jan. 2020, pp. 1–6, doi:
clustering algorithms for handling big data using apache spark,’’ 10.1109/KI48306.2020.9039797.
IEEE Trans. Big Data, vol. 2, no. 4, pp. 339–352, Dec. 2016, doi: [33] Y. Xu, H. Liu, and Z. Long, ‘‘A distributed computing frame-
10.1109/tbdata.2016.2622288. work for wind speed big data forecasting on apache spark,’’ Sustain.
[12] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, Energy Technol. Assessments, vol. 37, Feb. 2020, Art. no. 100582, doi:
R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, 10.1016/j.seta.2019.100582.
O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, ‘‘Apache [34] A. Zainab, D. Syed, A. Ghrayeb, H. Abu-Rub, S. S. Refaat, M. Houchati,
Hadoop YARN: Yet another resource negotiator,’’ in Proc. 4th Annu. O. Bouhali, and S. Banales Lopez, ‘‘A multiprocessing-based sensitivity
Symp. Cloud Comput., Santa Clara, CA, USA, Oct. 2013, pp. 1–16, analysis of machine learning algorithms for load forecasting of electric
doi: 10.1145/2523616.2523633. power distribution system,’’ IEEE Access, vol. 9, pp. 31684–31694, 2021,
[13] B. Hindman, A Konwinski, M Zaharia, A Ghodsi, A. D. Joseph, R. H. Katz, doi: 10.1109/ACCESS.2021.3059730.
S. Shenker, and I. Stoica, ‘‘Mesos: A platform for fine-grained resource
sharing in the data center,’’ in Proc. NSDI, 2011, vol. 11, no. 2011,
pp. 295–308. AMEEMA ZAINAB (Member, IEEE) received the
[14] T. White, Hadoop: The Definitive Guide, 3rd ed. Sebastopol, CA, USA: bachelor’s degree in electronics and communica-
O’Reilly Media, 2012. tion engineering from Osmania University, Hyder-
[15] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, abad, India, in 2013, and the M.S. degree in data
P. S. Wyckoff, and R. Murthy, ‘‘Hive: A warehousing solution over a map- science and engineering from Hamad Bin Khalifa
reduce framework,’’ Proc. VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, University (HBKU), Qatar. She is currently pursu-
2009.
ing the Ph.D. degree in electrical engineering with
[16] L. George, HBase: The Definitive Guide: Random Access to Your Planet-
Texas A&M University (TAMU), College Station,
Sized Data, 1st ed. Sebastopol, CA, USA: O’Reilly Media, 2011.
[17] Z. Hu, D. Li, and D. Guo, ‘‘Balance resource allocation for spark jobs based
TX, USA. She has three years of industry expe-
on prediction of the optimal resource,’’ Tsinghua Sci. Technol., vol. 25, rience, working as a Data Analytics Professional,
no. 4, pp. 487–497, Aug. 2020, doi: 10.26599/TST.2019.9010054. supporting audit at Deloitte Touche LLP, Hyderabad. She is also a base
[18] R. E. Edwards, J. New, and L. E. Parker, ‘‘Predicting future hourly residen- SAS Certified Programmer. Her research interests include data science, big
tial electrical consumption: A machine learning case study,’’ Energy Build- data machine learning, power forecasting, and big data management in the
ings, vol. 49, pp. 591–603, Jun. 2012, doi: 10.1016/j.enbuild.2012.03.010. smart grids.
VOLUME 9, 2021 57383
A. Zainab et al.: Distributed Tree-Based ML for Short-Term Load Forecasting With Apache Spark

ALI GHRAYEB (Fellow, IEEE) received the Ph.D. SHADY S. REFAAT (Senior Member, IEEE)
degree in electrical engineering from The Uni- received the B.A.Sc., M.A.Sc., and Ph.D. degrees
versity of Arizona, Tucson, AZ, USA, in 2000. in EE from Cairo University, Giza, Egypt, in 2002,
He was a Professor with the Department of Electri- 2007, and 2013, respectively. For more than
cal and Computer Engineering, Concordia Univer- 12 years, he has worked in the industry as an Engi-
sity, Montreal, Canada. He is currently a Professor neering Team Leader, a Senior EE, and an Electri-
with the Department of Electrical and Computer cal Design Engineer. He is currently an Associate
Engineering, Texas A&M University at Qatar. Research Scientist with the Department of ECEN,
His research interests include wireless and mobile TAMU-Q. He has published more than 100 journal
communications, physical layer security, massive and conference papers. His main research interests
MIMO, and visible light communications. He served as an instructor or include power systems, electrical machines, smart grid, big data, devel-
a co-instructor in technical tutorials at several major IEEE conferences. opment of fault-tolerant systems, reliability of power grids and electric
He served as the Executive Chair for the 2016 IEEE WCNC Conference. machinery, fault detection, condition monitoring, and energy management
He has served on the editorial board of several IEEE and non-IEEE journals. systems. He is also a member of IET and the SGC-Q.

HAITHAM ABU-RUB (Fellow, IEEE) received

two Ph.D. degrees. He has been with many uni-
versities in many countries, including Poland,
Palestine, USA, Germany, and Qatar. Since 2006,
he has been with Texas A&M University at Qatar
(TAMU-Q). He is currently a Full Professor of OTHMANE BOUHALI (Member, IEEE) is a
electrical engineering (EE). He is also the Manag- currently a Research Professor of physics with
ing Director of the Smart Grid Center-Extension in TAMU-Q. He is also the Founder and the Director
Qatar (SGC-Q). He has supervised many research of the TAMU-Q Advanced Scientific Computing
projects on the smart grid and renewable energy Center. He has been involved in the Large Hadron
systems. He has published more than 400 journal and conference papers, five Collider research program for more than 25 years
books, and five book chapters. His principal research interests include smart and has supervised various research projects. His
grid, power electronic converters, renewable energy, and electric drives. research interests include large scale modeling,
He was a recipient of the American Fulbright Scholarship, the German high-performance computing, and detector tech-
Alexander von Humboldt Fellowship, and many national and international nologies for radiation and medical physics.
awards and recognitions.

57384 VOLUME 9, 2021

Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Research Article: Load Forecasting Method Based On Improved Deep Learning in Cloud Computing Environment
No ratings yet
Research Article: Load Forecasting Method Based On Improved Deep Learning in Cloud Computing Environment
11 pages
E3sconf Icmed-Icmpc2023 01048
No ratings yet
E3sconf Icmed-Icmpc2023 01048
9 pages
1-s2.0-S0306261924023572-main
No ratings yet
1-s2.0-S0306261924023572-main
12 pages
Forecasting of Power Demands Using Deep Learning
No ratings yet
Forecasting of Power Demands Using Deep Learning
11 pages
Performance Comparison of Simple Regression Random Forest and XGBoost Algorithms For Forecasting Electricity Demand
No ratings yet
Performance Comparison of Simple Regression Random Forest and XGBoost Algorithms For Forecasting Electricity Demand
7 pages
Abstract
No ratings yet
Abstract
6 pages
Power Plays: Unleashing Machine Learning Magic in Smart Grids
No ratings yet
Power Plays: Unleashing Machine Learning Magic in Smart Grids
16 pages
1 s2.0 S2352484723000653 Main
No ratings yet
1 s2.0 S2352484723000653 Main
8 pages
Smart Grid Management System Based on Machine Lear (1)
No ratings yet
Smart Grid Management System Based on Machine Lear (1)
7 pages
Energy Demand Forecasting Using Deep Learning Applications for the French Grid
No ratings yet
Energy Demand Forecasting Using Deep Learning Applications for the French Grid
15 pages
sustainability-15-15055
No ratings yet
sustainability-15-15055
29 pages
Energies Transformer Load
No ratings yet
Energies Transformer Load
23 pages
DrRKNayak-Article-IEEE-AnOverviewofDeepLearninginSmartGrids
No ratings yet
DrRKNayak-Article-IEEE-AnOverviewofDeepLearninginSmartGrids
5 pages
2024 Review on Smart Grid Load Forecasting for Smart Energy Management Using
No ratings yet
2024 Review on Smart Grid Load Forecasting for Smart Energy Management Using
17 pages
ETASR_8304
No ratings yet
ETASR_8304
6 pages
Mishra_2022_J._Phys.__Conf._Ser._2161_012068
No ratings yet
Mishra_2022_J._Phys.__Conf._Ser._2161_012068
13 pages
Machine Learning Applications in Power Systems
No ratings yet
Machine Learning Applications in Power Systems
5 pages
Smart Power Consumption Forecast Model With Optimized Weighted Average Ensemble
No ratings yet
Smart Power Consumption Forecast Model With Optimized Weighted Average Ensemble
15 pages
energies-14-07788-v2
No ratings yet
energies-14-07788-v2
14 pages
Application of Big Data Analytics Pertaining To Power System Security
No ratings yet
Application of Big Data Analytics Pertaining To Power System Security
6 pages
Energies 15 04427 v2
No ratings yet
Energies 15 04427 v2
31 pages
Short-Term_Load_Forecasting_with_Temporal_Fusion_Transformers_for_Power_Distribution_Networks
No ratings yet
Short-Term_Load_Forecasting_with_Temporal_Fusion_Transformers_for_Power_Distribution_Networks
5 pages
Vol 28
No ratings yet
Vol 28
11 pages
A Multi-Scale Time-Series Dataset With Benchmark For Machine Learning in Decarbonized Energy Grids
No ratings yet
A Multi-Scale Time-Series Dataset With Benchmark For Machine Learning in Decarbonized Energy Grids
18 pages
Conference Template A4
No ratings yet
Conference Template A4
11 pages
Short Term Load Forcasting
No ratings yet
Short Term Load Forcasting
21 pages
On Short Term Load Forecasting Using Mac
No ratings yet
On Short Term Load Forecasting Using Mac
22 pages
Short-Term Load Forecasting Using Smart Meter Data
No ratings yet
Short-Term Load Forecasting Using Smart Meter Data
22 pages
UGC List of Approved Journals
No ratings yet
UGC List of Approved Journals
4 pages
Privacy_Constrained_Load_Profiling_using_Smart_Meter_Data
No ratings yet
Privacy_Constrained_Load_Profiling_using_Smart_Meter_Data
6 pages
Energies 16 01434
No ratings yet
Energies 16 01434
21 pages
1 s2.0 S2772671123001882 Main
No ratings yet
1 s2.0 S2772671123001882 Main
13 pages
Data Mining in Smart Grids
No ratings yet
Data Mining in Smart Grids
118 pages
Supervised Machine Learning Techniques For Short-Term Load Foreca
No ratings yet
Supervised Machine Learning Techniques For Short-Term Load Foreca
94 pages
OceanofPDF.com AI and Blockchain in Smart Grids - Amit Kumar Tyagi
No ratings yet
OceanofPDF.com AI and Blockchain in Smart Grids - Amit Kumar Tyagi
681 pages
energies-16-01480
No ratings yet
energies-16-01480
33 pages
An Integrated Gaussian Process Modeling Framework For Residential Load Prediction
No ratings yet
An Integrated Gaussian Process Modeling Framework For Residential Load Prediction
11 pages
MY Thesis
No ratings yet
MY Thesis
39 pages
Probabilistic Electric Load Forecasting Through Bayesian Mixture Density Networks
No ratings yet
Probabilistic Electric Load Forecasting Through Bayesian Mixture Density Networks
31 pages
ref-30
No ratings yet
ref-30
28 pages
Deep Neural Networks for Energy Load Forecasting
No ratings yet
Deep Neural Networks for Energy Load Forecasting
6 pages
03 22 1-S2.0-S1364032122000569-Main
No ratings yet
03 22 1-S2.0-S1364032122000569-Main
35 pages
Short-Term Load Forecasting of An Interconnected Grid by Using Neural Network
No ratings yet
Short-Term Load Forecasting of An Interconnected Grid by Using Neural Network
10 pages
Powernet: A Smart Energy Forecasting Architecture Based On Neural Networks
No ratings yet
Powernet: A Smart Energy Forecasting Architecture Based On Neural Networks
10 pages
Supporting Future Electrical Utilities - Infotech
No ratings yet
Supporting Future Electrical Utilities - Infotech
6 pages
Energies 17 05524
No ratings yet
Energies 17 05524
27 pages
Dynamic Feedback Neuro-Evolutionary Networks For Forecasting The Highly Fluctuating Electrical Loads
No ratings yet
Dynamic Feedback Neuro-Evolutionary Networks For Forecasting The Highly Fluctuating Electrical Loads
18 pages
Term Paper Presentation M. Tafseer-23101118 (MSDS-Fall-2024) (3)
No ratings yet
Term Paper Presentation M. Tafseer-23101118 (MSDS-Fall-2024) (3)
17 pages
Machine and Deep Learning Approaches For Forecasting Electricity Price and Energy Load Assessment On Real Datasets
No ratings yet
Machine and Deep Learning Approaches For Forecasting Electricity Price and Energy Load Assessment On Real Datasets
18 pages
Wang 2022 J. Phys. Conf. Ser. 2378 012068
No ratings yet
Wang 2022 J. Phys. Conf. Ser. 2378 012068
6 pages
Concurrency and Computation - 2020 - Li - An effective deep learning neural network model for short‐term load forecasting
No ratings yet
Concurrency and Computation - 2020 - Li - An effective deep learning neural network model for short‐term load forecasting
10 pages
Deep Neural Networks For Short-Term Load Forecasting in ERCOT System
No ratings yet
Deep Neural Networks For Short-Term Load Forecasting in ERCOT System
6 pages
A Pyramid-CNN Based Deep Learning Model For Power Load Forecasting of Similar-Profile Energy Customers Based On Clustering
No ratings yet
A Pyramid-CNN Based Deep Learning Model For Power Load Forecasting of Similar-Profile Energy Customers Based On Clustering
12 pages
Energies 11 03283 PDF
No ratings yet
Energies 11 03283 PDF
20 pages
Residential Energy Consumption Forecasting Using Deep Learning Models
No ratings yet
Residential Energy Consumption Forecasting Using Deep Learning Models
14 pages
IET Generation Trans Dist - 2019 - Tang - Short‐term power load forecasting based on multi‐layer bidirectional recurrent
No ratings yet
IET Generation Trans Dist - 2019 - Tang - Short‐term power load forecasting based on multi‐layer bidirectional recurrent
8 pages
Tony Kiema 3116999147 Final It Report
No ratings yet
Tony Kiema 3116999147 Final It Report
6 pages
Load Forecasting
No ratings yet
Load Forecasting
26 pages
Artificial Intelligence To Enhance Energy Management and Distribution in Smart Grid Communication Networks
No ratings yet
Artificial Intelligence To Enhance Energy Management and Distribution in Smart Grid Communication Networks
13 pages
KRA-eTIMS-OSCU-VSCU-Technical-Guide
No ratings yet
KRA-eTIMS-OSCU-VSCU-Technical-Guide
8 pages
Network Diagram
No ratings yet
Network Diagram
10 pages
Decathlon & Under Armour
No ratings yet
Decathlon & Under Armour
16 pages
DL Seismic FWI
No ratings yet
DL Seismic FWI
57 pages
Olympus Cv180 User Manual-Cap2 Nomenclature and Functions
No ratings yet
Olympus Cv180 User Manual-Cap2 Nomenclature and Functions
19 pages
A Cruel Angels Thesis Spanish
100% (2)
A Cruel Angels Thesis Spanish
6 pages
SW Hart Selection Program v212 (2) (1) 1 (2)
No ratings yet
SW Hart Selection Program v212 (2) (1) 1 (2)
55 pages
Fuzzy Logic Based Control of Variable Wind Energy System: Trisha Bora Prateekee Chatterjee Saradindu Ghosh
No ratings yet
Fuzzy Logic Based Control of Variable Wind Energy System: Trisha Bora Prateekee Chatterjee Saradindu Ghosh
5 pages
Cued
No ratings yet
Cued
1 page
Parrot Analytics - The Global TV Demand Report Q2 2018
No ratings yet
Parrot Analytics - The Global TV Demand Report Q2 2018
55 pages
HashedIn_DevOps_JD_24-25
No ratings yet
HashedIn_DevOps_JD_24-25
3 pages
Agile Project Manager JD - 20223101
No ratings yet
Agile Project Manager JD - 20223101
3 pages
SDA Lab 4
No ratings yet
SDA Lab 4
17 pages
MGT Eva)
No ratings yet
MGT Eva)
2 pages
Marketing For Architects
No ratings yet
Marketing For Architects
13 pages
Indirect Feedback Compensation of CMOS Op-Amps: Vishal Saxena and R. Jacob Baker
No ratings yet
Indirect Feedback Compensation of CMOS Op-Amps: Vishal Saxena and R. Jacob Baker
2 pages
Installation Instructions: Sunbeacon Ii Series
No ratings yet
Installation Instructions: Sunbeacon Ii Series
9 pages
PParts For Developers Created 11 - 2015 BVO
No ratings yet
PParts For Developers Created 11 - 2015 BVO
17 pages
Ansoff Matrix - Overview, Strategies and Practical Examples
No ratings yet
Ansoff Matrix - Overview, Strategies and Practical Examples
1 page
L7 Uses of Artificial Intelligence Neural Netwroks
No ratings yet
L7 Uses of Artificial Intelligence Neural Netwroks
11 pages
CRC Handbook of Modern Telecommunications, Second Edition (101-199)
No ratings yet
CRC Handbook of Modern Telecommunications, Second Edition (101-199)
99 pages
Cot Powerpoint English 5-Types of Viewing Materials 1
100% (1)
Cot Powerpoint English 5-Types of Viewing Materials 1
137 pages
Chap4-Loaders-And-Linkers
No ratings yet
Chap4-Loaders-And-Linkers
25 pages
St2 6 Mathematics q4
No ratings yet
St2 6 Mathematics q4
4 pages
Lab Manual CS601 (Lab 12)
No ratings yet
Lab Manual CS601 (Lab 12)
6 pages
COA Part-I
No ratings yet
COA Part-I
53 pages
Why Do You Want To Be A Part of Experience@Singapore?: Entrepôt Hong Kong South Korea Taiwan Four Asian Tigers
No ratings yet
Why Do You Want To Be A Part of Experience@Singapore?: Entrepôt Hong Kong South Korea Taiwan Four Asian Tigers
4 pages
ErwinCoumans ExploringMLCPSolversAndFeatherstone
No ratings yet
ErwinCoumans ExploringMLCPSolversAndFeatherstone
85 pages
Lecture-47 (Exception Handling)
No ratings yet
Lecture-47 (Exception Handling)
32 pages

machine learning tree

Uploaded by

machine learning tree

Uploaded by

Received March 22, 2021, accepted April 5, 2021, date of publication April 12, 2021, date of current version

April 20, 2021.

Distributed Tree-Based Machine Learning for

Corresponding author: Ameema Zainab ([email protected])

I. INTRODUCTION decision-making processes and big data provides power in

VOLUME 9, 2021 57373

57374 VOLUME 9, 2021

FIGURE 3. Spark-based DAG visualization for random forest regressor.

VOLUME 9, 2021 57375

57376 VOLUME 9, 2021

VOLUME 9, 2021 57377

To improve the performance in terms of execution time, total

57378 VOLUME 9, 2021

times are noted under experiments to determine the robust-

1) VALIDATION OF EXECUTION TIME

VOLUME 9, 2021 57379

57380 VOLUME 9, 2021

TABLE 2. Performance of ML model in terms of RMSE and training time

VOLUME 9, 2021 57381

57382 VOLUME 9, 2021

HAITHAM ABU-RUB (Fellow, IEEE) received

57384 VOLUME 9, 2021

You might also like