Dynamic Distributed and Parallel Machine Learning Algorithms For Big Data Mining Processing
Dynamic Distributed and Parallel Machine Learning Algorithms For Big Data Mining Processing
https://www.emerald.com/insight/2514-9288.htm
DTA
56,4 Dynamic Distributed and Parallel
Machine Learning algorithms for
big data mining processing
558 Laouni Djafri
Ibn Khaldoun University, Tiaret, Algeria and
Received 13 June 2021
Revised 23 November 2021
EEDIS laboratory, Djillali Liabes University, Sidi Bel Abbes, Algeria
Accepted 23 November 2021
Abstract
Purpose – This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any
other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds
computing or other technologies.
Design/methodology/approach – In the age of Big Data, all companies want to benefit from large
amounts of data. These data can help them understand their internal and external environment and
anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later.
Thus, this knowledge becomes a great asset in companies’ hands. This is precisely the objective of data
mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are
now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the
problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel
processing techniques. So, the problem that the authors are raising in this work is how the authors can
make machine learning algorithms work in a distributed and parallel way at the same time without losing
the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic
Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their
work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-
Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture
that the authors designed is specially directed to handle big data processing that operates in a coherent and
efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors
to actually verify the classification results obtained using the representative learning base (RLB). In the
second part, the authors have extracted the representative learning base by sampling at two levels using
the stratified random sampling method. This sampling method is also applied to extract the shared
learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for
the second level (PLBL2). The experimental results show the efficiency of our solution that the authors
provided without significant loss of the classification results. Thus, in practical terms, the system DDPML
is generally dedicated to big data mining processing, and works effectively in distributed systems with a
simple structure, such as client-server networks.
Findings – The authors got very satisfactory classification results.
Originality/value – DDPML system is specially designed to smoothly handle big data mining classification.
Keywords Big data mining, Statistical sampling, Map-reduce, Machine learning, Distributed and parallel
processing, Big data platforms
Paper type Research paper
1. Introduction
Since the advent of the Internet to this day, we have seen explosive growth in the volume,
velocity and variety of data created daily (Sathyaraj et al., 2020); this amount of data is
generated by a variety of methods such as click stream data, financial transaction data, log
files generated by web or mobile applications, sensor data from Internet of things (IoT), in-
game player activity and telemetry from connected devices, and many other methods
(O’Donovan et al., 2015; Hariri et al., 2019). This data is commonly referred to as “Big Data”
Data Technologies and
Applications because of its volume, the velocity with which it arrives and the variety of forms it takes.
Vol. 56 No. 4, 2022
pp. 558-601
In 2001, Gartner proposed a three-dimensional or 3 Vs (volume, variety and velocity) view
© Emerald Publishing Limited
2514-9288
of the challenges and opportunities associated with data growth (Chen et al., 2014). In 2012,
DOI 10.1108/DTA-06-2021-0153 Gartner updated this report as follows: Big data is high volume, high speed and/or wide
variety of information resources that require new forms of processing to improve decision- DDPML
making (Erl et al., 2016). Oftentimes, these Vs are supplemented by a fourth V, is Veracity: algorithms
How accurate is the data (Chan, 2013; Roos et al., 2013). We can extend this model to the Big
Data dimensions over ten Vs: volume, variety, velocity, veracity, value, variability,
validity, volatility, viability and viscosity (Hariri et al., 2019; Khan et al., 2018; Kayyali
et al., 2013; Katal et al., 2013; Ferguson, 2013; Ripon and Arif, 2016; IBM, 2014; Elgendy and
Elragal, 2014). Accordingly, the increasing digitization of our activities, the ever-
increasing ability to store digital data, and the accumulation of information of all kinds, is 559
generating a new sector of activity aimed at analyzing these large amounts of data. This
leads to the emergence of new approaches, new methods, new knowledge, and ultimately,
undoubtedly, new ways of thinking and acting. Hence, this very large amount of data
must be exploited in order to better understand big data and how to extract knowledge
from it; this is known as big data mining (Cen et al., 2019; Dunren et al., 2013). Its main
purpose is to extract and retrieve desired information or patterns from a large amount of
data (Oussous et al., 2017). It is usually performed on a large amount of structured or
unstructured data using a combination of techniques that make it possible to explore
these large amounts of data, automatically or semi-automatically (Xindong et al., 2014;
Xingquan and Ian, 2007).
Every second, we see massive amounts of data growing exponentially, and there is no
weight for this huge amount of data unless we extract the true value from it by extracting the
information and knowledge or simply what we call big data mining. The real problem that we
are currently facing in big data mining is how to deal with this huge amount of data (volume)?
How to get the results in the shortest time (velocity)? But the question arises: can we maintain
or improve the precision (veracity and validity) of the results after reducing the size? These
and other questions will be discussed in our article. But, before we answer these questions, we
must understand very well the close relationship between these four characteristics (volume,
veracity, validity and velocity). We know very well that, if the size is large (volume), the
precision (veracity and validity) will be high and the speed (velocity) will be low. As for if the
size is small, the speed will be high and the precision may be low. Therefore, our goal in this
work is to reduce the size as far as possible to increase the speed to the maximum extent
possible. This speed will increase more and more if we use platforms and architectures
prepared for this purpose, provided that the precision of the results obtained is taken into
account.
First and foremost, if we want to reduce the volume of Big data in a scientific and
correct way, then we definitely think about mathematical methods, especially
mathematical statistics (Che et al., 2013; Trovati and Bessis, 2015; Urrehman et al.,
2016). So what are the effective mathematical statistics methods that we must apply in
such cases to give very satisfactory results? On the other hand, if we want to speed up
the time, we are considering the application of parallel and distributed computing (Lu
and Zhan, 2020; Concurrency-Computat:Pract.Exper, 2016; Brown et al., 2020) supported
by big data solutions (Jun et al., 2019; Zhang et al., 2017; Palanisamy and
Thirunavukarasu, 2019).
Today, big data mining mainly relies on statistical methods to overcome and control it
so that we can handle it comfortably. Thus, mathematical statistics is the most important
in data science and particularly in big data analytics (Bucchianico et al., 2019; Weihs and
Ickstadt, 2018; HLG-BAS, 2011). The reason may be due to its function, as mathematical
statistics reveal correlations between statistical groups, and its function is pivotal to
“reduce size” to better understand data, and thus extract information with greater
precision and more quickly; as a result, statisticians and data science experts went to the
use of statistical sampling technique (Rojas et al., 2017; Liu and Zhang, 2020; Mahmud
et al., 2020).
DTA In the Big Data analytics context, we often work with small-scale sets (sub-datasets)
56,4 that are part of the original dataset. For this reason, we mainly use mathematical
statistics. In mathematical statistics, a population usually contains too many individuals,
so that we may not be able to study them properly, and therefore a survey is often limited
to taking one or more samples. A well-chosen sample will contain most of the information
about a particular population parameter instead of studying the community as a whole;
this process is called sampling (Xindong et al., 2014; Berndt, 2020; Turner, 2020). So, the
560 goal is to generalize the results of the sample from the population (Singh and Masuku,
2014; Den-Broeck et al., 2013). Therefore, we must emphasize the importance of a good
choice of sample elements to make them representative of our population. A sample is said
to be representative when the original dataset is represented as faithfully as possible by
virtue of its characteristics and quantity (Singh and Masuku, 2014; Andrade, 2020; Lee
et al., 2020). There are also several sampling methods, both probabilistic and non-
probabilistic (Berndt, 2020; Etikan and Bala, 2017). In probability sampling, the first
important point of the sample is that each individual of the selected population must have
a known nonzero chance so that it does not necessarily require equality. We want the
selection to be done independently. In other words, the selection of an individual will not
affect the risk that other individuals will be chosen; we do this by selecting through a
process in which only chance acts, such as flipping one or more coins, usually using a set
of random numbers (Turner, 2020; Taherdoost, 2016; Robbins et al., 2020). The sample
chosen as such is called a random sample (West, 2016). The word “random” does not
describe this sample as such, but the way in which it is selected (Brechon, 2015; Bhardwaj,
2019). If the sampling unit is selected more than once so that it is placed back in the
population before selecting the next unit, this is called “random sampling with
replacement.” If the sampling unit is selected only once, i.e., is not replaced, it is called
“random sampling without replacement” (West, 2016; Antal and Tille, 2011). One of the
most important methods of probability sampling is stratified sampling (Yadav and
Tailor, 2020) and we will deal with it in our work. This type of sampling divides the
population into non-overlapping subpopulations called “strata” (Howell et al., 2020); this
division works according to certain characteristics so that the units of a stratum are as
close as possible (KSteven, 2012). Second, although one stratum may differ significantly
from another, a stratified sample with the required number of units from each population
stratum tends to be “representative” of the population as a whole. Stratified sampling, is
unlikely to choose an absurd sample because it guarantees the relative presence of all the
different subgroups that make up the population (Etikan and Bala, 2017; Padilla et al.,
2017). As for the non-probability sampling is generally based on subjective ideas. In other
words, the statistical sample selected is based on personal estimation rather than random
selection, and this type of sampling does not guarantee equal opportunity for every object
of the target population (Iachan et al., 2019; Gravetter and Forzano, 2012; Moorley and
Shorten, 2014).
Big Data Mining is a great source of information and knowledge from systems to other
end users. However, managing such a large amount of data or knowledge requires
automation, which leads to serious thinking about the use of machine learning techniques.
Machine learning consists of many powerful algorithms for learning patterns, knowledge
acquisition and predicts future events. Specifically, these algorithms work by searching a
group of possible predictive models to capture the best relationship between descriptive
features and target functions in the dataset. Based on this, the machine learning algorithm
makes the selection during the training process. The clear criterion for driving this choice
is the search for data-compatible models (Erl et al., 2016; Bailly et al., 2018). We can then
use this model to make predictions for new cases (instances) (Klaine et al., 2017).
Therefore, machine learning, which is one of the sub domains of artificial intelligence,
aims to automatically extract and exploit the information present in the dataset, that is, DDPML
equipping machines with human intelligence, so that they are able to make predictions algorithms
based on a huge amount of data, which is an almost impossible task for a human being
(Burhan et al., 2014). For example, machine learning plays a key role in better
understanding and coping with the COVID-19 crisis, in which machine learning
algorithms allow computers to mimic human intelligence and ingest large volumes of
data to quickly identify models and information; these models are used to predict new
observed values. After that, smart decisions can be taken to help us out of the crisis 561
(An et al., 2020; Goodman-Meza et al., 2020).
Machine learning algorithms are broadly classified into three categories: supervised,
unsupervised and reinforcement learning (Dasgupta and Nath, 2016). In our work, we have
relied on supervised algorithms in order to build predictive models so that it connects past
and current datasets with the help of labeled data to predict future events (Mathkunti and
Rangaswamy, 2020). We can simply say that supervised learning refers to known labels
(predicted classes are known beforehand) as a set of samples to predict future events
(Muhammad et al., 2020; Li et al., 2020). It is divided into three phases: the learning phase,
the validation phase and the test phase. Supervised learning is also divided into two broad
categories (James et al., 2013): classification and regression. Classification algorithms are
suitable for the system that produces discrete responses (Siirtola and R€oning, 2020). In
other words, responses are categorical variables, whereas regression algorithms are
algorithms that develop a model that relies on equations or mathematical operations based
on the values taken from input attributes to produce a continuous value representing the
output (James et al., 2013). This means that the input of these algorithms can take
continuous and discrete values depending on the algorithm, whereas the output is a
continuous value (Siirtola and R€oning, 2020). Supervised learning algorithms in the context
of big data are more complex. In this case, we must take into account the method of
treatment. This massive amount of data poses real problems for machine learning
algorithms. Sometimes these problems cause the system to crash completely. Thus, we
cannot give results and we cannot talk about velocity in this case. However, to overcome
these problems, we can make use of distributed and parallel processing techniques
(Assunç~ao et al., 2015; Debauche et al., 2018; Bendechache et al., 2019). Big data mining
processing requires massively parallel and widely distributed computing resources due to
the amount of data involved in a calculation, so that the results are delivered in a rather
short time laps; otherwise, this processing may lose value over time. Distributed and
parallel processing emerged as a solution to solve complex problems using nodes that have
multiple processors or nodes connected to each other across a network (Kambatla et al.,
2014). The shift from sequential processing to distributed and parallel processing provides
high performance and reliability for applications. But, it also introduces new challenges in
terms of hardware architectures, inter-process communication technologies, algorithms
and systems design.
Parallel processing uses computing nodes or modern machines that contain shared
processors often multicore, multithreaded or GPU like hardware infrastructures (Fang
et al., 2019; Chen et al., 2016), or it uses sophisticated platforms or technologies, often
Hadoop and its ecosystem like software infrastructures (Mostafaeipour et al., 2020; Singh
et al., 2020). These technologies allow to rapidly increase processor speed and power
efficiency. In addition, the processing of big data is speeded up by the distributed systems
used to enable the exchange of data between compute nodes (Pop et al., 2017). In parallel
processing, several processors cooperate to solve a problem, which reduces the processing
time, because several operations can be performed simultaneously. The use of multiple
processors working together on the same computation illustrates a new paradigm in
computer problem solving; it is completely different from sequential processing. Parallel
DTA processing provides models and architectures for performing multiple tasks within a
56,4 single compute node or group of nodes tightly coupled with homogeneous devices (Conti,
2015). The processing resources based on the network connection can also be popular
products such as Ethernet. However, it is often useful to design a custom network, or at
least, use a custom configuration of basic switches that meet the communication
requirements. On the other hand, the distributed paradigm has emerged as an alternative
to expensive supercomputers in order to meet the needs of new users and applications.
562 Unlike supercomputers, distributed computing systems are networks made up of a large
number of nodes or entities connected through a fast local area network. The nodes of a
distributed system are independent that do not physically share memory or processors,
but these nodes appear to their users as one coherent system (Van Steen and Tanenbaum,
2016). In distributed computing, several compute nodes cooperate to solve a very complex
problem, but there is one important thing that we need to know about distributed
architectures; it is that these architectures cannot reduce the processing time because
several operations cannot be performed simultaneously. Accordingly, in our work we have
combined both concepts, parallel and distributed computing, and have presented them as
a unified resource over a high-flow local network.
2. Related works
Big Data Mining is considered a hot topic for researchers in the fields of mathematics and
computer science. Modeling and predictive analytics can be of critical importance to
institutions if they are properly aligned with their processes and business needs. These
institutions can also significantly improve their performance and the validity of their
decisions, which increases their business value. Every institution can analyze its data
statistically and better understand its environment, but the greatest profit potential lies
with those who are able to perform modeling and predictive analysis based on machine
learning algorithms. In data science and artificial intelligence, big data is the beating heart
of machine 4 errors 24 warnings learning, because data is the tool that enables machine
learning to understand the behavior and the way humans think, and translate them in an
automatic way. But never forget that we have to turn a profit at the right time, because
with the increasing complexity of modern scientific and technical problems, the
requirements for real-time processing are increasing more and more. For this reason, we
must use the big data solutions
3. Research methodology
Our proposed work is primarily focused on creating a system that does the big data mining
classification in a distributed and parallel way, or rather, how one can run machine learning
algorithms in a distributed and parallel manner during the classification. To do this, we have
proposed sampling techniques that fit well with a distributed architecture to optimize big
data mining parallel processing in order to provide satisfactory classification results in the
shortest time. This part constitutes the practical part of the test of our work, which is a group
of coordinated and controlled tests comprising big data architectures and solutions
supported by Machine Learning algorithms. Therefore, it includes defining the objectives,
describing and clarifying the activities to be carried out, choosing the techniques to be
implemented as well as the technological tools that will be mobilized and finally mastering the
time and material constraints that will determine the success of our work. Then we will
discuss the results obtained in our experiments. We will therefore be at the heart of a work in
which we will have to identify a need and propose well-thought-out solutions to respond to the
problem posed.
Figure 1.
Proposed distributed
architecture
DTA Distribute the shared learning base to the compute nodes (L2), and distribute the
56,4 instances validation or sub-instances validation (same thing) to the compute
nodes (L2).
(3) The Reduce method, to collect the classification results from compute nodes (L2) and
send them to the central node.
The operating scenario (logical organization) of our distributed architecture is well detailed in
568 Figure 2.
Figure 2.
The mode of operation
of the proposed model
3.1.2 Proposed sampling technique (second part). Theoretically, statistical researchers DDPML
want to take samples from a population to generalize their results. Accessible algorithms
populations are groups of research units that the researcher can actually sample. The
expected sample is all the research units chosen by the researcher to participate in the
research. Practically, it is somewhat difficult to obtain data from each of the selected
research units. The researcher may not be able to find all the intended participants; some
may choose not to participate; some may start the search but not complete it; some may
give bad data, etc. The real sample is the group of research units from which we can 569
actually get data. It must be determined whether or not the actual sample is
representative of the population to which one wishes to generalize the results. This is
that we want to achieve in our study. But first, we must think about the methods and tools
used for this purpose.
As it is known, if we want to get good results, we must use best methods. In our work, we
chose a stratified sampling among the random sampling methods, but why? It is a compound
question, and to answer it, we must answer two secondary questions; the first is why did we
choose probability sampling methods and not non-probability sampling methods? Whereas,
the second question is why did we choose stratified sampling over other probability sampling
methods?
Regarding the first question, we chose a stratified sampling among the random
sampling methods. Here, we rely on the work provided by MacInnis et al. (2018) and
Espinosa et al. (2012), because they assured us that this method is the best among them.
Concerning the answer to the second question, we based ourselves on the work proposed
by Espinosa et al. (2012), Fei (2015), Peter (1976), Okororie and Otuonye (2015) and Puech
et al. (2014). Now, we confirm that the stratified sampling method is the best and most
optimal, especially in the field of big data mining (Zhao et al., 2019; Pandey and Shukla,
2020; Alim and Shukla, 2020). In addition, it is a method widely used in various fields
because it has proven its efficiency and success (Fellers and Kuiper, 2020; Shen, 2020;
Gong et al., 2020). When applying sampling methods in mathematical statistics, one must
first know the sample size and the formulas used in it. The most famous formulas are as
follows:
To determine the sample size Te, there are two approaches (Ataro, 1967):
(1) From a proportion (Cochran, 1977), we use the following formula: T ¼ n * p * ð1 − pÞ
2
e me 2
Te: Size of the expected sample,
n: Confidence level according to the reduced centered normal law,
p: estimated proportion of the population presenting the characteristic (when
unknown, traditionally we use: n 2 5 95% p 5 50%, me 5 5%), me: margin of error
tolerated.
(2) From an average (Gupta and Jain, 2015), we use the following formula: Te ¼ n m*e 2σ
2 2
571
DTA 1–2 for each compute node (level 1) do:
56,4
572
2- Reduce phase
2–1 from the central node:
(1) Collect the preliminary results received from the compute nodes of level 2;
(2) Display the final result of the predictive analysis;
2–2 for each compute node (level 1) do:
SNBRðmÞ
(1) Calculate the partial learning base (level 1): dp ðN L1
c Þ← i¼1 mi;
2–3 for each compute node (level 2) do:
SNBRðmi Þ 0
(1) Calculate d0p ðN L2
c Þ ← i¼1 mi;
(2) Delete the duplicated individuals;
(3) Calculate the representative learning base dr//dr i ¼ d0pi N L2
c þ ds;
(4) Building model by Random Forests algorithm from the (dr iþI 0v)//I 0v: sub-instances
validation;
(5) Send the preliminary result to the central node;
574
Figure 3.
Stability of the
classification result
using SLB
Table 2.
Classification result CCI % ICI % Kappa RSE % RAE % RMSE MAE Precision
(Performance metrics)
using SLB 87.5074 12.4926 0.7767 60.2263 44.3721 0.1439 0.0508 0.8582662875809086
DDPML
algorithms
575
Figure 4.
Classification result
using PLBL1
Table 3.
CCI % ICI % Kappa RSE % RAE % RMSE MAE Precision Classification result
(Performance metrics)
84.2914 15.7086 0.7157 64.8949 49.0757 0.1547 0.0557 0.8333855308487316 using PLBL1
Figure 5.
Stability of the
classification result
using PLBL2
Figure 5 shows the step in which we extracted the partial learning base for the second
level, where we extracted a sample from each subset by taking samples by the stratified
sampling method with replacement, knowing that we have divided the partial learning
base of the first level using the stratified sampling method without replacement so that we
can obtain a representative partial learning base in the second level. Each time we give the
sample extracted from the three subsets to the Random Forests classifier, and then we see
whether this result is satisfactory or not. If this result is not satisfactory, we add another
sample to it in the same way. At this stage, we remove the duplicate individuals, and then
DTA we present the new sample to the classifier again, and we also see the result of the
56,4 classification. This process is repeated until a satisfactory result is obtained. Moreover,
this result cannot be improved if we add more samples as the classification result always
remains constant.
The results obtained after classification are filled out in Table 4.
Figure 5 expresses the size of PLBL2 (MB) according to the precision. We ran this test to
find out how stable the classification result is using PLBL2, where it can be seen that the
576 classification result improved significantly (precision 5 0.846%) after using 340 MB as a
minimum size for this learning base, and that this result remains constant if the size of PLBL2
increases more than 340 MB.
Test 4. Presents the big data mining classification using PLBL1 and SLB. The result of
this test is shown in Figure 6.
The results obtained after classification are filled out in Table 5.
We see from Figure 6 that the classification result has been greatly improved after
merging the two datasets (PLBL1 and SLB), from precision 5 88.1407% using only ten trees,
which is an insufficient number for random forests classifier, until precision 5 89.7679%
(Increase rate is 1.5%) after using more than 600 trees. Also, we note that the precision
improved by 6% when merging two bases PLBL1 with SLB (precision 5
0.8976796797931227) instead of precision 5 0.8333855308487316 using only PLBL1.
Table 4.
Classification result CCI% ICI % Kappa RSE % RAE % RMSE MAE Precision
(Performance metrics)
using PLBL2 85.6756 14.3244 0.7414 62.6255 46.0367 0.1492 0.0523 0.8464812552851639
Figure 6.
Classification result
using PLBL1 and SLB
Table 5.
Classification result CCI % ICI % Kappa RSE % RAE % RMSE MAE Precision
(Performance metrics)
using PLBL1 with SLB 90.7998 9.2002 0.8363 53.1302 37.1303 0.1269 0.0424 0.8976796197931227
Test 5. Presents the work presented by Djafri et al. (2018), where he classified big data DDPML
(KDD Cup 2012) using representative learning base and the classical random algorithms
forests (CRF), as well as using the random forest classifier which he improved
(IRF), but in a completely different method from the method proposed in this
work. The results obtained are shown in Table 6:
The thing that attracts attention is what we see from Table 6 where the precision is somewhat
improved when using the improved random forests; they obtained precision 5 88.288% 577
using classical random forests, whereas they obtained precision 5 91.592% using improved
random forests with an estimated average increase of 3%.
Test 6. Presents the big data mining classification using the RLB which is an integration
of the two bases; the PLB2 and SLB which is extracted from the original data set
(instances training) located in the central node, knowing that after integration of
these two bases; the duplicated individuals are deleted. This is the most important
part of our proposed work in this paper. The different sizes of each learning base
used in our work and expressed in MB are as follows:
Active sub-set 5 (RLB) þ (sub-instances validation) 5 444, 73999 MB. So that:
(RLB 5 368.406 MB) and (sub-instances validation 5 76.33333 MB); knowing that
RLB 5 (SLB 5 420 3 0.226998 MB 5 95.34 MB) þ (PLBL2 5 273.066 MB).
From Figure 7, we note the following: First, the classification results obtained using RLB
(in our proposed work we have six subsets) are very similar (see Table 7), or we can frankly
Figure 7.
Classification result
using RLB
56,4
578
DTA
Table 7.
architecture
Classification result
Figure 8.
Classification result
using original dataset
(KDD Cup 2012)
DTA
56,4
580
Figure 9.
Final classification
result of the original
dataset and RLB using
CRF and IRF
of 1% compared to the result obtained using only PLBL1. After that, the classification result
increased to 89.76% when the two bases – SLB and PLBL1 were combined – where we
recorded an increase of 3% compared to the classification result that we obtained using only
SLB, and a 6% compared to the classification result that we obtained using only PLBL1. The
classification result also increases up to precision rate 5 91.48% when using RLB, and this
result is very satisfactory compared to the classification result obtained using the original
dataset (precision 5 91.73%), i.e. an error rate 5 0.002 5 equals (1/4 per thousand). In addition
to this, the final classification result improved at an estimated rate of 3% compared to the
work presented by Djafri et al. (2018), and this precision can be increased by approximately
3% (precision 5 94,59%) additional using the improved random forests classifier proposed
by Djafri et al. (2018). Through the results obtained in Test 6, it is shown that using RLB
improves the big data mining classification result from 2% to 11%. Furthermore, the
classification error rate between the compute nodes of the second level is very low (mean
error 5 0.0000096); i.e. it does not exceed (1/4 per thousand). This shows that RLB is similar in
all compute nodes L2 thanks to our proposed architecture. Also, we see from the above results
(Table 8) that the difference in precision of the classification between the big instances DDPML
training (original dataset) and the RLB (our proposed work) is very small (0, algorithms
0028357467138873) compared to the size of the data.
The classification results presented by Emara and Huang (2020) are good results despite a
4% loss of classification precision. While in our proposed work we gained up to 7%. Also, the
processing time was very slow compared to the time we got. In the work presented by Ibrahim
and Bassiouni (2020), the processing time was rather slow, although the machines use better
properties than our machines that we used and more than them in number (20 machines), but 581
we got better classification results and in real time. From this, we conclude that the number of
machines and their properties are not sufficient to achieve better execution time, but rather it
requires more efficient methods and strategies. This is what (Liu and Zhang, 2020) confirmed
in his proposed work. From results obtained by Pandey and Shukla (2019), our proposed work
remains better in terms of increase in precision, as well as in execution time, although in their
experiments they used small dataset. In the same context, our model gave a good classification
result compared to the work proposed by Islam and Amin (2020) using real data and a
distributed random forest classifier. Now, we can say that our proposed work maintains
stability the classification result in distributed systems. This result is also guaranteed in the
parallel computing of the learning base during the classification. Henceforward this is known
by the term: Dynamic Distributed and Parallel Machine Learning (DDPML).
582
Figure 10.
Processing time
required to extract
PLBL1 and SLB by
central node
Figure 11.
Processing time
required and
corresponding to the
progress size of the
PLBL2 in compute
node level 1
proportionally with increasing the size of the PLBL2 until reaches 83 s when the size of
1200 MB, and we also note that this processing time reached about 150 s when processing
twice this size.
Figure 12 shows the processing time for different sizes at the second level using compute
nodes level 2. We see the processing time null if the size is less than 400 MB, and it reaches
almost 04 s if the size is almost equal to 600 MB, and it reaches 120 s when reaching the size of
02 gigabytes.
Table 9 shows the coordination between the number of compute nodes of the first and
second level to extract the PLBL2, so that this coordination also gives a fixed size to this
partial learning base (see Figure 5). In addition, the processing of the active subset can be
performed in real time (average duration between: 0.280 s and 0.231 s), our proposed work is
presented on the second line in green in Table 9. You can also see Figures 12 and 13.
DDPML
algorithms
583
Figure 12.
Processing time
required and
corresponding to the
progress of size in
compute node level 2
Table 9.
Coordination between
MBR of compute nodes
of the first and second
level to extract
the PLBL2
Figure 13.
Average run time of
our model for different
sizes (GB) of data-set
(instances training: It)
according to number of
compute nodes (L2)
DTA To obtain real time classification results, it is necessary to fix the size of the PLBL2 by 1/3
56,4 of the original learning base. In addition, it is necessary to use a minimum number of
machines (of course, depending on the characteristics of the machine used) whether in level 1
or level 2 according to the equations here:
(1) Number of nodes at level 2 5 3 * instances training volume.
(2) Number of nodes at level 1 5 32 * instances training volume.
584
You should also know that PLBL2 size as well as processing time can be varied depending on
the availability and the capacity of our machines used (see: the partial learning base level 2
extracting Algorithm).
Figure 13 shows the coordination between the number of compute nodes L2 and data size
for real-time execution.
4.2.1 Second part discussion with conclusions. In this section, we discuss the processing
time required when classifying different datasets. Through Figure 10, we see that the
execution time increases more and more during the extraction of the PLBL1 compared to the
execution time during the extraction of the SLB, the reason is due to the partition of the original
dataset to build PLBL1. Also, we see that the execution time is slightly reduced when the
PLBL2 is extracted, so that, the execution time in this case is considered as moderate time
compared to the execution times required above. In addition, the execution time increases by
approximately one-third compared to the execution time required to extract the SLB and also
decreases by approximately one-third compared to the execution time required to extract the
PLBL1 (see Figure 10). In this case, the execution time is a little low despite the partitioning
process, the reason is due to the increase in the number of nodes at the first level and the second
level (see Table 9). Furthermore, the size of subsets at the first level is small compared to the
original dataset. Moreover, from Figure 12, it is concluded that our proposed system performs
massive processing in real time through the active subset (RLB þ sub-instances validation)
that we applied for the classification, as the time taken to process a volume 5 444.739 MB is
0.230 s, this is what we see in Figure 12. However, this execution time is variable and not fixed
when processing other large sizes. For example, when processing a dataset with a size of 5
gigabytes using our distributed architecture; the execution time is around 60 s. In this case, we
switch from the real-time (streaming) data processing mode to the micro-batch processing
mode. In addition, when processing a dataset with a size of 10 gigabytes, the execution time
increases to about 1.5 min. In this case our system fits into systems that process data in batch
mode. This execution time also increases to 570 s, or the equivalent of 9.5 min to process 75 GB.
However, this time can be reduced in real time by simply increasing the number of nodes at
first level and second level (see Table 9 and Figure 13). So, we can now say that our proposed
system is dynamic flexible and extensible according to the execution time (streaming, micro-
batch, or Batch) that we want to apply when big data analytics of different sizes.
Figure 14.
The operating
principle of our
proposed
system DDPML
Classifiers
Performance metrics SVM % ANN % KNN % RF % LR % BN %
Table 10.
Precision 91.032623 91.097246 86.394230 91.453504 90.960674 89.869753 Comparison of
Recall 95.792956 95.443117 93.187269 95.688636 95.623671 95.143884 classification results
AUC 98.344301 98.251421 96.809512 98.423291 98.309555 98.090501 (Binary classification:
F-measure 93.352143 93.219558 89.662269 93.526135 93.233905 92.431645 KDD Cup 2012) using
Note(s): The significance of italic values represent the good results of this classifier compared to other classifiers multiple classifiers
DTA We have selected the six classifiers because they are the most widely used and are highly
56,4 suitable for big data analytics. To confirm our choices, we quote the following:
SVM (Support Vector Machine) over the past decade, SVM has been gradually integrated
into the Big Data field. It solves big data classification problems. In particular, it can help
multi-domain applications in a big data environment (Sadrfaridpour et al., 2019).
ANN (Artificial neural networks) constitute a realistic criterion in the Big Data field, thus
knowledge of this field is of paramount importance for those who wish to extract significant
586 information from the big data available to date (Chiroma et al., 2019).
KNN (K-Nearest Neighbors) is widely used in big data analytics, especially if it is
developed more and more in order to give satisfactory classification results (Deng et al., 2016;
Xing and Bei, 2020).
RF (Random Forests) seem insensitive to over-fitting, this method generally does not
require a lot of parameter optimization efforts. Random forests therefore avoid one of the
main pitfalls of Big Data approaches in machine learning, which are taken for granted
because we have talked about them in detail in our work (Djafri et al., 2018).
LR (Logistic Regression) gives better result for analyzing the big data
(Dhamodharavadhani and Rathipriya, 2019).
BN (Bayesian Network) or (Naı€ve Bayes) can also be used in the Big Data field, it is very
useful for generating synthetic data when the actual data is insufficient (Scutari et al., 2019).
5.1.1 The first experiment. In this experiment, we use the KDD Cup 2012 dataset, knowing
that this dataset has two classes (binary classification) and that the number of features is very
important (see Table 1).
The results (classifier performance metrics) obtained are shown in Table 10:
From Table 10, we note that the RF classifier is the best classifier among the other
classifiers we used in our work to classify big data mining, it gave precision 5
91.45350422688846%, followed by two classifiers; ANN classifier with a precision
of 5 91.09724638550603%, which is a result very close to the previous one. Then the SVM
classifier comes in third place, with almost the same classification result
(precision 5 91.03262395418047%), it comes in first place with a recall 5
95.79295674781375%. Then the LR classifier also gave a good result (precision
90.96067415730337%) this result is closer than the results obtained using RF, ANN and
SVM. As for the KNN classifier, it gave a classification result with less precision than the
other classifiers (see Figure 15).
Figure 15.
Classifiers
performance metrics
(Binary classification)
5.1.2 The second experiment. In this experiment, we use the Mnist8m dataset, knowing DDPML
that this dataset has ten classes (multi-class classification). To see more information on this algorithms
database, visit this site [2] (see Table 11).
The results obtained are shown in Table 12:
From Table 12, we can see that RF classifier gives the best classification result for big data
mining with a precision 5 90.00379842998227%. Then in second place comes the ANN classifier
with a precision 5 89.9792045749935%, which is a result very close to the previous result
obtained using RF classifier, so that it gave a classification precision 5 89.25816023738873%, this 587
which is also a very close result to the results obtained using RF and ANN. Regarding SVM
classifier, it gives a moderate classification result with a precision 5 88.67615997819718%.
Finally, the remaining two classifiers BN and KNN gave a slightly lower classification result
compared to the other classifiers mentioned previously (see Figure 16).
From the results obtained in Tables 10 and 12, we conclude that RF classifier remains the
best classifier for big data mining classification in both cases whether binary classification or
multi-class classification. We also conclude that SVM is very suitable for big data mining
classification in the case of binary classification versus multi-class classification. On the other
hand, LR classifier gives us good results of multi-class classification compared to binary
Classifiers
Performance metrics SVM % ANN % KNN % RF % LR % BN % Table 12.
Comparison of
Precision 88.676159 89.979204 86.112947 90.003798 89.258160 86.458410
classification results
Recall 94.386830 94.654088 92.439889 94.874874 94.611029 92.437967 (Multi-class
AUC 97.464970 97.885518 96.153445 97.979399 97.656028 96.285861 classification:Mnist8m)
F-measure 91.442422 92.257462 89.164322 92.375166 91.856677 89.348256 using multiple
Note(s): The significance of italic values represent the good results of this classifier compared to other classifiers classifiers
Figure 16.
Classifiers
performance metrics
(Multi-class
classification)
DTA classification. In addition, ANN gives us results very close to RF in both cases whether binary
56,4 classification and multi-class classification. Regarding the two classifiers BN and KNN, they
give somewhat insufficient results in both cases (binary classification and multi-class
classification). Finally and as a result of the above, we conclude that the IRF developed by
Djafri et al. (2018) gives us good classification results than all these classifiers used in our
work, because we got precision 5 94.78324049406636% in the case of binary classification,
and precision 5 94.14018519392236% in the case of multi-class classification (see Figure 17).
588 Thanks to our DDPML system, we can say that we have achieved several objectives
related to the characteristics of big data such as:
(1) Volume: Reduce data volume (from big data to small data).
(2) Velocity: Speed up execution time (in real time).
(3) Veracity: Getting correct results using the representative learning base (RLB).
(4) Validity: Getting accurate results using our distributed architecture that enables us to
choose the best classifier.
5.2 Practical contribution of our proposed system (DDPML) for big data analytics
DDPML is a system that greatly helps save money for companies and research laboratories
that need huge computing resources and high human competence to analyze their big data. It
also leaves these institutions the freedom to dispose of their local resources according to the
distributed systems they are available, just simple configuration is required with these
available systems.
From Figure 18, we see that DDPML works comfortably even in the case of the big
instances validation base is very large, because sometimes this base can be too big than the
original dataset. In this case, the companies only need a simple distributed and parallel
system. But we may be faced with another very difficult problem, in case if these two bases
(big instances validation and original dataset) are big at the same time. To solve this problem,
it is preferable to apply the DDPML system to be able to process these data and obtain high
precision in a very short time.
5.2.1 The overall discussion. To enrich the discussion and confirm the results obtained, and
so that we can also compare the results of our proposed work with the results of other work,
Figure 17.
Classifiers
performance metrics in
the case of binary and
multi-class
classification
DDPML
algorithms
589
Figure 18.
Big Data Analytics
using DDPML system
whether about the design of the structure, the classification results, or the processing time;
some related works are presented, for example but not limited to, in the following:
Through the survey carried out by Mahmud et al. (2020), in which they explained in it
how to partition big data in distributed systems, where they mentioned that the traditional
data partition methods (range and hash) and division data methods via distributed files
(HDFS) often give inaccurate results, because these methods do not take into account the
statistical methods, thus this leads to poor results. From here, we conclude that data
partition and statistical sampling methods applied to them contribute to increase in the
accuracy of the results. Among the works presented in this survey (Approximate cluster
computing for big data analysis) are somewhat similar to our proposed work. Thanks to a
survey realized by Liu and Zhang (2020) they confirmed us again that sampling methods
reduce the volume of big data more effectively and help to speed up its processing. So, it
plays an important role in the era of big data, now and in the future. In the same context
(Emara and Huang, 2020) they developed two strategies for distributing data across
multiple data centers. In the first strategy, they relied on the Random Sample Partition
(RSP) data model to convert big data into sets of random sample data blocks and distribute
them across multiple data centers with or without replication. The second strategy allows
them to analyze data in any data center by randomly selecting a sample of data blocks
copied from other data centers. They concluded through the results obtained in their work
that the second strategy is better than the first strategy, because they obtained a
classification results between 95 and 96% using the random forests classifier, but the
processing time about 23 min. Through this work, we make sure once again that statistical
methods, especially random sampling, are very useful when partitioning data in distributed
systems. Another work suggested by Ibrahim and Bassiouni (2020) they introduced a new
partitioning system called Balanced Data Clusters Partitioner (BDCP) to make Hadoop
Yarn more efficient in cloud data centers. In their experiments, they used 20 machines with
excellent characteristics. Their goal is to reduce the time taken to minimize the job
completion time in Map-Reduce jobs. The runtime results obtained using the skew
DTA degree 5 0.1 are approximately 100 s for 2 GB and approximately 200 s for 4 GB and just
56,4 over 200 s for 6 GB. When using the skew degree 5 1.1, they got the execution time between
100 and 150 s for 2 GB, between 200 and 300 s for 4 GB and between 300 and 400 s for 6 GB.
So through this work we understand that we must develop methods that allow us to take
advantage of the execution time, especially if it comes to distributed systems and parallel
computing. There is another work that is no less important than the previous work
proposed by Islam and Amin (2020) they relied in this work on the Distributed Random
590 Forest (DRF) and the Gradient Boosting Machine (GBM), they concluded in their results
that the DRF gives classification results (precision 5 0.8436) better than the results
obtained by GBM (precision 5 0.7916) in the case of using real data.
At the end of this discussion, we present the survey conducted by Verbraeken et al. (2020),
where they provided us with a general and comprehensive overview of the latest technologies
in the field of distributed machine learning. They showcased the currently available systems
whether it is a distributed machine learning architectures or a distributed machine learning
ecosystem. However, we found that these works lack accurate results due to the architecture
design (simple and traditional design), so that data partitioning is also simple (It does not
depend on the leading mathematical methods in this field). These systems also require a fairly
long processing time compared to our proposed system. In addition, deployment predictive
models require human skills (developer, data expert, etc.). Furthermore, in their entirety, these
systems are intended for private or fixed use, in contrast to the DDPML system, which is
intended for organizations that analyze big data. DDPML is a dynamic and vital system;
because it contributes greatly to fault tolerance; for example, if one node, two or three nodes
goes down, our system always remains in service, because DDPML operates in a way
autonomous of the network architecture or the platforms, this is thanks to the representative
learning base that can be treated in any system. This learning base is completely independent
of the hardware and the type of their composition. Moreover, the results always remain
precise, because we said that the classification process depends on the representative base.
DDPML also does not take into account the architectural design of the distributed system or
the number of nodes that make up these systems, where these systems mainly help to speed
up processing time only.
References
Alam, A. and Ahmed, J. (2014), “Hadoop architecture and its issues”, International Conference on 591
Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, pp. 288-291,
doi: 10.1109/CSCI.2014.140.
Alim, A. and Shukla, D. (2020), “Solution approach to big data regarding parameter estimation
problems in predictive analytics model”, Research Journal of Computer and Information
Technology Sciences, Vol. 8 No. 1, pp. 1-8.
An, C., Lim, H. and Kim, D. (2020), “Machine learning prediction for mortality of patients diagnosed
with covid-19: a nationwide Korean cohort study”, Scientific Reports, Vol. 10, doi: 10.1038/
s41598-020-75767-2.
Andrade, C. (2020), “Sample size and its importance in research”, Indian Journal of Psychological
Medicine, Vol. 42 No. 1, pp. 102-103.
Antal, E. and Tille, Y. (2011), “Simple random sampling with over-replacement”, Journal of Statistical
Planning and Inference, Vol. 141 No. 1, pp. 597-601.
Ardakani, A.A., Kanafi, A., Acharya, U.R., Khadem, N. and Mohammadi, A. (2020), “Application of
deep learning technique to manage covid-19 in routine clinical practice using ct images: Re-
sults of 10 convolutional neural networks”, Computers in Biology and Medicine, Vol. 121, doi: 10.
1016/j.compbiomed.2020.103795.
Assunç~ao, M.D., Calheiros, R.N., Bianchi, S., Netto, M.A. and Buyya, R. (2015), “Big data computing
and clouds: trends and future directions”, Journal of Parallel and Distributed Computing, Vol. 79,
pp. 3-15, doi: 10.1016/j.jpdc.2014.08.003.
Ataro, Y. (1967), Statistics, an Introductory Analysis, 2nd ed., Harper & Row, New York.
Bailly, S., Meyfroidt, G. and Timsit, J. (2018), “What’s new in icu in 2050: big data and machine
learning”, Intensive Care Med, Vol. 44, pp. 1524-1527, doi: 10.1007/s00134-017-5034-3.
Bei, Z., Yu, Z., Luo, N., Jiang, C., Xu, C. and Feng, S. (2018), “Configuring in-memory cluster computing
using random forest”, Future Generation Computer Systems, Vol. 79, pp. 1-15, doi: 10.1016/j.
future.2017.08.011.
Bendechache, M., Tari, A.-K. and Kechadi, M.-T. (2019), “Parallel and distributed clustering framework
for big spatial data mining”, International Journal of Parallel, Emergent and Distributed
Systems, Vol. 34 No. 6, doi: 10.1080/17445760.2018.1446210.
Berndt, A.E. (2020), “Sampling methods”, Journal of Human Lactation, Vol. 36 No. 2, pp. 224-226,
doi: 10.1177/0890334420906850.
Bhandari, A. (2020), Bhandari, Introduction to the Hadoop Ecosystem for Big Data and Data
Engineering.
Bhardwaj, P. (2019), “Types of sampling in research”, Journal of the Practice of Cardiovascular
Sciences, Vol. 5 No. 3, pp. 157-163.
Bhaskar, S.B. and Zulfiqar, A. (2016), “Basic statistical tools in research and data analysis”, Indian
Journal of Anaesthesia, Vol. 90 No. 9, pp. 662-669, doi: 10.4103/00195049.190623.
Bhattacharya, A. and Bhatnagar, S. (2016), “Big data and Apache spark: a review”, International
Journal of Engineering Research Science, Vol. 2 No. 5.
Borthakur, D. (2007), The Hadoop Distributed File System: Architecture and Design, The Apache
Software Foundation.
Brechon, P. (2015), “Random sample, quota sample: the teachings of the evs 2008 survey in France”, BMS:
Bulletin of Sociological Methodology/Bulletin De Methodologie Sociologique, Vol. 126, pp. 67-83.
DTA Brown, D.W., Ford, V. and Ghafoor, S.K. (2020), “A framework for the evaluation of parallel and
distributed computing educational resources”. IEEE International Parallel and Distributed
56,4 Pro-cessing Symposium Workshops (IPDPSW), doi: 10.1109/IPDPSW50202.2020.00057.
Bruce, P. and Bruce, A. (2017), Practical Statistics for Data Scientists, O’Reilly Media, Sebastopol, CA.
Bucchianico, A.D., Iapichino, L., Litvak, N., van der Meulen, F. and Wehrens, R. (2019), “Mathematics
for big data”, Book: the Best Writing on Mathematics. doi: 10.2307/j.ctvggx33b.13.
592 Burhan, U.I.K., Rashidah, F.O., Hunain, A. and Asadullah, S. (2014), “Critical insight for mapreduce
optimization in hadoop”, International Journal of Computer Science and Control Engineering,
Vol. 2 No. 1, pp. 1-7.
Cebeci, Z. and Yildiz, F. (2016), “Efficiency of random sampling based data size reduction on
computing time and validity of clustering in data mining”, Journal of Agricultural Informatics,
Vol. 7 No. 1, pp. 53-64, doi: 10.17700/jai.2016.7.1.266.
Cen, T., Chu, Q. and He, R. (2019), “Big data mining for investor sentiment”, Journal of Physics:
Conference Series, Vol. 1187 No. 5.
Chan, J.O. (2013), “An architecture for big data analytics”, Communications of the IIMA, Vol. 13
No. 2, pp. 1-13.
Chauhan, R., Kaur, H. and Chang, V. (2017), “Advancement and applicability of classifiers for variant
exponential model to optimize the accuracy for deep learning”, Journal of Ambient Intelligence
and Humanized Computing. doi: 10.1007/s12652-017-0561-x.
Che, D., Safran, M. and Peng, Z. (2013), “From big data to big data mining: challenges, issues, and
opportunities”, Database Systems for Advanced Applications.
Chen, M., Mao, S. and Liu, Y. (2014), “Big data: a survey”, Mobile Networks and Application, Vol. 19
No. 2, pp. 171-209, doi: 10.1007/s11036-013-0489-0.
Chen, W., Xu, S., Jiang, H., Weng, T., Marino, M., Chen, Y. and Li, K. (2016), “Gpu computations
on hadoop clusters for massive data processing”, Proceedings of the 3rd International Conference
on Intelligent Technologies and Engineering Systems (ICITES2014), Springer, pp. 515-521.
Chiroma, H., Abdullahi, U.A., Abdulhamid, S.M., AlArood, A.A., Gabralla, L.A., Rana, N. and Herawan,
T. (2019), “Progress on artificial neural networks for big data analytics: a survey”, IEEE Access,
Vol. 7, doi: 10.1109/access.2018.2880694.
Chung, W.C., Wu, T.L., Lee, Y.H., Huang, K.C., Hsiao, H.C. and Lai, K.C. (2020), “Minimizing resource
waste in heterogeneous resource allocation for data stream processing on clouds”, Applied
Sciences, Vol. 11 No. 1, doi: 10.3390/app11010149.
Cochran, W.G. (1977), Sampling Techniques, 3rd ed., John Wiley and Sons, New York, pp. 4-6.
Concurrency-Computat:Pract.Exper (2016), Parallel and Distributed Computing for Big Data
Applications, Wiley Online Library, doi: 10.1002/cpe.3813.
Conti, F. (2015), “Heterogeneous architectures for parallel acceleration”, Doctoral Thesis, University of
Bologna.
Coulet, A., Chawki, M., Jay, N., Shah, N., Wack, M. and Dumontier, M. (2018), “Predicting the need for a
reduced drug dose at first prescription”, Scientific Reports, Vol. 8 No. 1, doi: 10.1038/s41598-018-33980-0.
Dasgupta, A. and Nath, A. (2016), “Classification of machine learning algorithms”, International
Journal of Innovative Research in Advanced Engineering, Vol. 3 No. 3.
Dataflair, T. (2020), Spark Tutorial:learn Spark Programming.
Davenport, T. and Kim, J. (2013), Keeping up with the Quants, Harvard Business Review Press.
Debauche, O., Mahmoudi, S.A., Mahmoudi, S. and Manneback, P. (2018), “Cloud platform using big
data and hpc technologies for distributed and parallels treatments”, Procedia Computer Science,
Vol. 141, pp. 112-118, doi: 10.1016/j.procs.2018.10.156.
Den-Broeck, V., Sandøy, I.F. and Brestoff, J.R. (2013), The Recruitment, Sampling, and Enrollment Plan
Epidemiology: Principles and Practical Guidelines, Springer, pp. 171-196.
Deng, Z., Zhu, X., Cheng, D., Zong, M. and Zhang, S. (2016), “Efficient knn classification algorithm for DDPML
big data”, Neurocomputing, Vol. 195, pp. 143-148, doi: 10.1016/j.neucom.2015.08.112.
algorithms
Deshpande, S., Gogtay, N. and Thatte, U. (2016), “Data types”, Journal of The Association of Physicians
of India, Vol. 64.
Dhamodharavadhani, S. and Rathipriya, R. (2019), Enhanced Logistic Regression (Elr) Model for Big-
Data, IGI Global, doi: 10.4018/978-1-7998-0106-1.ch008.
Dhyani, B. and Barthwal, A. (2014), “Big data analytics using hadoop”, International Journal of 593
Computer Applications, Vol. 108 No. 12.
Djafri, L., Amar-Bensaber, D. and Adjoudj, R. (2018), “Big data analytics for prediction: parallel
process- ing of the big learning base with the possibility of improving the final result of the
prediction”, Information Discovery and Delivery, Vol. 46 No. 3, pp. 147-160, doi: 10.1108/IDD-02-
2018-0002.
Dong, L.J., Li, X.B. and Peng, K. (2013), “Prediction of rockburst classification using random forest”,
Transactions of Nonferrous Metals Society of China, Vol. 23, pp. 472-477, doi: 10.1016/
S10036326(13)624875.
Dunren, C., Mejdl, S. and Zhiyong, P. (2013), “From big data to big data mining: challenges, issues, and
opportunities”, DASFAA Workshops LNCS 7827, pp. 1-15.
Elgendy, N. and Elragal, A. (2014), “Big data analytics: a literature review paper”, in Perner, P. (Ed.),
Advances in Data Mining. Applications and Theoretical Aspects. ICDM, Lecture Notes in
Computer Science, 8557, doi: 10.1007/978-3-319-08976-8-16.
Ellis, G., Bertini, E. and Dix, A. (2005), “The sampling lens: making sense of saturated visualisations”,
Conference on Human Factors in Computing Systems, ACM, Portland, Oregon, USA, pp. 1351-1354.
Emara, T.Z. and Huang, J.Z. (2020), “Distributed data strategies to support large-scale data analysis
across geo-distributed data centers”, IEEE Access, Vol. 8, pp. 178526-178538, doi: 10.1109/
access.2020.3027675.
Erl, T., Khattak, W. and Buhler, P. (2016), Big Data Fundamentals: Concepts, Drivers and Techniques,
Prentice Hall Press.
Espinosa, M.M., Bieski, I. and de Oliveira Martins, D.T. (2012), “Probability sampling design in
ethnobotanical surveys of medicinal plants”, Revista Brasileira de Farmacognosia, Vol. 22 No. 6,
doi: 10.1590/S0102695X2012005000091.
Etikan, I. and Bala, K. (2017), “Sampling and sampling methods”, Biometrics and Biostatistics
International Journal, Vol. 5 No. 6, pp. 138-149, doi: 10.15406/bbij.2017.05.00149.
Fang, Y., Chen, Q. and Xiong, N. (2019), “A multi-factor monitoring fault tolerance model based on a
gpu cluster for big data processing”, Information Sciences, Vol. 496, pp. 300-316.
Fei, S. (2015), “Study on a stratified sampling investigation method for resident travel and the
sampling rate”, Discrete Dynamics in Nature and Society, 496179, doi: 10.1155/2015/496179.
Fellers, P.S. and Kuiper, S. (2020), “Introducing undergraduates to concepts of survey data analysis”,
Journal of Statistics Education, Vol. 28 No. 1, pp. 18-24, doi: 10.1080/10691898.2020.1720552.
Ferguson, M. (2013), Enterprise Information Protection- the Impact of Big Data, IBM.
Gandomi, A., Movaghar, A. and Reshadi, M. (2020), “Designing a mapreduce performance model in
distributed heterogeneous platforms based on benchmarking approach”, The Journal of
Supercomputing, Vol. 76, pp. 7177-7203, doi: 10.1007/s11227-020-03162-9.
Gong, Y., Xie, H., Tong, X., Jin, Y., Xv, X. and Wang, Q. (2020), “Area estimation of multi-temporal
global impervious land cover based on stratified random sampling”, International Archives of
the Photogrammetry, Remote Sensing and Spatial Information Sciences, pp. 103-108, doi: 10.
5194/isprs-archives-XLIIIB4-2020-103-2020.
Gonzalez, J., Xin, A.D.R. and Crankshaw, D. (2014), “Graphx: graph processing in a distributed
dataflow framework”, Proceeding OSDI’14 Proceedings of the 11th USENIX Conference on
Operating Systems Design and Implementation, pp. 599-613.
DTA Goodman-Meza, D., Rudas, A., Chiang, J., Adamson, P., Ebinger, J. and Sun, N. (2020), “A machine
learning algorithm to increase covid-19 inpatient diagnostic capacity”, PLoS ONE, Vol. 15 No. 9,
56,4 doi: 10.1371/journal.pone.0239474.
Gravetter, J. and Forzano, B. (2012), “Selecting research participants”, Behavior Research Methods,
pp. 125-139.
Gupta, A. and Jain, S. (2015), “Estimation of sample size in dental research”, International Dental and
Medical Journal of Advanced Research, Vol. 1, doi: 10.15713/ins.idmjar.9.
594
Haoyuan, L., Matei, Z., Scott, S., Tathagata, D., Timothy, H. and Ion, S. (2013), “Discretized streams:
fault- tolerant streaming computation at scale”, SOSP’13, ACM, Farmington, Pennsylvania,
USA, Nov. 3-6, doi: 10.1145/2517349.2522737.
Hariri, R., Fredericks, E. and Bowers, K. (2019), “Uncertainty in big data analytics: survey,
opportunities, and challenges”, Journal of Big Data, Vol. 44 No. 6, doi: 10.1186/s40537-019-
0206-3.
HLG-BAS (2011), “Strategic vision of the high-level group for strategic developments in business
architecture in statistics”, Conference of European Statisticians, 59th Plenary.
Honnutagi, P. (2014), “The hadoop distributed file system”, International Journal of Computer Science
and Information Technologies, Vol. 5 No. 5, pp. 6238-6243.
Howell, C., Su, W. and Nassel, A. (2020), “Area based stratified random sampling using geospatial
technology in a community-based survey”, BMC Public Health, Vol. 20, doi: 10.1186/s12889-020-
09793-0.
Iachan, R., Berman, L., Kyle, T.M., Martin, K.J., Deng, Y., Moyse, D.N., Middleton, D. and Atienza, A.A.
(2019), “Weighting nonprobability and probability sample surveys in describing cancer
catchment areas”, Cancer Epidemiol Biomarkers Prev, Vol. 28 No. 3, pp. 471-477, doi: 10.1158/
1055-9965.EPI-18-0797.
IBM (2014), The Top Five Ways to Get Started with Big Data.
Ibrahim, I. and Bassiouni, M. (2020), “Improvement of job completion time in data-intensive cloud
computing applications”, Journal of Cloud Computing, Vol. 9 No. 8, doi: 10.1186/s13677-019-
0139-6.
Inderpal, S. (2013), “Review on parallel and distributed computing”, Scholars Journal of Engineering
and Technology, Vol. 1 No. 4, pp. 218-225.
Islam, S. and Amin, S. (2020), “Prediction of probable backorder scenarios in the supply chain using
distributed random forest and gradient boosting machine learning techniques”, Journal of Big
Data, Vol. 7 No. 1, doi: 10.1186/s40537-020-00345-2.
James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013), “Statistical learning.in: an
introduction to statistical learning”, Springer Texts in Statistics, Springer, New York, NY,
Vol. 103, pp. 15-57.
Jaradat, M., Jarrah, M., Bousselham, A., Jararweh, Y. and Al-Ayyouba, M. (2015), “The internet of
energy: smart sensor networks and big data management for smart grid”, Procedia Computer
Science, Vol. 56, pp. 592-597.
Jeong, H. and Cha, K.J. (2019), “An efficient mapreduce based parallel processing framework for user
based collaborative filtering”, Symmetry, Vol. 11 No. 6, doi: 10.3390/sym11060748.
Jun, S., Lee, S. and Ryu, J. (2015), “A divided regression analysis for big data”, International Journal of
Software Engineering and Its Applications, Vol. 9 No. 5, pp. 21-32.
Jun, C., Y.Lee, J. and H.Kim, B. (2019), “Cloud-based big data analytics platform using algorithm
templates for the manufacturing industry”, International Journal of Computer Integrated
Manufacturing, Vol. 32, pp. 723-738, doi: 10.1080/0951192X.2019.1610578.
Kambatla, K., Kollias, G., Kumarand, V. and Grama, A. (2014), “Trends in big data analytics”, Journal
of Parallel and Distributed Computing, Vol. 74 No. 7, pp. 2561-2573, doi: 10.1016/j.jpdc.2014.
01.003.
Kandel, S., Hellerstein, J., Paepcke, A. and Heer, J. (2012), “Enterprise data analysis and visualization: DDPML
an interview study”, IEEE Transactions on Visualization and Computer Graphics, Vol. 18
No. 12, pp. 2917-2926, doi: 10.1109/TVCG.2012.219. algorithms
Katal, A., Wazid, M. and Goudar, R. (2013), “Big data: issues, challenges, tools and good practices”,
Contemporary Computing (IC3) Sixth International Conference, IEEE, pp. 404-409.
Kayyali, Knott, D. and Kuiken, S.V. (2013), The Big-Data Revolution in Us Health Care: Accelerating
Value and Innovation, Mc Kinsey Company, Vol. 2 No. 8, pp. 1-13.
595
Khan, N., Shah, H., Badsha, G., Abbasi, A.A., Alsaqer, M. and Salehian., S. (2018), “10 vs, issues and
challenges of big data”, International Conference on Big Data and Education ICBDE ’18,
pp. 203-210.
Kiran, M., Murphy, P., Monga, I., Dugan, J. and Baveja, S. (2015), “Lambda architecture for cost
effective batch and speed big data processing”, IEEE International Conference on Big Data,
doi: 10.1109/BigData.7364082.
Klaine, P.V., Imran, M.A., Onireti, O. and Souza, R.D. (2017), “A survey of machine learning techniques
applied to self-organizing cellular networks”, IEEE Communications Surveys and Tutorials,
Vol. 19 No. 4, pp. 2392-2431, doi: 10.1109/COMST.2017.2727878.
KSteven, T. (2012), “Sampling”, Chapter 6: Unequal Probability Sampling, 3rd ed., John Wiley & Sons.
Kulkarni, A.P. and Khandewal, M. (2014), “Survey on hadoop and introduction to yarn”, International
Journal of Emerging Technology and Advanced Engineering, Vol. 4 No. 5.
Lalmuanawma, S., Hussain, J. and Chhakchhuak, L. (2020), “Applications of machine learning and
artificial intelligence for covid-19 (sars-cov-2) pandemic: a review”, Chaos, Solitons and Fractals,
Vol. 139 No. C, doi: 10.1016/j.chaos.2020.110059.
Landis, J. and Koch, G. (1977), “The measurement of observer agreement for categorical data”,
Biometrics, Vol. 33 No. 1, pp. 159-174.
Lee, K., Fitts, M. and Conigrave, J. (2020), “Recruiting a representative sample of urban south
australian aboriginal adults for a survey on alcohol consumption”, BMC Medical Research
Methodology. doi: 10.1186/s12874-020-01067-y.
Li, J. and Liu, H. (2017), “Challenges of feature selection for big data analytics”, IEEE Intelligent
Systems, Vol. 32 No. 2, pp. 9-15, doi: 10.1109/mis.2017.38.
Li, Y., Hai-Tao, Z. and Jorge, G. (2020), A Machine Learning-Based Model for Survival Prediction in
Patients with Severe Covid19 Infection, MedRxiv, doi: 10.1101/2020.02.27.20028027.
Liu, Z. and Zhang, A. (2020), “Mpling for big data profiling: a survey”, IEEE Access, Vol. 8,
pp. 72713-72726, doi: 10.1109/ACCESS.2020.2988120.
Lu, X. and Zhan, J. (2020), “Workshop 7: hpbdc high-performance big data and cloud computing”,
IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW),
doi: 10.1109/IPDPSW50202.2020.00073.
MacInnis, B., Krosnick, J.A., Ho, A.S. and Cho, M.-J. (2018), “The accuracy of measurements with
probability and nonprobability survey samples: replication and extension”, Public Opinion
Quarterly, Vol. 82 No. 4, pp. 707-744, doi: 10.1093/poq/nfy038.
Mahmud, M.S., Huang, J.Z., Salloum, S., Emara, T.Z. and Sadatdiynov, K. (2020), “A survey of data
partitioning and sampling methods to support big data analysis”, Big Data Mining and
Analytics, Vol. 3 No. 2, pp. 85-101, doi: 10.26599/BDMA.2019.9020015.
Mathkunti, N. and Rangaswamy, S. (2020), “Machine learning techniques to identify dementia”, SN
Comput Sci, Vol. 118 No. 1, doi: 10.1007/s42979-020-0099-4.
Mayya, S., Monteiro, A. and Ganapathy, S. (2017), “Types of biological variables”, Journal of Thoracic
Disease, Vol. 9 No. 6, pp. 1730-1733, doi: 10.21037/jtd.2017.05.75.
Mazhar, R., Awais, A. and Anand, P. (2016), “Real time intrusion detection system for ultra-high-speed
big data environments”, Journal of Supercomputing, Vol. 72, pp. 3489-3510, doi: 10.1007/s11227-
015-1615-5.
DTA Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M.,
Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Matei, Z. and Talwalkar, A. (2016), “Mllib:
56,4 machine learning in Apache spark”, Journal of Machine Learning Research, Vol. 17
No. 34, pp. 1-7.
Mohan, A., Venkatesan, R. and Pramod, K. (2017), “A scalable method for link prediction in large real
world networks”, Journal of Parallel and Distributed Computing. doi: 10.1016/j.jpdc.2017.05.009.
Moorley, C. and Shorten, A. (2014), “Selecting the sample”, Evidence Based Nursing, Vol. 17 No. 2,
596 pp. 32-33, doi: 10.1136/eb-2014-101747.
Mostafaeipour, A., Rafsanjani, J., Ahmadi, M. and Dhanraj, A. (2020), “Investigating the performance
of hadoop and spark platforms on machine learning algorithms”, The Journal of
Supercomputing. doi: 10.1007/s11227-020-03328-5.
Muhammad, L., Algehyne, E., Usman, S., Ahmad, A., Chakraborty, C. and Mohammed, I. (2020),
“Supervised machine learning models for prediction of covid-19 infection using epidemiology
dataset”, SN Computer Science, Vol. 2 No. 1, doi: 10.1007/s42979-020-00394-7.
Muthusami, R. and Saritha, K. (2020), “Statistical analysis and visualization of the potential cases of
pandemic coronavirus”, VirusDis, Vol. 31, pp. 204-208, doi: 10.1007/s13337-020-00610-1.
Nguyen, D., Long, T., Jia, X., Lu, W., Gu, X., Iqbal, Z. and Jiang, S. (2019), “A feasibility study for
predicting optimal radiation therapy dose distributions of prostate cancer patients from patient
anatomy using deep learning”, Scientific Reports, Vol. 9 No. 1, doi: 10.1038/s41598-018-37741-x.
Okororie, C. and Otuonye, E.L. (2015), “Efficiency of some sampling techniques”, Journal of Scientific
Research and Studies, Vol. 2 No. 3, pp. 63-69.
Oussous, A., Benjelloun, F.-Z., Lahcen, A. and Belfkih, S. (2017), “Big data technologies: a survey”,
Journal of King Saud University - Computer and Information Sciences. doi: 10.1016/j.jksuci.2017.
06.001.
Ozturk, T., Talo, M., Yildirim, E.A., Baloglu, U.B., Yildirim, O. and Acharya, U.R. (2020), “Automated
detection of covid-19 cases using deep neural networks with x-ray images”, Computers in
Biology and Medicine. doi: 10.1016/j.compbiomed.2020.103792.
O’Donovan, P., Leahy, K., Bruton, K. and O’Sullivan, T.J. (2015), “Big data in manufacturing: a
systematic mapping study”, Journal of Big Data, Vol. 20 No. 2, doi: 10.1186/s40537-015-0028-x.
Padilla, M., Olofsson, P., Stehman, V.S., Tansey, K. and Chuvieco, E. (2017), “Stratification and sample
allocation for reference burned area data”, Remote Sensing of Environment, Vol. 203,
pp. 240-255, doi: 10.1016/j.rse.2017.06.041.
Palanisamy, V. and Thirunavukarasu, R. (2019), “Implications of big data analytics in developing
healthcare frameworks – a review”, Journal of King Saud University – Computer and
Information Sciences, Vol. 31 No. 4, pp. 415-425, doi: 10.1016/j.jksuci.2017.12.007.
Pandey, K.K. and Shukla, D. (2019), “Optimized sampling strategy for big data mining through
stratified sampling”, International Journal of Scientific and Technology Research, Vol. 8 No. 11.
Pandey, K. and Shukla, D. (2020), “Stratified sampling-based data reduction and categorization model
for big data mining”, in Bansal, J., Gupta, M., Sharma, H. and Agarwal, B. (Eds), Communication
and Intelligent Systems. ICCIS 2019. Lecture Notes in Networks and Systems 120, Springer,
Singapore.
Peter, S. (1976), “The foundations of survey sampling: a review”, Journal of the Royal Statistical
Society, Vol. 139 No. 2, pp. 183-204.
Pham, Q., Nguyen, D.C., Huynh-The, T., Hwang, W. and Pathirana, P.N. (2020), “Artificial intelligence
(ai) and big data for coronavirus (covid-19) pandemic: a survey on the state-of-the-arts”, IEEE
Access, Vol. 8, pp. 130820-130839, doi: 10.1109/ACCESS.2020.3009328.
Poornima, S. and Pushpalatha, M. (2016), “A journey from big data towards prescriptive analytics”,
Arpn Journal of Engineering and Applied Sciences, Vol. 19 No. 11.
Pop, F., Dobre, C. and Costan, A. (2017), “AutoCompBD: Autonomic computing and big data DDPML
platforms”, Soft Computing, Vol. 21 No. 16, pp. 4497-4499, doi: 10.1007/s00500-017-2739-8.
algorithms
Prakash, V. and Atul, P. (2016), “Comparison of mapreduce and spark programming frameworks for
big data analytics on hdfs”, International Journal of Computer Science Communication, Vol. 7
No. 2, pp. 80-84.
Puech, P.L., Cardot, H. and Goga, C. (2014), “Analysing large datasets of functional data: a survey
sampling point of view”, Journal de la Societe Francaise de Statistique, Vol. 155 No. 4.
597
Rahul, V. and Pravin, P. (2016), “A survey on: predictive analytics for credit risk assessment”,
International Research Journal of Engineering and Technology, Vol. 3.
Reddy, G.T., Reddy, M.P.K., Lakshmanna, K., Kaluri, R., Rajput, D.S., Srivastava, T. and Baker, G.
(2020), “Analysis of dimensionality reduction techniques on big data”, IEEE Access, Vol. 8,
pp. 54776-54788, doi: 10.1109/access.2020.2980942.
Ripon, P. and Arif, A. (2016), “Big data: the v’s of the game changer paradigm”, IEEE 18th
International Conference on High Performance Computing and Communications ; IEEE 14th
International Conference on Smart City ; IEEE 2nd International Conference on Data Science
and Systems. doi: 10.1109/HPCC-SmartCity-DSS.2016.8.
Robbins, I.W., Ghosh-Dastidar, B. and Ramchand, R. (2020), “Blending probability and nonprobability
samples with applications to a survey of military caregivers”, Journal of Survey Statistics and
Methodology. doi: 10.1093/jssam/smaa037.
Rojas, J.A.R., Kery, M.B., Rosenthal, S. and Dey, A.K. (2017), “Sampling techniques to improve big
data exploration”, 2017 IEEE 7th Symposium on Large Data Analysis and Visualization
(LDAV), doi: 10.1109/LDAV.2017.8231848.
Roos, Deutsch, Corrigan, Zikopoulos, Parasuraman and Giles (2013), Harness the Power of Big Data:
The Ibm Big Data Platform, McGraw-Hill, New York.
Sadrfaridpour, E., Razzaghi, T. and Safro, I. (2019), “Engineering fast multilevel support vector
machines”, Machine Learning, Vol. 108, doi: 10.1007/s10994-019-05800-7.
Sathyaraj, R., Ramanathan, L., Lavanya, K., Balasubramanian, V. and Saira Banu, J. (2020), “Chicken
swarm foraging algorithm for big data classification using the deep belief network classifier”,
Data Technologies and Applications. doi:10.1108/DTA-08-2019-0146.
Schifano, E.D., Wu, J., Wang, C., Yan, J. and Chen, M.H. (2016), “Online updating of statistical inference
in the big data setting”, Technometrics. doi: 10.1080/00401706.2016.1142900.
Schmueli, G. and Koppius, O. (2011), “Predictive analytics in information systems research”,
Management Information Systems, Vol. 35, pp. 553-572.
Schwab-McCoy, A., Baker, C.M. and Gasper, R.E. (2020), “Data science in 2020: computing, cur- ricula,
and challenges for the next 10 years”, Journal of Statistics Education. doi: 10.1080/10691898.
2020.1851159.
Scutari, M., Vitolo, C. and Tucker, A. (2019), “Learning bayesian networks from big data with greedy
search: computational complexity and efficient implementation”, Statistics and Computing,
Vol. 29, pp. 1095-1108, doi: 10.1007/s11222-019-09857-1.
Sharma, R. and Singh, S.N. (2019), “Data mining classification techniques – comparison for better
accuracy in prediction of cardiovascular disease”, International Journal of Data Analysis
Techniques and Strategies, Vol. 11 No. 4.
Shen, E. (2020), “On the use of sampling weights for retrospective medical record reviews”, The
Permanente Journal, Vol. 24, doi: 10.7812/TPP/18.308.
Shim, K., Cha, S., Chen, L., Han, W.-S., Srivastava, D., Tanaka, K., Yu, H. and Zhou, X. (2012), “Data
management challenges and opportunities in cloud computing”, 17th International Conference
on Database Systems for Advanced Applications (DASFAA’2012), Springer, Berlin/Heidelberg.
Siirtola, P. and R€oning, J. (2020), “Comparison of regression and classification models for user-
independent and personal stress detection”, Sensors.
DTA Singh, A.S. and Masuku, M.B. (2014), “Sampling techniques and determination of sample size in
applied statistics research: an overview”, International Journal of Economics, Commerce and
56,4 Management, Vol. 2 No. 11, pp. 1-22.
Singh, A., Choudhary, S. and Kumari, M. (2020), “Hadoop ecosystem analytics and big data for
advanced computing platforms”, International Journal of Advanced Science and Technology,
Vol. 29 No. 5, pp. 6633-6642.
Sliwinski, T. and Kang, S. (2017), Applying Parallel Computing Techniques to Analyze Terabyte
598 Atmospheric Boundary Layer Model Outputs, Elsevier, doi: 10.1016/j.bdr.2017.01.001.
Sun, Z. and Wang, P. (2017), “A mathematical foundation of big data”, New Mathematics and Natural
Computation, Vol. 13 No. 2, doi: 10.1142/s1793005717400014.
Sun, L., Liu, G., Song, F., Shi, N., Liu, F., Li, S., Li, P., Zhang, W., Jiang, X., Zhang, Y., Sun, L., Chen, X.
and Shi, Y. (2020), “Combination of four clinical indicators predicts the severe/critical symptom
of patients infected covid-19”, Journal of Clinical Virology. doi: 10.1016/j.jcv.2020.104431.
Taherdoost, H. (2016), “Sampling methods in research methodology; how to choose a sampling
technique for research”, International Journal of Academic Research in Management.
Trovati, M. and Bessis, N. (2015), “An influence assessment method based on co-occurrence for
topologi- cally reduced big data sets”, Soft Computing, pp. 1-10.
Tukey, J. (1977), Exploratory Data Analysis, Addison-Wesley, Reading, MA.
Turner, D.P. (2020), “Sampling methods in research design”, Headache: The Journal of Head and Face
Pain, Vol. 60 No. 1, pp. 8-12, doi: 10.1111/head.13707.
Urrehman, M.H., Liew, C.S., Abbas, A., Jayaraman, P., Wah, T.Y. and Khan, S.U. (2016), “Big data
reduction methods: a survey”, Data Science and Engineering, Vol. 1, pp. 265-284.
Van Steen, M. and Tanenbaum, A.S. (2016), “A brief introduction to distributed systems”, Computing,
Vol. 98, pp. 967-1009, doi: 10.1007/s00607-016-0508-7.
Velliangiri, S., Alagumuthukrishnan, S. and joseph, S.I.T. (2019), “A review of dimensionality
reduction techniques for efficient computation”, Procedia Computer Science, Vol. 165,
pp. 104-111, doi: 10.1016/j.procs.2020.01.079.
Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T. and Rellermeyer, J.S. (2020), “A survey
on distributed machine learning”, ACM Computing Surveys, Vol. 53 No. 2, doi: 10.1145/3377454.
Verma, N., Malhotra, D. and Singh, J. (2020), “Big data analytics for retail industry using mapreduce-
apriori framework”, Journal of Management Analytics, Vol. 7 No. 3, pp. 424-442, doi: 10.1080/
23270012.2020.1728403.
Wah, B.W. (2008), Interconnection Networks for Parallel Computers, Wiley Encyclopedia of Computer
Science and Engineering.
Wei, C. and Chou, T. (2020), “Typhoon quantitative rainfall prediction from big data analytics by
using the Apache hadoop spark parallel computing framework”, Atmosphere, Vol. 11, doi: 10.
3390/atmos11080870.
Weihs, C. and Ickstadt, K. (2018), “Data science: the impact of statistics”, International Journal of Data
Science and Analytics, Vol. 6, pp. 189-194, doi: 10.1007/s41060-018-0102-5.
West, P. (2016), “Simple random sampling of individual items in the absence of a sampling frame that
lists the individuals”, New Zealand Journal of Forestry Science, Vol. 46 No. 15, doi: 10.1186/
s40490-016-0071-1.
Wu, J., Zhang, P., Zhang, L., Meng, W., Li, J., Tong, C., Li, Y., Cai, J., Yang, Z., Zhu, J., Zhao, M., Huang,
H., Xie, X. and Li, S. (2020), Rapid and Accurate Identification of Covid-19 Infection through
Machine Learning Based on Clinical Available Blood Test Results, medRxiv, doi: 10.1101/2020.04.
02.20051136.
Xindong, W., Xingquan, Z., Gong-Qing, W. and Wei, D. (2014), “Data mining with big data”, IEEE
Transactions on Knowledge and Data Engineering, Vol. 26 No. 1, pp. 97-107, doi: 10.1109/TKDE.
2013.109.
Xing, W. and Bei, Y. (2020), “Medical health big data classification based on knn classification DDPML
algorithm”, IEEE Access, Vol. 8, pp. 28808-28819, doi: 10.1109/ACCESS.2019.2955754.
algorithms
Xingquan, Z. and Ian, D. (2007), Knowledge Discovery and Data Mining: Challenges and Realities,
Hershey, New York, ISBN: 978-1-59904-252.
Yadav, R. and Tailor, R. (2020), “Estimation of finite population mean using two auxiliary variables
under stratified random sampling”, Statistics in Transition New Series, Vol. 21 No. 1, pp. 1-12,
doi: 10.21307/stattrans-2020-001.
599
Yanchao, D., Jiguang, Y., Yan, Z. and Zhencheng, H. (2016), “Comparison of random forest, random
ferns and support vector machine for eye state classification”, Multimedia Tools and
Applications, Vol. 75, pp. 11763-11783, doi: 10.1007/s1104201526350.
Yang, C., Chen, S., Liu, J., Liu, R. and Chang, C. (2020), “On construction of an energy monitoring
service using big data technology for the smart campus”, Cluster Computing, Vol. 23 No. 1,
doi: 10.1007/s10586-019-02921-5.
Zanoon, N., Alkharabsheh, K. and Ryalat, M.H. (2020), “Optimizing mapreduce model for big data
analytics using subtractive clustering algorithm”, International Journal of Advanced Science
and Technology, Vol. 29 No. 4, pp. 4106-4119.
Zhang, Y., Ren, S., Liu, Y., Sakao, T. and Huisingh, D. (2017), “A framework for big data driven
product lifecycle management”, Journal of Cleaner Production, Vol. 159, pp. 229-240.
Zhao, X., Liang, J. and Dang, C. (2019), “A stratified sampling based clustering algorithm for large-
scale data”, Knowledge-Based Systems, Vol. 163, pp. 416-428, doi: 10.1016/j.knosys.2018.09.007.
Appendix
A1. Some concepts in mathematical statistics
A1.2 Bias
The Bias is used to detect the possible presence of systematic error. It corresponds to the difference
between the mathematical expectancy of the estimator of a parameter and the parameter itself.
Bias ¼ E b
θ θ
A.1.4 Variance
The variance of a statistical series is the number noted V such that:
2
n1 ðx1 xÞ2 þ n2 ðx2 xÞ2 þ . . . . . . . . . þ np xp x
V¼
N
DTA We notice:
56,4 1 Xp 2
V¼ ni xi x
N i¼1
Remark:
(1) The variance is a positive number.
600
(2) We also have:
X
p 2
V¼ fi xi x
i¼1
A1.5 Precision
The Precision is the ratio between the number of true positives and the sum of true positives and false
positives.
A1.12 Kappa
The Kappa coefficient is supposed to measure the degree of agreement between two or more judges.
P0 Pe
601
k¼
1 Pe
with P0: The proportion of the sample on which the two judges agree (i.e. the main diagonal of the
confusion matrix).
P
i Pi Pj
Pe ¼
n2
K Interpretation
Corresponding author
Laouni Djafri can be contacted at: [email protected]
For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: [email protected]