0% found this document useful (0 votes)

5 views44 pages

Dynamic Distributed and Parallel Machine Learning Algorithms For Big Data Mining Processing

The document discusses the development of Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms aimed at efficiently processing big data through distributed systems while maintaining classification accuracy. It highlights the challenges posed by the volume, veracity, validity, and velocity of big data and proposes a solution involving a distributed architecture controlled by the Map-Reduce algorithm and stratified random sampling techniques. The findings indicate that the DDPML system effectively handles big data mining classification with satisfactory results.

Uploaded by

Ruan Lucas Fernandes De Negreiros

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views44 pages

Dynamic Distributed and Parallel Machine Learning Algorithms For Big Data Mining Processing

Uploaded by

Ruan Lucas Fernandes De Negreiros

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

The current issue and full text archive of this journal is available on Emerald Insight at:

https://www.emerald.com/insight/2514-9288.htm

DTA
56,4 Dynamic Distributed and Parallel
Machine Learning algorithms for
big data mining processing
558 Laouni Djafri
Ibn Khaldoun University, Tiaret, Algeria and
Received 13 June 2021
Revised 23 November 2021
EEDIS laboratory, Djillali Liabes University, Sidi Bel Abbes, Algeria
Accepted 23 November 2021
Abstract
Purpose – This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any
other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds
computing or other technologies.
Design/methodology/approach – In the age of Big Data, all companies want to benefit from large
amounts of data. These data can help them understand their internal and external environment and
anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later.
Thus, this knowledge becomes a great asset in companies’ hands. This is precisely the objective of data
mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are
now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the
problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel
processing techniques. So, the problem that the authors are raising in this work is how the authors can
make machine learning algorithms work in a distributed and parallel way at the same time without losing
the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic
Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their
work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-
Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture
that the authors designed is specially directed to handle big data processing that operates in a coherent and
efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors
to actually verify the classification results obtained using the representative learning base (RLB). In the
second part, the authors have extracted the representative learning base by sampling at two levels using
the stratified random sampling method. This sampling method is also applied to extract the shared
learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for
the second level (PLBL2). The experimental results show the efficiency of our solution that the authors
provided without significant loss of the classification results. Thus, in practical terms, the system DDPML
is generally dedicated to big data mining processing, and works effectively in distributed systems with a
simple structure, such as client-server networks.
Findings – The authors got very satisfactory classification results.
Originality/value – DDPML system is specially designed to smoothly handle big data mining classification.
Keywords Big data mining, Statistical sampling, Map-reduce, Machine learning, Distributed and parallel
processing, Big data platforms
Paper type Research paper

1. Introduction
Since the advent of the Internet to this day, we have seen explosive growth in the volume,
velocity and variety of data created daily (Sathyaraj et al., 2020); this amount of data is
generated by a variety of methods such as click stream data, financial transaction data, log
files generated by web or mobile applications, sensor data from Internet of things (IoT), in-
game player activity and telemetry from connected devices, and many other methods
(O’Donovan et al., 2015; Hariri et al., 2019). This data is commonly referred to as “Big Data”
Data Technologies and
Applications because of its volume, the velocity with which it arrives and the variety of forms it takes.
Vol. 56 No. 4, 2022
pp. 558-601
In 2001, Gartner proposed a three-dimensional or 3 Vs (volume, variety and velocity) view
© Emerald Publishing Limited
2514-9288
of the challenges and opportunities associated with data growth (Chen et al., 2014). In 2012,
DOI 10.1108/DTA-06-2021-0153 Gartner updated this report as follows: Big data is high volume, high speed and/or wide
variety of information resources that require new forms of processing to improve decision- DDPML
making (Erl et al., 2016). Oftentimes, these Vs are supplemented by a fourth V, is Veracity: algorithms
How accurate is the data (Chan, 2013; Roos et al., 2013). We can extend this model to the Big
Data dimensions over ten Vs: volume, variety, velocity, veracity, value, variability,
validity, volatility, viability and viscosity (Hariri et al., 2019; Khan et al., 2018; Kayyali
et al., 2013; Katal et al., 2013; Ferguson, 2013; Ripon and Arif, 2016; IBM, 2014; Elgendy and
Elragal, 2014). Accordingly, the increasing digitization of our activities, the ever-
increasing ability to store digital data, and the accumulation of information of all kinds, is 559
generating a new sector of activity aimed at analyzing these large amounts of data. This
leads to the emergence of new approaches, new methods, new knowledge, and ultimately,
undoubtedly, new ways of thinking and acting. Hence, this very large amount of data
must be exploited in order to better understand big data and how to extract knowledge
from it; this is known as big data mining (Cen et al., 2019; Dunren et al., 2013). Its main
purpose is to extract and retrieve desired information or patterns from a large amount of
data (Oussous et al., 2017). It is usually performed on a large amount of structured or
unstructured data using a combination of techniques that make it possible to explore
these large amounts of data, automatically or semi-automatically (Xindong et al., 2014;
Xingquan and Ian, 2007).
Every second, we see massive amounts of data growing exponentially, and there is no
weight for this huge amount of data unless we extract the true value from it by extracting the
information and knowledge or simply what we call big data mining. The real problem that we
are currently facing in big data mining is how to deal with this huge amount of data (volume)?
How to get the results in the shortest time (velocity)? But the question arises: can we maintain
or improve the precision (veracity and validity) of the results after reducing the size? These
and other questions will be discussed in our article. But, before we answer these questions, we
must understand very well the close relationship between these four characteristics (volume,
veracity, validity and velocity). We know very well that, if the size is large (volume), the
precision (veracity and validity) will be high and the speed (velocity) will be low. As for if the
size is small, the speed will be high and the precision may be low. Therefore, our goal in this
work is to reduce the size as far as possible to increase the speed to the maximum extent
possible. This speed will increase more and more if we use platforms and architectures
prepared for this purpose, provided that the precision of the results obtained is taken into
account.
First and foremost, if we want to reduce the volume of Big data in a scientific and
correct way, then we definitely think about mathematical methods, especially
mathematical statistics (Che et al., 2013; Trovati and Bessis, 2015; Urrehman et al.,
2016). So what are the effective mathematical statistics methods that we must apply in
such cases to give very satisfactory results? On the other hand, if we want to speed up
the time, we are considering the application of parallel and distributed computing (Lu
and Zhan, 2020; Concurrency-Computat:Pract.Exper, 2016; Brown et al., 2020) supported
by big data solutions (Jun et al., 2019; Zhang et al., 2017; Palanisamy and
Thirunavukarasu, 2019).
Today, big data mining mainly relies on statistical methods to overcome and control it
so that we can handle it comfortably. Thus, mathematical statistics is the most important
in data science and particularly in big data analytics (Bucchianico et al., 2019; Weihs and
Ickstadt, 2018; HLG-BAS, 2011). The reason may be due to its function, as mathematical
statistics reveal correlations between statistical groups, and its function is pivotal to
“reduce size” to better understand data, and thus extract information with greater
precision and more quickly; as a result, statisticians and data science experts went to the
use of statistical sampling technique (Rojas et al., 2017; Liu and Zhang, 2020; Mahmud
et al., 2020).
DTA In the Big Data analytics context, we often work with small-scale sets (sub-datasets)
56,4 that are part of the original dataset. For this reason, we mainly use mathematical
statistics. In mathematical statistics, a population usually contains too many individuals,
so that we may not be able to study them properly, and therefore a survey is often limited
to taking one or more samples. A well-chosen sample will contain most of the information
about a particular population parameter instead of studying the community as a whole;
this process is called sampling (Xindong et al., 2014; Berndt, 2020; Turner, 2020). So, the
560 goal is to generalize the results of the sample from the population (Singh and Masuku,
2014; Den-Broeck et al., 2013). Therefore, we must emphasize the importance of a good
choice of sample elements to make them representative of our population. A sample is said
to be representative when the original dataset is represented as faithfully as possible by
virtue of its characteristics and quantity (Singh and Masuku, 2014; Andrade, 2020; Lee
et al., 2020). There are also several sampling methods, both probabilistic and non-
probabilistic (Berndt, 2020; Etikan and Bala, 2017). In probability sampling, the first
important point of the sample is that each individual of the selected population must have
a known nonzero chance so that it does not necessarily require equality. We want the
selection to be done independently. In other words, the selection of an individual will not
affect the risk that other individuals will be chosen; we do this by selecting through a
process in which only chance acts, such as flipping one or more coins, usually using a set
of random numbers (Turner, 2020; Taherdoost, 2016; Robbins et al., 2020). The sample
chosen as such is called a random sample (West, 2016). The word “random” does not
describe this sample as such, but the way in which it is selected (Brechon, 2015; Bhardwaj,
2019). If the sampling unit is selected more than once so that it is placed back in the
population before selecting the next unit, this is called “random sampling with
replacement.” If the sampling unit is selected only once, i.e., is not replaced, it is called
“random sampling without replacement” (West, 2016; Antal and Tille, 2011). One of the
most important methods of probability sampling is stratified sampling (Yadav and
Tailor, 2020) and we will deal with it in our work. This type of sampling divides the
population into non-overlapping subpopulations called “strata” (Howell et al., 2020); this
division works according to certain characteristics so that the units of a stratum are as
close as possible (KSteven, 2012). Second, although one stratum may differ significantly
from another, a stratified sample with the required number of units from each population
stratum tends to be “representative” of the population as a whole. Stratified sampling, is
unlikely to choose an absurd sample because it guarantees the relative presence of all the
different subgroups that make up the population (Etikan and Bala, 2017; Padilla et al.,
2017). As for the non-probability sampling is generally based on subjective ideas. In other
words, the statistical sample selected is based on personal estimation rather than random
selection, and this type of sampling does not guarantee equal opportunity for every object
of the target population (Iachan et al., 2019; Gravetter and Forzano, 2012; Moorley and
Shorten, 2014).
Big Data Mining is a great source of information and knowledge from systems to other
end users. However, managing such a large amount of data or knowledge requires
automation, which leads to serious thinking about the use of machine learning techniques.
Machine learning consists of many powerful algorithms for learning patterns, knowledge
acquisition and predicts future events. Specifically, these algorithms work by searching a
group of possible predictive models to capture the best relationship between descriptive
features and target functions in the dataset. Based on this, the machine learning algorithm
makes the selection during the training process. The clear criterion for driving this choice
is the search for data-compatible models (Erl et al., 2016; Bailly et al., 2018). We can then
use this model to make predictions for new cases (instances) (Klaine et al., 2017).
Therefore, machine learning, which is one of the sub domains of artificial intelligence,
aims to automatically extract and exploit the information present in the dataset, that is, DDPML
equipping machines with human intelligence, so that they are able to make predictions algorithms
based on a huge amount of data, which is an almost impossible task for a human being
(Burhan et al., 2014). For example, machine learning plays a key role in better
understanding and coping with the COVID-19 crisis, in which machine learning
algorithms allow computers to mimic human intelligence and ingest large volumes of
data to quickly identify models and information; these models are used to predict new
observed values. After that, smart decisions can be taken to help us out of the crisis 561
(An et al., 2020; Goodman-Meza et al., 2020).
Machine learning algorithms are broadly classified into three categories: supervised,
unsupervised and reinforcement learning (Dasgupta and Nath, 2016). In our work, we have
relied on supervised algorithms in order to build predictive models so that it connects past
and current datasets with the help of labeled data to predict future events (Mathkunti and
Rangaswamy, 2020). We can simply say that supervised learning refers to known labels
(predicted classes are known beforehand) as a set of samples to predict future events
(Muhammad et al., 2020; Li et al., 2020). It is divided into three phases: the learning phase,
the validation phase and the test phase. Supervised learning is also divided into two broad
categories (James et al., 2013): classification and regression. Classification algorithms are
suitable for the system that produces discrete responses (Siirtola and R€oning, 2020). In
other words, responses are categorical variables, whereas regression algorithms are
algorithms that develop a model that relies on equations or mathematical operations based
on the values taken from input attributes to produce a continuous value representing the
output (James et al., 2013). This means that the input of these algorithms can take
continuous and discrete values depending on the algorithm, whereas the output is a
continuous value (Siirtola and R€oning, 2020). Supervised learning algorithms in the context
of big data are more complex. In this case, we must take into account the method of
treatment. This massive amount of data poses real problems for machine learning
algorithms. Sometimes these problems cause the system to crash completely. Thus, we
cannot give results and we cannot talk about velocity in this case. However, to overcome
these problems, we can make use of distributed and parallel processing techniques
(Assunç~ao et al., 2015; Debauche et al., 2018; Bendechache et al., 2019). Big data mining
processing requires massively parallel and widely distributed computing resources due to
the amount of data involved in a calculation, so that the results are delivered in a rather
short time laps; otherwise, this processing may lose value over time. Distributed and
parallel processing emerged as a solution to solve complex problems using nodes that have
multiple processors or nodes connected to each other across a network (Kambatla et al.,
2014). The shift from sequential processing to distributed and parallel processing provides
high performance and reliability for applications. But, it also introduces new challenges in
terms of hardware architectures, inter-process communication technologies, algorithms
and systems design.
Parallel processing uses computing nodes or modern machines that contain shared
processors often multicore, multithreaded or GPU like hardware infrastructures (Fang
et al., 2019; Chen et al., 2016), or it uses sophisticated platforms or technologies, often
Hadoop and its ecosystem like software infrastructures (Mostafaeipour et al., 2020; Singh
et al., 2020). These technologies allow to rapidly increase processor speed and power
efficiency. In addition, the processing of big data is speeded up by the distributed systems
used to enable the exchange of data between compute nodes (Pop et al., 2017). In parallel
processing, several processors cooperate to solve a problem, which reduces the processing
time, because several operations can be performed simultaneously. The use of multiple
processors working together on the same computation illustrates a new paradigm in
computer problem solving; it is completely different from sequential processing. Parallel
DTA processing provides models and architectures for performing multiple tasks within a
56,4 single compute node or group of nodes tightly coupled with homogeneous devices (Conti,
2015). The processing resources based on the network connection can also be popular
products such as Ethernet. However, it is often useful to design a custom network, or at
least, use a custom configuration of basic switches that meet the communication
requirements. On the other hand, the distributed paradigm has emerged as an alternative
to expensive supercomputers in order to meet the needs of new users and applications.
562 Unlike supercomputers, distributed computing systems are networks made up of a large
number of nodes or entities connected through a fast local area network. The nodes of a
distributed system are independent that do not physically share memory or processors,
but these nodes appear to their users as one coherent system (Van Steen and Tanenbaum,
2016). In distributed computing, several compute nodes cooperate to solve a very complex
problem, but there is one important thing that we need to know about distributed
architectures; it is that these architectures cannot reduce the processing time because
several operations cannot be performed simultaneously. Accordingly, in our work we have
combined both concepts, parallel and distributed computing, and have presented them as
a unified resource over a high-flow local network.

2. Related works
Big Data Mining is considered a hot topic for researchers in the fields of mathematics and
computer science. Modeling and predictive analytics can be of critical importance to
institutions if they are properly aligned with their processes and business needs. These
institutions can also significantly improve their performance and the validity of their
decisions, which increases their business value. Every institution can analyze its data
statistically and better understand its environment, but the greatest profit potential lies
with those who are able to perform modeling and predictive analysis based on machine
learning algorithms. In data science and artificial intelligence, big data is the beating heart
of machine 4 errors 24 warnings learning, because data is the tool that enables machine
learning to understand the behavior and the way humans think, and translate them in an
automatic way. But never forget that we have to turn a profit at the right time, because
with the increasing complexity of modern scientific and technical problems, the
requirements for real-time processing are increasing more and more. For this reason, we
must use the big data solutions

2.1 Mathematical statistics in the heart of big data analytics

Mathematics is sometimes referred to as the “queen of science” so Italian mathematician,
astronomer and physicist Galileo Galilei defines mathematics as the language in which God
wrote the universe. Therefore, let’s all agree that math is the language of sciences that gives it
more and more precision. In 1977, John Tookey (Tukey, 1977) highlighted the critical
importance of statistics in the field of data analysis where statistics and data science are
interconnected, so that they cannot be separated from each other. So much so, that wherever
the word “statistics” is used, “data science” is mentioned and vice versa (Schwab-McCoy et al.,
2020). Consequently, recent research in data science has focused primarily on how to use
statistical principles as the science that collects, analyzes and understands data (Muthusami
and Saritha, 2020).
In the science of big data mining, scientists explain this huge amount of data (qualitative
or quantitative) mathematically by applying descriptive methods to determine its dimensions
and extract characteristics from it, or by applying predictive methods to predict future
phenomena (Bhaskar and Zulfiqar, 2016; Deshpande et al., 2016; Bruce and Bruce, 2017;
Mayya et al., 2017). This includes empirical predictive models such as data mining algorithms DDPML
that predict future scenarios and evaluation methods to assess the predictive power of a algorithms
model (Schmueli and Koppius, 2011).
Big data mining is a huge challenge for data scientists and statisticians to extract the
social and economic benefits from these data. The challenge lies in how to deal with this
massive amount of data using statistical methods. In this context, we find the work presented
by Schifano et al. (2016) in which they explain the development of statistical methods to solve
complex problems. Similar work done by Sun and Wang (2017) where they proposed a 563
mathematical theory based on the concept of linguistics variable as a mathematical factor for
big data research like business analytics, big data intelligence and big data computing. Also,
a major concern of many statisticians in the big data field is to reduce the volume (Urrehman
et al., 2016; Mahmud et al., 2020; Li and Liu, 2017; Reddy et al., 2020) in order to handle it
comfortably. For example, Velliangiri et al. (2019) presented most widely used feature
extraction techniques such as EMD, PCA and features selection techniques such as
correlation, LDA to reduce the volume on the one hand and give accurate results on the other
hand. In a similar research work done by Reddy et al. (2020) they relied on two technologies in
their work: linear discriminant analysis (LDA) and principal component analysis (PCA).
Others relied on statistical sampling methods to handle smoothly with big data (Rojas et al.,
2017; Ellis et al., 2005). There is another statistical method suggested by Jun et al. (2015); this
method is based on small samples of data called “divided regression analysis” to reduce the
computational charge.
Generally speaking, when we talk about big data mining in a statistical context, we
mean by that descriptive/exploratory analysis or predictive/explanatory analysis.
Descriptive techniques aim at highlighting information that is present but hidden by
the data volume, while predictive techniques aim at extrapolating new information from
current information (Poornima and Pushpalatha, 2016). In our work, we focus on big data
mining in its predictive aspect. One can imagine that the data has been collected, and after
that we will perform massive calculations, but these calculations can be relatively
expensive. To overcome this challenge, it is necessary to find a representative sample that
represents an original big dataset, in which sampling is a pioneering statistical technique
for finding and locating patterns in big data mining. However, it may also be necessary to
use multiple sampling techniques to avoid data bias (Kandel et al., 2012). We present some
works which are based on statistical sampling in the big data context. For example, in a
study carried out by Cebeci and Yildiz (2016) in which there are similarities with our
proposed work, where they reduced the volume by relying on simple random sampling
and clustering technique using the K-means algorithm. What confirms our words is the
survey conducted by Mahmud et al. (2020) where they offered us methods and techniques
of data partitioning and sampling in distributed and parallel computing systems like
hadoop and its ecosystem. However, they finished his survey with research questions that
we will try to answer in our work. In the context of partitioning and distributing big data
and the strategies applied in this, we find a work proposed by Emara and Huang (2020),
where they distributed data in multiple data centers either without duplication or with
duplication. They used two strategies. In the first strategy, the distributed data is
converted to the RSP (Random Sample Partition) data blocks. In the second strategy, they
copied the data blocks among different data centers. However, the problem we may face is
the bias when obtaining random samples because it did not use the statistical methods
known in this field. Another work in this direction has been done by Rojas et al. (2017).
They prove once again that the use of the random sampling technique is inevitable in the
field of big data. It is the gain that illuminates our path more and more in our proposed
work. However, the question that keeps repeating itself is which statistical methods are
appropriate and which give more excellence for big data mining. This question answered
DTA by Zhao et al. (2019) where they indicated in their study that the stratified sampling
56,4 method was the preferred method for dealing with big data. Here, we have preempted the
events a bit, but that’s what we’ll see in our proposed work. Therefore, the challenge that
awaits us in our proposed work is how to use statistical methods to achieve largely
satisfactory results.

564 2.2 Big data, data mining and machine learning

Nowadays, there is a growing interest in social, economic, health, safety and other issues
that need to be solved using big data analysis and machine learning algorithms. These two
concepts are starting to gain attention in many scientific researches. For example, but not
limited to, in the business world, most decisions would be much easier if we could anticipate
the likelihood, or propensity of customers to take different actions using machine learning
algorithms. Successful applications of propensity modeling include predicting the
likelihood of customers moving from one mobile operator to another, responding to
particular marketing efforts, or purchasing different products (Davenport and Kim, 2013).
Also, organizations can use machine learning algorithms to better control and manage risk
(Rahul and Pravin, 2016). In the healthcare world, these algorithms can help professionals
make better diagnoses by tapping into large collections of historical examples on a scale
beyond anything an individual might see in their career. For example, predicting optimal
doses based on past dose data and associated outcomes (Coulet et al., 2018). In a similar
study conducted by Nguyen et al. (2019) in order to find out the optimal distribution of
prostate cancer radiotherapy a patient will receive. Currently, if we are talking about
fighting epidemic diseases and how to prevent them, we are talking more specifically about
the Corona virus pandemic. In early 2020, coinciding with the emergence of this pandemic
in China, December 2019 (Lalmuanawma et al., 2020), and to this day, the machine learning
algorithms is used terribly in most, if not we say, in all scientific research related to fighting
this virus (An et al., 2020; Goodman-Meza et al., 2020; Muhammad et al., 2020; Li et al., 2020;
Lalmuanawma et al., 2020; Pham et al., 2020). Therefore, big data mining and machine
learning are two promising technologies used by many healthcare providers use to help
medical experts in order to solve real problems. However, most of the works in this regard
have centered on predictions for prevention and saving lives. Predictions are mainly based
on supervised algorithms. For example, Ardakani et al. (2020) adopted the Deep
Convolutional Neural Network method to build predictive models. Also Ozturk et al.
(2020) adopted Convolutional Neural Network method for prediction. A similar work is
done by Sun et al. (2020) where they used the SVM method. Another work is presented by
Wu et al. (2020) in the same context, no less important than the other works, which is based
on the random forests algorithm. Also, in the context of data mining, there is a comparative
study for a better precision in the prediction of cardiovascular diseases carried out by
Sharma and Singh (2019), where several classifiers have been used including Naive Bayes,
C-PLS, KNN and decision tree.

2.3 Systems and technologies for big data

Big Data processing requires massively parallel and widely distributed computer systems
supported by big data technologies so that results are delivered in a fairly short period of
time. Otherwise, this processing may lose its value over time. Parallel processing is based
on parallel systems which are made up of processors, a hierarchy of memories and
interconnecting networks (Pop et al., 2017; Wah, 2008; Inderpal, 2013). Parallel processing
introduces models and architectures for performing multiple tasks within a single compute
node or group of tightly coupled nodes with homogeneous devices (Conti, 2015). Distributed
computer systems divide large, uncontrollable problems into small units so that they can be DDPML
solved efficiently and in a coordinated manner (Van Steen and Tanenbaum, 2016). algorithms
Therefore, distributed processing is now at the heart of Big Data. Among the notable works
in this field is the work done by Mohan et al. (2017); they presented a parallel label
propagation algorithm for community detection and a parallel similarity measure
algorithm for the link prediction. Another work of the same value has been done by
Sliwinski and Kang (2017) which allows parallel and distributed computing for Big Data
analytics. This system is based on the parallel implementation of the LBFGS algorithm on 565
the HPCC platform. There is also system known under the name the “Lambda architecture”
which provides a batch model on large volumes of data (Kiran et al., 2015). It preserves the
principles of big data, such as scalability and fault tolerance. Also, cloud computing is
widely used in the heart of big data processing (Shim et al., 2012) because it contains many
computing resources such as network, applications, servers, storage and services (Jaradat
et al., 2015; Yang et al., 2020). There are recent research studies (Chung et al., 2020) that seek
to improve more and more services in cloud computing, especially if it comes to storing and
analyzing big data.
Big data analytics is a promising field for the next generation of innovation in
automation due to the growing need to extract value from data from multiple application
fields. To this end, various technologies have evolved over the past ten years. The most
widely used of these technologies is Hadoop and its ecosystem (Bhandari, 2020). Hadoop –
high-availability distributed object-oriented platform – is a group of classes written in Java,
open source by the Apache foundation allowing to meet the needs of big data. It contains
four basic components (Alam and Ahmed, 2014; Dhyani and Barthwal, 2014). The first,
Hadoop Distributed File System (HDFS), is a distributed, expandable and portable file
system inspired by the Google File System (GFS). It is mainly used for persistent data
storage that provides high-speed access to application data; it was designed to store very
large volumes of data on a large number of machines equipped with standard hard drives
(Borthakur, 2007; Honnutagi, 2014). The second, Hadoop YARN (Yet Another Resource
Negotiator) is a framework for scheduling tasks and managing computing resources in
clusters. It is considered to be the brain of the Hadoop ecosystem (Kulkarni and Khandewal,
2014). The third, Hadoop Map-Reduce, it can be summarized in two functions Map () and
Reduce (). The Map function performs several actions such as filtering, grouping and
sorting (Jeong and Cha, 2019), whereas The Reduce function aggregates the result produced
by the Map function. The result generated by the Map function is a pair of base values (K, V)
which serves as input to the reduction function (Kulkarni and Khandewal, 2014). Finally,
Hadoop Common contains all the libraries and tools necessary for other software to use
Hadoop (Alam and Ahmed, 2014).
Among the most important ecosystem of hadoop is apache spark (Wei and Chou, 2020),
which we mainly focus on in our proposed work. Apache Spark is a framework for real-
time data analysis in a distributed system. It performs calculations in memory to increase
the speed of data processing on Map-Reduce. It is faster than Hadoop MapReduce for
large-scale data processing by leveraging in-memory calculations and other
improvements (Prakash and Atul, 2016). Spark has several advantages over other Big
Data technologies including MapReduce from Hadoop and Storm. First, Spark offers a
complete and unified framework to meet the needs of big data processing as well as by
type of processing (batch, micro batch or streaming). Second, Spark allows applications
on Hadoop to run up to 100 times faster in memory and 10 times faster on disk
(Bhattacharya and Bhatnagar, 2016), it includes a group of more than 80 high-level
operators (Dataflair, 2020). It is also composed of five basic components (Bhattacharya
and Bhatnagar, 2016; Dataflair, 2020; Haoyuan et al., 2013; Meng et al., 2016; Gonzalez
et al., 2014). The first one, Spark SQL is a Spark module for 40 structured data processing.
DTA It provides a programming abstraction called “DataFrames,” and it can also act as a
56,4 distributed SQL query engine. It allows unmodified Hadoop Hive queries to run up to 100
times faster on existing deployments and data (Dataflair, 2020). It also provides powerful
integration with the rest of the Spark ecosystem (for example, by integrating SQL query
processing with machine learning). The second, Spark streaming enables powerful
interactive and analytical applications for big data by inheriting the ease of use and fault
tolerance characteristics of Spark. It easily integrates with a wide variety of popular data
566 sources, including HDFS, Flume and Kafka. Spark Streaming can be used for real-time
data processing; it uses the DStream module, i.e. a series of RDD-Resilient Distributed
Dataset (Haoyuan et al., 2013). The third, Spark MLlib is a machine learning library for big
data analysis; it provides over 55 scalable machine learning algorithms that greatly
benefit from data parallelization (Meng et al., 2016). It also provides high quality
algorithms (e.g. multiple iterations to increase accuracy) and blazing speed (up to 100
times faster than Map-Reduce) (Bhattacharya and Bhatnagar, 2016). Spark’s MLlib is
used with applications written in Java, Scala, R and Python so that we can include it in big
data processing (Dataflair, 2020). The fourth, Spark GraphX is a graphic calculation
engine that allows the composition of graphs from unstructured data and tables; it also
allows displaying the same physical data in the form of a graph without moving or
duplicating data (Gonzalez et al., 2014). Finally, Spark Core is the general runtime engine,
and it is the fundamental Spark platform from which all other functionality is built. Spark
Core uses a specialized basic data structure: RDD. RDD is a logical set of data partitioned
across multiple computers. RDDs can be created in two ways: the first, it refers to datasets
in external storage systems, and the second is to apply transformations (e.g. map, filter,
reducer and join) to existing RDDs. Spark is a new infrastructure that supports these
applications while maintaining the scalability and fault tolerance of Map-Reduce. An
abstraction (RDD) is presented as a read-only collection of objects partitioned across a
group of machines (Meng et al., 2016).

3. Research methodology
Our proposed work is primarily focused on creating a system that does the big data mining
classification in a distributed and parallel way, or rather, how one can run machine learning
algorithms in a distributed and parallel manner during the classification. To do this, we have
proposed sampling techniques that fit well with a distributed architecture to optimize big
data mining parallel processing in order to provide satisfactory classification results in the
shortest time. This part constitutes the practical part of the test of our work, which is a group
of coordinated and controlled tests comprising big data architectures and solutions
supported by Machine Learning algorithms. Therefore, it includes defining the objectives,
describing and clarifying the activities to be carried out, choosing the techniques to be
implemented as well as the technological tools that will be mobilized and finally mastering the
time and material constraints that will determine the success of our work. Then we will
discuss the results obtained in our experiments. We will therefore be at the heart of a work in
which we will have to identify a need and propose well-thought-out solutions to respond to the
problem posed.

3.1 Proposed work

In order to have dynamic and efficient distributed machine learning for big data mining, we
propose two ideas. The first is to create a distributed architecture to speed up the execution
time. The second is to propose a stratified sampling technique in order to extract a good
representative sample, so that this sample yields good classification results. It is also useful in
speeding up processing time due to its small size, as it can be processed in a single machine DDPML
later or simple computer networks. algorithms
3.1.1 Proposed distributed architecture (first part). Parallel and distributed computing is of
utmost importance in particular to mitigate big data issues such as volume and velocity using
appropriate solutions and technologies. Big data solutions/technologies are platforms that do
parallel and distributed computing. These platforms provide cost-effective computing power
and unparalleled efficiency in processing large volumes of data in batch, micro-batch or
streaming. Thus, to manage Big Data, technologies have been created that are able to use the 567
computing power and storage capacity with increased performance proportional to
the number of existing machines. In our work, we propose a system specially adapted for
the treatment of big data mining. This system depends on three things: first is the distributed
architecture supported by big data solutions such as Hadoop, Spark and Zookeeper for
distributed and parallel processing, second is the Map-Reduce data partitioning method
based on stratified random sampling, and third is the supervised learning method including
Random Forests to do the classification. However, before realizing this proposed model, we
must ask the following questions: (1) How can we design a distributed architecture in order to
extract the representative sample? (2) Then, how can we use this representative sample later
in the distributed classification process? (3) Is this sample more effective for analyzing big
data? (4) How fast do we get the results? (5) And the most important question is whether these
results can be generalized to other distributed systems? i.e. what are the guarantees of the
result and performance if we use this sample in systems other than our own system,
regardless of the type of these systems?
Figure 1 illustrates the proposed architecture (physical organization):
The distributed architecture that we proposed builds four main components: the shared
learning base (SLB), the partial learning base level 1 (PLBL1), the partial learning base level 2
(PLBL2) and the representative learning base (RLB). These are the four elements that help us
answer the questions posed above. The proposed model has the following characteristics:
(1) Install Apache Spark in each node so that Spark Master in the central node and Spark
Worker in the compute nodes (level 1 and level 2),
(2) The Map method, from the central node:
Partition big learning base (instances training) into partial learning bases
(subsets) and distributing them to the different compute nodes (L1),
Partition subsets (L1) and distributing them to different compute nodes (L2),

Figure 1.
Proposed distributed
architecture
DTA Distribute the shared learning base to the compute nodes (L2), and distribute the
56,4 instances validation or sub-instances validation (same thing) to the compute
nodes (L2).
(3) The Reduce method, to collect the classification results from compute nodes (L2) and
send them to the central node.
The operating scenario (logical organization) of our distributed architecture is well detailed in
568 Figure 2.

Figure 2.
The mode of operation
of the proposed model
3.1.2 Proposed sampling technique (second part). Theoretically, statistical researchers DDPML
want to take samples from a population to generalize their results. Accessible algorithms
populations are groups of research units that the researcher can actually sample. The
expected sample is all the research units chosen by the researcher to participate in the
research. Practically, it is somewhat difficult to obtain data from each of the selected
research units. The researcher may not be able to find all the intended participants; some
may choose not to participate; some may start the search but not complete it; some may
give bad data, etc. The real sample is the group of research units from which we can 569
actually get data. It must be determined whether or not the actual sample is
representative of the population to which one wishes to generalize the results. This is
that we want to achieve in our study. But first, we must think about the methods and tools
used for this purpose.
As it is known, if we want to get good results, we must use best methods. In our work, we
chose a stratified sampling among the random sampling methods, but why? It is a compound
question, and to answer it, we must answer two secondary questions; the first is why did we
choose probability sampling methods and not non-probability sampling methods? Whereas,
the second question is why did we choose stratified sampling over other probability sampling
methods?
Regarding the first question, we chose a stratified sampling among the random
sampling methods. Here, we rely on the work provided by MacInnis et al. (2018) and
Espinosa et al. (2012), because they assured us that this method is the best among them.
Concerning the answer to the second question, we based ourselves on the work proposed
by Espinosa et al. (2012), Fei (2015), Peter (1976), Okororie and Otuonye (2015) and Puech
et al. (2014). Now, we confirm that the stratified sampling method is the best and most
optimal, especially in the field of big data mining (Zhao et al., 2019; Pandey and Shukla,
2020; Alim and Shukla, 2020). In addition, it is a method widely used in various fields
because it has proven its efficiency and success (Fellers and Kuiper, 2020; Shen, 2020;
Gong et al., 2020). When applying sampling methods in mathematical statistics, one must
first know the sample size and the formulas used in it. The most famous formulas are as
follows:
To determine the sample size Te, there are two approaches (Ataro, 1967):
(1) From a proportion (Cochran, 1977), we use the following formula: T ¼ n * p * ð1 − pÞ
2
e me 2
Te: Size of the expected sample,
n: Confidence level according to the reduced centered normal law,
p: estimated proportion of the population presenting the characteristic (when
unknown, traditionally we use: n 2 5 95% p 5 50%, me 5 5%), me: margin of error
tolerated.
(2) From an average (Gupta and Jain, 2015), we use the following formula: Te ¼ n m*e 2σ
2 2

Te: Size of the expected sample,

n: confidence level deduced from the confidence rate,
σ : estimated standard deviation of the average of the criterion studied,
me: margin of error.
In our proposed work, we have implemented random sampling method in order to extract a
random sample mi with or without replacement from It or from dp and to constitute the shared
learning base ds, the partial learning base dp (level 1), and the partial learning base d0p (level 2).
DTA We have mentioned in the abstract that Map-Reduce algorithm relies on random sampling
56,4 technique as shown in Figure 2:
The nodes of our proposed architecture are organized (physical topology and logical
topology) as follows:
(1) The central node: connected to all the compute nodes (L1 and L2), containing the
original data-set (instances training It, instances validation Iv). The central node is
570 used to calculate: (1) the shared learning base (SLB 5 ds) by the method of stratified
sampling with replacement
S then it sends this learning base to the compute nodes
(L2), so that ds ¼ ni¼1 mi fIt g and mi: random sample. (2) the partial learning base
level 1 (PLBL1 5 dp) by stratified sampling without replacement then it sends this
learning base to the compute nodes (L1), and (3) display the final classification
result.
(2) The compute node (L1): the compute node (L1) is used to do: (1) calculate the partial
learning base level 2 ðPLBL2 ¼ d 0p Þ based stratified sampling with replacement;
(2) send the partial learning base level 2 d0p to the compute nodes (L2).
(3) The compute node (L2): the compute node (L2) is used to do: (1) accumulates the
partial learning base level 2 d 0p and shared learning base ds to obtain the
representative learning base (RLB 5 dr), (2) executes the Improved Random Forests
method from the representative learning base dr and instances validation Iv or sub-
instance validation I 0v in the case of the large instances validation, and (3) then sends
the preliminary result to the central node.
3.1.3 Map-Reduce based on the sampling strategy. To this day, Hadoop still receives great
attention from big data experts, since it has several features that help them in their work
such as persistent storage of large amount of data via HDFS, distributed resource
management and task planning through Hadoop YARN Hadoop also provides parallel
processing possibility using the Map-Reduce algorithm. This algorithm is very widely
used by experts to divide a large dataset into smaller parts and executed on multiple
machines in order to process them comfortably. But, the latter generates very strong
restrictions; for example it can be biasing the data after partition or produce imbalanced
data during or after partition, this is called partitioning skew (148). For this, many
researchers in this field are still developing techniques to help Map-Reduce in order to
improve its performance, and to make this algorithm more efficient. We mention, but not
limited to, the work carried out by Verma et al. (2020); they have developed a strategy
called MR-Apriori to help the Map reduce algorithm by increasing the execution speed
and reducing the desired size. It also adequately meets big data processing requirements
including scalability, fault tolerance and partial failure support. Another work was
proposed by Gandomi et al. (2020); they have developed a new heuristic method for
reducing the scope of functions on the Map-Reduce algorithm aims to improve
performance. Another similar work was done by Zanoon et al. (2020) to improve the
performance of Map-Reduce. They used a technique called the subtractive clustering
algorithm to reduce the amount of data transferred between the phases of the Map-Reduce
and to shorten the repetitive periods thanks to a simultaneous mechanism. To solve the
problem of the skewed division of the map-reduction, Zanoon et al. (2020) proposed an
approach based on metaheuristics. This method allows you to find good partitions with
better balancing data between the machines. For these multiple reasons, we proposed a
distributed architecture which was based on the principle of the Map-Reduce algorithm,
which depends on statistical sampling. This algorithm is supported by Spark technology
to increase parallel computing performance. Next, we explain how Map-Reduce algorithm
works in our distributed architecture, and how to apply the sampling method in level 1 DDPML
and level 2. algorithms
1- Map phase
1–1 From the central node:

571
DTA 1–2 for each compute node (level 1) do:
56,4

572

2- Reduce phase
2–1 from the central node:
(1) Collect the preliminary results received from the compute nodes of level 2;
(2) Display the final result of the predictive analysis;
2–2 for each compute node (level 1) do:
SNBRðmÞ
(1) Calculate the partial learning base (level 1): dp ðN L1
c Þ← i¼1 mi;
2–3 for each compute node (level 2) do:
SNBRðmi Þ 0
(1) Calculate d0p ðN L2
c Þ ← i¼1 mi;
(2) Delete the duplicated individuals;

(3) Calculate the representative learning base dr//dr i ¼ d0pi N L2
c þ ds;
(4) Building model by Random Forests algorithm from the (dr iþI 0v)//I 0v: sub-instances
validation;
(5) Send the preliminary result to the central node;

4. Experiments, results and discussion

After the realization of the architecture proposed and illustrated in Figure 1, which is
composed of one central node, of three compute nodes L1, and of six compute nodes L2,
knowing that each node has the following characteristics: Intel Core i7- 3.40 GHz processor,
8 GB RAM, 1 TB Hard Disk and 1 Gigabit Ethernet. We developed our software with the tools
and APIs under the Linux Ubuntu operating system 20.04.1, Pycharm CE 2020.2, Java DDPML
development kit 11.0.8, Apache Spark 3.0.1, Apache ZooKeeper 3.6.1, Python 3.8, PyQt5 algorithms
designer 5.
In this work, we relied on the KDD CUP 2012 dataset [1] to effectuate our experiments.
KDD Cup 2012: is a Dataset designed to predict whether or not a user will follow a
recommended item, and such behavior improves customer service. This procedure depends
on the click rate on advertising web pages that contain information about ad elements that
may be individuals, organizations or groups, as the business model behind the ad requires 573
information that is entered from customers in order to evaluate the ads. In the given dataset,
each line indicates a vector for the properties on the ad pages that customers see by clicking
the button to access the ad page, or refer to the ad pages closed by a no-click, i.e. revoke access
to the ads page. So, we have two classes. The first class refers to the click action (consult the
page), while the second class refers to the no-click (cancel the page without consulting). KDD
Cup 2012 saved in the LIBSVM format, the size of this dataset is detailed in Table 1.
As we said in the abstract, our proposed work in this paper is divided into two parts: first,
evaluation of the classification result and second, evaluation of the processing time.

4.1 Classification results evaluation

The first part aims at extracting the representative learning base, so that this rule gives a
prediction result as if it came from the instances training of the original data set, or this result
may be better than we expect, knowing that the extraction of the representative learning base
is dependent on the extraction of the shared learning base and the partial learning base of the
second level.
For reference only, in this part we have chosen the supervised Random Forest Algorithm
from among other supervised machine learning classifiers, because it works well especially in
large scale processing, and it avoids the problem of over fitting. To ensure that we made the
right choice, we cite several works in this direction such as Chauhan et al. (2017), Mazhar et al.
(2016), Yanchao et al. (2016), Dong et al. (2013) and Bei et al. (2018). Moreover, our work offers
distributed and parallel classification service, or in other words, distributed and parallel
machine learning algorithms. Our proposed system (DDPML) goes through seven tests to
evaluate rating performance in the context of big data. Note that in each test (big data mining
classification) carried out in our work, we used classical random forests composed of 1,000
trees to avoid possible errors.
Test 1. Presents the big data mining classification using only the SLB. The result of this
test is shown in Figure 3.
Figure 3 shows the ideal size used in the SLB to give high classification precision, knowing
that this size is considered a minimum size to obtain a good result for predictive analysis.
Additionally, Figure 3 shows the step in which we have extracted the SLB from the
original dataset, where the sample is extracted from the original dataset using the stratified
sampling method with replacement. Then we present this sample to the random Forests
classifier, and we see the classification result; if this result is not satisfactory, we add another
sample to it in the same way, at this stage we remove the duplicated individuals, then we
present again the new sample composed of two samples to the classifier, and we also see the

Data-set Instance-training Instances-validation Features Classes Table 1.

Characteristics of the
KDD Cup 2012 (2 Go) 119,705,032 (1.60 Go) 29,934,073 (458 Mo) 54,686,452 2 KDD Cup 2012 dataset
DTA
56,4

574

Figure 3.
Stability of the
classification result
using SLB

classification result. This process is repeated until a satisfactory result is obtained. In

addition, this result cannot be improved even if we add more samples as the classification
result will always remain constant.
From Figure 3 which expresses the number of stratified samples according to precision,
we can know the stability of the classification result using SLB. It can also be seen that the
classification result is good after having used 420 samples (precision 5 0.858%) as a
minimum number for this shared learning base (SLB), and that this result remains constant
even if the number of samples increases.
The results obtained after classification are filled out in Table 2:
Test 2. Presents the big data mining classification using PLBL1. The result of this test is
shown in Figure 4.
In this experiment, we prepared a subset, which is actually a partial base of the original
dataset. It was extracted by taking a stratified sampling without replacement. This test
allows us to verify whether using this rule alone gives good classification results or not.
The results obtained after classification are filled out in Table 3.
Figure 4 shows the results of the classification using PLBL1. We notice in Figure 4 that the
classification results are somewhat weak due to the number of trees which compose the
random forests classifier (see precision 5 82.5% if we use 25 to 30 trees). Furthermore, these
results improved after using about 500 trees (precision 5 83.33204%). Furthermore, these
results become stable and do not change completely when using more than 600 trees
(precision 5 83.33385%).
Test 3. Presents the big data mining classification using PLBL 2. The result of this test is
shown in Figure 5.

Table 2.
Classification result CCI % ICI % Kappa RSE % RAE % RMSE MAE Precision
(Performance metrics)
using SLB 87.5074 12.4926 0.7767 60.2263 44.3721 0.1439 0.0508 0.8582662875809086
DDPML
algorithms

575

Figure 4.
Classification result
using PLBL1

Table 3.
CCI % ICI % Kappa RSE % RAE % RMSE MAE Precision Classification result
(Performance metrics)
84.2914 15.7086 0.7157 64.8949 49.0757 0.1547 0.0557 0.8333855308487316 using PLBL1

Figure 5.
Stability of the
classification result
using PLBL2

Figure 5 shows the step in which we extracted the partial learning base for the second
level, where we extracted a sample from each subset by taking samples by the stratified
sampling method with replacement, knowing that we have divided the partial learning
base of the first level using the stratified sampling method without replacement so that we
can obtain a representative partial learning base in the second level. Each time we give the
sample extracted from the three subsets to the Random Forests classifier, and then we see
whether this result is satisfactory or not. If this result is not satisfactory, we add another
sample to it in the same way. At this stage, we remove the duplicate individuals, and then
DTA we present the new sample to the classifier again, and we also see the result of the
56,4 classification. This process is repeated until a satisfactory result is obtained. Moreover,
this result cannot be improved if we add more samples as the classification result always
remains constant.
The results obtained after classification are filled out in Table 4.
Figure 5 expresses the size of PLBL2 (MB) according to the precision. We ran this test to
find out how stable the classification result is using PLBL2, where it can be seen that the
576 classification result improved significantly (precision 5 0.846%) after using 340 MB as a
minimum size for this learning base, and that this result remains constant if the size of PLBL2
increases more than 340 MB.
Test 4. Presents the big data mining classification using PLBL1 and SLB. The result of
this test is shown in Figure 6.
The results obtained after classification are filled out in Table 5.
We see from Figure 6 that the classification result has been greatly improved after
merging the two datasets (PLBL1 and SLB), from precision 5 88.1407% using only ten trees,
which is an insufficient number for random forests classifier, until precision 5 89.7679%
(Increase rate is 1.5%) after using more than 600 trees. Also, we note that the precision
improved by 6% when merging two bases PLBL1 with SLB (precision 5
0.8976796797931227) instead of precision 5 0.8333855308487316 using only PLBL1.

Table 4.
Classification result CCI% ICI % Kappa RSE % RAE % RMSE MAE Precision
(Performance metrics)
using PLBL2 85.6756 14.3244 0.7414 62.6255 46.0367 0.1492 0.0523 0.8464812552851639

Figure 6.
Classification result
using PLBL1 and SLB

Table 5.
Classification result CCI % ICI % Kappa RSE % RAE % RMSE MAE Precision
(Performance metrics)
using PLBL1 with SLB 90.7998 9.2002 0.8363 53.1302 37.1303 0.1269 0.0424 0.8976796197931227
Test 5. Presents the work presented by Djafri et al. (2018), where he classified big data DDPML
(KDD Cup 2012) using representative learning base and the classical random algorithms
forests (CRF), as well as using the random forest classifier which he improved
(IRF), but in a completely different method from the method proposed in this
work. The results obtained are shown in Table 6:
The thing that attracts attention is what we see from Table 6 where the precision is somewhat
improved when using the improved random forests; they obtained precision 5 88.288% 577
using classical random forests, whereas they obtained precision 5 91.592% using improved
random forests with an estimated average increase of 3%.
Test 6. Presents the big data mining classification using the RLB which is an integration
of the two bases; the PLB2 and SLB which is extracted from the original data set
(instances training) located in the central node, knowing that after integration of
these two bases; the duplicated individuals are deleted. This is the most important
part of our proposed work in this paper. The different sizes of each learning base
used in our work and expressed in MB are as follows:
Active sub-set 5 (RLB) þ (sub-instances validation) 5 444, 73999 MB. So that:
(RLB 5 368.406 MB) and (sub-instances validation 5 76.33333 MB); knowing that
RLB 5 (SLB 5 420 3 0.226998 MB 5 95.34 MB) þ (PLBL2 5 273.066 MB).
From Figure 7, we note the following: First, the classification results obtained using RLB
(in our proposed work we have six subsets) are very similar (see Table 7), or we can frankly

Classification using CRF Classification using IRF

CCI % 94.9367 CCI % 96.9937

ICI % 5.0633 ICI % 3.0063
RMSE 0.1875 RMSE 0.1377
MAE 0.0442 MAE 0.0325 Table 6.
Kappa 0.9012 Kappa 0.9549 Comparison of
Precision 0.8828828828828829 Precision 0.915929203539823 classifications results
Total time (s) 92.49 s Total time (s) 88.57 s between CRF and IRF

Figure 7.
Classification result
using RLB
56,4

578
DTA

Table 7.

architecture
Classification result

the six nodes of our

(Performance metrics)
using active subsets in
Big data mining classification result: First compute Big data mining classification result: Second compute Big data mining classification result: Third compute
node (L2) node (L2) node (L2)
CCI % 92.6216 CCI % 92.6215 CCI % 92.6211
ICI % 7.3784 ICI % 7.3785 ICI % 7.3789
RMSE 0.1174 RMSE 0.1174 RMSE 0.1174
MAE 0.0376 MAE 0.0376 MAE 0.0376
Kappa 0.8696 Kappa 0.8696 Kappa 0.8696
Precision 0.9148025018312954 Precision 0.9148012404849168 Precision 0.9147964362242021
Total time (s) 0.230 s Total time (s) 0.230 s Total time (s) 0.228 s
Big data mining classification result: Fourth compute Big data mining classification result: Fifth compute Big data mining classification result: Sixth compute
node (L2) node (L2) node (L2)
CCI % 92.6207 CCI % 92.6201 CCI % 92.6205
ICI % 7.3793 ICI % 7.3799 ICI % 7.3795
RMSE 0.1174 RMSE 0.1174 RMSE 0.1174
MAE 0.0376 MAE 0.0376 MAE 0.0376
Kappa 0.8696 Kappa 0.8696 Kappa 0.8696
Precision 0.9148047844730309 Precision 0.9147951698453899 Precision 0.9147999774304576
Total time (s) 0.229 s Total time (s) 0.230 s Total time (s) 0.230 s
say that they are completely similar. We also note that these results stabilize after using 500 DDPML
trees or a little more. Table 7 shows the results of big data mining classification using active algorithms
subsets in each compute node level 2.
After verifying that the results are completely similar in each subset (see Figure 7), we
divide sub-instances validation on the number of compute nodes level 2, and this is what we
call the active subset, where we got results largely satisfactory (average
precision 5 91.48000183815488%). We also note that the precision improved by 2% when
using RLB (average precision 5 91.48000183815488%) instead of precision 5 579
0.8976796797931227% when using PLBL1 with SLB at a time.
Test 7. Presents the big data mining classification using original dataset. The aim of this
test is to verify how well the classification results using RLB match the classification
results when we use original dataset. The result of this test is shown in Figure 8.
Figure 8 shows the classification result using the original dataset, where we get the
precision 5 91.7381%. It is also seen in this figure that the result stabilizes after using 400 trees.
We present Figure 9 so that we can clearly see the conformity of the classification results
whether using the original dataset or by using RLB proposed in our work.
Table 8 shows the final classification result of the original dataset (KDD Cup 2012) and the
representative learning base (RLB) using Classical Random Forests (CRF) and Improved
Random Forests (IRF) (Djafri et al., 2018).
From Table 8, we see that the classification results for the two learning bases (original
dataset and RLB) are very close in both cases, whether using classical random forests or
using improved random forests. Moreover, the processing time is almost null (real-time
processing) if RLB is used, whereas, if the original dataset is used, in this case the processing
time takes tens of seconds (micro-batch processing).
4.1.1 First part discussion with conclusions. The first part of the discussion focuses on
studying the problem of distributed and parallel classification, as well as the results obtained
in order to evaluate the performance of our proposed model.
In this part, we used classical random forests to perform the classification, where we only
use SLB (Test 1), and we got a precision 5 85% after using 420 stratified samples, and this
result is not subject to improvement, regardless of the increase in the number of stratified
samples. In addition, the classification score decreased slightly (2%) when only using PLBL1
(Test 2). As for the use of PLBL2 (Test 3), we obtained a moderate result, where we recorded a
decrease of 1% compared to the result obtained using SLB only. We also recorded an increase

Figure 8.
Classification result
using original dataset
(KDD Cup 2012)
DTA
56,4

580

Figure 9.
Final classification
result of the original
dataset and RLB using
CRF and IRF

Classification using CRF Classification using IRF

KDD Cup 2012 CCI % 92.599 CCI % 95.5052

ICI % 7.401 ICI % 4.4948
RMSE 0.1194 RMSE 0.1012
Table 8. MAE 0.0404 MAE 0.0316
Comparison of the Kappa 0.8688 Kappa 0.9206
classification results Precision 0.9173707889827719 Precision 0.9478324049406636
between the original
Total time (s) 67.39 s Total time (s) 79.57 s
dataset (KDD Cup
2012) and the RLB (proposed work) CCI % 92.5581 CCI % 95.3906
representative learning ICI % 7.4419 ICI % 4.6094
base (RLB) using RMSE 0.1201 RMSE 0.1017
Classical Random MAE 0.0406 MAE 0.0319
Forests (CRF) and Kappa 0.8680 Kappa 0.9186
Improved Random Precision 0.9145350422688846 Precision 0.9459774016779011
Forests (IRF) Total time (s) 0.231 s Total time (s) 0.357 s

of 1% compared to the result obtained using only PLBL1. After that, the classification result
increased to 89.76% when the two bases – SLB and PLBL1 were combined – where we
recorded an increase of 3% compared to the classification result that we obtained using only
SLB, and a 6% compared to the classification result that we obtained using only PLBL1. The
classification result also increases up to precision rate 5 91.48% when using RLB, and this
result is very satisfactory compared to the classification result obtained using the original
dataset (precision 5 91.73%), i.e. an error rate 5 0.002 5 equals (1/4 per thousand). In addition
to this, the final classification result improved at an estimated rate of 3% compared to the
work presented by Djafri et al. (2018), and this precision can be increased by approximately
3% (precision 5 94,59%) additional using the improved random forests classifier proposed
by Djafri et al. (2018). Through the results obtained in Test 6, it is shown that using RLB
improves the big data mining classification result from 2% to 11%. Furthermore, the
classification error rate between the compute nodes of the second level is very low (mean
error 5 0.0000096); i.e. it does not exceed (1/4 per thousand). This shows that RLB is similar in
all compute nodes L2 thanks to our proposed architecture. Also, we see from the above results
(Table 8) that the difference in precision of the classification between the big instances DDPML
training (original dataset) and the RLB (our proposed work) is very small (0, algorithms
0028357467138873) compared to the size of the data.
The classification results presented by Emara and Huang (2020) are good results despite a
4% loss of classification precision. While in our proposed work we gained up to 7%. Also, the
processing time was very slow compared to the time we got. In the work presented by Ibrahim
and Bassiouni (2020), the processing time was rather slow, although the machines use better
properties than our machines that we used and more than them in number (20 machines), but 581
we got better classification results and in real time. From this, we conclude that the number of
machines and their properties are not sufficient to achieve better execution time, but rather it
requires more efficient methods and strategies. This is what (Liu and Zhang, 2020) confirmed
in his proposed work. From results obtained by Pandey and Shukla (2019), our proposed work
remains better in terms of increase in precision, as well as in execution time, although in their
experiments they used small dataset. In the same context, our model gave a good classification
result compared to the work proposed by Islam and Amin (2020) using real data and a
distributed random forest classifier. Now, we can say that our proposed work maintains
stability the classification result in distributed systems. This result is also guaranteed in the
parallel computing of the learning base during the classification. Henceforward this is known
by the term: Dynamic Distributed and Parallel Machine Learning (DDPML).

4.2 Processing time evaluation

The second part aims to provide the classification results in real-time through parallel
computing/parallel classification, knowing that the main responsible for providing these
results is the compute node of the second level, therefore this node in the classification process
is based on RLB. As we have said previously, this base (RLB) is a sum of the two bases (SLB
and PLBL2), where the main responsible for building SLB and PLBL1 is the central node, and
the main responsible for building PLBL2 is the compute node of the first level.
For this reason, real-time processing in our work depends on the following:
(1) The size of the SLB that is processed by the central node;
(2) The size of the PLBL1that is processed by the central node;
(3) The size of the active sub-set (sum of RLB with sub-instances validation) that is
processed by the compute node of the second level.
For knowing the power and the performance of the proposed system, we present in what
follows the most important characteristics which allow us to have an ideal processing time,
and more a good classification result. Figures 10 and 11 show the ideal size used for the each
partial learning base as well as the ideal size used for the active sub-set (Figure 12), knowing
that these sizes are considered as maximum sizes (threshold) for real-time processing and
giving good classification result.
Figure 10 shows the required and corresponding execution time when increasing the volume
of the SLB and the PLBL1 in the central node. Firstly: regarding PLBL1, where we see that the
execution time is null if the size is completely less than 130 MB, and the execution time is
approximately 7 s when the size reaches 300 MB. When the size of 1 GB is reached, the execution
time reaches 63 s, and when processing 2 GB, the execution time reaches 91 s. Secondly: regarding
SLB, we see that the execution time is null if the size is less than 30 MB, and the execution time
reaches approximately 35 s when the size reaches 300 MB. When the size of 1 GB is reached, the
execution time reaches 102 s, and when processing 2 GB, the execution time reaches 184 s.
Figure 11 shows the processing time corresponding to the increase in the size of the partial
learning base level 2 (L2) in compute node level 1, where we note that the processing time is
immediate only if the size is completely less than 200 MB, this processing time increases
DTA
56,4

582

Figure 10.
Processing time
required to extract
PLBL1 and SLB by
central node

Figure 11.
Processing time
required and
corresponding to the
progress size of the
PLBL2 in compute
node level 1

proportionally with increasing the size of the PLBL2 until reaches 83 s when the size of
1200 MB, and we also note that this processing time reached about 150 s when processing
twice this size.
Figure 12 shows the processing time for different sizes at the second level using compute
nodes level 2. We see the processing time null if the size is less than 400 MB, and it reaches
almost 04 s if the size is almost equal to 600 MB, and it reaches 120 s when reaching the size of
02 gigabytes.
Table 9 shows the coordination between the number of compute nodes of the first and
second level to extract the PLBL2, so that this coordination also gives a fixed size to this
partial learning base (see Figure 5). In addition, the processing of the active subset can be
performed in real time (average duration between: 0.280 s and 0.231 s), our proposed work is
presented on the second line in green in Table 9. You can also see Figures 12 and 13.
DDPML
algorithms

583

Figure 12.
Processing time
required and
corresponding to the
progress of size in
compute node level 2

Table 9.
Coordination between
MBR of compute nodes
of the first and second
level to extract
the PLBL2

Figure 13.
Average run time of
our model for different
sizes (GB) of data-set
(instances training: It)
according to number of
compute nodes (L2)
DTA To obtain real time classification results, it is necessary to fix the size of the PLBL2 by 1/3
56,4 of the original learning base. In addition, it is necessary to use a minimum number of
machines (of course, depending on the characteristics of the machine used) whether in level 1
or level 2 according to the equations here:
(1) Number of nodes at level 2 5 3 * instances training volume.
(2) Number of nodes at level 1 5 32 * instances training volume.
584
You should also know that PLBL2 size as well as processing time can be varied depending on
the availability and the capacity of our machines used (see: the partial learning base level 2
extracting Algorithm).
Figure 13 shows the coordination between the number of compute nodes L2 and data size
for real-time execution.
4.2.1 Second part discussion with conclusions. In this section, we discuss the processing
time required when classifying different datasets. Through Figure 10, we see that the
execution time increases more and more during the extraction of the PLBL1 compared to the
execution time during the extraction of the SLB, the reason is due to the partition of the original
dataset to build PLBL1. Also, we see that the execution time is slightly reduced when the
PLBL2 is extracted, so that, the execution time in this case is considered as moderate time
compared to the execution times required above. In addition, the execution time increases by
approximately one-third compared to the execution time required to extract the SLB and also
decreases by approximately one-third compared to the execution time required to extract the
PLBL1 (see Figure 10). In this case, the execution time is a little low despite the partitioning
process, the reason is due to the increase in the number of nodes at the first level and the second
level (see Table 9). Furthermore, the size of subsets at the first level is small compared to the
original dataset. Moreover, from Figure 12, it is concluded that our proposed system performs
massive processing in real time through the active subset (RLB þ sub-instances validation)
that we applied for the classification, as the time taken to process a volume 5 444.739 MB is
0.230 s, this is what we see in Figure 12. However, this execution time is variable and not fixed
when processing other large sizes. For example, when processing a dataset with a size of 5
gigabytes using our distributed architecture; the execution time is around 60 s. In this case, we
switch from the real-time (streaming) data processing mode to the micro-batch processing
mode. In addition, when processing a dataset with a size of 10 gigabytes, the execution time
increases to about 1.5 min. In this case our system fits into systems that process data in batch
mode. This execution time also increases to 570 s, or the equivalent of 9.5 min to process 75 GB.
However, this time can be reduced in real time by simply increasing the number of nodes at
first level and second level (see Table 9 and Figure 13). So, we can now say that our proposed
system is dynamic flexible and extensible according to the execution time (streaming, micro-
batch, or Batch) that we want to apply when big data analytics of different sizes.

5. The final version of our proposed work

Through the results that we have achieved in this paper and after discussing them, we can
now present the final presentation of our proposed system (DDPML) and define the desired
objectives of this system, as well as the extent of its contribution in the various works and
domains related to big data analytics to extract real value in terms of precision results and
processing speed.

5.1 The real purpose of our proposed system (DDPML)

From the results obtained in Table 7, where we found that the classification results are very
close (identical) in all six compute nodes of the second level (RMSE 5 0.1174, MAE 5 0.0376,
and Kappa 5 0.8696), it is also possible to see Figure 7 to ensure that the results are really the DDPML
identical. Accordingly, any suspicion or slight error in the results can be eliminated. After algorithms
that, the RLB is credibly presented to clients and companies who need big data analytics in
their work. Therefore, our aim in this regard is to eliminate the prohibitive costs that weigh on
companies, especially if the difficulties are related to computing resources such as large
storage and massive processing, as well as human resources such as experts in the field of
data analysis, and others. Furthermore, our proposed DDPML system works positively in the
context of Big Data analytics (See results obtained in Table 7), as it can transform Big Data 585
mining into data mining (small data) where it is easy to manipulate (see Figure 14).
Our DDPML system allows us to smoothly select the optimal classifier. Because one can be
a better classifier than another depending on the nature of the big data (imbalanced, missing,
repetitive, irrelevant . . .), the size of these data (GB, TB, PB, EB, ZB, YB, BB, . . .) and the
number of classes (binary or multi classes) constituting these data. This gives us a very clear
insight to manipulate big data analytics with various compositions and remove any
ambiguity or suspicion for selecting the correct classifier. The results obtained are shown in
Tables 10 and 12. But before starting the experiments, we ask the following question: Why
did we choose the six classifiers in our work?

Figure 14.
The operating
principle of our
proposed
system DDPML

Classifiers
Performance metrics SVM % ANN % KNN % RF % LR % BN %
Table 10.
Precision 91.032623 91.097246 86.394230 91.453504 90.960674 89.869753 Comparison of
Recall 95.792956 95.443117 93.187269 95.688636 95.623671 95.143884 classification results
AUC 98.344301 98.251421 96.809512 98.423291 98.309555 98.090501 (Binary classification:
F-measure 93.352143 93.219558 89.662269 93.526135 93.233905 92.431645 KDD Cup 2012) using
Note(s): The significance of italic values represent the good results of this classifier compared to other classifiers multiple classifiers
DTA We have selected the six classifiers because they are the most widely used and are highly
56,4 suitable for big data analytics. To confirm our choices, we quote the following:
SVM (Support Vector Machine) over the past decade, SVM has been gradually integrated
into the Big Data field. It solves big data classification problems. In particular, it can help
multi-domain applications in a big data environment (Sadrfaridpour et al., 2019).
ANN (Artificial neural networks) constitute a realistic criterion in the Big Data field, thus
knowledge of this field is of paramount importance for those who wish to extract significant
586 information from the big data available to date (Chiroma et al., 2019).
KNN (K-Nearest Neighbors) is widely used in big data analytics, especially if it is
developed more and more in order to give satisfactory classification results (Deng et al., 2016;
Xing and Bei, 2020).
RF (Random Forests) seem insensitive to over-fitting, this method generally does not
require a lot of parameter optimization efforts. Random forests therefore avoid one of the
main pitfalls of Big Data approaches in machine learning, which are taken for granted
because we have talked about them in detail in our work (Djafri et al., 2018).
LR (Logistic Regression) gives better result for analyzing the big data
(Dhamodharavadhani and Rathipriya, 2019).
BN (Bayesian Network) or (Naı€ve Bayes) can also be used in the Big Data field, it is very
useful for generating synthetic data when the actual data is insufficient (Scutari et al., 2019).
5.1.1 The first experiment. In this experiment, we use the KDD Cup 2012 dataset, knowing
that this dataset has two classes (binary classification) and that the number of features is very
important (see Table 1).
The results (classifier performance metrics) obtained are shown in Table 10:
From Table 10, we note that the RF classifier is the best classifier among the other
classifiers we used in our work to classify big data mining, it gave precision 5
91.45350422688846%, followed by two classifiers; ANN classifier with a precision
of 5 91.09724638550603%, which is a result very close to the previous one. Then the SVM
classifier comes in third place, with almost the same classification result
(precision 5 91.03262395418047%), it comes in first place with a recall 5
95.79295674781375%. Then the LR classifier also gave a good result (precision
90.96067415730337%) this result is closer than the results obtained using RF, ANN and
SVM. As for the KNN classifier, it gave a classification result with less precision than the
other classifiers (see Figure 15).

Figure 15.
Classifiers
performance metrics
(Binary classification)
5.1.2 The second experiment. In this experiment, we use the Mnist8m dataset, knowing DDPML
that this dataset has ten classes (multi-class classification). To see more information on this algorithms
database, visit this site [2] (see Table 11).
The results obtained are shown in Table 12:
From Table 12, we can see that RF classifier gives the best classification result for big data
mining with a precision 5 90.00379842998227%. Then in second place comes the ANN classifier
with a precision 5 89.9792045749935%, which is a result very close to the previous result
obtained using RF classifier, so that it gave a classification precision 5 89.25816023738873%, this 587
which is also a very close result to the results obtained using RF and ANN. Regarding SVM
classifier, it gives a moderate classification result with a precision 5 88.67615997819718%.
Finally, the remaining two classifiers BN and KNN gave a slightly lower classification result
compared to the other classifiers mentioned previously (see Figure 16).
From the results obtained in Tables 10 and 12, we conclude that RF classifier remains the
best classifier for big data mining classification in both cases whether binary classification or
multi-class classification. We also conclude that SVM is very suitable for big data mining
classification in the case of binary classification versus multi-class classification. On the other
hand, LR classifier gives us good results of multi-class classification compared to binary

Data-set Instances training Features Classes Table 11.

Characteristics of the
Mnist8m (2.75 Go) 81,00,000 784 10 Mnist8m dataset

Classifiers
Performance metrics SVM % ANN % KNN % RF % LR % BN % Table 12.
Comparison of
Precision 88.676159 89.979204 86.112947 90.003798 89.258160 86.458410
classification results
Recall 94.386830 94.654088 92.439889 94.874874 94.611029 92.437967 (Multi-class
AUC 97.464970 97.885518 96.153445 97.979399 97.656028 96.285861 classification:Mnist8m)
F-measure 91.442422 92.257462 89.164322 92.375166 91.856677 89.348256 using multiple
Note(s): The significance of italic values represent the good results of this classifier compared to other classifiers classifiers

Figure 16.
Classifiers
performance metrics
(Multi-class
classification)
DTA classification. In addition, ANN gives us results very close to RF in both cases whether binary
56,4 classification and multi-class classification. Regarding the two classifiers BN and KNN, they
give somewhat insufficient results in both cases (binary classification and multi-class
classification). Finally and as a result of the above, we conclude that the IRF developed by
Djafri et al. (2018) gives us good classification results than all these classifiers used in our
work, because we got precision 5 94.78324049406636% in the case of binary classification,
and precision 5 94.14018519392236% in the case of multi-class classification (see Figure 17).
588 Thanks to our DDPML system, we can say that we have achieved several objectives
related to the characteristics of big data such as:
(1) Volume: Reduce data volume (from big data to small data).
(2) Velocity: Speed up execution time (in real time).
(3) Veracity: Getting correct results using the representative learning base (RLB).
(4) Validity: Getting accurate results using our distributed architecture that enables us to
choose the best classifier.

5.2 Practical contribution of our proposed system (DDPML) for big data analytics
DDPML is a system that greatly helps save money for companies and research laboratories
that need huge computing resources and high human competence to analyze their big data. It
also leaves these institutions the freedom to dispose of their local resources according to the
distributed systems they are available, just simple configuration is required with these
available systems.
From Figure 18, we see that DDPML works comfortably even in the case of the big
instances validation base is very large, because sometimes this base can be too big than the
original dataset. In this case, the companies only need a simple distributed and parallel
system. But we may be faced with another very difficult problem, in case if these two bases
(big instances validation and original dataset) are big at the same time. To solve this problem,
it is preferable to apply the DDPML system to be able to process these data and obtain high
precision in a very short time.
5.2.1 The overall discussion. To enrich the discussion and confirm the results obtained, and
so that we can also compare the results of our proposed work with the results of other work,

Figure 17.
Classifiers
performance metrics in
the case of binary and
multi-class
classification
DDPML
algorithms

589

Figure 18.
Big Data Analytics
using DDPML system

whether about the design of the structure, the classification results, or the processing time;
some related works are presented, for example but not limited to, in the following:
Through the survey carried out by Mahmud et al. (2020), in which they explained in it
how to partition big data in distributed systems, where they mentioned that the traditional
data partition methods (range and hash) and division data methods via distributed files
(HDFS) often give inaccurate results, because these methods do not take into account the
statistical methods, thus this leads to poor results. From here, we conclude that data
partition and statistical sampling methods applied to them contribute to increase in the
accuracy of the results. Among the works presented in this survey (Approximate cluster
computing for big data analysis) are somewhat similar to our proposed work. Thanks to a
survey realized by Liu and Zhang (2020) they confirmed us again that sampling methods
reduce the volume of big data more effectively and help to speed up its processing. So, it
plays an important role in the era of big data, now and in the future. In the same context
(Emara and Huang, 2020) they developed two strategies for distributing data across
multiple data centers. In the first strategy, they relied on the Random Sample Partition
(RSP) data model to convert big data into sets of random sample data blocks and distribute
them across multiple data centers with or without replication. The second strategy allows
them to analyze data in any data center by randomly selecting a sample of data blocks
copied from other data centers. They concluded through the results obtained in their work
that the second strategy is better than the first strategy, because they obtained a
classification results between 95 and 96% using the random forests classifier, but the
processing time about 23 min. Through this work, we make sure once again that statistical
methods, especially random sampling, are very useful when partitioning data in distributed
systems. Another work suggested by Ibrahim and Bassiouni (2020) they introduced a new
partitioning system called Balanced Data Clusters Partitioner (BDCP) to make Hadoop
Yarn more efficient in cloud data centers. In their experiments, they used 20 machines with
excellent characteristics. Their goal is to reduce the time taken to minimize the job
completion time in Map-Reduce jobs. The runtime results obtained using the skew
DTA degree 5 0.1 are approximately 100 s for 2 GB and approximately 200 s for 4 GB and just
56,4 over 200 s for 6 GB. When using the skew degree 5 1.1, they got the execution time between
100 and 150 s for 2 GB, between 200 and 300 s for 4 GB and between 300 and 400 s for 6 GB.
So through this work we understand that we must develop methods that allow us to take
advantage of the execution time, especially if it comes to distributed systems and parallel
computing. There is another work that is no less important than the previous work
proposed by Islam and Amin (2020) they relied in this work on the Distributed Random
590 Forest (DRF) and the Gradient Boosting Machine (GBM), they concluded in their results
that the DRF gives classification results (precision 5 0.8436) better than the results
obtained by GBM (precision 5 0.7916) in the case of using real data.
At the end of this discussion, we present the survey conducted by Verbraeken et al. (2020),
where they provided us with a general and comprehensive overview of the latest technologies
in the field of distributed machine learning. They showcased the currently available systems
whether it is a distributed machine learning architectures or a distributed machine learning
ecosystem. However, we found that these works lack accurate results due to the architecture
design (simple and traditional design), so that data partitioning is also simple (It does not
depend on the leading mathematical methods in this field). These systems also require a fairly
long processing time compared to our proposed system. In addition, deployment predictive
models require human skills (developer, data expert, etc.). Furthermore, in their entirety, these
systems are intended for private or fixed use, in contrast to the DDPML system, which is
intended for organizations that analyze big data. DDPML is a dynamic and vital system;
because it contributes greatly to fault tolerance; for example, if one node, two or three nodes
goes down, our system always remains in service, because DDPML operates in a way
autonomous of the network architecture or the platforms, this is thanks to the representative
learning base that can be treated in any system. This learning base is completely independent
of the hardware and the type of their composition. Moreover, the results always remain
precise, because we said that the classification process depends on the representative base.
DDPML also does not take into account the architectural design of the distributed system or
the number of nodes that make up these systems, where these systems mainly help to speed
up processing time only.

6. Conclusion and future works

In conclusion, our DDPML system is specially designed to smoothly handle big data mining
classification, where it can analyze big data easily and quickly by converting it into small
data. This system/framework is a combination of distributed architecture which operates in a
parallel manner, and which harmonizes perfectly with the stratified sampling technique in
order to extract the representative sample. Through this framework, we got very satisfactory
classification results, where we can choose the best and most appropriate classifier according
to the nature of the data to be analyzed. We can apply several classifiers simultaneously and
thus we make the correct and judicious decision when choosing the classifier. Furthermore,
DDPML is dynamic because we can simply adapt it to our needs and available resources. As a
future work, this system can be developed to give more precise classification results more
than if we used the original dataset. This can be done by applying data transformation
techniques, or by solving the classification problem in the case of imbalanced datasets. We
can improve the traditional classifier’s capabilities and performance by developing them
more and more. In the future, the operating scenario of DDPML can be used as a building
block in other settings like e.g., GPU, Map-Reduce, Spark or any other. Also, DDPML can be
deployed on other distributed systems such as P2P networks, clusters, clouds computing or
other technologies.
Notes DDPML
1. http://www.csie.ntu.edu.tw/ algorithms
2. https://www.csie.ntu.edu.tw/

References
Alam, A. and Ahmed, J. (2014), “Hadoop architecture and its issues”, International Conference on 591
Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, pp. 288-291,
doi: 10.1109/CSCI.2014.140.
Alim, A. and Shukla, D. (2020), “Solution approach to big data regarding parameter estimation
problems in predictive analytics model”, Research Journal of Computer and Information
Technology Sciences, Vol. 8 No. 1, pp. 1-8.
An, C., Lim, H. and Kim, D. (2020), “Machine learning prediction for mortality of patients diagnosed
with covid-19: a nationwide Korean cohort study”, Scientific Reports, Vol. 10, doi: 10.1038/
s41598-020-75767-2.
Andrade, C. (2020), “Sample size and its importance in research”, Indian Journal of Psychological
Medicine, Vol. 42 No. 1, pp. 102-103.
Antal, E. and Tille, Y. (2011), “Simple random sampling with over-replacement”, Journal of Statistical
Planning and Inference, Vol. 141 No. 1, pp. 597-601.
Ardakani, A.A., Kanafi, A., Acharya, U.R., Khadem, N. and Mohammadi, A. (2020), “Application of
deep learning technique to manage covid-19 in routine clinical practice using ct images: Re-
sults of 10 convolutional neural networks”, Computers in Biology and Medicine, Vol. 121, doi: 10.
1016/j.compbiomed.2020.103795.
Assunç~ao, M.D., Calheiros, R.N., Bianchi, S., Netto, M.A. and Buyya, R. (2015), “Big data computing
and clouds: trends and future directions”, Journal of Parallel and Distributed Computing, Vol. 79,
pp. 3-15, doi: 10.1016/j.jpdc.2014.08.003.
Ataro, Y. (1967), Statistics, an Introductory Analysis, 2nd ed., Harper & Row, New York.
Bailly, S., Meyfroidt, G. and Timsit, J. (2018), “What’s new in icu in 2050: big data and machine
learning”, Intensive Care Med, Vol. 44, pp. 1524-1527, doi: 10.1007/s00134-017-5034-3.
Bei, Z., Yu, Z., Luo, N., Jiang, C., Xu, C. and Feng, S. (2018), “Configuring in-memory cluster computing
using random forest”, Future Generation Computer Systems, Vol. 79, pp. 1-15, doi: 10.1016/j.
future.2017.08.011.
Bendechache, M., Tari, A.-K. and Kechadi, M.-T. (2019), “Parallel and distributed clustering framework
for big spatial data mining”, International Journal of Parallel, Emergent and Distributed
Systems, Vol. 34 No. 6, doi: 10.1080/17445760.2018.1446210.
Berndt, A.E. (2020), “Sampling methods”, Journal of Human Lactation, Vol. 36 No. 2, pp. 224-226,
doi: 10.1177/0890334420906850.
Bhandari, A. (2020), Bhandari, Introduction to the Hadoop Ecosystem for Big Data and Data
Engineering.
Bhardwaj, P. (2019), “Types of sampling in research”, Journal of the Practice of Cardiovascular
Sciences, Vol. 5 No. 3, pp. 157-163.
Bhaskar, S.B. and Zulfiqar, A. (2016), “Basic statistical tools in research and data analysis”, Indian
Journal of Anaesthesia, Vol. 90 No. 9, pp. 662-669, doi: 10.4103/00195049.190623.
Bhattacharya, A. and Bhatnagar, S. (2016), “Big data and Apache spark: a review”, International
Journal of Engineering Research Science, Vol. 2 No. 5.
Borthakur, D. (2007), The Hadoop Distributed File System: Architecture and Design, The Apache
Software Foundation.
Brechon, P. (2015), “Random sample, quota sample: the teachings of the evs 2008 survey in France”, BMS:
Bulletin of Sociological Methodology/Bulletin De Methodologie Sociologique, Vol. 126, pp. 67-83.
DTA Brown, D.W., Ford, V. and Ghafoor, S.K. (2020), “A framework for the evaluation of parallel and
distributed computing educational resources”. IEEE International Parallel and Distributed
56,4 Pro-cessing Symposium Workshops (IPDPSW), doi: 10.1109/IPDPSW50202.2020.00057.
Bruce, P. and Bruce, A. (2017), Practical Statistics for Data Scientists, O’Reilly Media, Sebastopol, CA.
Bucchianico, A.D., Iapichino, L., Litvak, N., van der Meulen, F. and Wehrens, R. (2019), “Mathematics
for big data”, Book: the Best Writing on Mathematics. doi: 10.2307/j.ctvggx33b.13.
592 Burhan, U.I.K., Rashidah, F.O., Hunain, A. and Asadullah, S. (2014), “Critical insight for mapreduce
optimization in hadoop”, International Journal of Computer Science and Control Engineering,
Vol. 2 No. 1, pp. 1-7.
Cebeci, Z. and Yildiz, F. (2016), “Efficiency of random sampling based data size reduction on
computing time and validity of clustering in data mining”, Journal of Agricultural Informatics,
Vol. 7 No. 1, pp. 53-64, doi: 10.17700/jai.2016.7.1.266.
Cen, T., Chu, Q. and He, R. (2019), “Big data mining for investor sentiment”, Journal of Physics:
Conference Series, Vol. 1187 No. 5.
Chan, J.O. (2013), “An architecture for big data analytics”, Communications of the IIMA, Vol. 13
No. 2, pp. 1-13.
Chauhan, R., Kaur, H. and Chang, V. (2017), “Advancement and applicability of classifiers for variant
exponential model to optimize the accuracy for deep learning”, Journal of Ambient Intelligence
and Humanized Computing. doi: 10.1007/s12652-017-0561-x.
Che, D., Safran, M. and Peng, Z. (2013), “From big data to big data mining: challenges, issues, and
opportunities”, Database Systems for Advanced Applications.
Chen, M., Mao, S. and Liu, Y. (2014), “Big data: a survey”, Mobile Networks and Application, Vol. 19
No. 2, pp. 171-209, doi: 10.1007/s11036-013-0489-0.
Chen, W., Xu, S., Jiang, H., Weng, T., Marino, M., Chen, Y. and Li, K. (2016), “Gpu computations
on hadoop clusters for massive data processing”, Proceedings of the 3rd International Conference
on Intelligent Technologies and Engineering Systems (ICITES2014), Springer, pp. 515-521.
Chiroma, H., Abdullahi, U.A., Abdulhamid, S.M., AlArood, A.A., Gabralla, L.A., Rana, N. and Herawan,
T. (2019), “Progress on artificial neural networks for big data analytics: a survey”, IEEE Access,
Vol. 7, doi: 10.1109/access.2018.2880694.
Chung, W.C., Wu, T.L., Lee, Y.H., Huang, K.C., Hsiao, H.C. and Lai, K.C. (2020), “Minimizing resource
waste in heterogeneous resource allocation for data stream processing on clouds”, Applied
Sciences, Vol. 11 No. 1, doi: 10.3390/app11010149.
Cochran, W.G. (1977), Sampling Techniques, 3rd ed., John Wiley and Sons, New York, pp. 4-6.
Concurrency-Computat:Pract.Exper (2016), Parallel and Distributed Computing for Big Data
Applications, Wiley Online Library, doi: 10.1002/cpe.3813.
Conti, F. (2015), “Heterogeneous architectures for parallel acceleration”, Doctoral Thesis, University of
Bologna.
Coulet, A., Chawki, M., Jay, N., Shah, N., Wack, M. and Dumontier, M. (2018), “Predicting the need for a
reduced drug dose at first prescription”, Scientific Reports, Vol. 8 No. 1, doi: 10.1038/s41598-018-33980-0.
Dasgupta, A. and Nath, A. (2016), “Classification of machine learning algorithms”, International
Journal of Innovative Research in Advanced Engineering, Vol. 3 No. 3.
Dataflair, T. (2020), Spark Tutorial:learn Spark Programming.
Davenport, T. and Kim, J. (2013), Keeping up with the Quants, Harvard Business Review Press.
Debauche, O., Mahmoudi, S.A., Mahmoudi, S. and Manneback, P. (2018), “Cloud platform using big
data and hpc technologies for distributed and parallels treatments”, Procedia Computer Science,
Vol. 141, pp. 112-118, doi: 10.1016/j.procs.2018.10.156.
Den-Broeck, V., Sandøy, I.F. and Brestoff, J.R. (2013), The Recruitment, Sampling, and Enrollment Plan
Epidemiology: Principles and Practical Guidelines, Springer, pp. 171-196.
Deng, Z., Zhu, X., Cheng, D., Zong, M. and Zhang, S. (2016), “Efficient knn classification algorithm for DDPML
big data”, Neurocomputing, Vol. 195, pp. 143-148, doi: 10.1016/j.neucom.2015.08.112.
algorithms
Deshpande, S., Gogtay, N. and Thatte, U. (2016), “Data types”, Journal of The Association of Physicians
of India, Vol. 64.
Dhamodharavadhani, S. and Rathipriya, R. (2019), Enhanced Logistic Regression (Elr) Model for Big-
Data, IGI Global, doi: 10.4018/978-1-7998-0106-1.ch008.
Dhyani, B. and Barthwal, A. (2014), “Big data analytics using hadoop”, International Journal of 593
Computer Applications, Vol. 108 No. 12.
Djafri, L., Amar-Bensaber, D. and Adjoudj, R. (2018), “Big data analytics for prediction: parallel
processing of the big learning base with the possibility of improving the final result of the
prediction”, Information Discovery and Delivery, Vol. 46 No. 3, pp. 147-160, doi: 10.1108/IDD-02-
2018-0002.
Dong, L.J., Li, X.B. and Peng, K. (2013), “Prediction of rockburst classification using random forest”,
Transactions of Nonferrous Metals Society of China, Vol. 23, pp. 472-477, doi: 10.1016/
S10036326(13)624875.
Dunren, C., Mejdl, S. and Zhiyong, P. (2013), “From big data to big data mining: challenges, issues, and
opportunities”, DASFAA Workshops LNCS 7827, pp. 1-15.
Elgendy, N. and Elragal, A. (2014), “Big data analytics: a literature review paper”, in Perner, P. (Ed.),
Advances in Data Mining. Applications and Theoretical Aspects. ICDM, Lecture Notes in
Computer Science, 8557, doi: 10.1007/978-3-319-08976-8-16.
Ellis, G., Bertini, E. and Dix, A. (2005), “The sampling lens: making sense of saturated visualisations”,
Conference on Human Factors in Computing Systems, ACM, Portland, Oregon, USA, pp. 1351-1354.
Emara, T.Z. and Huang, J.Z. (2020), “Distributed data strategies to support large-scale data analysis
across geo-distributed data centers”, IEEE Access, Vol. 8, pp. 178526-178538, doi: 10.1109/
access.2020.3027675.
Erl, T., Khattak, W. and Buhler, P. (2016), Big Data Fundamentals: Concepts, Drivers and Techniques,
Prentice Hall Press.
Espinosa, M.M., Bieski, I. and de Oliveira Martins, D.T. (2012), “Probability sampling design in
ethnobotanical surveys of medicinal plants”, Revista Brasileira de Farmacognosia, Vol. 22 No. 6,
doi: 10.1590/S0102695X2012005000091.
Etikan, I. and Bala, K. (2017), “Sampling and sampling methods”, Biometrics and Biostatistics
International Journal, Vol. 5 No. 6, pp. 138-149, doi: 10.15406/bbij.2017.05.00149.
Fang, Y., Chen, Q. and Xiong, N. (2019), “A multi-factor monitoring fault tolerance model based on a
gpu cluster for big data processing”, Information Sciences, Vol. 496, pp. 300-316.
Fei, S. (2015), “Study on a stratified sampling investigation method for resident travel and the
sampling rate”, Discrete Dynamics in Nature and Society, 496179, doi: 10.1155/2015/496179.
Fellers, P.S. and Kuiper, S. (2020), “Introducing undergraduates to concepts of survey data analysis”,
Journal of Statistics Education, Vol. 28 No. 1, pp. 18-24, doi: 10.1080/10691898.2020.1720552.
Ferguson, M. (2013), Enterprise Information Protection- the Impact of Big Data, IBM.
Gandomi, A., Movaghar, A. and Reshadi, M. (2020), “Designing a mapreduce performance model in
distributed heterogeneous platforms based on benchmarking approach”, The Journal of
Supercomputing, Vol. 76, pp. 7177-7203, doi: 10.1007/s11227-020-03162-9.
Gong, Y., Xie, H., Tong, X., Jin, Y., Xv, X. and Wang, Q. (2020), “Area estimation of multi-temporal
global impervious land cover based on stratified random sampling”, International Archives of
the Photogrammetry, Remote Sensing and Spatial Information Sciences, pp. 103-108, doi: 10.
5194/isprs-archives-XLIIIB4-2020-103-2020.
Gonzalez, J., Xin, A.D.R. and Crankshaw, D. (2014), “Graphx: graph processing in a distributed
dataflow framework”, Proceeding OSDI’14 Proceedings of the 11th USENIX Conference on
Operating Systems Design and Implementation, pp. 599-613.
DTA Goodman-Meza, D., Rudas, A., Chiang, J., Adamson, P., Ebinger, J. and Sun, N. (2020), “A machine
learning algorithm to increase covid-19 inpatient diagnostic capacity”, PLoS ONE, Vol. 15 No. 9,
56,4 doi: 10.1371/journal.pone.0239474.
Gravetter, J. and Forzano, B. (2012), “Selecting research participants”, Behavior Research Methods,
pp. 125-139.
Gupta, A. and Jain, S. (2015), “Estimation of sample size in dental research”, International Dental and
Medical Journal of Advanced Research, Vol. 1, doi: 10.15713/ins.idmjar.9.
594
Haoyuan, L., Matei, Z., Scott, S., Tathagata, D., Timothy, H. and Ion, S. (2013), “Discretized streams:
fault- tolerant streaming computation at scale”, SOSP’13, ACM, Farmington, Pennsylvania,
USA, Nov. 3-6, doi: 10.1145/2517349.2522737.
Hariri, R., Fredericks, E. and Bowers, K. (2019), “Uncertainty in big data analytics: survey,
opportunities, and challenges”, Journal of Big Data, Vol. 44 No. 6, doi: 10.1186/s40537-019-
0206-3.
HLG-BAS (2011), “Strategic vision of the high-level group for strategic developments in business
architecture in statistics”, Conference of European Statisticians, 59th Plenary.
Honnutagi, P. (2014), “The hadoop distributed file system”, International Journal of Computer Science
and Information Technologies, Vol. 5 No. 5, pp. 6238-6243.
Howell, C., Su, W. and Nassel, A. (2020), “Area based stratified random sampling using geospatial
technology in a community-based survey”, BMC Public Health, Vol. 20, doi: 10.1186/s12889-020-
09793-0.
Iachan, R., Berman, L., Kyle, T.M., Martin, K.J., Deng, Y., Moyse, D.N., Middleton, D. and Atienza, A.A.
(2019), “Weighting nonprobability and probability sample surveys in describing cancer
catchment areas”, Cancer Epidemiol Biomarkers Prev, Vol. 28 No. 3, pp. 471-477, doi: 10.1158/
1055-9965.EPI-18-0797.
IBM (2014), The Top Five Ways to Get Started with Big Data.
Ibrahim, I. and Bassiouni, M. (2020), “Improvement of job completion time in data-intensive cloud
computing applications”, Journal of Cloud Computing, Vol. 9 No. 8, doi: 10.1186/s13677-019-
0139-6.
Inderpal, S. (2013), “Review on parallel and distributed computing”, Scholars Journal of Engineering
and Technology, Vol. 1 No. 4, pp. 218-225.
Islam, S. and Amin, S. (2020), “Prediction of probable backorder scenarios in the supply chain using
distributed random forest and gradient boosting machine learning techniques”, Journal of Big
Data, Vol. 7 No. 1, doi: 10.1186/s40537-020-00345-2.
James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013), “Statistical learning.in: an
introduction to statistical learning”, Springer Texts in Statistics, Springer, New York, NY,
Vol. 103, pp. 15-57.
Jaradat, M., Jarrah, M., Bousselham, A., Jararweh, Y. and Al-Ayyouba, M. (2015), “The internet of
energy: smart sensor networks and big data management for smart grid”, Procedia Computer
Science, Vol. 56, pp. 592-597.
Jeong, H. and Cha, K.J. (2019), “An efficient mapreduce based parallel processing framework for user
based collaborative filtering”, Symmetry, Vol. 11 No. 6, doi: 10.3390/sym11060748.
Jun, S., Lee, S. and Ryu, J. (2015), “A divided regression analysis for big data”, International Journal of
Software Engineering and Its Applications, Vol. 9 No. 5, pp. 21-32.
Jun, C., Y.Lee, J. and H.Kim, B. (2019), “Cloud-based big data analytics platform using algorithm
templates for the manufacturing industry”, International Journal of Computer Integrated
Manufacturing, Vol. 32, pp. 723-738, doi: 10.1080/0951192X.2019.1610578.
Kambatla, K., Kollias, G., Kumarand, V. and Grama, A. (2014), “Trends in big data analytics”, Journal
of Parallel and Distributed Computing, Vol. 74 No. 7, pp. 2561-2573, doi: 10.1016/j.jpdc.2014.
01.003.
Kandel, S., Hellerstein, J., Paepcke, A. and Heer, J. (2012), “Enterprise data analysis and visualization: DDPML
an interview study”, IEEE Transactions on Visualization and Computer Graphics, Vol. 18
No. 12, pp. 2917-2926, doi: 10.1109/TVCG.2012.219. algorithms
Katal, A., Wazid, M. and Goudar, R. (2013), “Big data: issues, challenges, tools and good practices”,
Contemporary Computing (IC3) Sixth International Conference, IEEE, pp. 404-409.
Kayyali, Knott, D. and Kuiken, S.V. (2013), The Big-Data Revolution in Us Health Care: Accelerating
Value and Innovation, Mc Kinsey Company, Vol. 2 No. 8, pp. 1-13.
595
Khan, N., Shah, H., Badsha, G., Abbasi, A.A., Alsaqer, M. and Salehian., S. (2018), “10 vs, issues and
challenges of big data”, International Conference on Big Data and Education ICBDE ’18,
pp. 203-210.
Kiran, M., Murphy, P., Monga, I., Dugan, J. and Baveja, S. (2015), “Lambda architecture for cost
effective batch and speed big data processing”, IEEE International Conference on Big Data,
doi: 10.1109/BigData.7364082.
Klaine, P.V., Imran, M.A., Onireti, O. and Souza, R.D. (2017), “A survey of machine learning techniques
applied to self-organizing cellular networks”, IEEE Communications Surveys and Tutorials,
Vol. 19 No. 4, pp. 2392-2431, doi: 10.1109/COMST.2017.2727878.
KSteven, T. (2012), “Sampling”, Chapter 6: Unequal Probability Sampling, 3rd ed., John Wiley & Sons.
Kulkarni, A.P. and Khandewal, M. (2014), “Survey on hadoop and introduction to yarn”, International
Journal of Emerging Technology and Advanced Engineering, Vol. 4 No. 5.
Lalmuanawma, S., Hussain, J. and Chhakchhuak, L. (2020), “Applications of machine learning and
artificial intelligence for covid-19 (sars-cov-2) pandemic: a review”, Chaos, Solitons and Fractals,
Vol. 139 No. C, doi: 10.1016/j.chaos.2020.110059.
Landis, J. and Koch, G. (1977), “The measurement of observer agreement for categorical data”,
Biometrics, Vol. 33 No. 1, pp. 159-174.
Lee, K., Fitts, M. and Conigrave, J. (2020), “Recruiting a representative sample of urban south
australian aboriginal adults for a survey on alcohol consumption”, BMC Medical Research
Methodology. doi: 10.1186/s12874-020-01067-y.
Li, J. and Liu, H. (2017), “Challenges of feature selection for big data analytics”, IEEE Intelligent
Systems, Vol. 32 No. 2, pp. 9-15, doi: 10.1109/mis.2017.38.
Li, Y., Hai-Tao, Z. and Jorge, G. (2020), A Machine Learning-Based Model for Survival Prediction in
Patients with Severe Covid19 Infection, MedRxiv, doi: 10.1101/2020.02.27.20028027.
Liu, Z. and Zhang, A. (2020), “Mpling for big data profiling: a survey”, IEEE Access, Vol. 8,
pp. 72713-72726, doi: 10.1109/ACCESS.2020.2988120.
Lu, X. and Zhan, J. (2020), “Workshop 7: hpbdc high-performance big data and cloud computing”,
IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW),
doi: 10.1109/IPDPSW50202.2020.00073.
MacInnis, B., Krosnick, J.A., Ho, A.S. and Cho, M.-J. (2018), “The accuracy of measurements with
probability and nonprobability survey samples: replication and extension”, Public Opinion
Quarterly, Vol. 82 No. 4, pp. 707-744, doi: 10.1093/poq/nfy038.
Mahmud, M.S., Huang, J.Z., Salloum, S., Emara, T.Z. and Sadatdiynov, K. (2020), “A survey of data
partitioning and sampling methods to support big data analysis”, Big Data Mining and
Analytics, Vol. 3 No. 2, pp. 85-101, doi: 10.26599/BDMA.2019.9020015.
Mathkunti, N. and Rangaswamy, S. (2020), “Machine learning techniques to identify dementia”, SN
Comput Sci, Vol. 118 No. 1, doi: 10.1007/s42979-020-0099-4.
Mayya, S., Monteiro, A. and Ganapathy, S. (2017), “Types of biological variables”, Journal of Thoracic
Disease, Vol. 9 No. 6, pp. 1730-1733, doi: 10.21037/jtd.2017.05.75.
Mazhar, R., Awais, A. and Anand, P. (2016), “Real time intrusion detection system for ultra-high-speed
big data environments”, Journal of Supercomputing, Vol. 72, pp. 3489-3510, doi: 10.1007/s11227-
015-1615-5.
DTA Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M.,
Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Matei, Z. and Talwalkar, A. (2016), “Mllib:
56,4 machine learning in Apache spark”, Journal of Machine Learning Research, Vol. 17
No. 34, pp. 1-7.
Mohan, A., Venkatesan, R. and Pramod, K. (2017), “A scalable method for link prediction in large real
world networks”, Journal of Parallel and Distributed Computing. doi: 10.1016/j.jpdc.2017.05.009.
Moorley, C. and Shorten, A. (2014), “Selecting the sample”, Evidence Based Nursing, Vol. 17 No. 2,
596 pp. 32-33, doi: 10.1136/eb-2014-101747.
Mostafaeipour, A., Rafsanjani, J., Ahmadi, M. and Dhanraj, A. (2020), “Investigating the performance
of hadoop and spark platforms on machine learning algorithms”, The Journal of
Supercomputing. doi: 10.1007/s11227-020-03328-5.
Muhammad, L., Algehyne, E., Usman, S., Ahmad, A., Chakraborty, C. and Mohammed, I. (2020),
“Supervised machine learning models for prediction of covid-19 infection using epidemiology
dataset”, SN Computer Science, Vol. 2 No. 1, doi: 10.1007/s42979-020-00394-7.
Muthusami, R. and Saritha, K. (2020), “Statistical analysis and visualization of the potential cases of
pandemic coronavirus”, VirusDis, Vol. 31, pp. 204-208, doi: 10.1007/s13337-020-00610-1.
Nguyen, D., Long, T., Jia, X., Lu, W., Gu, X., Iqbal, Z. and Jiang, S. (2019), “A feasibility study for
predicting optimal radiation therapy dose distributions of prostate cancer patients from patient
anatomy using deep learning”, Scientific Reports, Vol. 9 No. 1, doi: 10.1038/s41598-018-37741-x.
Okororie, C. and Otuonye, E.L. (2015), “Efficiency of some sampling techniques”, Journal of Scientific
Research and Studies, Vol. 2 No. 3, pp. 63-69.
Oussous, A., Benjelloun, F.-Z., Lahcen, A. and Belfkih, S. (2017), “Big data technologies: a survey”,
Journal of King Saud University - Computer and Information Sciences. doi: 10.1016/j.jksuci.2017.
06.001.
Ozturk, T., Talo, M., Yildirim, E.A., Baloglu, U.B., Yildirim, O. and Acharya, U.R. (2020), “Automated
detection of covid-19 cases using deep neural networks with x-ray images”, Computers in
Biology and Medicine. doi: 10.1016/j.compbiomed.2020.103792.
O’Donovan, P., Leahy, K., Bruton, K. and O’Sullivan, T.J. (2015), “Big data in manufacturing: a
systematic mapping study”, Journal of Big Data, Vol. 20 No. 2, doi: 10.1186/s40537-015-0028-x.
Padilla, M., Olofsson, P., Stehman, V.S., Tansey, K. and Chuvieco, E. (2017), “Stratification and sample
allocation for reference burned area data”, Remote Sensing of Environment, Vol. 203,
pp. 240-255, doi: 10.1016/j.rse.2017.06.041.
Palanisamy, V. and Thirunavukarasu, R. (2019), “Implications of big data analytics in developing
healthcare frameworks – a review”, Journal of King Saud University – Computer and
Information Sciences, Vol. 31 No. 4, pp. 415-425, doi: 10.1016/j.jksuci.2017.12.007.
Pandey, K.K. and Shukla, D. (2019), “Optimized sampling strategy for big data mining through
stratified sampling”, International Journal of Scientific and Technology Research, Vol. 8 No. 11.
Pandey, K. and Shukla, D. (2020), “Stratified sampling-based data reduction and categorization model
for big data mining”, in Bansal, J., Gupta, M., Sharma, H. and Agarwal, B. (Eds), Communication
and Intelligent Systems. ICCIS 2019. Lecture Notes in Networks and Systems 120, Springer,
Singapore.
Peter, S. (1976), “The foundations of survey sampling: a review”, Journal of the Royal Statistical
Society, Vol. 139 No. 2, pp. 183-204.
Pham, Q., Nguyen, D.C., Huynh-The, T., Hwang, W. and Pathirana, P.N. (2020), “Artificial intelligence
(ai) and big data for coronavirus (covid-19) pandemic: a survey on the state-of-the-arts”, IEEE
Access, Vol. 8, pp. 130820-130839, doi: 10.1109/ACCESS.2020.3009328.
Poornima, S. and Pushpalatha, M. (2016), “A journey from big data towards prescriptive analytics”,
Arpn Journal of Engineering and Applied Sciences, Vol. 19 No. 11.
Pop, F., Dobre, C. and Costan, A. (2017), “AutoCompBD: Autonomic computing and big data DDPML
platforms”, Soft Computing, Vol. 21 No. 16, pp. 4497-4499, doi: 10.1007/s00500-017-2739-8.
algorithms
Prakash, V. and Atul, P. (2016), “Comparison of mapreduce and spark programming frameworks for
big data analytics on hdfs”, International Journal of Computer Science Communication, Vol. 7
No. 2, pp. 80-84.
Puech, P.L., Cardot, H. and Goga, C. (2014), “Analysing large datasets of functional data: a survey
sampling point of view”, Journal de la Societe Francaise de Statistique, Vol. 155 No. 4.
597
Rahul, V. and Pravin, P. (2016), “A survey on: predictive analytics for credit risk assessment”,
International Research Journal of Engineering and Technology, Vol. 3.
Reddy, G.T., Reddy, M.P.K., Lakshmanna, K., Kaluri, R., Rajput, D.S., Srivastava, T. and Baker, G.
(2020), “Analysis of dimensionality reduction techniques on big data”, IEEE Access, Vol. 8,
pp. 54776-54788, doi: 10.1109/access.2020.2980942.
Ripon, P. and Arif, A. (2016), “Big data: the v’s of the game changer paradigm”, IEEE 18th
International Conference on High Performance Computing and Communications ; IEEE 14th
International Conference on Smart City ; IEEE 2nd International Conference on Data Science
and Systems. doi: 10.1109/HPCC-SmartCity-DSS.2016.8.
Robbins, I.W., Ghosh-Dastidar, B. and Ramchand, R. (2020), “Blending probability and nonprobability
samples with applications to a survey of military caregivers”, Journal of Survey Statistics and
Methodology. doi: 10.1093/jssam/smaa037.
Rojas, J.A.R., Kery, M.B., Rosenthal, S. and Dey, A.K. (2017), “Sampling techniques to improve big
data exploration”, 2017 IEEE 7th Symposium on Large Data Analysis and Visualization
(LDAV), doi: 10.1109/LDAV.2017.8231848.
Roos, Deutsch, Corrigan, Zikopoulos, Parasuraman and Giles (2013), Harness the Power of Big Data:
The Ibm Big Data Platform, McGraw-Hill, New York.
Sadrfaridpour, E., Razzaghi, T. and Safro, I. (2019), “Engineering fast multilevel support vector
machines”, Machine Learning, Vol. 108, doi: 10.1007/s10994-019-05800-7.
Sathyaraj, R., Ramanathan, L., Lavanya, K., Balasubramanian, V. and Saira Banu, J. (2020), “Chicken
swarm foraging algorithm for big data classification using the deep belief network classifier”,
Data Technologies and Applications. doi:10.1108/DTA-08-2019-0146.
Schifano, E.D., Wu, J., Wang, C., Yan, J. and Chen, M.H. (2016), “Online updating of statistical inference
in the big data setting”, Technometrics. doi: 10.1080/00401706.2016.1142900.
Schmueli, G. and Koppius, O. (2011), “Predictive analytics in information systems research”,
Management Information Systems, Vol. 35, pp. 553-572.
Schwab-McCoy, A., Baker, C.M. and Gasper, R.E. (2020), “Data science in 2020: computing, cur- ricula,
and challenges for the next 10 years”, Journal of Statistics Education. doi: 10.1080/10691898.
2020.1851159.
Scutari, M., Vitolo, C. and Tucker, A. (2019), “Learning bayesian networks from big data with greedy
search: computational complexity and efficient implementation”, Statistics and Computing,
Vol. 29, pp. 1095-1108, doi: 10.1007/s11222-019-09857-1.
Sharma, R. and Singh, S.N. (2019), “Data mining classification techniques – comparison for better
accuracy in prediction of cardiovascular disease”, International Journal of Data Analysis
Techniques and Strategies, Vol. 11 No. 4.
Shen, E. (2020), “On the use of sampling weights for retrospective medical record reviews”, The
Permanente Journal, Vol. 24, doi: 10.7812/TPP/18.308.
Shim, K., Cha, S., Chen, L., Han, W.-S., Srivastava, D., Tanaka, K., Yu, H. and Zhou, X. (2012), “Data
management challenges and opportunities in cloud computing”, 17th International Conference
on Database Systems for Advanced Applications (DASFAA’2012), Springer, Berlin/Heidelberg.
Siirtola, P. and R€oning, J. (2020), “Comparison of regression and classification models for user-
independent and personal stress detection”, Sensors.
DTA Singh, A.S. and Masuku, M.B. (2014), “Sampling techniques and determination of sample size in
applied statistics research: an overview”, International Journal of Economics, Commerce and
56,4 Management, Vol. 2 No. 11, pp. 1-22.
Singh, A., Choudhary, S. and Kumari, M. (2020), “Hadoop ecosystem analytics and big data for
advanced computing platforms”, International Journal of Advanced Science and Technology,
Vol. 29 No. 5, pp. 6633-6642.
Sliwinski, T. and Kang, S. (2017), Applying Parallel Computing Techniques to Analyze Terabyte
598 Atmospheric Boundary Layer Model Outputs, Elsevier, doi: 10.1016/j.bdr.2017.01.001.
Sun, Z. and Wang, P. (2017), “A mathematical foundation of big data”, New Mathematics and Natural
Computation, Vol. 13 No. 2, doi: 10.1142/s1793005717400014.
Sun, L., Liu, G., Song, F., Shi, N., Liu, F., Li, S., Li, P., Zhang, W., Jiang, X., Zhang, Y., Sun, L., Chen, X.
and Shi, Y. (2020), “Combination of four clinical indicators predicts the severe/critical symptom
of patients infected covid-19”, Journal of Clinical Virology. doi: 10.1016/j.jcv.2020.104431.
Taherdoost, H. (2016), “Sampling methods in research methodology; how to choose a sampling
technique for research”, International Journal of Academic Research in Management.
Trovati, M. and Bessis, N. (2015), “An influence assessment method based on co-occurrence for
topologi- cally reduced big data sets”, Soft Computing, pp. 1-10.
Tukey, J. (1977), Exploratory Data Analysis, Addison-Wesley, Reading, MA.
Turner, D.P. (2020), “Sampling methods in research design”, Headache: The Journal of Head and Face
Pain, Vol. 60 No. 1, pp. 8-12, doi: 10.1111/head.13707.
Urrehman, M.H., Liew, C.S., Abbas, A., Jayaraman, P., Wah, T.Y. and Khan, S.U. (2016), “Big data
reduction methods: a survey”, Data Science and Engineering, Vol. 1, pp. 265-284.
Van Steen, M. and Tanenbaum, A.S. (2016), “A brief introduction to distributed systems”, Computing,
Vol. 98, pp. 967-1009, doi: 10.1007/s00607-016-0508-7.
Velliangiri, S., Alagumuthukrishnan, S. and joseph, S.I.T. (2019), “A review of dimensionality
reduction techniques for efficient computation”, Procedia Computer Science, Vol. 165,
pp. 104-111, doi: 10.1016/j.procs.2020.01.079.
Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T. and Rellermeyer, J.S. (2020), “A survey
on distributed machine learning”, ACM Computing Surveys, Vol. 53 No. 2, doi: 10.1145/3377454.
Verma, N., Malhotra, D. and Singh, J. (2020), “Big data analytics for retail industry using mapreduce-
apriori framework”, Journal of Management Analytics, Vol. 7 No. 3, pp. 424-442, doi: 10.1080/
23270012.2020.1728403.
Wah, B.W. (2008), Interconnection Networks for Parallel Computers, Wiley Encyclopedia of Computer
Science and Engineering.
Wei, C. and Chou, T. (2020), “Typhoon quantitative rainfall prediction from big data analytics by
using the Apache hadoop spark parallel computing framework”, Atmosphere, Vol. 11, doi: 10.
3390/atmos11080870.
Weihs, C. and Ickstadt, K. (2018), “Data science: the impact of statistics”, International Journal of Data
Science and Analytics, Vol. 6, pp. 189-194, doi: 10.1007/s41060-018-0102-5.
West, P. (2016), “Simple random sampling of individual items in the absence of a sampling frame that
lists the individuals”, New Zealand Journal of Forestry Science, Vol. 46 No. 15, doi: 10.1186/
s40490-016-0071-1.
Wu, J., Zhang, P., Zhang, L., Meng, W., Li, J., Tong, C., Li, Y., Cai, J., Yang, Z., Zhu, J., Zhao, M., Huang,
H., Xie, X. and Li, S. (2020), Rapid and Accurate Identification of Covid-19 Infection through
Machine Learning Based on Clinical Available Blood Test Results, medRxiv, doi: 10.1101/2020.04.
02.20051136.
Xindong, W., Xingquan, Z., Gong-Qing, W. and Wei, D. (2014), “Data mining with big data”, IEEE
Transactions on Knowledge and Data Engineering, Vol. 26 No. 1, pp. 97-107, doi: 10.1109/TKDE.
2013.109.
Xing, W. and Bei, Y. (2020), “Medical health big data classification based on knn classification DDPML
algorithm”, IEEE Access, Vol. 8, pp. 28808-28819, doi: 10.1109/ACCESS.2019.2955754.
algorithms
Xingquan, Z. and Ian, D. (2007), Knowledge Discovery and Data Mining: Challenges and Realities,
Hershey, New York, ISBN: 978-1-59904-252.
Yadav, R. and Tailor, R. (2020), “Estimation of finite population mean using two auxiliary variables
under stratified random sampling”, Statistics in Transition New Series, Vol. 21 No. 1, pp. 1-12,
doi: 10.21307/stattrans-2020-001.
599
Yanchao, D., Jiguang, Y., Yan, Z. and Zhencheng, H. (2016), “Comparison of random forest, random
ferns and support vector machine for eye state classification”, Multimedia Tools and
Applications, Vol. 75, pp. 11763-11783, doi: 10.1007/s1104201526350.
Yang, C., Chen, S., Liu, J., Liu, R. and Chang, C. (2020), “On construction of an energy monitoring
service using big data technology for the smart campus”, Cluster Computing, Vol. 23 No. 1,
doi: 10.1007/s10586-019-02921-5.
Zanoon, N., Alkharabsheh, K. and Ryalat, M.H. (2020), “Optimizing mapreduce model for big data
analytics using subtractive clustering algorithm”, International Journal of Advanced Science
and Technology, Vol. 29 No. 4, pp. 4106-4119.
Zhang, Y., Ren, S., Liu, Y., Sakao, T. and Huisingh, D. (2017), “A framework for big data driven
product lifecycle management”, Journal of Cleaner Production, Vol. 159, pp. 229-240.
Zhao, X., Liang, J. and Dang, C. (2019), “A stratified sampling based clustering algorithm for large-
scale data”, Knowledge-Based Systems, Vol. 163, pp. 416-428, doi: 10.1016/j.knosys.2018.09.007.

Appendix
A1. Some concepts in mathematical statistics

A1.1 Mean squared error (MSE)

We can build an indicator of precision that includes the notions of bias and variance; for this, it suffices to
calculate the mean of the squares of the deviations of the estimators from the true value.
2
MSE ¼ E bθ θ ¼ ðbiaisÞ2 þ variance

A1.2 Bias
The Bias is used to detect the possible presence of systematic error. It corresponds to the difference
between the mathematical expectancy of the estimator of a parameter and the parameter itself.

Bias ¼ E b
θ θ

A.1.3 Mathematical expectancy

The mathematical expectancyR of a continuous random variable X, of density f on the interval I 5 [a; b] is
b
the real defined by E(X) = a t 3 f ðtÞdt

A.1.4 Variance
The variance of a statistical series is the number noted V such that:
2
n1 ðx1 xÞ2 þ n2 ðx2 xÞ2 þ . . . . . . . . . þ np xp x
V¼
N
DTA We notice:
56,4 1 Xp 2
V¼ ni xi x
N i¼1

Remark:
(1) The variance is a positive number.
600
(2) We also have:
X
p 2
V¼ fi xi x
i¼1

A1.5 Precision
The Precision is the ratio between the number of true positives and the sum of true positives and false
positives.

A1.6 Correctly classified instances (CCI)

The (CCI) is the number of well-classified individuals, as an absolute value, then as a percentage of the
total number of instances.

A1.7 Incorrectly classified instances (ICI)

In the same format, the number of misclassified instances.

A1.8 Mean absolute error (MAE)

For each example, we calculate the difference between the probability (calculated by the classifier) for an
example of belonging to its real class, and its initial probability of belonging to the class which has been
fixed for it in the set d ’examples (individuals) (in general, this probability is 1). Then divide the sum of
these errors by the number of instances in the set of examples.
jp1 a1 j þ jp2 a2 j þ þ jpn an j
MAE ¼
n

A1.9 Root mean-squared error (RMSE)

This measure of error mainly concerns the predictors. Square root of the mean squared error: with the
same notations as above, it corresponds to:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðp1 a1 Þ2 þ þ ðp1 a1 Þ2
RMSE ¼
n

A1.10 Relative absolute error (RAE)

This error measure mainly concerns the predictors Relative absolute error: the name of the parameter is
very badly chosen. We compare the absolute error with the absolute error of a very simple predictor,
which would always return the mean value of ai, i.e:
1 X jp1 a1 j þ þ jpn an j
RAE ¼ ai
i
ja1 a2 j þ þ jan anþ1 j

A1.11 Root relative squared error (RSE)

This measure of error mainly concerns the predictors. Square root of the relative squared error: ratio
between the squared error and what would be the squared error of a predictor that would always return
the mean value.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðp1 a1 Þ2 þ þ ðpn an Þ
DDPML
RSE ¼ algorithms
ða1 aÞ2 þ þ ðan aÞ

A1.12 Kappa
The Kappa coefficient is supposed to measure the degree of agreement between two or more judges.
P0 Pe
601
k¼
1 Pe
with P0: The proportion of the sample on which the two judges agree (i.e. the main diagonal of the
confusion matrix).
P
i Pi Pj
Pe ¼
n2

(1) pi: sum of the elements of line i.

(2) pj: sum of the elements of column j.
(3) n: sample size.
Landis and Koch (1977) have proposed a scale of degree of agreement according to the value of the
coefficient:

K Interpretation

<0 Poor agreement

0.0–0.20 Slight agreement
0.21–0.40 Fair agreement
0.41–0.60 Moderate agreement
0.61–0.80 Substantial agreement
0.81–1.0 Almost perfect agreement

Corresponding author
Laouni Djafri can be contacted at: [email protected]

For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: [email protected]

Data Mining
No ratings yet
Data Mining
6 pages
Machine Learning Models and Algorithms For Big Data Classification - Suthaharan
100% (3)
Machine Learning Models and Algorithms For Big Data Classification - Suthaharan
30 pages
BC2017
No ratings yet
BC2017
28 pages
Synopsis - 119997107013 - Dineshkumar B. Vaghela
No ratings yet
Synopsis - 119997107013 - Dineshkumar B. Vaghela
23 pages
A Mini Review of Machine Learning in Big Data Analytics Applications
No ratings yet
A Mini Review of Machine Learning in Big Data Analytics Applications
17 pages
Effect of Drug Abuse On Youth in Plateau State
100% (1)
Effect of Drug Abuse On Youth in Plateau State
62 pages
1 s2.0 S2666764921000485 Main
No ratings yet
1 s2.0 S2666764921000485 Main
11 pages
Enhanced Over - Sampling Techniques For Imbalanced Big Data Set Classi Fication
No ratings yet
Enhanced Over - Sampling Techniques For Imbalanced Big Data Set Classi Fication
33 pages
Aipptoriginal 191215023212
No ratings yet
Aipptoriginal 191215023212
16 pages
Master Thesis
No ratings yet
Master Thesis
97 pages
Bigdata Researched
No ratings yet
Bigdata Researched
34 pages
A R S M L D A B D: Eview On The Ignificance of Achine Earning For ATA Nalysis in IG ATA
No ratings yet
A R S M L D A B D: Eview On The Ignificance of Achine Earning For ATA Nalysis in IG ATA
18 pages
1 s2.0 S2665917422000551 Main
No ratings yet
1 s2.0 S2665917422000551 Main
9 pages
Tensor Computation For Data Analysis Yipeng Liu PDF Download
No ratings yet
Tensor Computation For Data Analysis Yipeng Liu PDF Download
53 pages
Article Intéressant
No ratings yet
Article Intéressant
23 pages
The Application of Machine Learning in Data Mining Under Big Data Environment
No ratings yet
The Application of Machine Learning in Data Mining Under Big Data Environment
4 pages
Hybrid Decision Tree-Based Machine Learning Models For Short-Term Water Quality Prediction.
No ratings yet
Hybrid Decision Tree-Based Machine Learning Models For Short-Term Water Quality Prediction.
14 pages
Data Processing: June 2019
No ratings yet
Data Processing: June 2019
6 pages
Data 07 00011
No ratings yet
Data 07 00011
22 pages
A Novel Methodology For Discrimination Prevention in Data Mining
No ratings yet
A Novel Methodology For Discrimination Prevention in Data Mining
21 pages
A Review On Big Data
No ratings yet
A Review On Big Data
6 pages
Clustering Methods For Big Data Analytics Techniques, Toolboxes and Applications
No ratings yet
Clustering Methods For Big Data Analytics Techniques, Toolboxes and Applications
192 pages
Distributed Data Mining: Scaling Up and Beyond: Foster Provost New York University New York, NY 10012
No ratings yet
Distributed Data Mining: Scaling Up and Beyond: Foster Provost New York University New York, NY 10012
25 pages
Discussion 6
No ratings yet
Discussion 6
2 pages
Data Mining and Knowledge Discovery For Big Data - Methodologies, Challenge and Opportunities (Chu 2013-10-09)
No ratings yet
Data Mining and Knowledge Discovery For Big Data - Methodologies, Challenge and Opportunities (Chu 2013-10-09)
310 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
Unit 2
No ratings yet
Unit 2
37 pages
Machine Learning Models and Algorithms For Big Data Classification
50% (2)
Machine Learning Models and Algorithms For Big Data Classification
364 pages
Big Data
No ratings yet
Big Data
2 pages
2020, Sathyaraj - Chicken Swarm Foraging Algorithm For Big Data Classification Using The Deep Belief Network Classifier
No ratings yet
2020, Sathyaraj - Chicken Swarm Foraging Algorithm For Big Data Classification Using The Deep Belief Network Classifier
21 pages
Data Mining in The World of BIG Data-A Survey
No ratings yet
Data Mining in The World of BIG Data-A Survey
9 pages
Review of Data Analysis Algorithm and Its Applications
No ratings yet
Review of Data Analysis Algorithm and Its Applications
6 pages
A Review of Machine Learning Methods Applied To Structural Dynamics and Vibroacoustic
No ratings yet
A Review of Machine Learning Methods Applied To Structural Dynamics and Vibroacoustic
43 pages
Predictive Data Mining and Discovering Hidden Values of Data Warehouse
No ratings yet
Predictive Data Mining and Discovering Hidden Values of Data Warehouse
5 pages
2016 Book PrinciplesOfDataMining PDF
100% (3)
2016 Book PrinciplesOfDataMining PDF
530 pages
Ijcse 01768
No ratings yet
Ijcse 01768
4 pages
An Introduction To Data Mining Technique: August 2014
No ratings yet
An Introduction To Data Mining Technique: August 2014
6 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
Big Data The New Challenges in Data Mining
No ratings yet
Big Data The New Challenges in Data Mining
4 pages
Towards Methods For Systematic Research On Big Data
No ratings yet
Towards Methods For Systematic Research On Big Data
10 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
IEEE Representation Learning
No ratings yet
IEEE Representation Learning
6 pages
Data Mining Notes
No ratings yet
Data Mining Notes
14 pages
A Research On Machine Learning Methods For Big Data Processing, and Youming Sun
No ratings yet
A Research On Machine Learning Methods For Big Data Processing, and Youming Sun
9 pages
Sakhr - Chaib - Paper On Data Mining
No ratings yet
Sakhr - Chaib - Paper On Data Mining
3 pages
Black Book Project On Marketing in Banking Sector and Recent Trends - 257238756
100% (1)
Black Book Project On Marketing in Banking Sector and Recent Trends - 257238756
67 pages
The Rise of Big Data Science: A Survey of Techniques, Methods and Approaches in The Field of Natural Language Processing and Network Theory
No ratings yet
The Rise of Big Data Science: A Survey of Techniques, Methods and Approaches in The Field of Natural Language Processing and Network Theory
18 pages
The Nature of Psychology
No ratings yet
The Nature of Psychology
13 pages
Kaur 2020
No ratings yet
Kaur 2020
7 pages
Principal OPCRF
100% (1)
Principal OPCRF
24 pages
K 10 Continuum Maths
100% (1)
K 10 Continuum Maths
1 page
Data Science and Big Data An Environment of Computational Intelligence
100% (4)
Data Science and Big Data An Environment of Computational Intelligence
303 pages
Mining Using Genitic Algorithms
No ratings yet
Mining Using Genitic Algorithms
7 pages
Jsaer2016 03 01 21 24
No ratings yet
Jsaer2016 03 01 21 24
4 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Comparative Study of Data Mining Tools
No ratings yet
Comparative Study of Data Mining Tools
8 pages
TPW Data Mining
No ratings yet
TPW Data Mining
4 pages
Big Data Research Paper
No ratings yet
Big Data Research Paper
10 pages
Big Data Vs Data Mining: Abstract
No ratings yet
Big Data Vs Data Mining: Abstract
5 pages
Processing Model From Mining Prospective
No ratings yet
Processing Model From Mining Prospective
5 pages
An Approach To Analysis and Classification of Data From Big Data by Using Apriori Algorithm
No ratings yet
An Approach To Analysis and Classification of Data From Big Data by Using Apriori Algorithm
4 pages
Big Data Mining: A Challenge and How To Manage It: Csa Deptt. Pdmce Jitender Csa Deptt. Pdmce
No ratings yet
Big Data Mining: A Challenge and How To Manage It: Csa Deptt. Pdmce Jitender Csa Deptt. Pdmce
3 pages
Data Mining Technique
No ratings yet
Data Mining Technique
7 pages
META - Spatial Training Conference - Sat PM
No ratings yet
META - Spatial Training Conference - Sat PM
46 pages
Cornell VC Directory
No ratings yet
Cornell VC Directory
186 pages
04 Activity 1 - STRATEGIC BUSINESS ANALYSIS
No ratings yet
04 Activity 1 - STRATEGIC BUSINESS ANALYSIS
2 pages
Improving Sewing Section Efficiency Thro
No ratings yet
Improving Sewing Section Efficiency Thro
11 pages
1 and 2 Thesis Revised January 15
No ratings yet
1 and 2 Thesis Revised January 15
27 pages
Methodology Sujit
No ratings yet
Methodology Sujit
6 pages
Ramandeep Resume
No ratings yet
Ramandeep Resume
4 pages
TW Workbook Ver6
No ratings yet
TW Workbook Ver6
40 pages
Creating Business Value With Process Mining
No ratings yet
Creating Business Value With Process Mining
47 pages
Bpme 3043 Field Work Report A191
No ratings yet
Bpme 3043 Field Work Report A191
7 pages
Business Process Centric Modelling
No ratings yet
Business Process Centric Modelling
24 pages
A Self-Organizing P2P Framework For Collective Service Discovery - Carlo Mastroianni
No ratings yet
A Self-Organizing P2P Framework For Collective Service Discovery - Carlo Mastroianni
9 pages
A Practitioners Guide To Process Mining - Limitations of The Directly-Follows graph-VanDerAalst
No ratings yet
A Practitioners Guide To Process Mining - Limitations of The Directly-Follows graph-VanDerAalst
8 pages
15th Academic Council Meeting 12 MARCH 2024
No ratings yet
15th Academic Council Meeting 12 MARCH 2024
12 pages
High Quality Elementary School Japan
No ratings yet
High Quality Elementary School Japan
10 pages
Origins and Persistence of Economic Inequality: Carles Boix
No ratings yet
Origins and Persistence of Economic Inequality: Carles Boix
28 pages
Questionnaire Survey
No ratings yet
Questionnaire Survey
1 page
Probabilistic Seismic Loss Estimation of Aging Highway Bridges Subjected To Multiple Earthquake Events
No ratings yet
Probabilistic Seismic Loss Estimation of Aging Highway Bridges Subjected To Multiple Earthquake Events
21 pages
Checkpoint B1+ - TRC - Culture - U6
No ratings yet
Checkpoint B1+ - TRC - Culture - U6
2 pages
Chapter 5 Critical Reading As Reasoning
No ratings yet
Chapter 5 Critical Reading As Reasoning
9 pages
Artikel 21 Readability Test
No ratings yet
Artikel 21 Readability Test
8 pages
CAP792
No ratings yet
CAP792
1 page
Artigo Superfícies Inusitadas Publicado
No ratings yet
Artigo Superfícies Inusitadas Publicado
3 pages
A Dbs JP Am Review Teachers Attitudes Final
No ratings yet
A Dbs JP Am Review Teachers Attitudes Final
25 pages
ID Analisis Tata Kelola Pajak Bumi Bangunan
No ratings yet
ID Analisis Tata Kelola Pajak Bumi Bangunan
11 pages
TERM PAPER ON 7 RISKS THAT CAN THREATEN THE SUCCESS OF A PROPOSED IBADAN INLAND CONTAINER DRY PORT LOCATED IN MONIYA AXIS OF IBADAN AND THE RISK RESPONSE STRATEGIES TO MANAGE THEM.
No ratings yet
TERM PAPER ON 7 RISKS THAT CAN THREATEN THE SUCCESS OF A PROPOSED IBADAN INLAND CONTAINER DRY PORT LOCATED IN MONIYA AXIS OF IBADAN AND THE RISK RESPONSE STRATEGIES TO MANAGE THEM.
6 pages
Aplicacion Test E-H
No ratings yet
Aplicacion Test E-H
3 pages
The Role of Artificial Intelligence in Healthcare Management
No ratings yet
The Role of Artificial Intelligence in Healthcare Management
1 page
NASA: 71278main 03-046
No ratings yet
NASA: 71278main 03-046
2 pages

Dynamic Distributed and Parallel Machine Learning Algorithms For Big Data Mining Processing

Uploaded by

Dynamic Distributed and Parallel Machine Learning Algorithms For Big Data Mining Processing

Uploaded by

The current issue and full text archive of this journal is available on Emerald Insight at:

2.1 Mathematical statistics in the heart of big data analytics

564 2.2 Big data, data mining and machine learning

2.3 Systems and technologies for big data

3.1 Proposed work

Te: Size of the expected sample,

4. Experiments, results and discussion

4.1 Classification results evaluation

Data-set Instance-training Instances-validation Features Classes Table 1.

classification result. This process is repeated until a satisfactory result is obtained. In

Classification using CRF Classification using IRF

CCI % 94.9367 CCI % 96.9937

the six nodes of our

Classification using CRF Classification using IRF

KDD Cup 2012 CCI % 92.599 CCI % 95.5052

4.2 Processing time evaluation

5. The final version of our proposed work

5.1 The real purpose of our proposed system (DDPML)

Data-set Instances training Features Classes Table 11.

6. Conclusion and future works

A1.1 Mean squared error (MSE)

A.1.3 Mathematical expectancy

A1.6 Correctly classified instances (CCI)

A1.7 Incorrectly classified instances (ICI)

A1.8 Mean absolute error (MAE)

A1.9 Root mean-squared error (RMSE)

A1.10 Relative absolute error (RAE)

A1.11 Root relative squared error (RSE)

(1) pi: sum of the elements of line i.

<0 Poor agreement

You might also like