0% found this document useful (0 votes)
26 views10 pages

A Random Sample Partition Data Model For Big Data

This document summarizes a research paper that proposes a new data model called the random sample partition (RSP) data model for big data analysis. The RSP model partitions a large dataset into non-overlapping data blocks where each block has a similar probability distribution to the full dataset. This allows efficient block-level sampling to select representative sample data from the distributed dataset instead of expensive record-level sampling. The paper shows how RSP data blocks can be used to estimate statistics and build models equivalent to using the entire dataset, enabling scalable analysis of big data.

Uploaded by

rai789456126
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views10 pages

A Random Sample Partition Data Model For Big Data

This document summarizes a research paper that proposes a new data model called the random sample partition (RSP) data model for big data analysis. The RSP model partitions a large dataset into non-overlapping data blocks where each block has a similar probability distribution to the full dataset. This allows efficient block-level sampling to select representative sample data from the distributed dataset instead of expensive record-level sampling. The paper shows how RSP data blocks can be used to estimate statistics and build models equivalent to using the entire dataset, enabling scalable analysis of big data.

Uploaded by

rai789456126
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/321760934

A Random Sample Partition Data Model for Big Data Analysis

Article in IEEE Transactions on Industrial Informatics · December 2017

CITATIONS READS
32 1,886

5 authors, including:

Salman Salloum Joshua Huang


National University of Singapore Shenzhen University
19 PUBLICATIONS 686 CITATIONS 277 PUBLICATIONS 8,229 CITATIONS

SEE PROFILE SEE PROFILE

Tamer Emara
Pharos University
14 PUBLICATIONS 291 CITATIONS

SEE PROFILE

All content following this page was uploaded by Tamer Emara on 07 February 2018.

The user has requested enhancement of the downloaded file.


A Random Sample Partition Data Model for Big Data
Analysis


Salman Salloum Yulin He Joshua Zhexue Huang Xiaoliang Zhang Tamer Z. Emara
Chenghao Wei Heping He
Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University,
Shenzhen 518060, Guangdong, China
{ssalloum, yulinhe, zx.huang, zhangxlassz, tamer, chenghao.wei}@szu.edu.cn
arXiv:1712.04146v2 [cs.DC] 20 Jan 2018

[email protected]

ABSTRACT 2. INTRODUCTION
Big data sets must be carefully partitioned into statistically Big data analysis is a challenging problem in many ap-
similar data subsets that can be used as representative sam- plication areas especially with the ever increasing volume of
ples for big data analysis tasks. In this paper, we propose data. In this regard, divide-and-conquer is used as a com-
the random sample partition (RSP) data model to repre- mon strategy in current big data frameworks to distribute
sent a big data set as a set of non-overlapping data subsets, both data and computation on computing clusters. A big
called RSP data blocks, where each RSP data block has a data set is divided into smaller data blocks and distributed
probability distribution similar to the whole big data set. on the nodes of a computing cluster so that data-parallel
Under this data model, efficient block level sampling is used computations are run on these blocks. However, analyz-
to randomly select RSP data blocks, replacing expensive ing a big data set all at once may require more comput-
record level sampling to select sample data from a big dis- ing resources than available ones on the computing cluster.
tributed data set on a computing cluster. We show how RSP Moreover, current frameworks require new data-parallel im-
data blocks can be employed to estimate statistics of a big plementations of traditional data mining and analysis algo-
data set and build models which are equivalent to those built rithms. In order to enable efficient and effective big data
from the whole big data set. In this approach, analysis of a analysis when data volume goes beyond the available com-
big data set becomes analysis of few RSP data blocks which puting resources, a different approach is required which con-
have been generated in advance on the computing cluster. siders computational, as well as statistical, aspects of big
Therefore, the new method for data analysis based on RSP data on both data management and analysis levels. Such an
data blocks is scalable to big data. approach is essential for investigating a key research prob-
lem nowadays: should the full set of data be used to find
1. KEY INSIGHTS properties and reveal valuable insights from big data or a
subset of this data be good enough?
• Big data analysis requires not only computationally Although the existing mainstream big data frameworks
efficient, but also statistically effective approaches. (e.g., Hadoop’s MapReduce [2], Apache Spark [11]) run data-
parallel computations over distributed data blocks on com-
• A random sample partition (RSP) data model for big puting clusters, current data partitioning techniques do not
data can be used efficiently and effectively in building consider the probability distributions of data in these blocks.
ensemble models, estimating statistics, and other big Sequentially chunking a big data set into small data blocks
data analysis tasks on a computing cluster. does not guarantee that each block is a random sample in
• Experiments show that a sample of RSP data blocks case that data is not randomly ordered in the big data set.
from an RSP data model can be used to get approx- In such case, using data blocks directly to estimate statistics
imate analysis results equivalent to those using the and build models may lead to statistically incorrect or bi-
whole data set. ased results. Furthermore, classical random sampling tech-
niques, which require a full scan of the data set each time a
• In addition to the scalability advantage, adopting this random sample is generated, are no longer effective with the
approach saves time and computing resources. increasing volume of big data sets stored in distributed sys-
∗Corresponding Author tems [10]. Thus, partitioning a big data set into small data
subsets (i.e., data blocks), each being a random sample of
the whole data set, is a fundamental operation for big data
analysis. These data blocks can be used to estimate statis-
tics and build models, especially when analyzing big data
sets requires more than the available resources in order to
meet specific application requirements [4][3]. Therefore, it
is necessary to develop statistically-aware data partitioning
methods which enable effective and efficient usage of data
blocks to fulfill the statistical and scalability requirements
of big data analysis tasks. using the MapReduce computing model [2] adopted by the
Multivariate data is a common form of data in many ap- mainstream big data frameworks such as Apache Hadoop 1 ,
plication areas. Let D= {x1 , x2 , · · · , xN } be a multivariate Apache Spark 2 , and Microsoft R Server 3 .
data set of N records where N is too big for statistical anal- As a unified engine for big data processing, Apache Spark
ysis of D on a single machine. Each record is depicted with has been adopted in a variety of applications in both academia
M attributes or features, i.e., xi = (xi1 , xi2 , · · · , xiM ) for and industry [12][7] as a new generation engine after Hadoop’s
any i ∈ {1, 2, · · · , N }. In order to make a statistically-aware MapReduce. It uses a new data abstraction and in-memory
data partitioning, we propose the random sample par- computation model, the resilient distributed datasets (RDDs)
tition (RSP) data model to represent D as a set of non- [11], where collections of objects (e.g., records) are parti-
overlapping data blocks ( called RSP data blocks) where tioned across a cluster, kept in memory and processed in
each block itself is a random sample. If the records in D parallel. Similarly, Microsoft R Server addresses the in-
are independently and identically distributed (i.i.d), we call memory limitations of the open source statistical system,
D a randomized data set (i.e., the order of records in D is R, by adding parallel and chunk-wise distributed processes
random). Our previous empirical study [8] showed that data across multiple cores and nodes. It comes with the pro-
sets are naturally randomized in many application areas and prietary eXternal data frame (XDF) and a framework for
thus can be directly represented as RSP data models using parallel external memory algorithms (PEMAs) of statisti-
the current data partitioning techniques. In case that the cal analysis and machine learning. Both Apache Spark and
data is not randomized, scalable algorithms are developed Microsoft R Server operate on data stored on HDFS. A fun-
to randomize big data sets on computing clusters. damental operation is importing such data into XDF format
The RSP data blocks in an RSP can be directly drawn or RDDs, and then running data-parallel operations.
as random samples of the whole big data set in data ex- Although the current frameworks employ the data-parallel
ploratory and analysis tasks. Since the RSP data blocks are model to run scalable algorithms on computing clusters, an-
generated in advance and stored on the computing cluster, alyzing a whole big data set may exceed the available com-
randomly selecting a set of RSP data blocks is much effi- puting resources. There are some technical solutions for this
cient than sampling a set of records from a distributed big problem such as loading and processing blocks in batches
data file because the full scan on the whole file is no longer according to the available resources, but this still requires
needed. In this article, we show that a small sample of few analyzing each block in the data set which leads to longer
RSP data blocks from an RSP is enough to build models computation time. Furthermore, current data partitioning
and compute the statistical estimates which are equivalent techniques simply cut data sets into blocks without consid-
to those calculated from the whole big data set. We pro- ering data probability distributions in these blocks. This
pose the asymptotic ensemble learning framework for big can lead to statistically incorrect or biased results in some
data analysis which depends on ensemble methods as a gen- data analysis tasks. We argue that solving big data analysis
eral approach for building block-based ensemble models from problems requires solutions which not only consider the com-
RSP data blocks. With this framework, results can be im- putational aspects of big data, but also the statistical ones.
proved incrementally without the need to load and analyze To address this issue, we propose the RSP data model.
the whole big data set all at once. This approach can be
generalized for statistics estimation by defining appropriate
ensemble functions (e.g., averaging the means from different
4. RSP OF BIG DATA
RSP data blocks). The RSP data model is a new model to represent a big
In this article, we introduce the RSP data model of a data set as a set of RSP data blocks. It depends on two
big data set and show how RSP data blocks are essential fundamental concepts: random sample and partition of a
for efficient and effective big data analysis. We also discuss data set. First, we define these basic concepts and show
how this new RSP data model can be employed for different how they are applied to the RSP data models of big data
data analysis tasks using the asymptotic ensemble learning sets. Then, we present a formal definition of the RSP data
framework. Finally, we summarize the implications of this model. For convenience of discussion, we do not consider
new model and conclude this article with some of our current the set anisotropy.
works.
Random Sample of a Big Data Set.
Sampling is an essential technique for big data analysis
3. DISTRIBUTED DATA-PARALLEL COM- when the volume of data goes beyond the available comput-
PUTING ing resources. Random samples are widely used in statistics
to explore the statistical properties and distributions of big
As data volume in different application areas goes be- data, calculate the statistics estimates and build regression
yond the petabyte scale, divide-and-conquer is taken as a and classification models. In big data analysis, we define a
general strategy to process big data on computing clusters random sample of a big data set as follows:
considering the recent advancements in distributed and par-
allel computing technology. In this strategy, a big data file
Definition 1 (Random Sample): Let D = {x1 , x2 , · · · ,
is chunked into small non-overlapping data blocks and dis- xN } be a big data set of N records. Let Dn be a subset of D
tributed on the nodes of a computing cluster using a dis- containing n records chosen from D using a random process.
tributed file system such as Hadoop distributed file system
(HDFS) [9]. Then, data-parallel computations are run on 1
http://hadoop.apache.org/
these blocks considering data locality. After that, interme- 2
https://spark.apache.org/
diate results from processed blocks are integrated to produce 3
https://www.microsoft.com/en-us/cloud-platform/r-
the final result of the whole data set. This is usually done server
Dn is a random sample of D if Random Sample Partition of a Big Data Set.
We define the Random Sample Partition (RSP) to repre-
E[F̃n (x)] = F (x), for n ≤ N sent a big data set as a family of non-overlapping random
where F̃n (x) and F (x) denote the sample distribution func- samples.
tions of Dn and D, respectively. E[F̃n (x)] denotes the ex-
Definition 3 (Random Sample Partition): Let D =
pectation of F̃n (x).
{x1 , x2 , · · · , xN } be a big data set which is a random sample
According to the law of large numbers, we consider that
of a population and assume F (x) is the sample distribution
a big data set is a random sample of the population in a
function (s.d.f.) of D. Let T be a partition operation on D
certain application domain. For example, a big data set of
and T = {D1 , D2 , · · · , DK } be a partition of D accordingly.
customers is a random sample of the customer’s population
T is called a random sample partition of D if
in a company. Thus, we can use such data set to estimate the
distribution of all customers in the company. For big data E[F̃k (x)] = F (x) for each k = 1, 2, · · · , K,
analysis, a random sample of a big data set D is often used
to investigate the statistical properties of D when D is too where F̃k (x) denotes the sample distribution function of Dk
big to be analyzed using the available computing resources. and E[F̃k (x)] denotes its expectation. Accordingly, each Dk
A random sample is taken from D using a random sampling is called an RSP data block of D and T is called an RSP op-
process as stated in Lemma 1. However, if D is distributed eration on D. In the next section, we discuss the partitioning
on a computing cluster, taking a random sample from D process to generate RSPs from big data sets.
itself is an computationally expensive process because a full
scan of D needs to be conducted on the distributed nodes 5. TWO-STAGE DATA PARTITIONING
of the computing cluster. To avoid the random sampling Given a big data set in HDFS, a partitioning algorithm is
process on D in big data analysis, we can generate a set of required to convert the HDFS file into an RSP data model
random samples from D in advance and save these random of K RSP data blocks, each with n records, which are also
samples to be used in analysis of D. This idea is materialized stored in HDFS. In this case, the partitioning algorithm
in the RSP data model. consists of two main steps: data chunking and data ran-
domization as discussed below. The pseudo code is given in
Partition of a Big Data Set. Algorithm 1.
On distributed file systems, a big data set D is divided
into small data blocks which are distributed on the nodes of • Data Chunking: D is divided into P data blocks. We
a computing cluster. Each block contains a subset of records call these blocks the original data blocks which are not
from D. From a mathematical point of view, this set of necessarily randomized. In the current big data frame-
blocks forms a partition of D. In mathematics, a partition works, this operation is straightforward and available.
of a set is a grouping of the set’s elements into non-empty
subsets, in such a way that every element is included in one • Data Randomization: a slice of δ records is randomly
and only one of the subsets4 . selected without replacement from each of the original
P data blocks to form a new RSP data block. First,
Definition 2 (Partition of Data Set): Let D = {x1 , each original data block is randomized locally. Then,
x2 , · · · , xN } be a data set containing N objects. Let T the randomized block is further chunked into K sub-
be an operation which divides D into a set of subsets T = blocks, each with δ records. After that, a new RSP
{D1 , D2 , · · · , DK }. T is called a partition of data set D if block is created by selecting one sub-block from each
S
K of the randomized original blocks and combining these
(1) Dk = D; selected sub-blocks together. The last step is repeated
k=1
(2) Di ∩ Dj = ∅, when i, j ∈ {1, 2, · · · , K} and i 6= j. K times to produce the required number of RSP data
Accordingly, T is called a partition operation on D and each blocks where each RSP data block has n = Kδ records.
Dk (k = 1, 2, · · · , K) is called a data block of D.
The number of records in a slice δ can be determined
According to this definition, many partitions can be gen- depending on K and n (e.g., δ = n/K to distribute data
erated from a data set D. For example, an HDFS file is a evenly over all blocks). The number of RSP data blocks K,
particular partition of D generated by partitioning operation or the number of records in an RSP data block n, is selected
T to sequentially cutting the original data file D into data depending on both the available computing resources and
blocks {D1 , D2 , · · · , DK }. Even with the same partitioning the target analysis tasks so that a single RSP data block
operation, different partitions can be generated using differ- can be processed efficiently on one node (or core).
ent partitioning parameters such as the size of each block A prototype of this algorithm was implemented using the
or the number of records in each block. However, a key Apache Spark’s RDD API. As a preliminary test on a small
issue with this sequential partitioning is that data blocks cluster of 5 nodes (each node has 24 cores, 128 GB RAM
may not have similar statistical properties as the big data and 12.5 TB disk storage), Figure (1) shows the partitioning
set. In fact, it is theoretically possible and practically re- time for synthesized numerical data. We can see how the
quired to find alternatives for generating a partition of a partitioning time increases almost linearly with the number
data set which satisfies certain application requirements or of records in the data set. This shows the scalability of such
holds some statistical properties, e.g., each data block is a partitioning algorithms. In this experiment, the number of
random sample of D. RSP data blocks is the same as the number of original blocks
where each block contains 100,000 records. However, we also
4
https://en.wikipedia.org/wiki/Partition of a set found that the partitioning time did not vary much when the
Algorithm 1 Partitioning Algorithm
Input:
- D: big data set;
- P : number of data blocks of D;
- n: number of records in an RSP data block;
- K: number of required RSP data blocks;
Method:
δ = n/K;
Divide D into P data blocks;
for i = 1 to P do
Randomize block Di ;
Sequentially cut each randomized Di into K sub-blocks,
each with δ records;
end for Figure 1: Time for creating RSPs from synthesized
for k = 1 to K do numerical datasets with 100 features using Apache
Dk = φ; Spark on a computing cluster of 5 nodes (each node
for i = 1 to P do has 24 cores, 128 GB RAM and 12.5 TB disk stor-
Select one sub-block from Di without replacement; age). The algorithm was run on 10 different sizes
Append the sub-block to Dk ; of data, from 100 million records (1000 blocks) to
end for 1 billion records (10000 blocks). The storage size is
Save Dk as an RSP data block; approximately 100 GB for 100 million records and 1
end for TB for 1 billion records.
T = {D1 ,D2 , · · · , DK };
Output:
- T: a set of RSP data blocks;
randomly selected records from all the original blocks. Thus,
each RSP data block from an RSP data model is equivalent
to a simple random sample from D. The underlying theory
number of RSP data blocks is different from the number of
of RSP data blocks is shown in Lemma 1 and Theorem 1.
original blocks. Thus, this kind of algorithms can be used
to fulfill different requirements as the number of records in
Lemma 1: Suppose D = {x1 , x2 , · · · , xKδ } is a data set.
a block is an essential factor for some analysis tasks. Randomly choose a permutation of the sequence 1, 2, · · · , Kδ
which is denoted as τ = {τ1 , τ2 , · · · , τKδ }. For each i =
6. PROBABILITY DISTRIBUTIONS OF RSP 1, · · · , K, set Di = {xτδ(i−1) +1 , xτδ(i−1) +2 , · · · , xτδi }, then
each Di is an RSP data block of D. Below, we prove this for
DATA BLOCKS one dimensional data. It also applies to high dimensional
The key idea behind an RSP is partitioning a big data set cases with minor modifications.
into RSP data blocks which can be used directly as random
samples of the whole data set. Thus, it is essential to inves- Proof : Suppose the sample distribution function of D and
tigate the statistical properties of these RSP data blocks. Di are F (x) and Fi (x) (i = 1, 2, · · · , K) respectively. For
We used hypothesis testing and exploratory data analysis any real number x, let M denote the number of the records
in our previous empirical studies [8] to compare RSP data in D whose values are not greater than x. Then, it is easy
blocks. We discuss here the underlying theory of using the to show M = Kδ · F (x). If si denotes the number of records
RSP data blocks as random samples of the whole data set. in Di whose values are not greater than x, then
Each block in an RSP should follow a data distribution
similar to the whole data set. For a categorical feature, X
δ Xδ
C j · C δ−j
records that belong to the same category should be evenly E(si ) = j · P {si = j} = j M δKδ−M
j=1 j=1
CKδ
distributed over all blocks. A good example is the label fea-
ture in a classification task. For example, Figure (2.a) shows M δ−1
= · CKδ−1 = δ · F (x),
the distribution of a label feature (from HIGGS data set5 ) δ
CKδ
in the whole data set and some randomly selected RSP data
blocks. This also applies to continuous features as shown So, the expectation of Fi (x) is
in Figure (2.b). In addition, the similarity between RSP E(si )
data blocks and samples by simple random sampling, or the E[Fi (x)] = = F (x).
δ
whole data set if possible, can be computed using quantity
measures such as MMD measure [5]. Thus, each Di is an RSP data block of D.
Selecting a block from an RSP of D is equivalent to draw-
ing a random sample directly from the big data set D. If Theorem 1: Suppose data set A has N1 records and
D is a randomized data set, cutting a block of n records is data set B has N2 records. If A1 is an RSP data block of
equivalent to randomly selecting these n records. This also A with records n1 and
S B1 is an RSP data block of B
S with
applies when partitioning a non-randomized data set using records n2 , then A1 B1 is an RSP data block of A B as
n1 N1
Algorithm 1 because each RSP data block is formed from n2
= N 2
.
5
https://archive.ics.uci.edu/ml/datasets/HIGGS Proof : Suppose the sample distribution functions of data
Figure 2: Probability distribution in data blocks and the whole data set.

blocks A, B, A1 and B1 are F1 (x), F2 (x), F̃1 (x) and F̃2 (x),
respectively. We have E[F̃1 (x)] = F1 (x) and E[F̃2 (x)] = As each block in an RSP data model is a random sam-
F2 (x). For any real number x, the number of records of ple of big data set D, the RSP data model can be used as
S
A1 B1 whose values are not greater than x is n1 F̃1 (x) + a reduced representative space to directly draw samples of
n2 F̃S
2 (x). Therefore, the sample distribution function of RSP data blocks instead of drawing samples of individual
A1 B1 is: records from D. In contrast to the classical record level
sampling, we call this sampling method as block level sam-
n1 F̃1 (x) + n2 F̃2 (x) pling where blocks are sampled, without replacement and
F̃ (x) = .
n1 + n2 with equal probability. Thus, a sample of RSP data blocks
S
Similarly, the sample distribution function of A B is: is a set of non-overlapping random samples from D. This
N1 F1 (x) + N2 F2 (x) is more efficient especially when K << N because it avoids
F (x) = . the full scan of D each time a random sample is needed.
N1 + N2
The expectation of F̃ (x) is Definition 4 (Block Level Sample): Let T be an RSP of
a big data set D. S = {D1 , D2 , · · · , Dg }, where g < K, is a
n1 F̃1 (x) + n2 F̃2 (x) block level sample with RSP data blocks randomly selected
E[F̃ (x)] = E[ ]
n1 + n2 from T without replacement and with equal probability.
n1 E[F̃1 (x)] + n2 E[F̃2 (x)] The block-sampling operation is called separately in each
=
n1 + n2 analysis process so that samples of RSP data blocks are se-
n1
F (x) + F2 (x) N1
F1 (x) + F2 (x) lected without replacement, i.e., without repeating a block
n 1 N
= 2 n1 = 2 N1 neither in the same sample nor in other samples in the same
n2
+1 N2
+1 analysis process. The number of RSP data blocks g in a
= F (x). block sample depends on the available computing resources
S S so that each selected block can be loaded and analyzed lo-
Therefore, A1 B1 is an RSP data block of A B. cally on one node or core. In such case, a sample of RSP
data blocks is analyzed in one batch in a perfectly paral-
7. BLOCK LEVEL SAMPLE: A SAMPLE OF lel manner. This sampling process can be refined to select
RANDOM SAMPLE DATA BLOCKS blocks depending on the availability of nodes on a computing
cluster which can lead to better scheduling algorithms.
Block level sampling can be employed for designing data
analysis pipelines, for example, analyzing big data in batches
(i.e., each batch uses one sample of RSP data blocks) to step-
wise collect and improve statistical models and estimates
rather than trying to load the data set all at once. Further
statistical investigation can be done on the selected RSP
data blocks. For example, exploratory data analysis tech-
niques can be used to visualize and compare data distribu-
tions between RSP data blocks and classical random sam-
ples. Quantitative measures can be also used, such as the
maximum mean discrepancy (MMD) measure [5] for the sim-
ilarity between data distributions and Hotelling’s T-Square
test for the difference between means of features. This kind
of statistical testing is fundamental for many data analysis
tasks. In the following sections, we describe how block level
samples can be used effectively and efficiently for estimating
statistics and building ensemble models.

8. ESTIMATION FROM A SAMPLE OF RSP


DATA BLOCKS
The method of bootstrapping multiple samples from a Figure 3: Block level estimation of means of 4 fea-
given data set to evaluate statistical estimates is widely used tures in HIGGS data using 5 RSP data blocks.
in statistics. However, traditional bootstrap method is not Each point represents the estimated value after each
suitable for a distributed big data set because of the high batch (averaged from 100 runs). The dotted line
computational and storage costs. One approach for reduc- represents the true value calculated from the entire
ing these costs is drawing samples of small sizes such as the HIGGS data.
bag of little bootstraps [6] to calculate the average estimates
from those small samples. However, this still requires scan-
ning the whole data set each time a sample is needed. In
addition, extra storage space to store these samples is also
inevitable in most cases. Instead, storing a big data set in
an RSP data model can avoid both the repeated random
sampling and extra storage costs. This gives a chance, not
only to directly estimate statistics and sample distributions
from RSP data blocks, but also to turn the focus into im-
proving the quality of estimations because the whole data
set is stored, by default, as a collection of random samples.
Estimated statistics from individual RSP data blocks of
an RSP data model can be combined (e.g., by averaging) in
a single estimation which is generally better than any of the
individual ones. For example, Figure (3) shows the average
means of 4 features in HIGGS data. Each value is an average
of the estimated means from blocks used up to that point.
We can see the error of the means is not significant even in
the first several batches and the estimated mean values con-
verge to the true mean value while adding more blocks. In a
similar way, Figure (4) shows the estimated standard devi-
ations for the same features in HIGGS data. As we can see,
samples of RSP data blocks from an RSP data model can be
used efficiently to estimate the statistics of a big data set. Figure 4: Block level estimation of standard devia-
This approach can be generalized to other statistical analy- tions of 4 features in HIGGS data using 5 RSP data
sis tasks such as building classification or regression models. blocks. Each point represents the estimated value
In the following section, we show how a sample of RSP data after each batch (averaged from 100 runs). The dot-
blocks can be used effectively for building ensemble models. ted line represents the true value calculated from the
entire HIGGS data.
9. ASYMPTOTIC ENSEMBLE LEARNING
FRAMEWORK different algorithms from different component data sets. The
The RSP data blocks of an RSP data model can be used as results from base models are combined in a single ensemble
component data sets to build ensemble models for different model which generally outperforms each of the individual
tasks (e.g., classification, regression, clustering). Ensemble ones [1].
methods use multiple base models built with the same or As shown in Figure (5), the asymptotic ensemble learning
Figure 5: Asymptotic Ensemble Learning Framework: Base models are built from a sample of RSP data
blocks and then combined in an ensemble model.

framework provides a general framework to analyze big data


by building base models from randomly selected RSP data
blocks of the big data set. Given a learning algorithm F and
an RSP data model T = {D1 , D2 , · · · , DK }, an ensemble
model are learnt from RSD data blocks in batches as follows:
• Blocks Selection: a sample of RSP data blocks is drawn
without replacement from the RSP data model, e.g.,
D5 , D120 , D506 and D890 . This sample of RSP data
blocks is put in one batch.
• Learning Base Models: A base model is built from each
selected RSP data block. These base models are built
in parallel on a computing cluster. For instance, four
classifiers π1 , π2 , π3 , π4 are built in parallel from the
four selected RSP data blocks as shown in the figure.
• Ensemble Learning and Update: The base models are
collected to form an ensemble model Π which is up- Figure 6: Accuracy of block-based ensemble clas-
dated after each batch. sifiers on HIGGS data averaged from 100 runs of
Algorithm 2. Base models were trained in batches
• Ensemble Evaluation: If the current ensemble model
of 5 blocks with 1% data in each block. The dotted
Π is evaluated. If it does not satisfy the termination
line is the accuracy of a single model built using all
condition(s), go back to the first step; otherwise, out-
data with the same algorithm.
put the ensemble model. This process continues until a
satisfactory ensemble model Π is obtained or all blocks
are used up.
The basic operations in building an ensemble model using analysis requirements. An empirical study of this method
the asymptotic ensemble learning framework are described was presented in [8].
in Algorithm 2. Base models are built from RSP data block A prototype of this framework was implemented using Mi-
samples until there is no significant increase in the accuracy crosoft R Server packages with Apache Spark. Taking de-
of the ensemble model. BlocksSampling function is used to cision trees as an example, Figure (6) shows the results on
select g RSP data blocks from T. Ω() is an evaluation func- HIGGS data set. Ensemble models were built using the pre-
tion to evaluate whether the ensemble model Π satisfies the vious ensemble process and evaluated on the same test data
Algorithm 2 Asymptotic Ensemble Learning Algorithm
Input: In this work, we argue that alleviating the challenges of
- T: an RSP of D; big data analysis, especially when data volume goes beyond
- f : a learning algorithm; the available computing resources, requires a different way
- g: number of blocks in one batch; of thinking. It is not efficient to treat data analysis as a
Method: pure computational problem and ignore its statistical as-
Π = φ; pects whether on the data analysis level or data manage-
1- Blocks Selection: S = BlocksSampling(T, f, g) ment level. Applying the RSP data model on big data sets
2- Base Models Learning: is a promising approach for a variety of data analysis tasks.
for Dq ∈ S do It can open the door for further investigations of new inno-
πq = f (Dq ) vative solutions to cope with the ever increasing volume of
end for big data sets.
3- Ensemble Update: Π = Π + {πq }gq=1 A key advantage is that RSP data blocks can be used
4- Ensemble Evaluation: directly as random samples without the need for expensive
If Ω(Π) < threshold (or no more blocks) online sampling and extra storage space. These blocks can
Stop be used to obtain unbiased estimators of a big data set be-
Else cause each data block has approximately the same distribu-
Go to 1 tion as the whole data set. In addition, creating an RSP
Output: Π data model from a big data set is only performed once. Af-
ter that, samples of RSP data blocks can be used directly,
not only for approximate computing and ensemble learning,
after each batch. We can see how the ensemble model ac- but also for exploratory data analysis, interactive analysis
curacy changes after each batch. However, this change is and quickly piloting models. As we can see from the ex-
not significant after using about 15-20% of the data. Fur- perimental results, while there is no significant difference
thermore, the ensemble model accuracy is generally better in accuracy, block-based ensembles and estimations require
than the accuracy of a single model built using the whole less computational time and resources. At least, equivalent
data (the dotted line in the figure). This framework can also results can be reached using only a small portion of a big
be used to get indicators of a model’s quality using only a data set. In this way, memory limitation is not critical any-
subset of RSP data blocks. more as data is analyzed in batches and RSP data blocks
These results show that a small subset of RSP data blocks are small enough to be processed on single nodes or cores.
from a big data set are enough to obtain ensemble models Since the RSP based analysis approach enables reusing
equivalent to the models built using the whole data set. In sequential algorithms for conducting data analysis on RSP
this way, computation time can be decreased significantly data blocks without the need to rewrite these algorithms
as shown in Figure (7). As far as the number of RSP data for distributed and parallel environment, tackling big data
blocks in a batch is equal to or smaller than the number of analysis problems turns to solving small data problems and
available executors in the computing cluster, the computa- defining proper ensemble or combination functions. Further-
tion time of a batch does not vary a lot. Even when using more, statistically-aware data partitioning algorithms can be
all RSP data blocks of a big data set in one batch, which is implemented using the current APIs of distributed compu-
generally not required, it still takes less time than required tation engines, such as Apache Hadoop and Apache Spark.
for training a single model from the whole data (about 8 As a result, many of the current algorithms can be scaled to
minutes for HIGGS data set on the same cluster). big data without the need for new parallel implementations.
This approach can also help in separating big data storage
from big data analysis so that selected RSP data blocks can
be loaded on or transferred to local machines or comput-
ing clusters for analysis. As such, a computing cluster can
be used to analyze data sets stored in different clusters by
combining the outputs from different locations. If RSP data
blocks from different data centers have different probabil-
ity distributions, a combination criterion can be defined to
produce representative RSP data blocks of the whole data
set and then analyze the combined blocks to produce the
estimated results of the big data in different data centers.

Figure 7: Training time for block-based models on a 11. CONCLUSIONS


cluster of 5 nodes. Each block holds 1% of the data In this article, we introduced the random sample parti-
(e.g., 5% of data means that 5 blocks were used in tion (RSP) data model to represent a big data set as a set
one batch). For comparison, the time for building of RSP data blocks where each RSP data block itself is a
a single model from all data on the same cluster is random sample of the whole data set. An RSP data model
shown on the far right. of a big data set can be created using distributed data par-
titioning algorithms. We showed that block level samples
from an RSP data model can be used efficiently and effec-
tively for different data analysis tasks such as statistics es-
10. ADVANTAGES AND IMPLICATIONS timation and ensemble learning. We demonstrated how the
asymptotic ensemble learning framework is used as a general S. Venkataraman, M. J. Franklin, A. Ghodsi,
framework for building block-based ensemble models using J. Gonzalez, S. Shenker, and I. Stoica. Apache spark:
block samples from an RSP data model. For further testing A unified engine for big data processing. Commun.
and investigation, some of our future works include imple- ACM, 59(11):56–65, Oct. 2016.
menting different statistically-aware partitioning algorithms,
extending our framework for building random subspace base
models in order to increase the variety of base models as well
as testing other data analysis tasks.

12. ACKNOWLEDGMENTS
The first author and second author contributed equally
the same to this paper which was supported by grants from
National Natural Science Foundation of China (No. 61473194,
61503252) and China Postdoctoral Science Foundation (No.
2016T90799).

13. REFERENCES
[1] C. C. Aggarwal. Data Mining: The Textbook. Springer
Publishing Company, Incorporated, 2015.
[2] J. Dean and S. Ghemawat. Mapreduce: Simplified
data processing on large clusters.
COMMUNICATIONS OF THE ACM, 51(1):107–113,
JAN 2008.
[3] J. Gantz and D. Reinsel. Extracting value from chaos,
2011.
[4] Í. Goiri, R. Bianchini, S. Nagarakatte, and T. D.
Nguyen. Approxhadoop: Bringing approximations to
mapreduce frameworks. In Proceedings of the
Twentieth International Conference on Architectural
Support for Programming Languages and Operating
Systems, pages 383–397. ACM, 2015.
[5] A. Gretton, K. M. Borgwardt, M. J. Rasch,
B. Schölkopf, and A. Smola. A kernel two-sample test.
J. Mach. Learn. Res., 13:723–773, Mar. 2012.
[6] A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan.
The Big Data Bootstrap. In ICML, 2012.
[7] S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z.
Huang. Big data analytics on apache spark.
International Journal of Data Science and Analytics,
1(3):145–164, 2016.
[8] S. Salloum, J. Z. Huang, and Y. He. Empirical
analysis of asymptotic ensemble learning for big data.
In Proceedings of the 3rd IEEE/ACM International
Conference on Big Data Computing, Applications and
Technologies, BDCAT ’16, pages 8–17, New York, NY,
USA, 2016. ACM.
[9] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.
The hadoop distributed file system. In 2010 IEEE
26th Symposium on Mass Storage Systems and
Technologies (MSST), pages 1–10, May 2010.
[10] M. Vojnovic, F. Xu, and J. Zhou. Sampling based
range partition methods for big data analytics.
Technical Report MSR-TR-2012-18, February 2012.
[11] M. Zaharia, M. Chowdhury, T. Das, and A. Dave.
Resilient distributed datasets: A fault-tolerant
abstraction for in-memory cluster computing. NSDI’12
Proceedings of the 9th USENIX conference on
Networked Systems Design and Implementation, pages
2–2, Apr. 2012.
[12] M. Zaharia, R. S. Xin, P. Wendell, T. Das,
M. Armbrust, A. Dave, X. Meng, J. Rosen,

View publication stats

You might also like