A Random Sample Partition Data Model For Big Data
A Random Sample Partition Data Model For Big Data
net/publication/321760934
CITATIONS READS
32 1,886
5 authors, including:
Tamer Emara
Pharos University
14 PUBLICATIONS 291 CITATIONS
SEE PROFILE
All content following this page was uploaded by Tamer Emara on 07 February 2018.
∗
Salman Salloum Yulin He Joshua Zhexue Huang Xiaoliang Zhang Tamer Z. Emara
Chenghao Wei Heping He
Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University,
Shenzhen 518060, Guangdong, China
{ssalloum, yulinhe, zx.huang, zhangxlassz, tamer, chenghao.wei}@szu.edu.cn
arXiv:1712.04146v2 [cs.DC] 20 Jan 2018
ABSTRACT 2. INTRODUCTION
Big data sets must be carefully partitioned into statistically Big data analysis is a challenging problem in many ap-
similar data subsets that can be used as representative sam- plication areas especially with the ever increasing volume of
ples for big data analysis tasks. In this paper, we propose data. In this regard, divide-and-conquer is used as a com-
the random sample partition (RSP) data model to repre- mon strategy in current big data frameworks to distribute
sent a big data set as a set of non-overlapping data subsets, both data and computation on computing clusters. A big
called RSP data blocks, where each RSP data block has a data set is divided into smaller data blocks and distributed
probability distribution similar to the whole big data set. on the nodes of a computing cluster so that data-parallel
Under this data model, efficient block level sampling is used computations are run on these blocks. However, analyz-
to randomly select RSP data blocks, replacing expensive ing a big data set all at once may require more comput-
record level sampling to select sample data from a big dis- ing resources than available ones on the computing cluster.
tributed data set on a computing cluster. We show how RSP Moreover, current frameworks require new data-parallel im-
data blocks can be employed to estimate statistics of a big plementations of traditional data mining and analysis algo-
data set and build models which are equivalent to those built rithms. In order to enable efficient and effective big data
from the whole big data set. In this approach, analysis of a analysis when data volume goes beyond the available com-
big data set becomes analysis of few RSP data blocks which puting resources, a different approach is required which con-
have been generated in advance on the computing cluster. siders computational, as well as statistical, aspects of big
Therefore, the new method for data analysis based on RSP data on both data management and analysis levels. Such an
data blocks is scalable to big data. approach is essential for investigating a key research prob-
lem nowadays: should the full set of data be used to find
1. KEY INSIGHTS properties and reveal valuable insights from big data or a
subset of this data be good enough?
• Big data analysis requires not only computationally Although the existing mainstream big data frameworks
efficient, but also statistically effective approaches. (e.g., Hadoop’s MapReduce [2], Apache Spark [11]) run data-
parallel computations over distributed data blocks on com-
• A random sample partition (RSP) data model for big puting clusters, current data partitioning techniques do not
data can be used efficiently and effectively in building consider the probability distributions of data in these blocks.
ensemble models, estimating statistics, and other big Sequentially chunking a big data set into small data blocks
data analysis tasks on a computing cluster. does not guarantee that each block is a random sample in
• Experiments show that a sample of RSP data blocks case that data is not randomly ordered in the big data set.
from an RSP data model can be used to get approx- In such case, using data blocks directly to estimate statistics
imate analysis results equivalent to those using the and build models may lead to statistically incorrect or bi-
whole data set. ased results. Furthermore, classical random sampling tech-
niques, which require a full scan of the data set each time a
• In addition to the scalability advantage, adopting this random sample is generated, are no longer effective with the
approach saves time and computing resources. increasing volume of big data sets stored in distributed sys-
∗Corresponding Author tems [10]. Thus, partitioning a big data set into small data
subsets (i.e., data blocks), each being a random sample of
the whole data set, is a fundamental operation for big data
analysis. These data blocks can be used to estimate statis-
tics and build models, especially when analyzing big data
sets requires more than the available resources in order to
meet specific application requirements [4][3]. Therefore, it
is necessary to develop statistically-aware data partitioning
methods which enable effective and efficient usage of data
blocks to fulfill the statistical and scalability requirements
of big data analysis tasks. using the MapReduce computing model [2] adopted by the
Multivariate data is a common form of data in many ap- mainstream big data frameworks such as Apache Hadoop 1 ,
plication areas. Let D= {x1 , x2 , · · · , xN } be a multivariate Apache Spark 2 , and Microsoft R Server 3 .
data set of N records where N is too big for statistical anal- As a unified engine for big data processing, Apache Spark
ysis of D on a single machine. Each record is depicted with has been adopted in a variety of applications in both academia
M attributes or features, i.e., xi = (xi1 , xi2 , · · · , xiM ) for and industry [12][7] as a new generation engine after Hadoop’s
any i ∈ {1, 2, · · · , N }. In order to make a statistically-aware MapReduce. It uses a new data abstraction and in-memory
data partitioning, we propose the random sample par- computation model, the resilient distributed datasets (RDDs)
tition (RSP) data model to represent D as a set of non- [11], where collections of objects (e.g., records) are parti-
overlapping data blocks ( called RSP data blocks) where tioned across a cluster, kept in memory and processed in
each block itself is a random sample. If the records in D parallel. Similarly, Microsoft R Server addresses the in-
are independently and identically distributed (i.i.d), we call memory limitations of the open source statistical system,
D a randomized data set (i.e., the order of records in D is R, by adding parallel and chunk-wise distributed processes
random). Our previous empirical study [8] showed that data across multiple cores and nodes. It comes with the pro-
sets are naturally randomized in many application areas and prietary eXternal data frame (XDF) and a framework for
thus can be directly represented as RSP data models using parallel external memory algorithms (PEMAs) of statisti-
the current data partitioning techniques. In case that the cal analysis and machine learning. Both Apache Spark and
data is not randomized, scalable algorithms are developed Microsoft R Server operate on data stored on HDFS. A fun-
to randomize big data sets on computing clusters. damental operation is importing such data into XDF format
The RSP data blocks in an RSP can be directly drawn or RDDs, and then running data-parallel operations.
as random samples of the whole big data set in data ex- Although the current frameworks employ the data-parallel
ploratory and analysis tasks. Since the RSP data blocks are model to run scalable algorithms on computing clusters, an-
generated in advance and stored on the computing cluster, alyzing a whole big data set may exceed the available com-
randomly selecting a set of RSP data blocks is much effi- puting resources. There are some technical solutions for this
cient than sampling a set of records from a distributed big problem such as loading and processing blocks in batches
data file because the full scan on the whole file is no longer according to the available resources, but this still requires
needed. In this article, we show that a small sample of few analyzing each block in the data set which leads to longer
RSP data blocks from an RSP is enough to build models computation time. Furthermore, current data partitioning
and compute the statistical estimates which are equivalent techniques simply cut data sets into blocks without consid-
to those calculated from the whole big data set. We pro- ering data probability distributions in these blocks. This
pose the asymptotic ensemble learning framework for big can lead to statistically incorrect or biased results in some
data analysis which depends on ensemble methods as a gen- data analysis tasks. We argue that solving big data analysis
eral approach for building block-based ensemble models from problems requires solutions which not only consider the com-
RSP data blocks. With this framework, results can be im- putational aspects of big data, but also the statistical ones.
proved incrementally without the need to load and analyze To address this issue, we propose the RSP data model.
the whole big data set all at once. This approach can be
generalized for statistics estimation by defining appropriate
ensemble functions (e.g., averaging the means from different
4. RSP OF BIG DATA
RSP data blocks). The RSP data model is a new model to represent a big
In this article, we introduce the RSP data model of a data set as a set of RSP data blocks. It depends on two
big data set and show how RSP data blocks are essential fundamental concepts: random sample and partition of a
for efficient and effective big data analysis. We also discuss data set. First, we define these basic concepts and show
how this new RSP data model can be employed for different how they are applied to the RSP data models of big data
data analysis tasks using the asymptotic ensemble learning sets. Then, we present a formal definition of the RSP data
framework. Finally, we summarize the implications of this model. For convenience of discussion, we do not consider
new model and conclude this article with some of our current the set anisotropy.
works.
Random Sample of a Big Data Set.
Sampling is an essential technique for big data analysis
3. DISTRIBUTED DATA-PARALLEL COM- when the volume of data goes beyond the available comput-
PUTING ing resources. Random samples are widely used in statistics
to explore the statistical properties and distributions of big
As data volume in different application areas goes be- data, calculate the statistics estimates and build regression
yond the petabyte scale, divide-and-conquer is taken as a and classification models. In big data analysis, we define a
general strategy to process big data on computing clusters random sample of a big data set as follows:
considering the recent advancements in distributed and par-
allel computing technology. In this strategy, a big data file
Definition 1 (Random Sample): Let D = {x1 , x2 , · · · ,
is chunked into small non-overlapping data blocks and dis- xN } be a big data set of N records. Let Dn be a subset of D
tributed on the nodes of a computing cluster using a dis- containing n records chosen from D using a random process.
tributed file system such as Hadoop distributed file system
(HDFS) [9]. Then, data-parallel computations are run on 1
http://hadoop.apache.org/
these blocks considering data locality. After that, interme- 2
https://spark.apache.org/
diate results from processed blocks are integrated to produce 3
https://www.microsoft.com/en-us/cloud-platform/r-
the final result of the whole data set. This is usually done server
Dn is a random sample of D if Random Sample Partition of a Big Data Set.
We define the Random Sample Partition (RSP) to repre-
E[F̃n (x)] = F (x), for n ≤ N sent a big data set as a family of non-overlapping random
where F̃n (x) and F (x) denote the sample distribution func- samples.
tions of Dn and D, respectively. E[F̃n (x)] denotes the ex-
Definition 3 (Random Sample Partition): Let D =
pectation of F̃n (x).
{x1 , x2 , · · · , xN } be a big data set which is a random sample
According to the law of large numbers, we consider that
of a population and assume F (x) is the sample distribution
a big data set is a random sample of the population in a
function (s.d.f.) of D. Let T be a partition operation on D
certain application domain. For example, a big data set of
and T = {D1 , D2 , · · · , DK } be a partition of D accordingly.
customers is a random sample of the customer’s population
T is called a random sample partition of D if
in a company. Thus, we can use such data set to estimate the
distribution of all customers in the company. For big data E[F̃k (x)] = F (x) for each k = 1, 2, · · · , K,
analysis, a random sample of a big data set D is often used
to investigate the statistical properties of D when D is too where F̃k (x) denotes the sample distribution function of Dk
big to be analyzed using the available computing resources. and E[F̃k (x)] denotes its expectation. Accordingly, each Dk
A random sample is taken from D using a random sampling is called an RSP data block of D and T is called an RSP op-
process as stated in Lemma 1. However, if D is distributed eration on D. In the next section, we discuss the partitioning
on a computing cluster, taking a random sample from D process to generate RSPs from big data sets.
itself is an computationally expensive process because a full
scan of D needs to be conducted on the distributed nodes 5. TWO-STAGE DATA PARTITIONING
of the computing cluster. To avoid the random sampling Given a big data set in HDFS, a partitioning algorithm is
process on D in big data analysis, we can generate a set of required to convert the HDFS file into an RSP data model
random samples from D in advance and save these random of K RSP data blocks, each with n records, which are also
samples to be used in analysis of D. This idea is materialized stored in HDFS. In this case, the partitioning algorithm
in the RSP data model. consists of two main steps: data chunking and data ran-
domization as discussed below. The pseudo code is given in
Partition of a Big Data Set. Algorithm 1.
On distributed file systems, a big data set D is divided
into small data blocks which are distributed on the nodes of • Data Chunking: D is divided into P data blocks. We
a computing cluster. Each block contains a subset of records call these blocks the original data blocks which are not
from D. From a mathematical point of view, this set of necessarily randomized. In the current big data frame-
blocks forms a partition of D. In mathematics, a partition works, this operation is straightforward and available.
of a set is a grouping of the set’s elements into non-empty
subsets, in such a way that every element is included in one • Data Randomization: a slice of δ records is randomly
and only one of the subsets4 . selected without replacement from each of the original
P data blocks to form a new RSP data block. First,
Definition 2 (Partition of Data Set): Let D = {x1 , each original data block is randomized locally. Then,
x2 , · · · , xN } be a data set containing N objects. Let T the randomized block is further chunked into K sub-
be an operation which divides D into a set of subsets T = blocks, each with δ records. After that, a new RSP
{D1 , D2 , · · · , DK }. T is called a partition of data set D if block is created by selecting one sub-block from each
S
K of the randomized original blocks and combining these
(1) Dk = D; selected sub-blocks together. The last step is repeated
k=1
(2) Di ∩ Dj = ∅, when i, j ∈ {1, 2, · · · , K} and i 6= j. K times to produce the required number of RSP data
Accordingly, T is called a partition operation on D and each blocks where each RSP data block has n = Kδ records.
Dk (k = 1, 2, · · · , K) is called a data block of D.
The number of records in a slice δ can be determined
According to this definition, many partitions can be gen- depending on K and n (e.g., δ = n/K to distribute data
erated from a data set D. For example, an HDFS file is a evenly over all blocks). The number of RSP data blocks K,
particular partition of D generated by partitioning operation or the number of records in an RSP data block n, is selected
T to sequentially cutting the original data file D into data depending on both the available computing resources and
blocks {D1 , D2 , · · · , DK }. Even with the same partitioning the target analysis tasks so that a single RSP data block
operation, different partitions can be generated using differ- can be processed efficiently on one node (or core).
ent partitioning parameters such as the size of each block A prototype of this algorithm was implemented using the
or the number of records in each block. However, a key Apache Spark’s RDD API. As a preliminary test on a small
issue with this sequential partitioning is that data blocks cluster of 5 nodes (each node has 24 cores, 128 GB RAM
may not have similar statistical properties as the big data and 12.5 TB disk storage), Figure (1) shows the partitioning
set. In fact, it is theoretically possible and practically re- time for synthesized numerical data. We can see how the
quired to find alternatives for generating a partition of a partitioning time increases almost linearly with the number
data set which satisfies certain application requirements or of records in the data set. This shows the scalability of such
holds some statistical properties, e.g., each data block is a partitioning algorithms. In this experiment, the number of
random sample of D. RSP data blocks is the same as the number of original blocks
where each block contains 100,000 records. However, we also
4
https://en.wikipedia.org/wiki/Partition of a set found that the partitioning time did not vary much when the
Algorithm 1 Partitioning Algorithm
Input:
- D: big data set;
- P : number of data blocks of D;
- n: number of records in an RSP data block;
- K: number of required RSP data blocks;
Method:
δ = n/K;
Divide D into P data blocks;
for i = 1 to P do
Randomize block Di ;
Sequentially cut each randomized Di into K sub-blocks,
each with δ records;
end for Figure 1: Time for creating RSPs from synthesized
for k = 1 to K do numerical datasets with 100 features using Apache
Dk = φ; Spark on a computing cluster of 5 nodes (each node
for i = 1 to P do has 24 cores, 128 GB RAM and 12.5 TB disk stor-
Select one sub-block from Di without replacement; age). The algorithm was run on 10 different sizes
Append the sub-block to Dk ; of data, from 100 million records (1000 blocks) to
end for 1 billion records (10000 blocks). The storage size is
Save Dk as an RSP data block; approximately 100 GB for 100 million records and 1
end for TB for 1 billion records.
T = {D1 ,D2 , · · · , DK };
Output:
- T: a set of RSP data blocks;
randomly selected records from all the original blocks. Thus,
each RSP data block from an RSP data model is equivalent
to a simple random sample from D. The underlying theory
number of RSP data blocks is different from the number of
of RSP data blocks is shown in Lemma 1 and Theorem 1.
original blocks. Thus, this kind of algorithms can be used
to fulfill different requirements as the number of records in
Lemma 1: Suppose D = {x1 , x2 , · · · , xKδ } is a data set.
a block is an essential factor for some analysis tasks. Randomly choose a permutation of the sequence 1, 2, · · · , Kδ
which is denoted as τ = {τ1 , τ2 , · · · , τKδ }. For each i =
6. PROBABILITY DISTRIBUTIONS OF RSP 1, · · · , K, set Di = {xτδ(i−1) +1 , xτδ(i−1) +2 , · · · , xτδi }, then
each Di is an RSP data block of D. Below, we prove this for
DATA BLOCKS one dimensional data. It also applies to high dimensional
The key idea behind an RSP is partitioning a big data set cases with minor modifications.
into RSP data blocks which can be used directly as random
samples of the whole data set. Thus, it is essential to inves- Proof : Suppose the sample distribution function of D and
tigate the statistical properties of these RSP data blocks. Di are F (x) and Fi (x) (i = 1, 2, · · · , K) respectively. For
We used hypothesis testing and exploratory data analysis any real number x, let M denote the number of the records
in our previous empirical studies [8] to compare RSP data in D whose values are not greater than x. Then, it is easy
blocks. We discuss here the underlying theory of using the to show M = Kδ · F (x). If si denotes the number of records
RSP data blocks as random samples of the whole data set. in Di whose values are not greater than x, then
Each block in an RSP should follow a data distribution
similar to the whole data set. For a categorical feature, X
δ Xδ
C j · C δ−j
records that belong to the same category should be evenly E(si ) = j · P {si = j} = j M δKδ−M
j=1 j=1
CKδ
distributed over all blocks. A good example is the label fea-
ture in a classification task. For example, Figure (2.a) shows M δ−1
= · CKδ−1 = δ · F (x),
the distribution of a label feature (from HIGGS data set5 ) δ
CKδ
in the whole data set and some randomly selected RSP data
blocks. This also applies to continuous features as shown So, the expectation of Fi (x) is
in Figure (2.b). In addition, the similarity between RSP E(si )
data blocks and samples by simple random sampling, or the E[Fi (x)] = = F (x).
δ
whole data set if possible, can be computed using quantity
measures such as MMD measure [5]. Thus, each Di is an RSP data block of D.
Selecting a block from an RSP of D is equivalent to draw-
ing a random sample directly from the big data set D. If Theorem 1: Suppose data set A has N1 records and
D is a randomized data set, cutting a block of n records is data set B has N2 records. If A1 is an RSP data block of
equivalent to randomly selecting these n records. This also A with records n1 and
S B1 is an RSP data block of B
S with
applies when partitioning a non-randomized data set using records n2 , then A1 B1 is an RSP data block of A B as
n1 N1
Algorithm 1 because each RSP data block is formed from n2
= N 2
.
5
https://archive.ics.uci.edu/ml/datasets/HIGGS Proof : Suppose the sample distribution functions of data
Figure 2: Probability distribution in data blocks and the whole data set.
blocks A, B, A1 and B1 are F1 (x), F2 (x), F̃1 (x) and F̃2 (x),
respectively. We have E[F̃1 (x)] = F1 (x) and E[F̃2 (x)] = As each block in an RSP data model is a random sam-
F2 (x). For any real number x, the number of records of ple of big data set D, the RSP data model can be used as
S
A1 B1 whose values are not greater than x is n1 F̃1 (x) + a reduced representative space to directly draw samples of
n2 F̃S
2 (x). Therefore, the sample distribution function of RSP data blocks instead of drawing samples of individual
A1 B1 is: records from D. In contrast to the classical record level
sampling, we call this sampling method as block level sam-
n1 F̃1 (x) + n2 F̃2 (x) pling where blocks are sampled, without replacement and
F̃ (x) = .
n1 + n2 with equal probability. Thus, a sample of RSP data blocks
S
Similarly, the sample distribution function of A B is: is a set of non-overlapping random samples from D. This
N1 F1 (x) + N2 F2 (x) is more efficient especially when K << N because it avoids
F (x) = . the full scan of D each time a random sample is needed.
N1 + N2
The expectation of F̃ (x) is Definition 4 (Block Level Sample): Let T be an RSP of
a big data set D. S = {D1 , D2 , · · · , Dg }, where g < K, is a
n1 F̃1 (x) + n2 F̃2 (x) block level sample with RSP data blocks randomly selected
E[F̃ (x)] = E[ ]
n1 + n2 from T without replacement and with equal probability.
n1 E[F̃1 (x)] + n2 E[F̃2 (x)] The block-sampling operation is called separately in each
=
n1 + n2 analysis process so that samples of RSP data blocks are se-
n1
F (x) + F2 (x) N1
F1 (x) + F2 (x) lected without replacement, i.e., without repeating a block
n 1 N
= 2 n1 = 2 N1 neither in the same sample nor in other samples in the same
n2
+1 N2
+1 analysis process. The number of RSP data blocks g in a
= F (x). block sample depends on the available computing resources
S S so that each selected block can be loaded and analyzed lo-
Therefore, A1 B1 is an RSP data block of A B. cally on one node or core. In such case, a sample of RSP
data blocks is analyzed in one batch in a perfectly paral-
7. BLOCK LEVEL SAMPLE: A SAMPLE OF lel manner. This sampling process can be refined to select
RANDOM SAMPLE DATA BLOCKS blocks depending on the availability of nodes on a computing
cluster which can lead to better scheduling algorithms.
Block level sampling can be employed for designing data
analysis pipelines, for example, analyzing big data in batches
(i.e., each batch uses one sample of RSP data blocks) to step-
wise collect and improve statistical models and estimates
rather than trying to load the data set all at once. Further
statistical investigation can be done on the selected RSP
data blocks. For example, exploratory data analysis tech-
niques can be used to visualize and compare data distribu-
tions between RSP data blocks and classical random sam-
ples. Quantitative measures can be also used, such as the
maximum mean discrepancy (MMD) measure [5] for the sim-
ilarity between data distributions and Hotelling’s T-Square
test for the difference between means of features. This kind
of statistical testing is fundamental for many data analysis
tasks. In the following sections, we describe how block level
samples can be used effectively and efficiently for estimating
statistics and building ensemble models.
12. ACKNOWLEDGMENTS
The first author and second author contributed equally
the same to this paper which was supported by grants from
National Natural Science Foundation of China (No. 61473194,
61503252) and China Postdoctoral Science Foundation (No.
2016T90799).
13. REFERENCES
[1] C. C. Aggarwal. Data Mining: The Textbook. Springer
Publishing Company, Incorporated, 2015.
[2] J. Dean and S. Ghemawat. Mapreduce: Simplified
data processing on large clusters.
COMMUNICATIONS OF THE ACM, 51(1):107–113,
JAN 2008.
[3] J. Gantz and D. Reinsel. Extracting value from chaos,
2011.
[4] Í. Goiri, R. Bianchini, S. Nagarakatte, and T. D.
Nguyen. Approxhadoop: Bringing approximations to
mapreduce frameworks. In Proceedings of the
Twentieth International Conference on Architectural
Support for Programming Languages and Operating
Systems, pages 383–397. ACM, 2015.
[5] A. Gretton, K. M. Borgwardt, M. J. Rasch,
B. Schölkopf, and A. Smola. A kernel two-sample test.
J. Mach. Learn. Res., 13:723–773, Mar. 2012.
[6] A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan.
The Big Data Bootstrap. In ICML, 2012.
[7] S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z.
Huang. Big data analytics on apache spark.
International Journal of Data Science and Analytics,
1(3):145–164, 2016.
[8] S. Salloum, J. Z. Huang, and Y. He. Empirical
analysis of asymptotic ensemble learning for big data.
In Proceedings of the 3rd IEEE/ACM International
Conference on Big Data Computing, Applications and
Technologies, BDCAT ’16, pages 8–17, New York, NY,
USA, 2016. ACM.
[9] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.
The hadoop distributed file system. In 2010 IEEE
26th Symposium on Mass Storage Systems and
Technologies (MSST), pages 1–10, May 2010.
[10] M. Vojnovic, F. Xu, and J. Zhou. Sampling based
range partition methods for big data analytics.
Technical Report MSR-TR-2012-18, February 2012.
[11] M. Zaharia, M. Chowdhury, T. Das, and A. Dave.
Resilient distributed datasets: A fault-tolerant
abstraction for in-memory cluster computing. NSDI’12
Proceedings of the 9th USENIX conference on
Networked Systems Design and Implementation, pages
2–2, Apr. 2012.
[12] M. Zaharia, R. S. Xin, P. Wendell, T. Das,
M. Armbrust, A. Dave, X. Meng, J. Rosen,