0% found this document useful (0 votes)
30 views6 pages

Big Data Dimensional Analysis

big data

Uploaded by

Kunal Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views6 pages

Big Data Dimensional Analysis

big data

Uploaded by

Kunal Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Big Data Dimensional Analysis

Vijay Gadepally & Jeremy Kepner


MIT Lincoln Laboratory, Lexington, MA 02420
{vijayg, jeremy}@ ll.mit.edu

work with data to extract meaningful knowledge have realized


Abstract—The ability to collect and analyze large amounts of that the ability to quantify low level parameters of big data
data is a growing problem within the scientific community. The can be an important first step in an analysis pipeline. For
growing gap between data and users calls for innovative tools example, in the case of machine learning, removing extraneous
that address the challenges faced by big data volume, velocity dimensions or erroneous records allows the algorithms to focus
and variety. One of the main challenges associated with big data on meaningful data. Thus, the first step of a machine learning
variety is automatically understanding the underlying structures
analyst is manually cleaning the big dataset or performing
and patterns of the data. Such an understanding is required as a
pre-requisite to the application of advanced analytics to the data. dimensionality reduction through techniques such as random
Further, big data sets often contain anomalies and errors that are projections [4] or sketching [5]. Such tasks require a coherent
difficult to know a priori. Current approaches to understanding understanding of the data set which can also provide insight
data structure are drawn from the traditional database ontology into any weaknesses that may be present in a data set. Further,
design. These approaches are effective, but often require too much detailed analysis of each data set is required to determine any
human involvement to be effective for the volume, velocity and internal patterns that may exist.
variety of data encountered by big data systems. Dimensional The process for analyzing a big data set can often be
Data Analysis (DDA) is a proposed technique that allows big summarized as follows:
data analysts to quickly understand the overall structure of a
big dataset, determine anomalies. DDA exploits structures that 1) Learn about data structure through Dimensional Data
exist in a wide class of data to quickly determine the nature Analysis (DDA);
of the data and its statical anomalies. DDA leverages existing 2) Determine background model of big data;
schemas that are employed in big data databases today. This 3) Use data structure and background model for fea-
paper presents DDA, applies it to a number of data sets, and ture extraction, dimensionality reduction, or noise
measures its performance. The overhead of DDA is low and can removal;
be applied to existing big data systems without greatly impacting 4) Perform advanced analytics; and
their computing requirements. 5) Explore results and data.
Keywords—Big Data, Data Analytics, Dimensional Analysis These steps can be adapted to a wide variety of data and
borrow heavily from the processes and tools developed for
I. I NTRODUCTION the signal processing community. The first two steps in this
The challenges associated with big data are commonly process provide a high level view of a given data set - a very
referred to as the 3 V’s of Big Data - Volume, Velocity and important step to ensure that data inconsistencies are known
Variety [1]. The 3 V’s provide a guide to the largest outstanding prior to complex analytics that may obscure the existence of
challenges associated with working with big data systems. Big noise. Traditionally, this view is obtained as a byproduct of
data volume stresses the storage, memory and compute capac- standard database ontology techniques, whereby a database
ity of a computing system and requires access to a computing analyst or data architecture examines the data in detail prior
cloud. The velocity of big data velocity stresses the rate at to assembling the database schema. In big data systems, data
which data can be absorbed and meaningful answers produced. can change quickly or whole new classes of data can appear
Big data variety makes it difficult to develop algorithms and unexpectedly and it is not feasible for this level of analysis
tools that can address that large variety of input data. to be employed. The D4M (d4m.mit.edu) schema addresses
The MIT Supercloud infrastructure [2] is designed to part of this problem by allowing a big data system to absorb
address the challenge of big data volume. To address big and index a wide range of data with only a handful of tables
data velocity concerns, MIT Lincoln Laboratory worked with that are consistent across different classes of data. An added
various U.S. government agencies to develop the Common byproduct of D4M schema is that common structures emerge
Big Data Architecture and its associated Apache Accumulo that can be exploited to quickly or automatically characterize
database. Finally, to address big data variety problems, MIT data.
Lincoln Laboratory developed the D4M technology and its Dimensional Data Analysis (DDA) is a technique to learn
associated schema [3] that is widely used across Accumulo about data structure and can be used as a first step with a new
community. or unknown big data set. This technique can be used to gain an
While these techniques and technologies continue to evolve understanding of corpus structure, important dimensions, and
with the increase in each of the V’s of big data, analysts who data corruptions (if present).
The article is organized as follows. Section II describes
This work is sponsored by the Assistant Secretary of Defense for Research the MIT SuperCloud technologies designed to mitigate the
and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions,
interpretations, recommendations and conclusions are those of the authors and challenges associated with the 3 V’s of big data. Section III
are not necessarily endorsed by the United States Government provides the mathematical and application aspects of dimen-
978-1-4799-6233-4/14/$31.00 ©2014 IEEE sional analysis. In order to illustrate the value of dimensional
analysis, two applications are described in Section IV along
with performance measurements for the applications of this
approach. Finally, section V concludes the article and discusses A(k) : Sd → S k = (k1 , ..., kd )
future work. k i ∈ Si
Where A(k) is a partial function from d keys to one value
II. B IG DATA AND MIT S UPER C LOUD
where:
The growing gap between data generation and users has
influenced the movement towards cloud computing that can 
offer centralized large scale computing, storage and commu- vi
A(ki ) =
nication networks. Currently, there are four multibillion dollar φ otherwise
ecosystems that dominate the cloud computing environment:
enterprise clouds, big data clouds, SQL database clouds, and Associative arrays support a variety of linear algebraic
supercomputing clouds. The MIT Supercloud infrastructure operations such as summation, union, intersection, multipli-
was developed to allow the co-existence of all four cloud cation. Summation of two associative arrays, for example,
ecosystems on the same hardware without sacrificing perfor- that do not have any common row or column key performs
mance or functionality. The MIT Supercloud uses the Common a concatenation. In the D4M schema a table in the Accumulo
Big Data Architecture (CBDA), an architectural abstraction database is an associative array.
that describes the flow of information within such systems
as well as the variety of users, data sources, and system C. D4M & D4M Schema
requirements. The CBDA has gained widespread adoption NoSQL databases such as Accumulo have become a pop-
and uses the NSA developed Accumulo database [6] that ular alternative to traditional Database Management Systems
has demonstrated high performance capabilities (capable of such as SQL. Such databases require a database schema which
hundreds millions of database entries/second) and has been can be difficult due to big data variety. Big data variety
used in a variety of applications [7]. These technologies and challenges the tools and algorithms developed to process big
other tools discussed in this section are used to develop the data sets. The promise of big data is the ability to correlate
dimensional analysis technique. diverse and heterogeneous data sources to reduce the time
to insight. Correlating this data requires putting each format
A. Big Data Pipeline into a common frame of reference so that like entities can be
A big data pipeline is a distilled view of the CBDA compared. D4M [3] allows vast quantities of highly diverse
meant to describe the system components and interconnections data to automatically be ingested into a simple common
involved in most big data systems. The pipeline in Figure 1 schema that allows every unique element to be quickly queried
has been applied to diverse big data problems such as health and correlated.
care, social media, defense applications, intelligence reports, Within the CBDA, the D4M environment is used to support
building management systems, etc. prototyping of analytic solutions for big data applications.
The generalized five step system was created after observ- D4M applies the concepts of linear algebra and signal pro-
ing numerous big data systems in which the following steps cessing to databases through associative arrays; provides a
are performed (application specific names may differ, but the data schema capable of representing most data; and provides
component’s purpose is usually the same): a low barrier to entry through a computing API implemented
in MATLAB and GNU Octave.
1) Raw Data Acquisition: Retrieve raw data from exter- The D4M 2.0 Schema [8], provides a four table solution
nal sensors or sources. that can be used to represent most data values. The four table
2) Parse Data: Raw data is often in a format which needs solution allows diverse data to be stored in a simple, common
to be parsed. Store results on distributed filesystem. format with just a handful of tables that can automatically be
3) Ingest Data: If using a database, ingest parsed data generated from the data with no human intervention. From the
into database. schema described in [8], a dense database can be converted by
4) Query or Scan for Data: Use either database or “exploding” each data entry into an associative array where
filesystem to find information. each unique column-value pair is a column. Once in sparse
5) Analyze Data: Perform complex analytics, visualiza- matrix form, the full machinery of linear algebraic graph
tions, etc. for knowledge discovery. processing [9, 10] and detection theory can be applied. For
example, multiplying two associative arrays correlates the two.
B. Associative Arrays
In order to perform complex analytics on databases, it is III. D IMENSIONAL DATA A NALYSIS
necessary to develop a mathematical formulation for big data Dimensional Data Analysis (DDA) provides a principled
types. Associations between multidimensional entities (tuples) way to develop a coherent understanding of underlying data
using number/string keys and number/string values can be structures, data inconsistencies, data patterns, data formatting,
stored in data structures called associative arrays. Associative etc. Over time, a database may develop inconsistencies or
arrays are used as the building block for big data types and errors, often due to a variety of reasons. Further, it is very
consists of a collection of key value pairs. common to perform advanced analytics on a database, often
Formally, an associative array A with possible keys looking for small artifacts in the data which may be of interest.
{k1 , k2 , ..., kd }, denoted as A(k) is a partial function that maps In these cases, it is important for a user to understand the
two spaces Sd and S: information content of their database.
Fig. 1. Working with big data usually implies working with these 5 steps.

Fig. 2. A database can be represented as the sum (concatenation) of a series of sub associative arrays that correspond to different ideal or vestigial arrays

First, it is necessary to define the components of a database • Identity (I): Sub-associative array Ei in which the
in formal terms. number of rows and columns are of the same order:
A database can be represented by a model that is described Ni ∼ Mi
as a sum of sparse associative arrays, Figure 2. Consider a
database E represented as the sum of sparse sub-associative • Authoritative (A): Sub-associative array Ei in which
arrays Ei : the number of rows is significantly smaller than the
number of columns:
n
E = ∑ Ei Ni  Mi
1
• Organizational (O): Sub-associative array Ei in which
Where i corresponds to different entities that comprise the n the number of rows is significantly greater than the
entities of E. Each Ei has the following properties/definitions: number of columns:
• N is the number of rows in the whole database Ni  Mi
(number of rows in associative array E).
• Vestigial (δ ): Sub-associative array Ei in which the
• Ni is the number of rows in database entity i with at number of rows and columns are significantly small
least one value (number of rows in associative array
Ei ). Ni ∼ 1
• Mi is the number of unique values in database column Mi ∼ 1
i (number of columns in associative array Ei ). Conceptually, data collection for each of the entities is
• Vi is the number of values in database column i intended to follow the structure of ideal models. However, due
(number of non zero values in associative array Ei ). to inconsistencies and changes over time, they may develop
With these definitions, the following global sums hold: vestigial qualities or differ from the intended ideal array.
By comparing a given sub associative array to the structures
described above, it is possible to learn about a given database
N ≤ ∑ Ni , ,∀i and recognize inconsistencies or errors.
i
M = ∑ Mi , ∀i A. Performing DDA
i
V = ∑ Vi , ∀i Consider a database E. In a real system, E is a large sparse
i associative array representation of all the data in a database
using the schema described in the previous sections. Suppose
where N, M, and V correspond to the number of rows, that E is made up of k entities, such that:
columns and values in database E respectively.
Theoretically, each sub-associative array (Ei ) can be typed k
as ideal or vestigial arrays depending on the properties of this E = ∑ Ei
sub-associative array: 1
In a real database, these entities typically relate to vari- sensors, etc. Further, the actual dimensions of each sub-
ous dimensions in the dataset. For example, entities may be associative array can provide information about the structure
time stamp, username, building id number, etc. Each of the of a database that enables a high level understanding of a
associative arrays corresponding to Ei is referred to as a sub- particular data dimension.
associative array. Dimensional analysis compares the structure
of each Ei with the intended structural model. This process
consists of the steps described in Algorithm 1. IV. A PPLICATION E XAMPLES
In this section, we provide two example data sets and
Data: DB represented by sparse associative array E the results obtained through DDA. This section is meant to
Result: Dimensions of sub-associative arrays illustrate the concepts described before.
corresponding to entities
for entity i in k do
A. Geo Tweets Corpus
read sub-associative array Ei ∈ E;
if number of rows in Ei ≥ 1 then Social media analysis is a growing area of interest in the
number of rows in Ei = Ni ; big data community. Very often, large amounts of data is
number of unique columns in Ei =Mi ; collected through a variety of data generation processes and it
number of values in Ei =Vi ; is necessary to learn about the low level structural information
else behind such data. Twitter is a microblog that allow up to 140
go to next entity; character “tweets” by a registered user. Each tweet is published
end by Twitter and is available via a publicly acessible API. Many
end tweets contain geo-tagged information if enabled by the user.
Algorithm 1: Dimensional Analysis Algorithm A prototype twitter dataset containing 2.02 million tweets was
used for the dimensional analysis.
Using the algorithm above, let the dimensions of each sub 1) Dimensional Analysis Procedure: The process outlined
associative array (Ei ) be contained in the 3-tuple (Ni , Mi , Vi ) in the previous section was used to perform dimensional
corresponding to the number of rows, columns and values in analysis on a set of Twitter data with the intent of finding
each sub associative array which corresponds to a single entity. any anomalies, special accounts, etc. The database consists of
2.02 million rows and values distributed across 10 different
B. Using DDA Results entities such as latitude, longitude, userID, username, etc.
Once the tuples corresponding to each entity is collected The associative array representation of the Twitter corpus
for a database E, one can compare the dimensions with the is shown in Figure 3. The 10 dimensions or entities of the
ideal and vestigial arrays described in Section III to determine database that make up the full dataset are also shown.
the approximate intended structural model for each entity.
2) DDA Results: Dimensional analysis of the dataset can
Once the intended structural model for an entity is deter-
be performed by performing Algorithm 1 on each of the
mined, it is possible to highlight interesting patterns, anoma-
entities (i) in k possible entities. For example, E7 = ETime which
lies, formatting, and inconsistencies. For example:
is the associative array in which all of the column keys cor-
• Authoritative (A): Important entity values (such as respond to time stamps. Thus, the triple (NTime ,VTime , MTime )
usernames, words, etc.) are highlighted by: is the number of entries with a time stamp, number of time
Ei ∗ 1Nx1 > 1 stamp entries in the corpus, and number of unique time stamp
values respectively. Performing Algorithm 1 on each of the 10
11xN ∗ Ei > 1 entities yields the results described in Table I.
• Identity (I): Misconfigured or non-standard entity val- Using the definitions defined in Section III, we can quickly
ues are highlighted by: determine important characters. For example, to find the
most popular users, we can look at the difference where
Ei ∗ 1Nx1 > 1 Euser ∗ 1Nx1 > 1. Using D4M, this computation can easily be
11xN ∗ Ei > 1 performed with associative arrays to yield the most popular
INxN − Ei 6= 0NxN users. Performing this analysis on the full 2.02 million tweet
dataset represented by an associative array E:
• Organizational (O): The mapping structure of a sub-
associative array is highlighted by counts and corre-
% Extract Associative Array Euser
lations in which: >>Euser = E(:,StartsWith('user|,'));
%Add up count of all users
Ei ∗ 1Nx1  1 >>Acommon = sum(Euser, 1);
EiT ∗ E j >> 1 %Display most common users
>>display(Acommon>150);
11xN ∗ Ei = 1 (1,user|SFBayRoadAlerts) 258
(1,user|akhbarhurra) 177
• Vestigial (δ ): Erroneous or misconfigured entries can (1,user|attir_midzi) 159
typically be determined by inspecting Ei . (1,user|verkehr_bw) 300
The difference between a sub-associative array and an in-
tended model such as those above provide valuable information The results above indicate that there are 4 users who have
about failed processes, corrupted or junk data, non-working greater than 150 tweets in the dataset.
Fig. 3. Associative Array representation of Twitter data. E1 , E2 , ..., E10 represent the concatenated associative arrays Ei that constitute all the entities in the full
dataset. Each blue dot corresponds to a value of 1 in the associative array representation.

Entity Ni Vi Mi Structure Type


latlon 1624984 1625197 1506465 Identity
lat 1624984 1625192 1504469 Identity
lon 1625061 1625725 1504619 Identity
place 1741337 1741516 1504619 Identity
retweetID 636455 636644 627163 Identity
reuserID 720624 722148 676616 Identity
time 2020000 2020000 35176 Organization
userID 2020000 2020000 1711141 Identity
user 2020000 2020000 1711143 Identity
word 1976746 17180314 7838862 Authority
TABLE I. D IMENSIONAL A NALYSIS PERFORMED ON 2.02 MILLION T WEETS

DDA and Ingestion times for Twitter Data


2500 B. HPC Scheduler Log Files
DDA Time
DB Ingest Time
Another application in which dimensional analysis was
tested is with HPC scheduler log files. LLSuperCloud uses the
2000 Grid Engine scheduler for dispatching jobs. For each job that is
Time for DDA/Data Ingest (seconds)

finished, an accounting record is written to an accounting file.


These records can be used in the future to generate statistics
1500 about accounts, system usage, etc. Each line in the accounting
file represents an individual job that has completed.
1) DDA Procedure: The process outlined in the previous
1000 section were used to perform dimensional analysis on a set of
SGE accounting data with the intent of finding any anomalies,
special accounts, etc. The database consists of approximately
500 11.5 million entries with 27 entities each. A detailed descrip-
tion of the entities in the SGE accounting file can be found at:
[11].
0
50,000 100,000 200,000 500000 1000000 2000000 The associative array representation of the SGE corpus is
Number of Tweets
shown in Figure 5. The 27 “dimensions” of data that make up
the full data set are shown in figure 5.
Fig. 4. Relative performance between DDA and ingesting data. 2) DDA Results: After performing dimensional analysis
on the SGE corpus, the results are tallied for inspection. A
subset of the results is shown in Table II. It is interesting
to note that there are many accounting file entries that are
3) DDA Performance: Ingesting data into a database is not collected and have only default values. For example, the
often an expensive and time consuming process. One of the field “defaultdepartment” contains only one unique value in the
features of DDA is in the ability to potentially reduce the entire dataset - “default”. For an individual wishing to perform
amount of information that needs to be stored. Figure 4 more advanced analytics on the dataset, this is an important
describes the relative time taken by DDA compared to data result and can be used to reduce the dimensionality of each
ingest. data point.
A D4M code snipped to find the most common job names
From this comparison, it is clear that DDA takes a fraction is shown below.
of the time compared to ingest. By using DDA, one may be
able to remove entries that need to be ingested, thus reducing % Extract Associative Array Euser
the overall ingest time. >>Ejobname = E(:,StartsWith('job_name|,'));
Fig. 5. Associative Array representation of SGE accounting data represent the concatenated associative arrays Ei that make up the full dataset.

Entity Ni Vi Mi Structure Type


Account 11446187 11446187 1 Vestigial
CPU Hours 11446187 11446187 2752964 Identity
Default Department 11446187 11446187 1 Vestigial
Job Name 11446187 11446187 90491 Organization
Job Number 11446187 11446187 485212 Identity
Memory Usage 11446187 11446187 5241559 Identity
Priority 11446187 11446187 1 Vestigial
Task Number 11446187 11446187 7491889 Identity
User Name 11446187 11446187 8388 Organization
TABLE II. D IMENSIONAL A NALYSIS PERFORMED ON 11.5 MILLION S UN G RID E NGINE ACCOUNTING ENTRIES . O NLY SELECTED ENTRIES ARE
SHOWN OF THE 27 TOTAL ENTRIES COLLECTED

%Add up count of all users ACKNOWLEDGMENT


>>Acommon = sum(Ejobname, 1); The authors would like to thank the LLGrid team at MIT
%Display most common users
>>display(Acommon>1000000);
Lincoln Laboratory for their support and expertise in setting
(1,job_name|rolling_pipeline.sh) 2762791 up the computing environment.
(1,job_name|run_blast.sh) 1256422 R EFERENCES
(1,job_name|run_blast_parser.sh) 1162522 [1] D. Laney, “3d data management: Controlling data volume, velocity and
variety,” META Group Research Note, vol. 6, 2001.
Interestingly, of the 27 dimensions of data in the SGE log [2] A. Reuther, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun,
M. Hubbell, P. Michaleas, J. Mullen, A. Prout, et al., “Llsupercloud:
files, 8 of the entities are not actually recorded. This infor- Sharing hpc systems for diverse rapid prototyping,” in High Perfor-
mation can be very important to one interested in performing mance Extreme Computing Conference (HPEC), 2013 IEEE, pp. 1–6,
IEEE, 2013.
advanced analytics on a dataset in which nearly one third of [3] J. Kepner, W. Arcand, W. Bergeron, N. Bliss, R. Bond, C. Byun,
the data is unchanging. G. Condon, K. Gregson, M. Hubbell, J. Kurz, et al., “Dynamic
distributed dimensional data model (d4m) database and computation
system,” in Acoustics, Speech and Signal Processing (ICASSP), 2012
V. C ONCLUSIONS AND F UTURE W ORK IEEE International Conference on, pp. 5349–5352, IEEE, 2012.
In this paper, we proposed a process to understand the [4] E. Bingham and H. Mannila, “Random projection in dimensionality
reduction: applications to image and text data,” in Proceedings of
structural characteristics of a database called dimensional the seventh ACM SIGKDD international conference on Knowledge
data analysis. Using DDA, a researcher can learn a great discovery and data mining, pp. 245–250, ACM, 2001.
[5] P. Indyk and R. Motwani, “Approximate nearest neighbors: towards
deal about the hidden patterns, structural characteristics and removing the curse of dimensionality,” in Proceedings of the thirtieth
possible errors of a large unknown database. DDA consists of annual ACM symposium on Theory of computing, pp. 604–613, ACM,
1998.
representing a dataset using associative arrays and performing [6] “Apache accumulo,” https://accumulo.apache.org/.
a comparison between the constituent associative arrays and [7] C. Byun, W. Arcand, D. Bestor, B. Bergeron, M. Hubbell, J. Kepner,
intended ideal database arrays. Deviations from the intended A. McCabe, P. Michaleas, J. Mullen, D. O’Gwynn, et al., “Driving
model can highlight important details or incorrect information. big data with big compute,” in High Performance Extreme Computing
(HPEC), 2012 IEEE Conference on, pp. 1–6, IEEE, 2012.
We recommend that the DDA technique be the first step of [8] J. Kepner, C. Anderson, W. Arcand, D. Bestor, B. Bergeron, C. Byun,
an analytic pipeline. The common next steps in an analytic M. Hubbell, P. Michaleas, J. Mullen, D. O’Gwynn, et al., “D4m 2.0
schema: A general purpose high performance schema for the accu-
pipeline such as background modeling, feature extraction, mulo database,” in High Performance Extreme Computing Conference
machine learning and visual analytics depend heavily on the (HPEC), 2013 IEEE, pp. 1–6, IEEE, 2013.
[9] J. Kepner and J. Gilbert, Graph algorithms in the language of linear
quality of input data. algebra, vol. 22. SIAM, 2011.
Next steps to this work include developing an automated [10] J. Kepner, D. Ricke, and D. Hutchinson, “Taming biological big data
with d4m,” Lincoln Laboratory Journal, vol. 20, no. 1, 2013.
mechanism to perform background modeling of big datasets, [11] “Ubunto manpage: Sun grid engine accounting file format,”
and application of detection theory to big data sets. http://manpages.ubuntu.com/manpages/lucid/man5/sge accounting.5.html.

You might also like