Big Data Dimensional Analysis
Big Data Dimensional Analysis
Fig. 2. A database can be represented as the sum (concatenation) of a series of sub associative arrays that correspond to different ideal or vestigial arrays
First, it is necessary to define the components of a database • Identity (I): Sub-associative array Ei in which the
in formal terms. number of rows and columns are of the same order:
A database can be represented by a model that is described Ni ∼ Mi
as a sum of sparse associative arrays, Figure 2. Consider a
database E represented as the sum of sparse sub-associative • Authoritative (A): Sub-associative array Ei in which
arrays Ei : the number of rows is significantly smaller than the
number of columns:
n
E = ∑ Ei Ni Mi
1
• Organizational (O): Sub-associative array Ei in which
Where i corresponds to different entities that comprise the n the number of rows is significantly greater than the
entities of E. Each Ei has the following properties/definitions: number of columns:
• N is the number of rows in the whole database Ni Mi
(number of rows in associative array E).
• Vestigial (δ ): Sub-associative array Ei in which the
• Ni is the number of rows in database entity i with at number of rows and columns are significantly small
least one value (number of rows in associative array
Ei ). Ni ∼ 1
• Mi is the number of unique values in database column Mi ∼ 1
i (number of columns in associative array Ei ). Conceptually, data collection for each of the entities is
• Vi is the number of values in database column i intended to follow the structure of ideal models. However, due
(number of non zero values in associative array Ei ). to inconsistencies and changes over time, they may develop
With these definitions, the following global sums hold: vestigial qualities or differ from the intended ideal array.
By comparing a given sub associative array to the structures
described above, it is possible to learn about a given database
N ≤ ∑ Ni , ,∀i and recognize inconsistencies or errors.
i
M = ∑ Mi , ∀i A. Performing DDA
i
V = ∑ Vi , ∀i Consider a database E. In a real system, E is a large sparse
i associative array representation of all the data in a database
using the schema described in the previous sections. Suppose
where N, M, and V correspond to the number of rows, that E is made up of k entities, such that:
columns and values in database E respectively.
Theoretically, each sub-associative array (Ei ) can be typed k
as ideal or vestigial arrays depending on the properties of this E = ∑ Ei
sub-associative array: 1
In a real database, these entities typically relate to vari- sensors, etc. Further, the actual dimensions of each sub-
ous dimensions in the dataset. For example, entities may be associative array can provide information about the structure
time stamp, username, building id number, etc. Each of the of a database that enables a high level understanding of a
associative arrays corresponding to Ei is referred to as a sub- particular data dimension.
associative array. Dimensional analysis compares the structure
of each Ei with the intended structural model. This process
consists of the steps described in Algorithm 1. IV. A PPLICATION E XAMPLES
In this section, we provide two example data sets and
Data: DB represented by sparse associative array E the results obtained through DDA. This section is meant to
Result: Dimensions of sub-associative arrays illustrate the concepts described before.
corresponding to entities
for entity i in k do
A. Geo Tweets Corpus
read sub-associative array Ei ∈ E;
if number of rows in Ei ≥ 1 then Social media analysis is a growing area of interest in the
number of rows in Ei = Ni ; big data community. Very often, large amounts of data is
number of unique columns in Ei =Mi ; collected through a variety of data generation processes and it
number of values in Ei =Vi ; is necessary to learn about the low level structural information
else behind such data. Twitter is a microblog that allow up to 140
go to next entity; character “tweets” by a registered user. Each tweet is published
end by Twitter and is available via a publicly acessible API. Many
end tweets contain geo-tagged information if enabled by the user.
Algorithm 1: Dimensional Analysis Algorithm A prototype twitter dataset containing 2.02 million tweets was
used for the dimensional analysis.
Using the algorithm above, let the dimensions of each sub 1) Dimensional Analysis Procedure: The process outlined
associative array (Ei ) be contained in the 3-tuple (Ni , Mi , Vi ) in the previous section was used to perform dimensional
corresponding to the number of rows, columns and values in analysis on a set of Twitter data with the intent of finding
each sub associative array which corresponds to a single entity. any anomalies, special accounts, etc. The database consists of
2.02 million rows and values distributed across 10 different
B. Using DDA Results entities such as latitude, longitude, userID, username, etc.
Once the tuples corresponding to each entity is collected The associative array representation of the Twitter corpus
for a database E, one can compare the dimensions with the is shown in Figure 3. The 10 dimensions or entities of the
ideal and vestigial arrays described in Section III to determine database that make up the full dataset are also shown.
the approximate intended structural model for each entity.
2) DDA Results: Dimensional analysis of the dataset can
Once the intended structural model for an entity is deter-
be performed by performing Algorithm 1 on each of the
mined, it is possible to highlight interesting patterns, anoma-
entities (i) in k possible entities. For example, E7 = ETime which
lies, formatting, and inconsistencies. For example:
is the associative array in which all of the column keys cor-
• Authoritative (A): Important entity values (such as respond to time stamps. Thus, the triple (NTime ,VTime , MTime )
usernames, words, etc.) are highlighted by: is the number of entries with a time stamp, number of time
Ei ∗ 1Nx1 > 1 stamp entries in the corpus, and number of unique time stamp
values respectively. Performing Algorithm 1 on each of the 10
11xN ∗ Ei > 1 entities yields the results described in Table I.
• Identity (I): Misconfigured or non-standard entity val- Using the definitions defined in Section III, we can quickly
ues are highlighted by: determine important characters. For example, to find the
most popular users, we can look at the difference where
Ei ∗ 1Nx1 > 1 Euser ∗ 1Nx1 > 1. Using D4M, this computation can easily be
11xN ∗ Ei > 1 performed with associative arrays to yield the most popular
INxN − Ei 6= 0NxN users. Performing this analysis on the full 2.02 million tweet
dataset represented by an associative array E:
• Organizational (O): The mapping structure of a sub-
associative array is highlighted by counts and corre-
% Extract Associative Array Euser
lations in which: >>Euser = E(:,StartsWith('user|,'));
%Add up count of all users
Ei ∗ 1Nx1 1 >>Acommon = sum(Euser, 1);
EiT ∗ E j >> 1 %Display most common users
>>display(Acommon>150);
11xN ∗ Ei = 1 (1,user|SFBayRoadAlerts) 258
(1,user|akhbarhurra) 177
• Vestigial (δ ): Erroneous or misconfigured entries can (1,user|attir_midzi) 159
typically be determined by inspecting Ei . (1,user|verkehr_bw) 300
The difference between a sub-associative array and an in-
tended model such as those above provide valuable information The results above indicate that there are 4 users who have
about failed processes, corrupted or junk data, non-working greater than 150 tweets in the dataset.
Fig. 3. Associative Array representation of Twitter data. E1 , E2 , ..., E10 represent the concatenated associative arrays Ei that constitute all the entities in the full
dataset. Each blue dot corresponds to a value of 1 in the associative array representation.