Scaling Up To Billions of Cells With: Supporting Large Spreadsheets With Databases
Scaling Up To Billions of Cells With: Supporting Large Spreadsheets With Databases
1
a system, DATA S PREAD, that can not only efficiently support op- To study these two aspects, we first retrieve a large collection
erations on billions of records, but naturally incorporates relational of real spreadsheets from four disparate sources, and quantitatively
database features such as expressiveness and collaboration support. analyze them on different metrics. We supplement this quantitative
DATA S PREAD uses a standard relational database as a backend analysis with a small-scale user survey to understand the spectrum
(currently PostgreSQL, but nothing ties us to that database), with of operations frequently performed. The latter is necessary since
a web-based spreadsheet system [3] as the frontend. By using a we do not have a readily available trace of user operations from
standard relational database, with no modifications to the underly- the real spreadsheets (e.g., indicating how often users add rows or
ing engine, we can just seamlessly leverage improvements to the columns, or edit formulae.)
database, while allowing the same data to be used by other appli- We first describe our methodology for both these evaluations,
cations. This allows a clean encapsulation and separation of front- before diving into our findings for the two aspects.
end and back-end code, and also admits portability and a simpler
design. DATA S PREAD is fully functional — the DATA S PREAD re- 2.1 Methodology
sources, along with video and code can be found at dataspread. As described above, we have two forms of evaluation of spread-
github.io. We demonstrated a primitive version of DATA S PREAD sheet use: the first, via an analysis of spreadsheets, and the second,
at the VLDB conference last year [11]. via interviews of spreadsheet users. The datasets can be found at
While there have been many attempts at combining spreadsheets dataspread.github.io.
and relational database functionality, ultimately, all of these at-
tempts fall short because they do not let spreadsheet users perform 2.1.1 Real Spreadsheet Datasets
ad-hoc data manipulation operations [47, 48, 28]. Other work sup- For our evaluation of real spreadsheets, we assemble four datasets
ports expressive and intuitive querying modalities without address- from a wide variety of sources.
ing scalability issues [9, 5, 20], addressing an orthogonal problem. Internet. This dataset was generated by crawling the web for Excel
There have been efforts that enhance spreadsheets or databases (.xls) files, using a search engine, across a wide variety of domains.
without combining them [38]. Furthermore, while there has been As a result, these 53k spreadsheets vary widely in content, ranging
work on array-based databases, most of these systems do not sup- from tabular data to images.
port edits: for instance, SciDB [13] supports an append-only, no-
ClueWeb09. This dataset of 26k spreadsheets was generated by ex-
overwrite data model. We describe related work in more detail in
tracting .xls file URLs from the ClueWeb09 [15] web crawl dataset.
Section 8.
Enron. This dataset was generated by extracting 18k spreadsheets
Rest of the Paper. The outline of the rest of the paper is as follows.
from the Enron email dataset [26]. These spreadsheets were used
• We begin with an empirical study of four real spreadsheet to exchange data within the Enron corporation.
datasets, plus an on-line user survey, targeted at understand-
Academic. This dataset was collected from an academic institu-
ing how spreadsheets are used for data analysis in Section 2.
tion; this academic institution used these spreadsheets to manage
• Then, in Section 3, we introduce the notion of a conceptual internal data about course workloads of instructors, salaries of staff,
data model for spreadsheet data, as well as the set of opera- and student performance.
tions we wish to support on this data model.
We list these four datasets along with some statistics in Table 1.
• In Section 4, we propose three primitive data models for sup- Since the first two datasets are from the open web, they are primar-
porting the conceptual data model within a database, along ily meant for data publication: as a result, only about 29% and 42%
with a hybrid data model that combines the benefits of these of these sheets (column 3) contain formulae, with the formulae oc-
primitive data models. We demonstrate that identifying the cupying less than 3% of the total number of non-empty cells for
optimal hybrid data model is NP-H ARD, but we can develop both datasets (column 5). The third dataset is from a corporation,
a PTIME dynamic programming algorithm that allows us to and is primarily meant for data exchange, with a similarly low frac-
find an approximately optimal solution. tion of 39% of these sheets containing formulae, and 3.35% of the
• Then, in Section 5, we motivate the need for, and develop non-empty cells containing formulae. The fourth dataset is from
indexing solutions for positional mapping—a method for re- an academic institution, and is primarily meant for data analysis,
ducing the impact of cascading updates for inserts and deletes with a high fraction of 91% of the sheets containing formulae, and
on all our data models. 23.26% of the non-empty cells containing formulae.
• We give a brief overview of the system architecture from the
perspective of our data models in Section 6. We also describe 2.1.2 User Survey
how we seamlessly support standard relational operations in To evaluate the kinds of operations performed on spreadsheets,
DATA S PREAD. we solicited participants for a qualitative user survey: we recruited
• We perform experiments to evaluate our data models and po- thirty participants from the industry who exclusively use spread-
sitional mapping schemes in Section 7, and discuss related sheets for data management and analysis. This survey was con-
work in Section 8. ducted via an online form, with the participants answering a small
number of multiple-choice and free-form questions, followed by
the authors aggregating the responses.
2. SPREADSHEET USAGE IN PRACTICE
In this section, we empirically evaluate how spreadsheets are 2.2 Structure Evaluation
used for data management. We use the insights from this evalu- We now use our spreadsheet datasets to understand how data is
ation to both motivate the design decisions for DATA S PREAD, and laid out on spreadsheets.
develop a realistic workload for spreadsheet usage. To the best of Across Spreadsheets: Data Density. First, we study how similar
our knowledge, no such evaluation, focused on the usage of spread- real spreadsheets are to relational data conforming to a specific tab-
sheets for data analytics, has been performed in the literature. ular structure. To study this, we estimate the density of each spread-
We focus on two aspects: (a) structure: identifying how users sheet, defined as the ratio of the filled-in cells to the total number of
structure and manage data on a spreadsheet, and (b) operations: un- cells—specified by the minimum bounding rectangular box enclos-
derstanding the common spreadsheet operations that users perform. ing the filled-in cells—within a spreadsheet. We depict the results
2
Dataset Sheets Sheets with formulae Sheets with > 20% formulae % of formulae Sheets with < 50% density Sheets with < 20% density
Internet 52311 29.15% 20.26% 1.30% 22.53% 6.21%
ClueWeb09 26148 42.21% 27.13% 2.89% 46.71% 23.8%
Enron 17765 39.72% 30.42% 3.35% 50.06% 24.76%
Academic 636 91.35% 71.26% 23.26% 90.72% 60.53%
20 K 6K 4K 300
15 K 3K
4K 200
#Sheets
#Sheets
#Sheets
#Sheets
10 K 2K
2K 100
5K 1K
0 0 0 0
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
Density Density Density Density
Figure 1: Data Density — (a) Internet (b) ClueWeb09 (c) Enron (d) Academic
in the last two columns of Table 1, and in Figure 1, which depicts Popularity: Formulae Usage. We begin by studying how often
the distribution of this ratio. We note that spreadsheets within Inter- formulae are used within spreadsheets. On examining Table 1, we
net, Clueweb09, and Enron datasets are typically dense, i.e., more find that there is a high variance in the fraction of cells that are for-
than 50% of the spreadsheets have density greater than 0.5. On the mulae (column 5), ranging from 1.3% to 23.26%. We note that the
other hand, for the Academic dataset, we note a high proportion academic institution dataset embeds a high fraction of formulae,
(greater than 60%) of spreadsheets have density values less than indicating that the spreadsheets in that case are used primarily for
0.2. This low density is because the latter dataset embeds a num- data management and analysis as opposed to data sharing or pub-
ber of formulae and use forms to report data in a user-accessible lication. Despite that, all of the datasets have a substantial fraction
interface. Thus, we have: of spreadsheets where the formulae occupy more than 20% of the
Takeaway 1: Real spreadsheets vary widely in their density, cells (column 4)—20.26% and higher for all datasets.
ranging from highly sparse to highly dense, necessitating data Takeaway 3: Formulae are very common in spreadsheets, with
models that can adapt to such variations. over 20% of the spreadsheets containing a large fraction of
Within a Spreadsheet: Tabular regions. For the spreadsheets that over 15 of formulae, across all datasets. The high prevalence
are sparse, we further analyzed them to evaluate whether there are of formulae indicates that optimizing for the access patterns
regions within these spreadsheets with high density—essentially of formulae when developing data models is crucial.
indicating that these regions can be regarded as tables. To iden-
tify these tabular regions, we first constructed a graph consisting of Access: Formulae Distribution and Access Patterns. Next, we
filled-in cells within each spreadsheet, where two cells (i.e., nodes) study the distribution of formulae used within spreadsheets—see
have an edge between them if they are adjacent either vertically or Figure 2. Not surprisingly, arithmetic operations are very com-
horizontally. We then computed the connected components on this mon across all datasets. The first three datasets have an abundance
graph. We declare a connected component to be a tabular region if of conditional formulae through IF statements (e.g., second bar in
it spans at least two columns and five rows, and has an overall den- Figure 2a)—these statements were typically used to fill in missing
sity of at least 0.7, defined as before as the ratio of the filled-in cells data or to change the data type, e.g., IF(H67=true,1.0,0.0). In contrast,
to the total number of cells in the minimum bounding rectangle en- the Academic dataset is dominated by formulae on numeric data.
compassing the connected component. In Table 2, for each dataset, Overall, there is a wide variety of formulae that span both a small
we list the total number of tabular regions identified (column 2), number of cell accesses (e.g., arithmetic), as well as a large number
the number of filled-in cells covered by these regions (column 3), of them (e.g., SUM, VL short for VLOOKUP). The last two correspond
and the fraction of the total filled-in cells that are captured within to standard database operations such as aggregation and joins.
these tabular regions (column 4). Dataset Total Cells Cells accessed Components
Dataset Tables Table cells %Coverage Accessed per Formula Per Formula
Internet Crawl 67,374 124,698,013 66.03 Internet 2,460,371 334.26 2.5
ClueWeb09 37,164 52,257,649 67.68 ClueWeb09 2,227,682 147.99 1.92
Enron 9,733 8,135,241 60.98 Enron 446,667 143.05 1.75
Academic 286 18,384 12.10 Academic 35,335 3.03 1.54
3
2.4 M 2.0 M 0.4 M 30 K
2.0 M 1.6 M
0.3 M
#Formulae
#Formulae
#Formulae
#Formulae
1.6 M 20 K
1.2 M
1.2 M 0.2 M
0.8 M
0.8 M 10 k
0.1 M
0.4 M 0.4 M
0 0 0 0
ARITH IF LN BLANK SUM VL ... ARITH IF SUM NUM SEARCH AND ARITH SUM IF BLANK AND RAND ... ARITH SUM LOG ROUND LN FLOOR ...
Formula Formula Formula Formula
Figure 2: Formulae Distribution — (a) Internet (b) ClueWeb09 (c) Enron (d) Academic
30
1 Our second goal for performing the study was to understand
2
3 how users organize their data on a spreadsheet. We asked each
20
4 participant if their data is organized in well-structured tables, or if
Usage
5
the data scattered throughout the spreadsheet, on a scale of 1 (not
10 organized)–5 (highly organized)—see Figure 3. Only five partici-
pants marked < 4 which indicates that users do organize their data
on a spreadsheet (column 5). We also asked the importance of or-
0
Scrolling Changing Formula Row/Col Tabular Ordering dering of records in the spreadsheet on a scale of 1 (not important)–
Spreadsheet Operations 5 (highly important). Unsurprisingly, only five participants marked
< 4 for this question (column 6). We also provided a free-form
Figure 3: Operations performed on spreadsheets. textual input where multiple participants mentioned that ordering
tabular regions. We then counted the number of connected com- comes naturally to them and is often taken for granted while using
ponents in this graph, and tabulated the results in column 4 in the spreadsheets.
same table. As can be seen, even though the number of cells ac- Takeaway 6: Spreadsheet users typically try to organize their
cessed may be large, these cells stem from a small number of con- data as far as possible on the spreadsheet, and rely heavily
nected components; as a result, we can exploit spatial locality to on the ordering and presentation of the data on their spread-
execute them more efficiently. sheets.
Takeaway 4: Formulae on spreadsheets access cells on the
spreadsheet by position; some common formulae such as SUM 3. SPREADSHEET DESIDERATA
or VLOOKUP access a rectangular range of cells at a time. The goal for DATA S PREAD is to combine the ease of use and
The number of cells accessed by these formulae can be quite interactivity of spreadsheets, while simultaneously providing the
large, and most of these cells stem from contiguous areas of scalability, expressiveness, and collaboration capabilities of databases.
the spreadsheet. Thus, as we develop DATA S PREAD, having two aspects of inter-
est: first, how do we support spreadsheet semantics over a database
User-Identified Operations. In addition to identifying how users backend, and second, how do we support database operations within
structure and manage data on a spreadsheet, we now analyze the a spreadsheet. Our primary focus will be on the former, which
common spreadsheet operations that users perform. To this end, we will occupy the bulk of the paper. We return to the latter in Sec-
conducted a small-scale online survey of 30 participants to study tion 6. For now, we focus on describing the desiderata for support-
how users operate on spreadsheet data. This qualitative study is ing spreadsheet semantics over databases. We first describe our
valuable since real spreadsheets do not reveal traces of user oper- conceptual spreadsheet data model, and then describe the desired
ations performed on them (e.g., revealing how often users perform operations that need to be supported on this conceptual data model.
ad-hoc operations like scrolling, sorting, deleting rows or columns). Conceptual Data Model. A spreadsheet consists of a collection
Our questions in this study were targeted at understanding (a) how of cells. A cell is referenced by two dimensions: row and column.
users perform operations on the spreadsheet and (b) how users or- Columns are referenced using letters A, . . ., Z, AA, . . .; while rows
ganize data on the spreadsheet. are referenced using numbers 1, 2, . . . Each cell contains either a
With the goal of understanding how users perform operations on value, or a formula. A value is a constant belonging to some fixed
the spreadsheet, we asked each participant to answer a series of type. For example, in Figure 4 a screenshot from our working im-
questions where each question corresponded to whether they con- plementation of DATA S PREAD, B2 (column B, row 2) contains the
ducted the specific operation under consideration on a scale of 1–5, value 10. In contrast, a formula is a mathematical expression that
where 1 corresponds to “never” and 5 to “frequently”. For each contains values and/or cell references as arguments, to be manipu-
operation, we plotted the results in a stacked bar chart in Figure 3, lated by operators or functions. The expression corresponding to a
with the higher numbers stacked on the smaller ones like the legend formula eventually unrolls into a value. For example, in Figure 4,
indicates. cell F2 contains the formula =AVERAGE(B2:C2)+D2+E2, which unrolls
We find that all the thirty participants perform scrolling, i.e., into the value 85. The value of F2 depends on the value of cells B2,
moving up and down the spreadsheet to examine the data, with C2, D2, and E2, which appear in the formula associated with F2.
22 of them marking 5 (column 1). All participants reported to have In addition to a value or a formula, a cell could also additionally
performed editing of individual cells (column 2), and many of them have formatting associated with it; e.g., a cell could have a specific
reported to have performed formula evaluation frequently (column width, or the text within a cell can have bold font, and so on. For
3). Only four of the participants marked < 4 for some form of simplicity, we ignore formatting aspects, but these aspects can be
row/column-level operations, i.e., deleting or adding one or more easily captured within our representation schemes without signifi-
rows or columns at a time (column 4). cant changes.
Spreadsheet Operations. We now describe the operations that we
Takeaway 5: There are several common operations performed
aim to support on DATA S PREAD, drawing from the operations we
by spreadsheet users including scrolling, row and column
found in our user survey (takeaway 5). We consider the following
modification, and editing individual cells.
read-only operations:
4
Pp
minimize size(T ) = i=1 size(Ti ). Moreover, we would like
to minimize the time taken for accessing data using T , i.e., the
access cost, which is the cost of accessing a rectangular range of
cells for formulae (takeaway 4) or scrolling to specific locations
(takeaway 5), which are both common operations. And we would
like to minimize the time taken to perform updates, i.e., the update
cost, which is the cost of updating individual cells or a range of
cells, and the insertion and deletion of rows and columns.
Figure 4: Sample Spreadsheet (DATA S PREAD screenshot). Overall, starting from a collection of cells C, our goal is to iden-
tify a physical data model T such that: (a) T is recoverable with
• Scrolling: This operation refers to the act of retrieving cells respect to C, and (b) T minimizes a combination of storage, access
within a certain range of rows and columns. For instance, and update costs, among all T 2 P.
when a user scrolls to a specific position on the spreadsheet, We begin by considering the setting where the physical data
we need to retrieve a rectangular range corresponding to the model T has a single relational table, i.e., T = {T1 }. We develop
window that is visible to the user. Accessing an entire row or three ways of representing this table: we call them primitive data
column, e.g., A:A, is a special case of rectangular range where models, and are all drawn from prior work, each of which work
the column/row corresponding to the range is not bounded. well for a specific structure of spreadsheet—this is the focus of
• Formula evaluation: Evaluating formulae can require ac- Section 4.2. Then, we extend this to the setting where |T | > 1 by
cessing multiple individual cells (e.g., A1) within the spread- defining the notion of a hybrid data model with multiple tables each
sheet or ranges of cells (e.g., A1:D100). of which uses one of the three primitive data models to represent a
Note that in both cases, the accesses correspond to rectangular re- certain portion of the spreadsheet—this is the focus of Section 4.3.
gions of the spreadsheet. We consider the following four update Given the high diversity of structure within spreadsheets and high
operations: skew (takeaway 2), having multiple primitive data models, and the
• Updating an existing cell: This operation corresponds to ability to use multiple tables, gives us substantial power in repre-
accessing a cell with a specific row and column number and senting spreadsheet data.
changing its value. Along with cell updates, we are also re-
quired to reevaluate any formulae dependent on the cell.
4.2 Primitive Data Models
Our primitive data models represent trivial solutions for spread-
• Inserting/Deleting row/column(s): This operation corresp-
sheet representation with a single table. Before we describe these
onds to inserting/deleting row/column(s) at a specific posi-
data models, we discuss a small wrinkle that affects all of these
tion on the spreadsheet, followed by shifting subsequent row/-
models. To capture a cell’s identity, i.e., its row and column num-
column(s) appropriately.
ber, we need to implicitly or explicitly record a row and column
Note that, similar to read-only operations, the update operations number with each cell. Say we use an attribute to capture the row
require updating cells corresponding to rectangular regions. number for a cell. Then, the insertion or deletion of rows requires
In the next section, we develop data models for representing the cascading updates to the row number attribute for all subsequent
conceptual data model as described in this section, with an eye to- rows. As it turns out, all of the data models we describe in this
wards supporting the operations described above. section suffer from performance issues arising from cascading up-
dates, but the solution to deal with these issues is similar for of
4. REPRESENTING SPREADSHEETS these all of them, and will be described in Section 5.
We now address the problem of representing a spreadsheet within Also, note that the access and update cost of various data mod-
a relational database. For the purpose of this section and the next, els depends on whether the underlying database is a row store or a
we focus on representing one spreadsheet, but our techniques seam- columnar store. For the rest of this section and the paper, we fo-
lessly carry over to the multiple spreadsheet case; like we described cus on a row store, such as PostgreSQL, which is what we use in
earlier, we focus on the content of the spreadsheet as opposed to the practice, and is also more tailored for hybrid read-write settings.
formatting, as well as other spreadsheet metadata, like spreadsheet We now describe the three primitive data models:
name(s), spreadsheet dimensions, and so on. Row-Oriented Model (ROM). The row-oriented data model (ROM)
We describe the high-level problem of representation of spread- is straightforward, and is akin to data models used in traditional re-
sheet data here; we will concretize this problem subsequently. lational databases. Let rmax and cmax represent the maximum row
number and column number across all of the cells in C. Then, in the
4.1 High-level Problem Description ROM model, we represent each row from row 1 to rmax as a sep-
The conceptual data model corresponds to a collection of cells, arate tuple, with an attribute for each column Col1 . . ., Colcmax ,
represented as C = {C1 , C2 , . . . , Cm }; as described in the previ- and an additional attribute for explicitly capturing the row iden-
ous section, each cell Ci corresponds to a location (i.e., a specific tity, i.e., RowID. The schema for ROM is: ROM(RowID, Col1,
row and column), and has some contents—either a value or a for- . . ., Colcmax )—we illustrate the ROM representation of Figure 4
mula. Our goal is to represent and store the cells C comprising in Figure 5: each entry is a pair corresponding to a value and a
the conceptual data model, via one of the physical data models, formula, if any. For dense spreadsheets that are tabular (takeaways
P. Each T 2 P corresponds to a collection of relational tables 1 and 2), this data model can be quite efficient in storage and ac-
{T1 , . . . , Tp }. Each table Ti records the data in a certain portion of cess, since it minimizes redundant information: each row number is
the spreadsheet, as we will see subsequently. Given a collection C, recorded only once, independent of the number of columns. Over-
a physical data model T is said to be recoverable with respect to C all, the ROM representation shines when entire rows are accessed
if for each Ci 2 C, 9Tj 2 T such that Tj records the data in Ci , at a time, as opposed to entire columns. It is also efficient for ac-
and 8k 6= j, Tk does not record the data in Ci . Thus, our goal is to cessing a large range of cells at a time.
identify physical data models that are recoverable. Column-Oriented Model (COM). The second representation is
At the same time, we want to minimize the amount of storage also straightforward, and is simply the transpose of the ROM rep-
required to record T within the database, i.e., we would like to resentation. Often, we find that certain spreadsheets have many
5
3 2
RowID Col1 ... Col6
1 ID, NULL ... Total, NULL A B C D E F G H I
2 Alice, NULL ... 85, AVERAGE(B2:C2)+D2+E2 1 ✕ ✕ ✕ ✕
... ... ... ...
2 ✕ ✕ ✕
Figure 5: ROM Data Model for Figure 4.
3 ✕ ✕ ✕
6
A B C D E F G H I 2 1 3 1 2
1 ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕
A C D G H Time Complexity. Our dynamic programming algorithm runs in
2 ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕
1 ✕
polynomial time with respect to the size of the spreadsheet. Let
2 ✕ ✕ ✕
3 ✕ ✕ the length of the larger side of the minimum enclosing rectangle
1 3 ✕
4 ✕ ✕ ✕ ✕ of the spreadsheet is of size n. Then, the number of candidate
1 4 ✕ ✕
5 ✕ ✕ rectangles is O(n4 ). For each rectangle, we have O(n) ways to
1 5 ✕
6 ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ perform the cut. Therefore, the running time of our algorithm is
7 ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕
2 6 ✕ ✕ ✕ ✕
O(n5 ). However, this number could be very large if the spread-
Figure 9: (a) Counterexample (b) Weighted Representation sheet is massive—which typical of the use-cases we aim to tackle.
Weighted Representation. We now describe a simple optimiza-
To see this, note that any vertical or horizontal cut that one would tion that helps us reduce the time complexity substantially, while
make at the start would cut through one of the four tables, mak- preserving optimality for the cost model that we have been using
ing the decomposition impossible. Nevertheless, the hybrid data so far. Notice that in many real spreadsheets, there are many rows
models obtained via recursive decomposition form a natural class and columns that are very similar to each other in structure, i.e.,
of data models. they have the same set of filled cells. We exploit this property to
As it turns out, identifying the solution to Problem 1 is PTIME reduce the effective size n of the spreadsheet. Essentially, we col-
for the space of hybrid data models obtained via recursive decom- lapse rows that have identical structure down to a single weighted
position. The algorithm involves dynamic programming. Infor- row, and similarly collapse columns that have identical structure
mally, our algorithm makes the most optimal “cut” horizontally or down to a single weighted column.
vertically at every step, and proceeds recursively. We now describe Consider Figure 9(b) which shows the weighted version of Fig-
the dynamic programming equations. ure 9(a). Here, we can collapse column B down into column A,
Consider a rectangular area formed from x1 to x2 as the top and which is now associated with weight 2; similarly, we can collapse
bottom row numbers respectively, both inclusive, and from y1 to y2 row 2 into row 1, which is now associated with weight 2. In this
as the left and right column numbers respectively, both inclusive, manner, the effective area of the spreadsheet now becomes 5⇥5 as
for some x1 , x2 , y1 , y2 . We represent the optimal cost by the func- opposed to 7⇥9.
tion Opt(). Now, the optimal cost of representing this rectangular Now, we can apply the same dynamic programming algorithm
area, i.e., Opt((x1 , y1 ), (x2 , y2 )), is the minimum of the following to the weighted representation of the spreadsheet: in essence, we
possibilities: are avoiding making cuts “in-between” the weighted edges, thereby
• If there is no filled cell in the rectangular area (x1 , y1 ), (x2 , y2 ), reducing the search space of hybrid data models. As it turns out,
then we do not use any data model. Hence, we have this does not sacrifice optimality, as the following theorem shows:
T HEOREM 3 (W EIGHTED O PTIMALITY ). The optimal hybrid
Opt((x1 , y1 ), (x2 , y2 )) = 0 (2) data model obtained by recursive decomposition on the weighted
spreadsheet is no worse than the optimal hybrid data model ob-
• Do not split, i.e., store as a ROM model (romCost()): tained by recursive decomposition on the original spreadsheet.
romCost((x1 , y1 ), (x2 , y2 )) = s1 + s2 · (r12 ⇥ c12 )
4.5 Greedy Decomposition Algorithms
+ s3 · c12 + s4 · r12 , (3)
Greedy Decomposition. To improve the running time even fur-
where number of rows r12 = (x2 x1 + 1), and the number ther, we propose a greedy heuristic that avoids the high complexity
of columns c12 = (y2 y1 + 1). of the dynamic programming algorithm, but sacrifices somewhat on
• Perform a horizontal cut (CH ): optimality. The greedy algorithm essentially repeatedly splits the
spreadsheet area in a top-down manner, making a greedy locally
CH = min Opt((x1 , y1 ), (i, y2 )) optimal decision, instead of systematically considering all alterna-
i2{x1 ,...,x2 } tives, like in the dynamic programming algorithm. Thus, at each
+ Opt((i + 1, y1 ), (x2 , y2 )). (4) step, when operating on a rectangular spreadsheet area (x1 , y1 ), (x2 , y2 ),
it identifies the operation that results in the lowest local cost. We
• Perform a vertical cut (CV ): have three alternatives: Either we do not split, in which case the
cost is from Equation 3, i.e., romCost((x1 , y1 ), (x2 , y2 )). Or we
CV = min Opt((x1 , y1 ), (x2 , j)) split horizontally (vertically), in which case the cost is the same as
j2{y1 ,...,y2 }
CH (CV ) from Equation 4 (Equation 5), but with Opt() replaced
+ Opt((x1 , j + 1), (x2 , y2 )). (5) with romCost(), since we are making a locally optimal decision.
The smallest cost decision is followed, and then we continue recur-
Therefore, when there are filled cells in the rectangle, sively decomposing using the same rule on the new areas, if any.
Complexity. This algorithm has a complexity of O(n2 ), since each
Opt((x1 , y1 ), (x2 , y2 )) = step takes O(n) and there are O(n) steps. While the greedy algo-
min romCost((x1 , y1 ) , (x2 , y2 )), CH , CV . (6) rithm is sub-optimal, the local decision that it makes is optimal in
the worst case, i.e., with no further information about the structure
else Opt((x1 , y1 ), (x2 , y2 )) = 0. of the areas that arise as a result of the decomposition, this is the
The base case is when the rectangular area is of dimension 1 ⇥ 1. best decision to make at each step.
Here, we store the area as a ROM table if it is a filled cell. Hence, Aggressive Greedy Decomposition. The greedy algorithm de-
we have, Opt((x1 , y1 ), (x1 , y1 )) = c1 + c2 + c3 + c4 , if filled, scribed above stops exploration as soon as it is unable to find a
and 0 if not. cut that reduces the cost locally, based on a worst case assumption.
We have the following theorem: This may cause the algorithm to halt prematurely, even though ex-
T HEOREM 2 (DYNAMIC P ROGRAMMING O PTIMALITY ). The ploring further decompositions may have helped reduce the cost.
optimal ROM-based hybrid data model based on recursive decom- An alternative to the greedy algorithm described above is one where
position can be determined via dynamic programming. we don’t stop subdividing, i.e., we always choose to use the best
7
Key Data
horizontal or vertical cut, and then subdivide the area based on that 1 100 abc
3 4
cut in a depth-first manner. We keep doing this until we end up with 2 200 aui
rectangular areas where all of the cells are filled in with values. (At 3 250 ois
this point, it provably doesn’t benefit us to subdivide further.) After 4 300 kov Non-Leaf Nodes
333 rte
this point, we backtrack up the tree of decompositions, bottom-up, 5
1 2 3 1
(counts &
6 350 pos children pointers)
assembling the best solution that was discovered, similar to the dy- 400 iks
7
namic programming approach, considering whether to not split, or 8 500 bhg
perform a horizontal or vertical split. 600 kis Leaf Nodes
... ...
9
5 1 3 4 6 7 8
Complexity. Like the greedy approach, the aggressive greedy ap- (values)
proach has complexity O(n2 ), but takes longer since it considers a Figure 10: (a) Monotonic Positional Mapping (b) Index for Hierar-
larger space of data models than the greedy approach. chical Positional Mapping
4.6 Extensions particular, we want a data structure on items (here tuples) that can
In this section, we describe extensions to the cost model and capture a specific ordering among the items and efficiently support
algorithms to handle COM and RCV tables in addition to ROM. the following operations: (a) fetch items based on a position, (b) in-
Other extensions can be found in Appendix B, including incorpo- sert items at a position, and (c) delete items from a position. The
rating access cost along with storage, including the costs of indexes, insert and delete operations require updating the positions of the
and dealing with situations when database systems impose limita- subsequent items, e.g., inserting an item at the nth position requires
tions on the number of columns in a relation. We will describe these us to first increment by one the positions of all the items that have
extensions to the cost model, and then describe the changes to the a position greater than or equal to n, and then add the new item
basic dynamic programming algorithm; modifications to the greedy at the nth position. Due to the interactive nature of DATA S PREAD,
and aggressive greedy decomposition algorithms are straightfor- our goal is to perform these operations within a few hundred mil-
ward. liseconds.
RCV and COM. The cost model can be extended in a straightfor- Row Number as-is. We motivate the problem by demonstrating
ward manner to allow each rectangular area to be a ROM, COM, the impact of cascading updates in terms of time complexity. Stor-
or an RCV table. First, note that it doesn’t benefit us to have mul- ing the row numbers as-is with every tuple makes the fetch opera-
tiple RCV tables—we can simply combine all of these tables into tion efficient at the expense of making the insert and delete opera-
one, and assume that we’re paying a fixed up-front cost to have one tions inefficient. With a traditional index, e.g., a B-Tree index, the
RCV table. Then, the cost for a table Ti , if it is stored as a COM complexity to access an arbitrary row identified by a row number is
table is: O(log N ). On the other hand, insert and delete operations require
comCost(Ti ) = s1 + s2 · (ri ⇥ ci ) + s4 · ci + s3 · ri . updating the row numbers of the subsequent tuples. These updates
also need to be propagated in the index, and therefore it results in a
This equation is the same as Equation 1, but with the last two con- worst case complexity of O(N log N ). To illustrate the impact of
stants transposed. And the cost for a table Ti , if it is stored as an these complexities on practice, in Table 4(a), we display the perfor-
RCV table is simply: mance of storing the row numbers as-is for two operations—fetch
rcvCost(Ti ) = s5 ⇥ #cells. and insert—on a spreadsheet containing 106 cells. We note that
where s5 is the cost incurred per tuple. Once we have this cost irrespective of the data model used, the performance of inserts is
model set up, it is straightforward to apply dynamic programming beyond our acceptable threshold whereas that of the fetch opera-
once again to identify the optimal hybrid data model encompassing tion is acceptable.
ROM, COM, and RCV. The only step that changes in the dynamic Row Number as-is Positional Mapping
programming equations is Equation 3, where we have to consider Operation RCV ROM Operation RCV ROM
the COM and RCV alternatives in addition to ROM. We have the Insert 87,821 1,531 Insert 9.6 1.2
following theorem. Fetch 312 244 Fetch 30,621 273
T HEOREM 4 (O PTIMALITY WITH ROM, COM, AND RCV).
The optimal ROM, COM, and RCV-based hybrid data model based Table 4: The performance of (in ms) (a) storing Row Number as-is
on recursive decomposition can be determined via dynamic pro- (b) Monotonic Positional Mapping.
gramming.
Intuition. To improve the performance of inserts and deletes for
ordered items, we introduce the idea of positional mapping. At its
5. POSITIONAL MAPPING core, the idea is remarkably simple: we do not store positions but
As discussed in Section 4, for all of the data models, storing the instead store what we call positional mapping keys. These posi-
row and/or column numbers may result in substantial overheads tional mapping keys p are proxies that have a one-to-one mapping
during insert and delete operations due to cascading updates to all with the positions r, i.e., p r. Formally, positional mapping M
subsequent rows or columns—this could make working with large is a bijective function that maintains the relationship between the
spreadsheets infeasible. In this section, we develop solutions for row numbers and positional mapping keys, i.e., M(r) ! p.
this problem by introducing the notion of positional mapping to Monotonic Positional Mapping. One approach towards positional
eliminate the overhead of cascading updates. For our discussion mapping is to have positional mapping keys monotonically increase
we focus on row numbers; the techniques can be analogously ap- with position, i.e., for two arbitrary positions ri and rj , if ri > rj
plied to columns. To keep our discussion general, we use the term then M(rj ) > M(ri ). For example, consider the ordered list of
position to represent the ordinal number, i.e., either row or column items shown in Figure 10(a). Here, even though the positional map-
number, that captures the location of the cell along a specific di- ping keys do not correspond to the row number, and even though
mension. In addition, row and column numbers can be dealt with there can be arbitrary differences between consecutive positional
independently. mapping keys, we can fetch the nth record by scanning the posi-
Problem. We require a data structure to efficiently support posi- tional mapping keys in an increasing order while maintaining a run-
tional operations without the overhead of cascading updates. In ning counter to skip n-1 records. The gaps between the consecutive
8
User
positional mapping keys reduce or even eliminate the renumbering Interface Ajax
Responses
Ajax
Requests
during insert and delete operations. Web Browser
case), and we have a traditional B+tree index on this key, then the LRU Cell Cache
complexity of this operation is O(log N ). Similarly, the complex-
ity of inserting an item if we know the positional mapping key, Hybrid Translator
determined based on the positional mapping keys of neighboring ROM COM RCV
items, is O(log N ), which is the effort spent to update the under- Translator Translator Translator
9
Relational Operations in Spreadsheet. Since DATA S PREAD is 10000
DP
Greedy
built on top of a traditional relational database, it can leverage the Agg
7. EXPERIMENTAL EVALUATION
In this section, we present an evaluation of DATA S PREAD. Our 10
high-level goals are to evaluate the feasibility of DATA S PREAD to Internet ClueWeb09 Enron Academic
work with large spreadsheets with billions of cells; in addition, we Figure 13: Hybrid optimization algorithms: Running time.
attempt to understand the impact of the hybrid data models, and ROM primitive model, and evaluate the performance of fetch, in-
the impact of the positional mapping schemes. Recent work has sert, and delete operations on varying the number of rows.
identified 500ms as a yardstick of interactivity [29], and we aim to
verify if DATA S PREAD can actually meet that yardstick. 7.2 Impact of Hybrid Data Models
7.1 Experimental Setup Takeaways: Hybrid data models provide substantial benefits
over primitive data models, with up to 20% reductions in stor-
Environment. Our data models and positional mapping techniques
age, and up to 50% reduction in formula access or evalua-
were implemented on top of a PostgreSQL (version: 9.6) database.
tion time on PostgreSQL on real spreadsheet datasets, com-
The database was configured with default parameters. We run all of
pared to the best primitive data model. While DP has better
our experiments on a workstation with the following configuration:
performance on storage than Greedy and Agg, it suffers from
Processor: Intel Core i7-4790K 4.0 GHz, RAM: 16 GB, Op-
high running time; Agg is able to bridge the gap between
erating System: Windows 10. Our test scripts are single-threaded
Greedy and DP, while taking only marginally more running
applications developed in Java. While we have also developed
time than Greedy. Lastly, if we were to design a database stor-
a full-fledged web-based front-end application (see Figure 4), our
age engine from scratch, the hybrid data models would provide
test scripts are independent of this front-end, so that we can iso-
up to 50% reductions in storage compared to the best primi-
late the back-end performance implications. We ensured fairness
tive data model.
by clearing the appropriate cache(s) before every run.
The goal of this section is to evaluate our data models—both our
Datasets. We evaluate our algorithms on a variety of real and syn-
primitive and hybrid data models—on real datasets. For each sheet
thetic datasets. Our real datasets are the ones listed in Table 1:
within each dataset, we run the dynamic programming algorithm
Internet, ClueWeb09, Enron, and Academic. The first three have
(denoted DP), the greedy algorithm (denoted Greedy), and the ag-
over 10,000 sheets each while the last one has about 700 sheets.
gressive greedy algorithm (denoted Agg) that help us identify ef-
To test scalability, our real-world datasets are insufficient, because
fective hybrid data models. We compare the resulting data models
they are limited in scale by what current spreadsheet tools can sup-
against the primitive data models: ROM, COM and RCV, where
port. Therefore, we constructed additional large synthetic spread-
the entire spreadsheet is stored in a single table.
sheet datasets. The spreadsheets in the datasets each have between
10–100 columns, with the number of rows varying from 103 to 107 , Storage Evaluation on PostgreSQL. We begin with an evaluation
and a density between 0–1; this last quantity indicates the proba- of storage for different data models on PostgreSQL. The costs for
bility that a given cell within the spreadsheet area is filled-in. Our storage on PostgreSQL as measured by us is as follows: s1 is 8
largest synthetic dataset has a billion non-empty cells, enabling us KB, s2 is 1 bit, s3 is 40 bytes, s4 is 50 bytes, and s5 is 52 bytes.
to explicitly verify the premise of the title of this work. We plot the results in Figure 12(a): here, we depict the average
normalized storage across sheets: for the Internet, ClueWeb09, and
We identify several goals for our experimental evaluation: Enron datasets, we found RCV to have the worst performance, and
Goal 1: Impact of Hybrid Data Models on Real Datasets. We hence normalized it to a cost of 100, and scaled the others accord-
evaluate the hybrid data models selected by our algorithms against ingly; for the Academic datasets, we found COM to have the worst
the primitive data models, when the cost model is optimized for performance, and hence normalized it to a cost of 100, and scaled
storage. The algorithms evaluated include: ROM, COM, RCV (the the others accordingly. For the first three datasets, recall that these
primitive data models, using a single table to represent a sheet), datasets are primarily used for data sharing, and as a result are quite
DP (the dynamic programming algorithm from Section 4.4), and dense. As a result, the ROM and COM data models do well, using
Greedy and Agg (the greedy and aggressive-greedy algorithms from about 40% of the storage of RCV. At the same time, DP, Greedy
Section 4.5). We evaluate these data models on both storage, as and Agg perform roughly similarly, and better than the primitive
well as formulae access cost, based on the formulae embedded data models, providing an additional reduction of 15-20%. On the
within the spreadsheets. In addition, we evaluate the running time other hand, the last dataset, which is primarily used for computa-
of the hybrid optimization algorithms for DP, Greedy, and Agg. tion as opposed to sharing, and is very sparse, RCV does better
Goal 2: Scalability on Synthetic Datasets. Since our real datasets than ROM and COM, while DP, Greedy, and Agg once again pro-
aren’t very large, we turn to synthetic datasets for testing out the vide additional benefits.
scalability of DATA S PREAD. We focus on the primitive data mod- Storage Evaluation on an Ideal Database. Note that the reason
els, i.e., ROM and RCV, coupled with positional mapping schemes, why RCV does so poorly for the first three datasets is because Post-
and evaluate the performance of select, update, and insert/delete greSQL imposes a high overhead per tuple, of 50 bytes, consider-
on these data models on varying the number of rows, number of ably larger than the amount of storage required to store each cell.
columns, and the density of the dataset. So, to explore this further, we investigated the scenario if we had
Goal 3: Impact of Positional Mapping Schemes. We evaluate the ability to redesign our database storage engine from scratch. We
the impact of our positional mapping schemes in aiding positional consider a theoretical “ideal” cost model, where additional over-
access on the spreadsheet. We focus on Row-number-as-is, Mono- heads are minimized. For this cost model, the cost of a ROM or
tonic, and Hierarchical positional mapping schemes applied on the COM table is equal to the number of cells, plus the length and
10
100 ROM
100
Normalized Storage
Normalized Storage
COM
80
RCV
DP
60
Greedy 10
40 Agg
20
0 1
Internet ClueWeb09 Enron Academic Internet ClueWeb09 Enron Academic
Figure 12: (a) Storage Comparison for PostgreSQL (b) Storage Comparison on an Ideal Database
RCV
10 Agg the best trade-off between running time and storage costs. Given
a sheet in a dataset, for each data model, we measured the time
taken to evaluate the formulae in that sheet, and averaged this time
1 across all sheets and all formulae. We plot the results for different
datasets in Figure 14 in log scale in ms. As a concrete example, on
the Internet dataset, ROM has a formula access time of 0.23, RCV
0.1 has 3.17, while Agg has 0.13. Thus, Agg provides a substantial re-
Internet ClueWeb09 Enron Academic
duction of 96% over RCV and 45% over ROM—even though Agg
Figure 14: Average access time for formulae was optimized for storage and not for formula access. This vali-
breadth of the table (to store the data, the schema, as well as posi-
dates our design of hybrid data models to store spreadsheet data.
tional identifiers), while the cost of an RCV row is simply 3 units
Note that while the performance numbers for the real spreadsheet
(to store the data, as well as the row and column number). We plot
datasets are small for all data models (due to the size limitations
the results in Figure 12(b) in log scale for each of the datasets—we
in present spreadsheet tools) when scaling up to large datasets, and
exclude COM for this chart since it has the same performance as
formulae that operate on these large datasets, these numbers will
ROM. Here, we find that ROM has the worst cost across most of
increase in a proportional manner, at which point it is even more
the datasets since it no longer leverages benefits from minimizing
important to opt for hybrid data models.
the number of tuples. (For Internet, ROM and RCV are similar, but
RCV is slightly worse.) As before, we normalize the cost of the 7.3 Scalability of Data Models
ROM model to 100 for each sheet, and scaled the others accord-
ingly, followed by taking an average across all sheets per dataset. Takeaway: Our primitive data models, augmented with posi-
As an example, we find that for the ClueWeb09 corpus, RCV, DP, tional mapping provide interactive (<500ms) response time
Greedy and Agg have normalized costs of about 36, 14, 18, and on spreadsheet datasets ranging up to 1 billion cells for se-
14 respectively—with the hybrid data models more than halving lect, insert, and update operations.
th
the cost of RCV, and getting 17 the cost of ROM. Furthermore, Since our real datasets did not have any spreadsheets that are ex-
in this ideal cost model, DP provides additional benefits relative tremely large, we now evaluate the scalability of the DATA S PREAD
to Greedy, and Agg ends up bringing us close to or equal to DP data models in supporting very large synthetic spreadsheets. We fo-
performance. cus on the two primitive data models i.e., ROM and RCV, with the
Running Time of Hybrid Optimization Algorithm. Our next spreadsheet being represented as a single table in these data mod-
question is how long our hybrid data model optimization algo- els. Since we use synthetic datasets where cells are “filled in” with
rithms for DP, Greedy, and Agg, take on real datasets. In Figure 13, a certain probability, we did not involve hybrid data models, since
we depict the average running time of these algorithms on the four they would (in this artificial context) typically end up preferring the
real datasets. The results for all datasets are similar—as an ex- ROM data model. These primitive data models are augmented with
ample, for Enron, DP took 6.3s on average, Greedy took 45ms (a hierarchical positional mapping. We consider the performance on
140⇥ reduction), while Agg took 345ms (a 20⇥ reduction). Thus varying several parameters of these datasets: the density (i.e., the
DP has the highest running time for all datasets, since it explores number of cells that are filled in), the number of rows, and the num-
the entire space of models that can be obtained by recursive par- ber of columns. The default values of these parameters are 1, 107
titioning. Between Greedy and Agg, Greedy turns out to take less and 100 respectively. We repeat each operation 500 times and re-
time. Note that these observations are consistent with our complex- port the averages.
ity analyses from Section 4.5. That said, Agg allows us to trade off In Figure 15, we depict the charts corresponding to average time
a little bit more running time for improved performance on storage to perform a random select operation on a region of 1000 rows and
(as we saw earlier). We note that for the cases where the spread- 20 columns. This is, for example, the operation that would corre-
sheets were large, we terminated DP after about 10 minutes, since spond to a user scrolling to a certain position on our spreadsheet.
we want our optimization to be relatively fast. (Note that using a As can be seen in Figure 15(a), ROM starts dominating RCV be-
similar criterion for termination, Agg and Greedy did not have to yond a certain density, at which point it makes more sense to store
be terminated for any of the real datasets.) To be fair across all the the data in as tuples that span rows instead of incurring the penalty
algorithms, we excluded all of these spreadsheets from this chart— of creating a tuple for every cell. Nevertheless, the best of these
if we had included them, the difference between DP and the other two models takes less than 150ms across sheets of varying densi-
algorithms would be even more stark. ties. In Figure 15(b)(c), since the spreadsheet is very dense (density
Formulae Access Evaluation on PostgreSQL. Next, we wanted = 1), ROM takes less time than RCV. Overall, in all cases, even on
to evaluate if our hybrid data models, optimized only on storage, spreadsheets with 100 columns and 107 rows and a density of 1,
have any impact on the access cost for formulae within the real the average time to select a region is well within 500ms.
datasets. Our hope is that the formulae embedded within spread- We report briefly on the update and insert performance—detailed
sheets end up focusing on “tightly coupled” tabular areas, which results and charts can be found in the Appendix. Overall, for both
our hybrid data models are able to capture and store in separate RCV and ROM, for inserting a row, the time is well below 500ms
11
300 300 300
RCV RCV RCV
250 ROM 250 ROM 250 ROM
Time (ms)
Time (ms)
Time (ms)
200 200 200
50 50 50
0
0.2 0.4 0.6 0.8 1.0 10 30 50 70 90 100 104 105 106 107
Sheet Density #Columns #Rows
Figure 15: Select performance vs — (a) Sheet Density (b) Column Count (c) Row Count
for all of the charts; for updates of a large region, while ROM is 2b. One way export of operations from spreadsheets to databases.
still highly interactive, RCV ends up taking longer since 1000s of There has been some work on exporting spreadsheet operations
queries need to be issued to the database. In practice, users won’t into database systems, such as the work from Oracle [47, 48] as
update such a large region at a time, and we can batch these queries. well as startups 1010Data [40] and AirTable [41], to improve the
We discuss this further in the appendix. performance of spreadsheets. However, the database itself has no
awareness of the existence of the spreadsheet, making the integra-
7.4 Evaluation of Positional Mapping tion superficial. In particular, positional and ordering aspects are
not captured, and user operations on the front-end, e.g., inserts,
Takeaway: Hierarchical positional mapping retains the rapid
deletes, and adding formulae, are not supported.
fetch benefits of row-number-as-is, while also providing the
rapid insert and update benefits of monotonic positional map- 2c. Using a spreadsheet to mimic a database. There has been
ping. Overall, hierarchical positional mapping is able to per- some work on using a spreadsheet as an interface for posing tradi-
form positional operations within a few milliseconds, while tional database queries. For example, Tyszkiewicz [38] describes
the other positional mapping schemes scale poorly, taking sec- how to simulate database operations in a spreadsheet. However,
onds on large datasets for certain operations. this approach loses the scalability benefits of relational databases.
Bakke et al. [9, 8, 7] support joins by depicting relations using a
We report detailed results and charts for this evaluation in Ap- nested relational model. Liu et al. [28] use spreadsheet operations
pendix D. to specify single-block SQL queries; this effort is essentially a re-
placement for visual query builders. Recently, Google Sheets [39]
8. RELATED WORK has provided the ability to use single-table SQL on its frontend,
Our work draws on related work from multiple areas; we re- without availing of the scalability benefits of database integration.
view papers in each of the areas, and describe how they relate to Excel, with its Power Pivot and Power Query [46] functionality
DATA S PREAD. We discuss 1) efforts that enhance the usability of has made moves towards supporting SQL in the front-end, with the
databases, 2) those that attempt to merge the functionality of the same limitations. Like this line of work, we support SQL queries
spreadsheet and database paradigms, but without a holistic inte- on the spreadsheet frontend, but our focus is on representing and
gration, and 3) using array-based database management systems. operating on spreadsheet data within a database.
We described our vision for DATA S PREAD in an earlier demo pa- 3. Array database systems. While there has been work on array-
per [10]. based databases, most of these systems do not support edits: for
1. Making databases more usable. There has been a lot of re- instance, SciDB [13] supports an append-only, no-overwrite data
cent work on making database interfaces more user friendly [4, model.
23]. This includes recent work on gestural query and scrolling in-
terfaces [22, 31, 33, 32, 36], visual query builders [6, 16], query 9. CONCLUSIONS
sharing and recommendation tools [24, 18, 17, 25], schema-free
We presented DATA S PREAD, a data exploration tool that holisti-
databases [34], schema summarization [49], and visual analytics
cally unifies spreadsheets and databases with a goal towards work-
tools [14, 30, 37, 21]. However, none of these tools can replace
ing with large datasets. We proposed three primitive data models
spreadsheet software which has the ability to analyze, view, and
for representing spreadsheet data within a database, along with al-
modify data via a direct manipulation interface [35] and has a large
gorithms for identifying the optimal hybrid data model arising from
user base.
recursive decomposition to give one or more primitive data models.
2a. One way import of data from databases to spreadsheets. Our hybrid data models provide substantial reductions in terms of
There are various mechanisms for importing data from databases to storage (up to 20–50%) and formula evaluation (up to 50%) over
spreadsheets, and then analyzing this data within the spreadsheet. the primitive data models. Our primitive and hybrid data models,
This approach is followed by Excel’s Power BI tools, including coupled with positional mapping schemes, make working with very
Power Pivot [45], with Power Query [46] for exporting data from large spreadsheets—over a billion cells—interactive.
databases and the web or deriving additional columns and Power
View [46] to create presentations; and Zoho [42] and ExcelDB [44]
(on Excel), and Blockspring [43] (on Google Sheets [39]) enabling 10. REFERENCES
the import from a variety of sources including the databases and [1] Google sheets. https://www.google.com/sheets/about/.
the web. Typically, the import is one-shot, with the data residing in [2] Microsoft excel. http://products.office.com/en-us/excel.
the spreadsheet from that point on, negating the scalability benefits [3] ZK Spreadsheet.
derived from the database. Indeed, Excel 2016 specifies a limit of https://www.zkoss.org/product/zkspreadsheet.
1M records that can be analyzed once imported, illustrating that the [4] S. Abiteboul, R. Agrawal, P. Bernstein, M. Carey, S. Ceri,
scalability benefits are lost; Zoho specifies a limit of 0.5M records. B. Croft, D. DeWitt, M. Franklin, H. G. Molina, D. Gawlick,
Furthermore, the connection to the base data is lost: any modifica- J. Gray, L. Haas, A. Halevy, J. Hellerstein, Y. Ioannidis,
tions made at either end are not propagated. M. Kersten, M. Pazzani, M. Lesk, D. Maier, J. Naughton,
12
H. Schek, T. Sellis, A. Silberschatz, M. Stonebraker, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon.
R. Snodgrass, J. Ullman, G. Weikum, J. Widom, and Google fusion tables: web-centered data management and
S. Zdonik. The lowell database research self-assessment. collaboration. In Proceedings of the 2010 ACM SIGMOD
Commun. ACM, 48(5):111–118, May 2005. International Conference on Management of data, pages
[5] A. Abouzied, J. Hellerstein, and A. Silberschatz. Dataplay: 1061–1066. ACM, 2010.
Interactive tweaking and example-driven correction of [22] S. Idreos and E. Liarou. dbTouch: Analytics at your
graphical database queries. In Proceedings of the 25th Fingertips. In CIDR, 2013.
Annual ACM Symposium on User Interface Software and [23] H. V. Jagadish, A. Chapman, A. Elkiss, M. Jayapandian,
Technology, UIST ’12, pages 207–218, New York, NY, USA, Y. Li, A. Nandi, and C. Yu. Making database systems usable.
2012. ACM. In Proceedings of the 2007 ACM SIGMOD international
[6] A. Abouzied, J. Hellerstein, and A. Silberschatz. DataPlay: conference on Management of data, pages 13–24. ACM,
interactive tweaking and example-driven correction of 2007.
graphical database queries. In Proceedings of the 25th [24] N. Khoussainova, M. Balazinska, W. Gatterbauer, Y. Kwon,
annual ACM symposium on User interface software and and D. Suciu. A Case for A Collaborative Query
technology, pages 207–218. ACM, 2012. Management System. In CIDR. www.cidrdb.org, 2009.
[7] E. Bakke and E. Benson. The Schema-Independent Database [25] N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu.
UI: A Proposed Holy Grail and Some Suggestions. In CIDR, SnipSuggest: Context-aware autocompletion for SQL.
pages 219–222. www.cidrdb.org, 2011. Proceedings of the VLDB Endowment, 4(1):22–33, 2010.
[8] E. Bakke, D. Karger, and R. Miller. A spreadsheet-based user [26] B. Klimt and Y. Yang. Introducing the enron corpus. In
interface for managing plural relationships in structured data. CEAS, 2004.
In Proceedings of the SIGCHI conference on human factors [27] A. Lingas, R. Y. Pinter, R. L. Rivest, and A. Shamir.
in computing systems, pages 2541–2550. ACM, 2011. Minimum edge length partitioning of rectilinear polygons. In
[9] E. Bakke and D. R. Karger. Expressive query construction Proceedings - Annual Allerton Conference on
through direct manipulation of nested relational results. In Communication, Control, and Computing, pages 53–63,
Proceedings of the 2016 International Conference on 1982.
Management of Data, pages 1377–1392. ACM, 2016. [28] B. Liu and H. V. Jagadish. A Spreadsheet Algebra for a
[10] M. Bendre, B. Sun, D. Zhang, X. Zhou, K. C.-C. Chang, and Direct Data Manipulation Query Interface. pages 417–428.
A. Parameswaran. Dataspread: Unifying databases and IEEE, Mar. 2009.
spreadsheets. Proc. VLDB Endow., 8(12):2000–2003, Aug. [29] Z. Liu and J. Heer. The effects of interactive latency on
2015. exploratory visual analysis. IEEE Trans. Vis. Comput.
[11] M. Bendre, B. Sun, X. Zhou, D. Zhang, K. Chang, and Graph., 20(12):2122–2131, 2014.
A. Parameswaran. Dataspread: Unifying databases and [30] J. Mackinlay, P. Hanrahan, and C. Stolte. Show me:
spreadsheets. In VLDB, volume 8, 2015. Automatic presentation for visual analysis. Visualization and
[12] D. Bricklin and B. Frankston. Visicalc 1979. Creative Computer Graphics, IEEE Transactions on,
Computing, 10(11):122, 1984. 13(6):1137–1144, 2007.
[13] P. G. Brown. Overview of scidb: Large scale array storage, [31] A. N. L. J. M. Mandel, A. Nandi, and L. Jiang. Gestural
processing and analysis. In Proceedings of the 2010 ACM Query Specification. Proceedings of the VLDB Endowment,
SIGMOD International Conference on Management of Data, 7(4), 2013.
SIGMOD ’10, pages 963–968, New York, NY, USA, 2010. [32] A. Nandi. Querying Without Keyboards. In CIDR, 2013.
ACM. [33] A. Nandi and H. V. Jagadish. Guided interaction: Rethinking
[14] S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. the query-result paradigm. Proceedings of the VLDB
Silva, and H. T. Vo. VisTrails: visualization meets data Endowment, 4(12):1466–1469, 2011.
management. In Proceedings of the 2006 ACM SIGMOD [34] L. Qian, K. LeFevre, and H. V. Jagadish. CRIUS:
international conference on Management of data, pages user-friendly database design. Proceedings of the VLDB
745–747. ACM, 2006. Endowment, 4(2):81–92, 2010.
[15] J. Callan, M. Hoy, C. Yoo, and L. Zhao. Clueweb09 data set, [35] B. Shneiderman. Direct Manipulation: A Step Beyond
2009. Programming Languages. IEEE Computer, 16(8):57–69,
[16] T. Catarci, M. F. Costabile, S. Levialdi, and C. Batini. Visual 1983.
query systems for databases: A survey. Journal of Visual [36] M. Singh, A. Nandi, and H. V. Jagadish. Skimmer: rapid
Languages & Computing, 8(2):215–260, 1997. scrolling of relational query results. In Proceedings of the
[17] U. Cetintemel, M. Cherniack, J. DeBrabant, Y. Diao, 2012 ACM SIGMOD International Conference on
K. Dimitriadou, A. Kalinin, O. Papaemmanouil, and S. B. Management of Data, pages 181–192. ACM, 2012.
Zdonik. Query Steering for Interactive Data Exploration. In [37] C. Stolte, D. Tang, and P. Hanrahan. Polaris: A system for
CIDR, 2013. query, analysis, and visualization of multidimensional
[18] G. Chatzopoulou, M. Eirinaki, and N. Polyzotis. Query relational databases. Visualization and Computer Graphics,
recommendations for interactive database exploration. In IEEE Transactions on, 8(1):52–65, 2002.
Scientific and Statistical Database Management, pages 3–18. [38] J. Tyszkiewicz. Spreadsheet as a relational database engine.
Springer, 2009. In Proceedings of the 2010 ACM SIGMOD International
[19] C. E. L. Cormen, Thomas H. and R. L. Rivest. Introduction Conference on Management of data, pages 195–206. ACM,
to Algorithms. Cambridge. MA: MIT, 1990. 2010.
[20] D. Flax. Gesturedb: An accessible & touch-guided ipad app [39] http:/google.com/sheets. Google Sheets (retrieved March 10,
for mysql database browsing. 2016. 2015).
[21] H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, [40] https://www.1010data.com/. 1010 Data (retrieved March 10,
13
2015). in the spreadsheet. We claim that a minimum edge length parti-
[41] https://www.airtable.com/. Airtable (retrieved March 10, tion of the given rectilinear polygon P of length at most k exists
2015). iff the following setting of the optimal hybrid data model problem:
[42] https://www.zoho.com/. Zoho Reports (retrieved March 10, s1 = 0, s2 = 2|C|+1, s3 = s4 = 1, where the storage cost should
2015). not exceed k0 = k + Perimeter(P
2
)
+ s2 |C| for some decomposition
[43] http://www.blockspring.com/. Blockspring (retrieved March of the spreadsheet.
10, 2015). ) Let us assume that the spreadsheet we generate using P has
[44] http://www.excel-db.net/. Excel-DB (retrieved March 10, a decomposition of rectangles whose storage cost is less than k0 =
2015). k + Perimeter(P
2
)
+ s2 |C|. We have to show that there exists a
[45] http://www.microsoft.com/en-us/download/details.aspx?id= partition with minimum edge length of at most k. We first make
43348. Microsoft sql server power pivot (retrieved march 10, the following key observations:
2015). 1. There exists a valid decomposition that doesn’t store any blank
[46] C. Webb. Power Query for Power BI and Excel. Apress, cell. Let’s assume the contrary and consider a decomposition
2014. that stores a blank cell. Since we are now storing |C| + 1 cells
[47] A. Witkowski, S. Bellamkonda, T. Bozkaya, N. Folkert, at minimum,
A. Gupta, J. Haydu, L. Sheng, and S. Subramanian.
Advanced SQL modeling in RDBMS. ACM Transactions on k0 > s2 (|C| + 1) = |C|s2 + s2 = |C|s2 + 2|C| + 1
Database Systems (TODS), 30(1):83–121, 2005. k0 > |C|(s2 + 1 + 1)
[48] A. Witkowski, S. Bellamkonda, T. Bozkaya, A. Naimat, | {z }
storing each cell in a separate table
L. Sheng, S. Subramanian, and A. Waingold. Query by excel.
In Proceedings of the 31st international conference on Very Therefore, if we have a decomposition that stores a blank cell,
large data bases, pages 1204–1215. VLDB Endowment, we also have a decomposition that does not store any blank
2005. cell and has lower cost.
[49] C. Yu and H. V. Jagadish. Schema summarization. In
2. There exists a decomposition of the spreadsheet where all the
Proceedings of the 32nd international conference on Very
tables are disjoint. The argument is similar to the previous
large data bases, pages 319–330. VLDB Endowment, 2006.
case since storing the same cell twice in different tables is
equivalent to storing an extra blank cell.
APPENDIX From our above two observations, we conclude that there exists
A. OPTIMAL HYBRID DATA MODELS a decomposition where all tables are disjoint, and no table stores a
blank cell. Therefore, this decomposition corresponds to partition-
In this section, we demonstrate that the following problem is ing the given spreadsheet into rectangles. We represent this parti-
NP-H ARD. tion of the spreadsheet by T = {T1 , T2 , . . . , Tp }. We now show
P ROBLEM 2 (H YBRID -ROM). Given a spreadsheet with a that this partition of the spreadsheet corresponds to a partitioning
collection of cells C, identify the hybrid data model T with only of the rectilinear polygon P with edge-length less than k.
ROM tables that minimizes cost(T ).
p
X
As before, the cost model is defined as: cost(T ) = s1 + s2 · (ri ⇥ ci ) + s3 · ci + s4 · ri
p
X i=1
cost(T ) = s1 + s2 · (ri ⇥ ci ) + s3 · ci + s4 · ri . (7) X p p
X p
X p
X
i=1 = s1 + s2 ·(ri ⇥ ci ) + s3 ci + s4 ri
i=1 i=1 i=1 i=1
The decision version of the above problem has the following struc-
ture: a value k is provided, and the goal is to test whether there is a substituting s1 = 0, s2 = 2|C| + 1, s3 = s4 = 1, we get:
hybrid data model with cost(T ) k.
We reduce the minimum edge length partitioning problem [27] p p p
!
X X X
of rectilinear polygons to Problem 2, thereby showing that it is NP- = 0 + s2 |C| + 1 · ci + ri
Hard. First, a rectilinear polygon is a polygon in which all edges i=1 i=1 i=1
are either aligned with the x-axis or the y- axis. We consider the
problem of partitioning a rectilinear polygon into disjoint rectan- since cost(T ) k0 = k + Perimeter(P )
+ s2 |C|,
2
gles using the minimum amount of “ink”. In other words, the min-
imality criterion is the total length of the edges (lines) used to form p p
!
X X
the internal partition. Notice that this doesn’t correspond to the cost(T ) = s2 |C| + 1 · ci + ri
minimality criterion of reducing the number of components. We i=1 i=1
illustrate this in Figure 19, which is borrowed from the original p
X Perimeter(P )
paper [27]. The following decision problem was shown to be NP- =) (ri + ci ) k +
Hard in [27]: Given any rectilinear polygon P and a number k, is i=1
2
there a rectangular partitioning whose edge length does not exceed Xp
Perimeter(Ti ) Perimeter(P )
k? We now provide the reduction. =) k+
i=1
2 2
P ROOF FOR P ROBLEM 2. Consider an instance of the polygon p
partitioning problem with minimum edge length required to be at X
=) Perimeter(Ti ) 2 ⇥ k + Perimeter(P )
most k. We are given a rectilinear polygon P . We now repre-
i=1
sent the polygon P in a spreadsheet by filling the cells interior
of the polygon, and not filling any other cell in the spreadsheet. Since, the sum of perimeters of all the tables Ti counts the bound-
Let C = {C1 , C2 , . . . , Cm } represent the set of all filled cells ary of P exactly once, and the edge length partition of P exactly
14
RCV RCV RCV
ROM ROM ROM
1000 1000 1000
Time (ms)
Time (ms)
Time (ms)
100 100 100
10 10
0.2 0.4 0.6 0.8 1.0 30 50 70 90 100 104 105 106 107
Sheet Density #Columns #Rows
Figure 16: Update range performance vs (a) Sheet Density (b) Column Count (c) Row Count
RCV RCV RCV
100 ROM 100 ROM 100 ROM
Time(ms)
Time(ms)
Time(ms)
10 10 10
1 1 1
0.2 0.4 0.6 0.8 1.0 10 30 50 70 90 100 104 105 106 107
Sheet Density #Column #rows
Figure 17: Insert row performance vs (a) Sheet Density (b) Column Count (c) Row Count
15
Row number as-is Row num as-is Row num as-is
Monotonic Monotonic Monotonic
100 Hierarchical
100 Hierarchical
100 Hierarchical
Time (ms)
Time (ms)
Time (ms)
10 10 10
1 1 1
Figure 18: Positional mapping performance for (a) Select (b) Insert (c) Delete
16