0% found this document useful (0 votes)

83 views16 pages

Scaling Up To Billions of Cells With: Supporting Large Spreadsheets With Databases

This document proposes DATA SPREAD, a system that combines spreadsheets and databases to enable working with massive spreadsheets. It discusses challenges in representing spreadsheet data within a database efficiently and supporting common spreadsheet operations. The authors conduct a survey of real-world spreadsheet usage to understand common operations and inform the system design. DATA SPREAD uses a standard relational database backend with a web-based spreadsheet frontend to leverage database scalability while retaining spreadsheet usability.

Uploaded by

sarthak405

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views16 pages

Scaling Up To Billions of Cells With: Supporting Large Spreadsheets With Databases

Uploaded by

sarthak405

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Scaling up to Billions of Cells with DATA S PREAD:

Supporting Large Spreadsheets with Databases

Mangesh Bendre, Vipul Venkataraman, Xinyan Zhou

Kevin Chen-Chuan Chang, Aditya Parameswaran
University of Illinois at Urbana-Champaign (UIUC)
{bendre1 | vvnktrm2 | xzhou14 | kcchang | adityagp}@illinois.edu

ABSTRACT functionality, e.g., expressiveness, that databases natively provide.

Spreadsheet software is the tool of choice for ad-hoc tabular data This discussion raises the following question: can we leverage
management, manipulation, querying, and visualization with adop- relational databases to support spreadsheets at scale? That is, can
tion by billions of users. However, spreadsheets are not scalable, we retain the spreadsheet front-end that so many end-users are so
unlike database systems. We develop DATA S PREAD, a system that comfortable with, while supporting that front-end with a standard
holistically unifies databases and spreadsheets with a goal to work relational database, seamlessly leveraging the benefits of scalability
with massive spreadsheets: DATA S PREAD retains all of the ad- and expressiveness?
vantages of spreadsheets, including ease of use, ad-hoc analysis To address this question, our first challenge is to efficiently rep-
and visualization capabilities, and a schema-free nature, while also resent spreadsheet data within a database. First, notice that while
adding the scalability and collaboration abilities of traditional re- databases natively use an unordered “set” semantics, spreadsheets
lational databases. We design DATA S PREAD with a spreadsheet utilize position as a first-class primitive, thus it is not natural to
front-end and a regular relational database back-end. To integrate represent and store spreadsheet data in a database. Further, spread-
spreadsheets and databases, in this paper, we develop a storage sheets rarely obey a fixed schema — a user may paste several “tab-
and indexing engine for spreadsheet data. We first formalize and ular” or table-like regions within a spreadsheet, often interspersed
study the problem of representing and manipulating spreadsheet with empty rows or columns. Users may also embed formulae
data within a relational database. We demonstrate that identifying into spreadsheets, along with data. This means that considering
the optimal representation is NP-H ARD via a reduction from par- the entire sheet as a single relation, with rows corresponding to
titioning of rectangles; however, under certain reasonable assump- rows of the spreadsheet, and columns corresponding to columns
tions, can be solved in PTIME. We develop a collection of mech- of the spreadsheet, can be very wasteful due to sparsity. At the
anisms for representing spreadsheet data, and evaluate these repre- other extreme, we can consider only storing cells of the spreadsheet
sentations on a workload of typical data manipulation operations. that are filled-in: we can simply store a table with schema (row
We augment our mechanisms with novel positionally-aware index- number, column number, value): this can be effective for highly
ing structures that further improve performance. DATA S PREAD can sparse spreadsheets, but is wasteful for dense spreadsheets with
scale to billions of cells, returning results for common operations well-defined tabular regions. One can imagine hybrid represen-
within seconds. Lastly, to motivate our research questions, we per- tation schemes that use both “tabular” and “sparse” representation
form an extensive survey of spreadsheet use for ad-hoc tabular data schemes as well or schemas that take access patterns, e.g., via for-
management. mulae, into account. In this paper, we show that it is NP-H ARD
to identify the optimal storage representation, given a spreadsheet.
Despite this wrinkle, we characterize a certain natural subset of
1. INTRODUCTION representations for which identifying the optimal one is in fact,
Spreadsheet software, from the pioneering VisiCalc [12] to Mi- PTIME; furthermore, we identify a collection of optimization tech-
crosoft Excel [2] and Google Sheets [1], have found ubiquitous niques that further reduce the computation complexity to the point
use in ad-hoc manipulation, management, and analysis of tabular where the optimal representation can be identified in the time it
data. The billions who use spreadsheets take advantage of not only takes to make a single pass over the data.
its ad-hoc nature and flexibility but also the in-built statistical and The next challenge is in supporting operations on the spread-
visualization capabilities. Spreadsheets cater to both novice and sheet. Notice first that the most primitive operation on a spread-
advanced users, spanning businesses, universities, organizations, sheet is scrolling to an arbitrary position on a sheet. Unlike tra-
government, and home. ditional databases, where order is not a first class citizen, spread-
Yet, this mass adoption of spreadsheets breeds new challenges. sheets require positionally aware access. This motivates the need
With the increasing sizes and complexities of data sets, as well as for positional indexes; we develop and experiment with indexing
types of analyses, we see a frenzy to push the limits: users are strug- mechanisms that adapt traditional indexing schemes, to take posi-
gling to work with spreadsheet software on large datasets; they are tion into account. Furthermore, we need to support modifications to
trying to import large data sets into Excel (e.g., billions of gene- the spreadsheet. Note that even a small modification can be rather
gene interactions) and are failing at doing so. In response, spread- costly: inserting a single row can impact the row number of all sub-
sheet softwares are stretching the size of data they can support; e.g., sequent rows. How do we support such modifications efficiently?
Excel has lifted its size limits from 65k to 1 million rows, and added We develop positional mapping schemes that allow us to avoid the
Power Query and PowerPivot [46, 45] to support one-shot import of expensive computation that results from small modifications.
data from databases in 2010; Google Sheets has expanded its size By addressing the aforementioned challenges, we answer the
limit to 2 million cells. Despite these developments, these moves question of whether we can leverage relational databases to sup-
are far from the kind of scale, e.g., beyond memory limits, and port spreadsheets at scale in the affirmative in this paper. We build

1
a system, DATA S PREAD, that can not only efficiently support op- To study these two aspects, we first retrieve a large collection
erations on billions of records, but naturally incorporates relational of real spreadsheets from four disparate sources, and quantitatively
database features such as expressiveness and collaboration support. analyze them on different metrics. We supplement this quantitative
DATA S PREAD uses a standard relational database as a backend analysis with a small-scale user survey to understand the spectrum
(currently PostgreSQL, but nothing ties us to that database), with of operations frequently performed. The latter is necessary since
a web-based spreadsheet system [3] as the frontend. By using a we do not have a readily available trace of user operations from
standard relational database, with no modifications to the underly- the real spreadsheets (e.g., indicating how often users add rows or
ing engine, we can just seamlessly leverage improvements to the columns, or edit formulae.)
database, while allowing the same data to be used by other appli- We first describe our methodology for both these evaluations,
cations. This allows a clean encapsulation and separation of front- before diving into our findings for the two aspects.
end and back-end code, and also admits portability and a simpler
design. DATA S PREAD is fully functional — the DATA S PREAD re- 2.1 Methodology
sources, along with video and code can be found at dataspread. As described above, we have two forms of evaluation of spread-
github.io. We demonstrated a primitive version of DATA S PREAD sheet use: the first, via an analysis of spreadsheets, and the second,
at the VLDB conference last year [11]. via interviews of spreadsheet users. The datasets can be found at
While there have been many attempts at combining spreadsheets dataspread.github.io.
and relational database functionality, ultimately, all of these at-
tempts fall short because they do not let spreadsheet users perform 2.1.1 Real Spreadsheet Datasets
ad-hoc data manipulation operations [47, 48, 28]. Other work sup- For our evaluation of real spreadsheets, we assemble four datasets
ports expressive and intuitive querying modalities without address- from a wide variety of sources.
ing scalability issues [9, 5, 20], addressing an orthogonal problem. Internet. This dataset was generated by crawling the web for Excel
There have been efforts that enhance spreadsheets or databases (.xls) files, using a search engine, across a wide variety of domains.
without combining them [38]. Furthermore, while there has been As a result, these 53k spreadsheets vary widely in content, ranging
work on array-based databases, most of these systems do not sup- from tabular data to images.
port edits: for instance, SciDB [13] supports an append-only, no-
ClueWeb09. This dataset of 26k spreadsheets was generated by ex-
overwrite data model. We describe related work in more detail in
tracting .xls file URLs from the ClueWeb09 [15] web crawl dataset.
Section 8.
Enron. This dataset was generated by extracting 18k spreadsheets
Rest of the Paper. The outline of the rest of the paper is as follows.
from the Enron email dataset [26]. These spreadsheets were used
• We begin with an empirical study of four real spreadsheet to exchange data within the Enron corporation.
datasets, plus an on-line user survey, targeted at understand-
Academic. This dataset was collected from an academic institu-
ing how spreadsheets are used for data analysis in Section 2.
tion; this academic institution used these spreadsheets to manage
• Then, in Section 3, we introduce the notion of a conceptual internal data about course workloads of instructors, salaries of staff,
data model for spreadsheet data, as well as the set of opera- and student performance.
tions we wish to support on this data model.
We list these four datasets along with some statistics in Table 1.
• In Section 4, we propose three primitive data models for sup- Since the first two datasets are from the open web, they are primar-
porting the conceptual data model within a database, along ily meant for data publication: as a result, only about 29% and 42%
with a hybrid data model that combines the benefits of these of these sheets (column 3) contain formulae, with the formulae oc-
primitive data models. We demonstrate that identifying the cupying less than 3% of the total number of non-empty cells for
optimal hybrid data model is NP-H ARD, but we can develop both datasets (column 5). The third dataset is from a corporation,
a PTIME dynamic programming algorithm that allows us to and is primarily meant for data exchange, with a similarly low frac-
find an approximately optimal solution. tion of 39% of these sheets containing formulae, and 3.35% of the
• Then, in Section 5, we motivate the need for, and develop non-empty cells containing formulae. The fourth dataset is from
indexing solutions for positional mapping—a method for re- an academic institution, and is primarily meant for data analysis,
ducing the impact of cascading updates for inserts and deletes with a high fraction of 91% of the sheets containing formulae, and
on all our data models. 23.26% of the non-empty cells containing formulae.
• We give a brief overview of the system architecture from the
perspective of our data models in Section 6. We also describe 2.1.2 User Survey
how we seamlessly support standard relational operations in To evaluate the kinds of operations performed on spreadsheets,
DATA S PREAD. we solicited participants for a qualitative user survey: we recruited
• We perform experiments to evaluate our data models and po- thirty participants from the industry who exclusively use spread-
sitional mapping schemes in Section 7, and discuss related sheets for data management and analysis. This survey was con-
work in Section 8. ducted via an online form, with the participants answering a small
number of multiple-choice and free-form questions, followed by
the authors aggregating the responses.
2. SPREADSHEET USAGE IN PRACTICE
In this section, we empirically evaluate how spreadsheets are 2.2 Structure Evaluation
used for data management. We use the insights from this evalu- We now use our spreadsheet datasets to understand how data is
ation to both motivate the design decisions for DATA S PREAD, and laid out on spreadsheets.
develop a realistic workload for spreadsheet usage. To the best of Across Spreadsheets: Data Density. First, we study how similar
our knowledge, no such evaluation, focused on the usage of spread- real spreadsheets are to relational data conforming to a specific tab-
sheets for data analytics, has been performed in the literature. ular structure. To study this, we estimate the density of each spread-
We focus on two aspects: (a) structure: identifying how users sheet, defined as the ratio of the filled-in cells to the total number of
structure and manage data on a spreadsheet, and (b) operations: un- cells—specified by the minimum bounding rectangular box enclos-
derstanding the common spreadsheet operations that users perform. ing the filled-in cells—within a spreadsheet. We depict the results

2
Dataset Sheets Sheets with formulae Sheets with > 20% formulae % of formulae Sheets with < 50% density Sheets with < 20% density
Internet 52311 29.15% 20.26% 1.30% 22.53% 6.21%
ClueWeb09 26148 42.21% 27.13% 2.89% 46.71% 23.8%
Enron 17765 39.72% 30.42% 3.35% 50.06% 24.76%
Academic 636 91.35% 71.26% 23.26% 90.72% 60.53%

Table 1: Spreadsheet Datasets: Preliminary Statistics

20 K 6K 4K 300

15 K 3K
4K 200
#Sheets

#Sheets

#Sheets
10 K 2K
2K 100
5K 1K

0 0 0 0
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
Density Density Density Density
Figure 1: Data Density — (a) Internet (b) ClueWeb09 (c) Enron (d) Academic

in the last two columns of Table 1, and in Figure 1, which depicts Popularity: Formulae Usage. We begin by studying how often
the distribution of this ratio. We note that spreadsheets within Inter- formulae are used within spreadsheets. On examining Table 1, we
net, Clueweb09, and Enron datasets are typically dense, i.e., more find that there is a high variance in the fraction of cells that are for-
than 50% of the spreadsheets have density greater than 0.5. On the mulae (column 5), ranging from 1.3% to 23.26%. We note that the
other hand, for the Academic dataset, we note a high proportion academic institution dataset embeds a high fraction of formulae,
(greater than 60%) of spreadsheets have density values less than indicating that the spreadsheets in that case are used primarily for
0.2. This low density is because the latter dataset embeds a num- data management and analysis as opposed to data sharing or pub-
ber of formulae and use forms to report data in a user-accessible lication. Despite that, all of the datasets have a substantial fraction
interface. Thus, we have: of spreadsheets where the formulae occupy more than 20% of the
Takeaway 1: Real spreadsheets vary widely in their density, cells (column 4)—20.26% and higher for all datasets.
ranging from highly sparse to highly dense, necessitating data Takeaway 3: Formulae are very common in spreadsheets, with
models that can adapt to such variations. over 20% of the spreadsheets containing a large fraction of
Within a Spreadsheet: Tabular regions. For the spreadsheets that over 15 of formulae, across all datasets. The high prevalence
are sparse, we further analyzed them to evaluate whether there are of formulae indicates that optimizing for the access patterns
regions within these spreadsheets with high density—essentially of formulae when developing data models is crucial.
indicating that these regions can be regarded as tables. To iden-
tify these tabular regions, we first constructed a graph consisting of Access: Formulae Distribution and Access Patterns. Next, we
filled-in cells within each spreadsheet, where two cells (i.e., nodes) study the distribution of formulae used within spreadsheets—see
have an edge between them if they are adjacent either vertically or Figure 2. Not surprisingly, arithmetic operations are very com-
horizontally. We then computed the connected components on this mon across all datasets. The first three datasets have an abundance
graph. We declare a connected component to be a tabular region if of conditional formulae through IF statements (e.g., second bar in
it spans at least two columns and five rows, and has an overall den- Figure 2a)—these statements were typically used to fill in missing
sity of at least 0.7, defined as before as the ratio of the filled-in cells data or to change the data type, e.g., IF(H67=true,1.0,0.0). In contrast,
to the total number of cells in the minimum bounding rectangle en- the Academic dataset is dominated by formulae on numeric data.
compassing the connected component. In Table 2, for each dataset, Overall, there is a wide variety of formulae that span both a small
we list the total number of tabular regions identified (column 2), number of cell accesses (e.g., arithmetic), as well as a large number
the number of filled-in cells covered by these regions (column 3), of them (e.g., SUM, VL short for VLOOKUP). The last two correspond
and the fraction of the total filled-in cells that are captured within to standard database operations such as aggregation and joins.
these tabular regions (column 4). Dataset Total Cells Cells accessed Components
Dataset Tables Table cells %Coverage Accessed per Formula Per Formula
Internet Crawl 67,374 124,698,013 66.03 Internet 2,460,371 334.26 2.5
ClueWeb09 37,164 52,257,649 67.68 ClueWeb09 2,227,682 147.99 1.92
Enron 9,733 8,135,241 60.98 Enron 446,667 143.05 1.75
Academic 286 18,384 12.10 Academic 35,335 3.03 1.54

Table 2: Tabular Regions in Spreadsheets. Table 3: Cells accessed by formulae.

For the Internet Crawl, ClueWeb09, and Enron datasets, we observe To gain a better understanding of how much effort is necessary to
that greater than 60% of the cells are part of tabular regions. We execute these formulae, we measure the number of cells accessed
also note that for the Academic dataset, where the sheets are rather by each formula. Then, we tabulate the average number of cells
sparse, there still are a modest number of regions that are tabular accesses per formula in column 3 of Table 3 for each dataset. As
(286 across 636 sheets). we can see in the table, the average number of cells accesses per
formula is not small—with up to 300+ cells per formula for the In-
Takeaway 2: Even within a single spreadsheet, there is often
ternet dataset, and about 140+ cells per formula for the Enron and
high skew, with areas of both high density and low density,
ClueWeb09 datasets. The Academic dataset has a smaller average
indicating the need for fine-grained data models that can treat
number—many of these formulae correspond to derived columns
these regions differently.
that access a small number of cells at a time. Next, we wanted to
check if the accesses made by these formulae were spread across
2.3 Operation Evaluation the spreadsheet, or could exploit spatial locality. To measure this,
We now move onto evaluating the operations performed on spread- we considered the set of cells accessed by each formula, and then
sheets, both the formulae embedded with spreadsheets, as well as generated the corresponding graph of these accessed cells as de-
other data manipulation, viewing and modification operations. scribed in the previous subsection for computing the number of

3
2.4 M 2.0 M 0.4 M 30 K

2.0 M 1.6 M
0.3 M
#Formulae

#Formulae

#Formulae
1.6 M 20 K
1.2 M
1.2 M 0.2 M
0.8 M
0.8 M 10 k
0.1 M
0.4 M 0.4 M

0 0 0 0
ARITH IF LN BLANK SUM VL ... ARITH IF SUM NUM SEARCH AND ARITH SUM IF BLANK AND RAND ... ARITH SUM LOG ROUND LN FLOOR ...
Formula Formula Formula Formula
Figure 2: Formulae Distribution — (a) Internet (b) ClueWeb09 (c) Enron (d) Academic
30
1 Our second goal for performing the study was to understand
2
3 how users organize their data on a spreadsheet. We asked each
20
4 participant if their data is organized in well-structured tables, or if
Usage

5
the data scattered throughout the spreadsheet, on a scale of 1 (not
10 organized)–5 (highly organized)—see Figure 3. Only five partici-
pants marked < 4 which indicates that users do organize their data
on a spreadsheet (column 5). We also asked the importance of or-
0
Scrolling Changing Formula Row/Col Tabular Ordering dering of records in the spreadsheet on a scale of 1 (not important)–
Spreadsheet Operations 5 (highly important). Unsurprisingly, only five participants marked
< 4 for this question (column 6). We also provided a free-form
Figure 3: Operations performed on spreadsheets. textual input where multiple participants mentioned that ordering
tabular regions. We then counted the number of connected com- comes naturally to them and is often taken for granted while using
ponents in this graph, and tabulated the results in column 4 in the spreadsheets.
same table. As can be seen, even though the number of cells ac- Takeaway 6: Spreadsheet users typically try to organize their
cessed may be large, these cells stem from a small number of con- data as far as possible on the spreadsheet, and rely heavily
nected components; as a result, we can exploit spatial locality to on the ordering and presentation of the data on their spread-
execute them more efficiently. sheets.
Takeaway 4: Formulae on spreadsheets access cells on the
spreadsheet by position; some common formulae such as SUM 3. SPREADSHEET DESIDERATA
or VLOOKUP access a rectangular range of cells at a time. The goal for DATA S PREAD is to combine the ease of use and
The number of cells accessed by these formulae can be quite interactivity of spreadsheets, while simultaneously providing the
large, and most of these cells stem from contiguous areas of scalability, expressiveness, and collaboration capabilities of databases.
the spreadsheet. Thus, as we develop DATA S PREAD, having two aspects of inter-
est: first, how do we support spreadsheet semantics over a database
User-Identified Operations. In addition to identifying how users backend, and second, how do we support database operations within
structure and manage data on a spreadsheet, we now analyze the a spreadsheet. Our primary focus will be on the former, which
common spreadsheet operations that users perform. To this end, we will occupy the bulk of the paper. We return to the latter in Sec-
conducted a small-scale online survey of 30 participants to study tion 6. For now, we focus on describing the desiderata for support-
how users operate on spreadsheet data. This qualitative study is ing spreadsheet semantics over databases. We first describe our
valuable since real spreadsheets do not reveal traces of user oper- conceptual spreadsheet data model, and then describe the desired
ations performed on them (e.g., revealing how often users perform operations that need to be supported on this conceptual data model.
ad-hoc operations like scrolling, sorting, deleting rows or columns). Conceptual Data Model. A spreadsheet consists of a collection
Our questions in this study were targeted at understanding (a) how of cells. A cell is referenced by two dimensions: row and column.
users perform operations on the spreadsheet and (b) how users or- Columns are referenced using letters A, . . ., Z, AA, . . .; while rows
ganize data on the spreadsheet. are referenced using numbers 1, 2, . . . Each cell contains either a
With the goal of understanding how users perform operations on value, or a formula. A value is a constant belonging to some fixed
the spreadsheet, we asked each participant to answer a series of type. For example, in Figure 4 a screenshot from our working im-
questions where each question corresponded to whether they con- plementation of DATA S PREAD, B2 (column B, row 2) contains the
ducted the specific operation under consideration on a scale of 1–5, value 10. In contrast, a formula is a mathematical expression that
where 1 corresponds to “never” and 5 to “frequently”. For each contains values and/or cell references as arguments, to be manipu-
operation, we plotted the results in a stacked bar chart in Figure 3, lated by operators or functions. The expression corresponding to a
with the higher numbers stacked on the smaller ones like the legend formula eventually unrolls into a value. For example, in Figure 4,
indicates. cell F2 contains the formula =AVERAGE(B2:C2)+D2+E2, which unrolls
We find that all the thirty participants perform scrolling, i.e., into the value 85. The value of F2 depends on the value of cells B2,
moving up and down the spreadsheet to examine the data, with C2, D2, and E2, which appear in the formula associated with F2.
22 of them marking 5 (column 1). All participants reported to have In addition to a value or a formula, a cell could also additionally
performed editing of individual cells (column 2), and many of them have formatting associated with it; e.g., a cell could have a specific
reported to have performed formula evaluation frequently (column width, or the text within a cell can have bold font, and so on. For
3). Only four of the participants marked < 4 for some form of simplicity, we ignore formatting aspects, but these aspects can be
row/column-level operations, i.e., deleting or adding one or more easily captured within our representation schemes without signifi-
rows or columns at a time (column 4). cant changes.
Spreadsheet Operations. We now describe the operations that we
Takeaway 5: There are several common operations performed
aim to support on DATA S PREAD, drawing from the operations we
by spreadsheet users including scrolling, row and column
found in our user survey (takeaway 5). We consider the following
modification, and editing individual cells.
read-only operations:

4
Pp
minimize size(T ) = i=1 size(Ti ). Moreover, we would like
to minimize the time taken for accessing data using T , i.e., the
access cost, which is the cost of accessing a rectangular range of
cells for formulae (takeaway 4) or scrolling to specific locations
(takeaway 5), which are both common operations. And we would
like to minimize the time taken to perform updates, i.e., the update
cost, which is the cost of updating individual cells or a range of
cells, and the insertion and deletion of rows and columns.
Figure 4: Sample Spreadsheet (DATA S PREAD screenshot). Overall, starting from a collection of cells C, our goal is to iden-
tify a physical data model T such that: (a) T is recoverable with
• Scrolling: This operation refers to the act of retrieving cells respect to C, and (b) T minimizes a combination of storage, access
within a certain range of rows and columns. For instance, and update costs, among all T 2 P.
when a user scrolls to a specific position on the spreadsheet, We begin by considering the setting where the physical data
we need to retrieve a rectangular range corresponding to the model T has a single relational table, i.e., T = {T1 }. We develop
window that is visible to the user. Accessing an entire row or three ways of representing this table: we call them primitive data
column, e.g., A:A, is a special case of rectangular range where models, and are all drawn from prior work, each of which work
the column/row corresponding to the range is not bounded. well for a specific structure of spreadsheet—this is the focus of
• Formula evaluation: Evaluating formulae can require ac- Section 4.2. Then, we extend this to the setting where |T | > 1 by
cessing multiple individual cells (e.g., A1) within the spread- defining the notion of a hybrid data model with multiple tables each
sheet or ranges of cells (e.g., A1:D100). of which uses one of the three primitive data models to represent a
Note that in both cases, the accesses correspond to rectangular re- certain portion of the spreadsheet—this is the focus of Section 4.3.
gions of the spreadsheet. We consider the following four update Given the high diversity of structure within spreadsheets and high
operations: skew (takeaway 2), having multiple primitive data models, and the
• Updating an existing cell: This operation corresponds to ability to use multiple tables, gives us substantial power in repre-
accessing a cell with a specific row and column number and senting spreadsheet data.
changing its value. Along with cell updates, we are also re-
quired to reevaluate any formulae dependent on the cell.
4.2 Primitive Data Models
Our primitive data models represent trivial solutions for spread-
• Inserting/Deleting row/column(s): This operation corresp-
sheet representation with a single table. Before we describe these
onds to inserting/deleting row/column(s) at a specific posi-
data models, we discuss a small wrinkle that affects all of these
tion on the spreadsheet, followed by shifting subsequent row/-
models. To capture a cell’s identity, i.e., its row and column num-
column(s) appropriately.
ber, we need to implicitly or explicitly record a row and column
Note that, similar to read-only operations, the update operations number with each cell. Say we use an attribute to capture the row
require updating cells corresponding to rectangular regions. number for a cell. Then, the insertion or deletion of rows requires
In the next section, we develop data models for representing the cascading updates to the row number attribute for all subsequent
conceptual data model as described in this section, with an eye to- rows. As it turns out, all of the data models we describe in this
wards supporting the operations described above. section suffer from performance issues arising from cascading up-
dates, but the solution to deal with these issues is similar for of
4. REPRESENTING SPREADSHEETS these all of them, and will be described in Section 5.
We now address the problem of representing a spreadsheet within Also, note that the access and update cost of various data mod-
a relational database. For the purpose of this section and the next, els depends on whether the underlying database is a row store or a
we focus on representing one spreadsheet, but our techniques seam- columnar store. For the rest of this section and the paper, we fo-
lessly carry over to the multiple spreadsheet case; like we described cus on a row store, such as PostgreSQL, which is what we use in
earlier, we focus on the content of the spreadsheet as opposed to the practice, and is also more tailored for hybrid read-write settings.
formatting, as well as other spreadsheet metadata, like spreadsheet We now describe the three primitive data models:
name(s), spreadsheet dimensions, and so on. Row-Oriented Model (ROM). The row-oriented data model (ROM)
We describe the high-level problem of representation of spread- is straightforward, and is akin to data models used in traditional re-
sheet data here; we will concretize this problem subsequently. lational databases. Let rmax and cmax represent the maximum row
number and column number across all of the cells in C. Then, in the
4.1 High-level Problem Description ROM model, we represent each row from row 1 to rmax as a sep-
The conceptual data model corresponds to a collection of cells, arate tuple, with an attribute for each column Col1 . . ., Colcmax ,
represented as C = {C1 , C2 , . . . , Cm }; as described in the previ- and an additional attribute for explicitly capturing the row iden-
ous section, each cell Ci corresponds to a location (i.e., a specific tity, i.e., RowID. The schema for ROM is: ROM(RowID, Col1,
row and column), and has some contents—either a value or a for- . . ., Colcmax )—we illustrate the ROM representation of Figure 4
mula. Our goal is to represent and store the cells C comprising in Figure 5: each entry is a pair corresponding to a value and a
the conceptual data model, via one of the physical data models, formula, if any. For dense spreadsheets that are tabular (takeaways
P. Each T 2 P corresponds to a collection of relational tables 1 and 2), this data model can be quite efficient in storage and ac-
{T1 , . . . , Tp }. Each table Ti records the data in a certain portion of cess, since it minimizes redundant information: each row number is
the spreadsheet, as we will see subsequently. Given a collection C, recorded only once, independent of the number of columns. Over-
a physical data model T is said to be recoverable with respect to C all, the ROM representation shines when entire rows are accessed
if for each Ci 2 C, 9Tj 2 T such that Tj records the data in Ci , at a time, as opposed to entire columns. It is also efficient for ac-
and 8k 6= j, Tk does not record the data in Ci . Thus, our goal is to cessing a large range of cells at a time.
identify physical data models that are recoverable. Column-Oriented Model (COM). The second representation is
At the same time, we want to minimize the amount of storage also straightforward, and is simply the transpose of the ROM rep-
required to record T within the database, i.e., we would like to resentation. Often, we find that certain spreadsheets have many

5
3 2
RowID Col1 ... Col6
1 ID, NULL ... Total, NULL A B C D E F G H I
2 Alice, NULL ... 85, AVERAGE(B2:C2)+D2+E2 1 ✕ ✕ ✕ ✕
... ... ... ...
2 ✕ ✕ ✕
Figure 5: ROM Data Model for Figure 4.
3 ✕ ✕ ✕

columns and relatively few rows, necessitating such a representa- 4 ✕ ✕ ✕ 1

tion. The schema for COM is: COM(ColID, Row1, . . ., Rowrmax ). 5 ✕ ✕ ✕
The COM representation of Figure 4 is provided in Figure 6. Like 6 ✕ ✕ ✕ ✕
ROM, COM shines for dense data; while ROM shines for row- 7 ✕ ✕ ✕
oriented operations, COM shines for column-oriented operations. 5 4
ColID Row1 ... Row5
Figure 8: Hybrid Data Model and its Recursive Decomposition
p
1 ID,NULL ... Dave,NULL X
2 HW1,NULL ... 8,NULL cost(T ) = s1 + s2 · (ri ⇥ ci ) + s3 · ci + s4 · ri . (1)
... ... ... ... i=1
Figure 6: COM Data Model for Figure 4.
Here, the constant s1 is the cost of initializing a new table, as well
Row-Column-Value Model (RCV). The Row-Column-Value Model as storing table-related metadata, while the constant s2 is the cost of
(RCV) is inspired by key-value stores, where the Row-Column storing each individual cell (empty or not) in the ROM table. Note
number pair is treated as the key, i.e., the row and column identi- that the non-empty cells that have content may require even more
fiers are explicitly captured as two attributes. The schema for RCV space than s2 ; however this is a constant cost that does not depend
is RCV(RowID, ColID, V alue). The RCV representation for Fig- on the specific hybrid data model instance, and hence is excluded
ure 4 is provided in Figure 7. For sparse spreadsheets that are often from the cost above. The constant s3 is the cost corresponding
found in practice (takeaway 1 and 2), this model is quite efficient in to each column, while s4 is the cost corresponding to each row.
storage and access since it records only the cells that are filled in, The former is necessary to record schema information per column,
but for dense spreadsheets, it incurs the additional cost of record- while the latter is necessary to record the row information in the
ing and retrieving both the row and column number for each cell RowID attribute. Overall, while the specific costs si may differ
as compared to ROM and COM, and has a much larger number of quite a bit across different database systems, what is clear is that all
tuples. RCV is also efficient when it comes to retrieving specific of these different costs matter.
cells at a time.
Formal Problem. We are now ready to state our formal problem
RowID ColID Value below.
1 1 ID, NULL P ROBLEM 1 (H YBRID -ROM). Given a spreadsheet with a
... ... ..., ... collection of cells C, identify the hybrid data model T with only
2 2 10, NULL ROM tables that minimizes cost(T ).
... ... ..., ...
2 6 85, AVERAGE(B2:C2)+D2+E2
Unfortunately, Problem 1 is NP-H ARD, via a reduction from the
... ... ..., ... minimum edge length partitioning problem [27] of rectilinear poly-
Figure 7: RCV Data Model for Figure 4. gons—the problem of finding a partitioning of a polygon whose
edges are aligned to the X and Y axes, into rectangles, while min-
4.3 Hybrid Data Model: Intractability imizing the total sum of the perimeter of the resulting rectangles.
So far, we developed three primitive data models, that represent
T HEOREM 1 (I NTRACTABILITY ). Problem 1 is NP-H ARD.
reasonable extremes if we are to represent and store a spreadsheet We formally show the hardness of the problem in Appendix A.
within a single table in a database system. If, however, we do not
limit data models to have a single table, we may be able to develop 4.4 Optimal Recursive Decomposition
even better solutions by combining the benefits of the three prim- Instead of directly solving Problem 1, which is intractable, we
itive data models, and decomposing the spreadsheet into multiple instead aim to make it tractable, by reducing the search space of
tables each of which is represented by one of the primitive data solutions. In particular, we focus on hybrid data models that can
models. We call these data models as hybrid data models. be obtained by recursive decomposition. Recursive decomposition
D EFINITION 1 (H YBRID DATA M ODELS ). Given a collection is a process where we repeatedly subdivide the spreadsheet area
of cells C, we define hybrid data models to the space of physical from [1 . . . rmax , 1 . . . cmax ] by using a vertical cut between two
data models that are formed using a collection of tables T such that columns or a horizontal cut between two rows, and then recurse
the T is recoverable with respect to C, and further, each Ti 2 T is on the two areas that are formed. As an example, in Figure 8, we
either a ROM, COM, or an RCV table. can make a cut along line 1 horizontally, giving us two regions
As an example, for the spreadsheet in Figure 8, we might want from rows 1 to 4 and rows 5 to 6. We can then cut the top portion
the dense areas, i.e., B1:D4 and D5:G7, represented via a ROM or along line 2 vertically, followed by line 3, separating out one table
COM table each and the remaining area, specifically, H1 and I2 to B1:D4. By cutting the bottom portion along line 4 and line 5, we
be represented by an RCV table. can separate out the table D5:G7. Further cuts can help us carve out
Cost Model. Next, the question is how do we model the cost for tables out of H1 or I2, not depicted here.
a specific hybrid data model. As discussed earlier, the storage, the As the example illustrates, recursive decomposition is very pow-
access cost, and the update cost all impact our choice of hybrid data erful, since it captures a broad space of hybrid models; basically
model. For the purpose of this section, we will focus on exclusively anything that can be obtained via recursive cuts along the x and y
on the storage. We will generalize to the access cost in Appendix B. axis. Now, a natural question is: what sorts of hybrid data models
The update cost will be the focus of the next section. Furthermore, cannot be composed via recursive decomposition? We present an
our focus will now be on ROM tables; we will generalize to RCV example in Figure 9(a).
and COM tables in Section 4.6. O BSERVATION 1 (C OUNTEREXAMPLE ). In Figure 9(a), the
Given a hybrid data model T = {T1 , . . . , Tp }, where each ROM tables: A1:B4, D1:I2, A6:F7, H4:I7 can never be obtained via recursive
table Ti has ri rows and ci columns, the cost of T is defined as decomposition.

6
A B C D E F G H I 2 1 3 1 2
1 ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕
A C D G H Time Complexity. Our dynamic programming algorithm runs in
2 ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕
1 ✕
polynomial time with respect to the size of the spreadsheet. Let
2 ✕ ✕ ✕
3 ✕ ✕ the length of the larger side of the minimum enclosing rectangle
1 3 ✕
4 ✕ ✕ ✕ ✕ of the spreadsheet is of size n. Then, the number of candidate
1 4 ✕ ✕
5 ✕ ✕ rectangles is O(n4 ). For each rectangle, we have O(n) ways to
1 5 ✕
6 ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ perform the cut. Therefore, the running time of our algorithm is
7 ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕
2 6 ✕ ✕ ✕ ✕
O(n5 ). However, this number could be very large if the spread-
Figure 9: (a) Counterexample (b) Weighted Representation sheet is massive—which typical of the use-cases we aim to tackle.
Weighted Representation. We now describe a simple optimiza-
To see this, note that any vertical or horizontal cut that one would tion that helps us reduce the time complexity substantially, while
make at the start would cut through one of the four tables, mak- preserving optimality for the cost model that we have been using
ing the decomposition impossible. Nevertheless, the hybrid data so far. Notice that in many real spreadsheets, there are many rows
models obtained via recursive decomposition form a natural class and columns that are very similar to each other in structure, i.e.,
of data models. they have the same set of filled cells. We exploit this property to
As it turns out, identifying the solution to Problem 1 is PTIME reduce the effective size n of the spreadsheet. Essentially, we col-
for the space of hybrid data models obtained via recursive decom- lapse rows that have identical structure down to a single weighted
position. The algorithm involves dynamic programming. Infor- row, and similarly collapse columns that have identical structure
mally, our algorithm makes the most optimal “cut” horizontally or down to a single weighted column.
vertically at every step, and proceeds recursively. We now describe Consider Figure 9(b) which shows the weighted version of Fig-
the dynamic programming equations. ure 9(a). Here, we can collapse column B down into column A,
Consider a rectangular area formed from x1 to x2 as the top and which is now associated with weight 2; similarly, we can collapse
bottom row numbers respectively, both inclusive, and from y1 to y2 row 2 into row 1, which is now associated with weight 2. In this
as the left and right column numbers respectively, both inclusive, manner, the effective area of the spreadsheet now becomes 5⇥5 as
for some x1 , x2 , y1 , y2 . We represent the optimal cost by the func- opposed to 7⇥9.
tion Opt(). Now, the optimal cost of representing this rectangular Now, we can apply the same dynamic programming algorithm
area, i.e., Opt((x1 , y1 ), (x2 , y2 )), is the minimum of the following to the weighted representation of the spreadsheet: in essence, we
possibilities: are avoiding making cuts “in-between” the weighted edges, thereby
• If there is no filled cell in the rectangular area (x1 , y1 ), (x2 , y2 ), reducing the search space of hybrid data models. As it turns out,
then we do not use any data model. Hence, we have this does not sacrifice optimality, as the following theorem shows:
T HEOREM 3 (W EIGHTED O PTIMALITY ). The optimal hybrid
Opt((x1 , y1 ), (x2 , y2 )) = 0 (2) data model obtained by recursive decomposition on the weighted
spreadsheet is no worse than the optimal hybrid data model ob-
• Do not split, i.e., store as a ROM model (romCost()): tained by recursive decomposition on the original spreadsheet.
romCost((x1 , y1 ), (x2 , y2 )) = s1 + s2 · (r12 ⇥ c12 )
4.5 Greedy Decomposition Algorithms
+ s3 · c12 + s4 · r12 , (3)
Greedy Decomposition. To improve the running time even fur-
where number of rows r12 = (x2 x1 + 1), and the number ther, we propose a greedy heuristic that avoids the high complexity
of columns c12 = (y2 y1 + 1). of the dynamic programming algorithm, but sacrifices somewhat on
• Perform a horizontal cut (CH ): optimality. The greedy algorithm essentially repeatedly splits the
spreadsheet area in a top-down manner, making a greedy locally
CH = min Opt((x1 , y1 ), (i, y2 )) optimal decision, instead of systematically considering all alterna-
i2{x1 ,...,x2 } tives, like in the dynamic programming algorithm. Thus, at each
+ Opt((i + 1, y1 ), (x2 , y2 )). (4) step, when operating on a rectangular spreadsheet area (x1 , y1 ), (x2 , y2 ),
it identifies the operation that results in the lowest local cost. We
• Perform a vertical cut (CV ): have three alternatives: Either we do not split, in which case the
cost is from Equation 3, i.e., romCost((x1 , y1 ), (x2 , y2 )). Or we
CV = min Opt((x1 , y1 ), (x2 , j)) split horizontally (vertically), in which case the cost is the same as
j2{y1 ,...,y2 }
CH (CV ) from Equation 4 (Equation 5), but with Opt() replaced
+ Opt((x1 , j + 1), (x2 , y2 )). (5) with romCost(), since we are making a locally optimal decision.
The smallest cost decision is followed, and then we continue recur-
Therefore, when there are filled cells in the rectangle, sively decomposing using the same rule on the new areas, if any.
Complexity. This algorithm has a complexity of O(n2 ), since each
Opt((x1 , y1 ), (x2 , y2 )) = step takes O(n) and there are O(n) steps. While the greedy algo-
min romCost((x1 , y1 ) , (x2 , y2 )), CH , CV . (6) rithm is sub-optimal, the local decision that it makes is optimal in
the worst case, i.e., with no further information about the structure
else Opt((x1 , y1 ), (x2 , y2 )) = 0. of the areas that arise as a result of the decomposition, this is the
The base case is when the rectangular area is of dimension 1 ⇥ 1. best decision to make at each step.
Here, we store the area as a ROM table if it is a filled cell. Hence, Aggressive Greedy Decomposition. The greedy algorithm de-
we have, Opt((x1 , y1 ), (x1 , y1 )) = c1 + c2 + c3 + c4 , if filled, scribed above stops exploration as soon as it is unable to find a
and 0 if not. cut that reduces the cost locally, based on a worst case assumption.
We have the following theorem: This may cause the algorithm to halt prematurely, even though ex-
T HEOREM 2 (DYNAMIC P ROGRAMMING O PTIMALITY ). The ploring further decompositions may have helped reduce the cost.
optimal ROM-based hybrid data model based on recursive decom- An alternative to the greedy algorithm described above is one where
position can be determined via dynamic programming. we don’t stop subdividing, i.e., we always choose to use the best

7
Key Data
horizontal or vertical cut, and then subdivide the area based on that 1 100 abc
3 4
cut in a depth-first manner. We keep doing this until we end up with 2 200 aui
rectangular areas where all of the cells are filled in with values. (At 3 250 ois
this point, it provably doesn’t benefit us to subdivide further.) After 4 300 kov Non-Leaf Nodes
333 rte
this point, we backtrack up the tree of decompositions, bottom-up, 5
1 2 3 1
(counts &
6 350 pos children pointers)
assembling the best solution that was discovered, similar to the dy- 400 iks
7
namic programming approach, considering whether to not split, or 8 500 bhg
perform a horizontal or vertical split. 600 kis Leaf Nodes
... ...
9
5 1 3 4 6 7 8
Complexity. Like the greedy approach, the aggressive greedy ap- (values)

proach has complexity O(n2 ), but takes longer since it considers a Figure 10: (a) Monotonic Positional Mapping (b) Index for Hierar-
larger space of data models than the greedy approach. chical Positional Mapping
4.6 Extensions particular, we want a data structure on items (here tuples) that can
In this section, we describe extensions to the cost model and capture a specific ordering among the items and efficiently support
algorithms to handle COM and RCV tables in addition to ROM. the following operations: (a) fetch items based on a position, (b) in-
Other extensions can be found in Appendix B, including incorpo- sert items at a position, and (c) delete items from a position. The
rating access cost along with storage, including the costs of indexes, insert and delete operations require updating the positions of the
and dealing with situations when database systems impose limita- subsequent items, e.g., inserting an item at the nth position requires
tions on the number of columns in a relation. We will describe these us to first increment by one the positions of all the items that have
extensions to the cost model, and then describe the changes to the a position greater than or equal to n, and then add the new item
basic dynamic programming algorithm; modifications to the greedy at the nth position. Due to the interactive nature of DATA S PREAD,
and aggressive greedy decomposition algorithms are straightfor- our goal is to perform these operations within a few hundred mil-
ward. liseconds.
RCV and COM. The cost model can be extended in a straightfor- Row Number as-is. We motivate the problem by demonstrating
ward manner to allow each rectangular area to be a ROM, COM, the impact of cascading updates in terms of time complexity. Stor-
or an RCV table. First, note that it doesn’t benefit us to have mul- ing the row numbers as-is with every tuple makes the fetch opera-
tiple RCV tables—we can simply combine all of these tables into tion efficient at the expense of making the insert and delete opera-
one, and assume that we’re paying a fixed up-front cost to have one tions inefficient. With a traditional index, e.g., a B-Tree index, the
RCV table. Then, the cost for a table Ti , if it is stored as a COM complexity to access an arbitrary row identified by a row number is
table is: O(log N ). On the other hand, insert and delete operations require
comCost(Ti ) = s1 + s2 · (ri ⇥ ci ) + s4 · ci + s3 · ri . updating the row numbers of the subsequent tuples. These updates
also need to be propagated in the index, and therefore it results in a
This equation is the same as Equation 1, but with the last two con- worst case complexity of O(N log N ). To illustrate the impact of
stants transposed. And the cost for a table Ti , if it is stored as an these complexities on practice, in Table 4(a), we display the perfor-
RCV table is simply: mance of storing the row numbers as-is for two operations—fetch
rcvCost(Ti ) = s5 ⇥ #cells. and insert—on a spreadsheet containing 106 cells. We note that
where s5 is the cost incurred per tuple. Once we have this cost irrespective of the data model used, the performance of inserts is
model set up, it is straightforward to apply dynamic programming beyond our acceptable threshold whereas that of the fetch opera-
once again to identify the optimal hybrid data model encompassing tion is acceptable.
ROM, COM, and RCV. The only step that changes in the dynamic Row Number as-is Positional Mapping
programming equations is Equation 3, where we have to consider Operation RCV ROM Operation RCV ROM
the COM and RCV alternatives in addition to ROM. We have the Insert 87,821 1,531 Insert 9.6 1.2
following theorem. Fetch 312 244 Fetch 30,621 273
T HEOREM 4 (O PTIMALITY WITH ROM, COM, AND RCV).
The optimal ROM, COM, and RCV-based hybrid data model based Table 4: The performance of (in ms) (a) storing Row Number as-is
on recursive decomposition can be determined via dynamic pro- (b) Monotonic Positional Mapping.
gramming.
Intuition. To improve the performance of inserts and deletes for
ordered items, we introduce the idea of positional mapping. At its
5. POSITIONAL MAPPING core, the idea is remarkably simple: we do not store positions but
As discussed in Section 4, for all of the data models, storing the instead store what we call positional mapping keys. These posi-
row and/or column numbers may result in substantial overheads tional mapping keys p are proxies that have a one-to-one mapping
during insert and delete operations due to cascading updates to all with the positions r, i.e., p r. Formally, positional mapping M
subsequent rows or columns—this could make working with large is a bijective function that maintains the relationship between the
spreadsheets infeasible. In this section, we develop solutions for row numbers and positional mapping keys, i.e., M(r) ! p.
this problem by introducing the notion of positional mapping to Monotonic Positional Mapping. One approach towards positional
eliminate the overhead of cascading updates. For our discussion mapping is to have positional mapping keys monotonically increase
we focus on row numbers; the techniques can be analogously ap- with position, i.e., for two arbitrary positions ri and rj , if ri > rj
plied to columns. To keep our discussion general, we use the term then M(rj ) > M(ri ). For example, consider the ordered list of
position to represent the ordinal number, i.e., either row or column items shown in Figure 10(a). Here, even though the positional map-
number, that captures the location of the cell along a specific di- ping keys do not correspond to the row number, and even though
mension. In addition, row and column numbers can be dealt with there can be arbitrary differences between consecutive positional
independently. mapping keys, we can fetch the nth record by scanning the posi-
Problem. We require a data structure to efficiently support positional mapping keys in an increasing order while maintaining a run-
tional operations without the overhead of cascading updates. In ning counter to skip n-1 records. The gaps between the consecutive

8
User
positional mapping keys reduce or even eliminate the renumbering Interface Ajax
Responses
Ajax
Requests
during insert and delete operations. Web Browser

Thus, monotonic positional mapping trades-off the performance

of the fetch operation for making insert and delete operations effi- Execution View Controller
cient. To fetch the nth item, in the absence of the stored position we Engine

need to scan n items, i.e., the average time complexity is O(N ),

where N is the total number of items. If we know the positional
mapping key of the item we are fetching (which is often not the Evaluator Dependency Parser

case), and we have a traditional B+tree index on this key, then the LRU Cell Cache
complexity of this operation is O(log N ). Similarly, the complex-
ity of inserting an item if we know the positional mapping key, Hybrid Translator

determined based on the positional mapping keys of neighboring ROM COM RCV
items, is O(log N ), which is the effort spent to update the under- Translator Translator Translator

lying indexing structure. In Table 4(b), we experimentally observe Positional Mapper

that benefits from monotonic positional mapping for the insert op-
erations come at the expense of the fetch operation, leading to un- Storage Database
acceptable latencies.
Hierarchical Positional Mapping. We now describe a scheme, ti- Spreadsheet Data Pos. Index
ROM COM RCV
tled hierarchical positional mapping, that enhances monotonic po- Metadata
Hybrid
Optimizer
sitional mapping, by adding a new indexing structure that allevi-
ates the cost of insert and delete operations, while not sacrificing Figure 11: DATA S PREAD Architecture
the performance of the fetch operation. This new indexing struc-
cally evaluate our positional mapping schemes in Section 7.
ture adapts classical work on order-statistic trees [19]. Just like a
typical B+Tree is used to capture the mapping from keys to the cor- Operation on nth record.
responding records, we can use the same structure to map positions Positional Mapping Method Fetch Insert/Delete
to positional mapping keys. Here, instead of storing a key we store Row Number as-is O(log N ) O(N )
the count of elements stored within the entire sub-tree. The leaf Monotonic Positional Mapping O(N ) O(log N )
nodes store the values, while the remaining nodes store pointers to Hierarchical Positional Mapping O(log N ) O(log N )
the children along with counts. Table 5: Complexity of different positional mapping methods.
For the positional mapping shown in Figure 10(a), we show the
corresponding hierarchical positional mapping index structure in
Figure 10(b). Similar to a B+tree of order m, our structure satis- 6. DATA S PREAD ARCHITECTURE
fies the following invariants. (a) Every node has at most ⌃ m⌥ chil- We have implemented DATA S PREAD as a web-based tool on
dren. (b) Every non-leaf node (except root) as at-least m 2
chil- top of a PostgreSQL relational database implementing the Model-
dren. (c) All leaf nodes appear at the same level. Again similar to View-Controller approach. The system currently supports basic
B+tree, we ensure the invariants by either splitting a node into two spreadsheet operations, e.g., scrolling to arbitrary positions, inser-
when the number of children overflow or merging two nodes into tion of rows or columns, and formulae insert and evaluation, on
one when the number of children underflow. This ensures that the large spreadsheets that are persisted in the PostgreSQL database.
height of the tree is at most logdm/2e N . Figure 11 illustrates DATA S PREAD’s architecture, which at a
Hierarchical Positional Mapping: Fetch. Our hierarchical index- high level can be divided into three main layers, i.e., (a) user inter-
ing structure makes accessing the item at the nth position efficient, face, (b) execution engine, and (c) storage. The user interface layer
using the following steps: (i) We start from the root node. (ii) At a consists of a spreadsheet widget, which presents a spreadsheet on a
node, we identify the child node to traverse next, by subtracting the web-based interface to users and records the interactions on it. The
count associated with the children iteratively from n, left to right, execution engine layer is a web application developed in Java that
as long as the remainder is positive. This step adjusts the value resides on an application server. The controller accepts user inter-
of n; we then move one level down in the tree to that child node. actions in form of events and identifies the corresponding actions,
(iii) We repeat the previous step until we reach a leaf node, after e.g., a formula update is sent to the formula parser, an update to a
which we extract the nth element from this node. Now, we have cell is sent to the cell cache. The dependency graph captures the
the key with which to probe a traditional B+tree index on the posi- formula dependencies between the cells and aids in triggering the
tional mapping keys, as in monotonic positional mapping. Overall, computation of dependent cells. The positional mapper translates
the complexity of this operation is O(log N ). the row and column numbers into the corresponding stored identi-
Hierarchical Positional Mapping: Insert/Delete. Insert and delete fiers and vice versa. The ROM, COM, RCV, and hybrid translators
operations require updating the counts associated with all of the use their corresponding spreadsheet representations and provide a
nodes that fall on the path between the root and the leaf node cor- “collection of cells” abstraction to the upper layers. This collection
responding to the position that is being updated. As before, we of cells are then cached in memory via an LRU cell cache. The stor-
first identify the leaf node as discussed for a fetch operation, fol- age layer consists of a relational database, which is responsible for
lowed by updating the item at the leaf node, and traversing back up persisting data. This data is persisted using a combination of ROM,
the tree to the root. Simultaneously, we use the traditional B+tree COM and RCV data models (as described in Section 4) along with
index on the positional mapping keys to update the corresponding positional indexes, which map row and column numbers to corre-
positional mapping key. Once again, the complexity of this opera- sponding stored identifiers (as described in Section 5), and meta-
tion is O(log N ). data, which records information about the hybrid data model, and
In Table 5, we contrast the complexity of the hierarchical posi- which tables are responsible for handling which rectangular areas
tional mapping scheme against other positional mapping schemes, on the spreadsheet. The hybrid optimizer determines the optimal
and demonstrate that it dominates the other schemes. We empiri- hybrid data model and is responsible for migrating data across dif-
ferent tables and primitive data models.

9
Relational Operations in Spreadsheet. Since DATA S PREAD is 10000
DP
Greedy
built on top of a traditional relational database, it can leverage the Agg

SQL engine of the database and seamlessly support SQL queries on

Avg time (ms)

the front-end spreadsheet interface. We describe how we support 1000

standard relational operations in more detail in Appendix C.

100

7. EXPERIMENTAL EVALUATION
In this section, we present an evaluation of DATA S PREAD. Our 10
high-level goals are to evaluate the feasibility of DATA S PREAD to Internet ClueWeb09 Enron Academic

work with large spreadsheets with billions of cells; in addition, we Figure 13: Hybrid optimization algorithms: Running time.
attempt to understand the impact of the hybrid data models, and ROM primitive model, and evaluate the performance of fetch, in-
the impact of the positional mapping schemes. Recent work has sert, and delete operations on varying the number of rows.
identified 500ms as a yardstick of interactivity [29], and we aim to
verify if DATA S PREAD can actually meet that yardstick. 7.2 Impact of Hybrid Data Models
7.1 Experimental Setup Takeaways: Hybrid data models provide substantial benefits
over primitive data models, with up to 20% reductions in stor-
Environment. Our data models and positional mapping techniques
age, and up to 50% reduction in formula access or evalua-
were implemented on top of a PostgreSQL (version: 9.6) database.
tion time on PostgreSQL on real spreadsheet datasets, com-
The database was configured with default parameters. We run all of
pared to the best primitive data model. While DP has better
our experiments on a workstation with the following configuration:
performance on storage than Greedy and Agg, it suffers from
Processor: Intel Core i7-4790K 4.0 GHz, RAM: 16 GB, Op-
high running time; Agg is able to bridge the gap between
erating System: Windows 10. Our test scripts are single-threaded
Greedy and DP, while taking only marginally more running
applications developed in Java. While we have also developed
time than Greedy. Lastly, if we were to design a database stor-
a full-fledged web-based front-end application (see Figure 4), our
age engine from scratch, the hybrid data models would provide
test scripts are independent of this front-end, so that we can iso-
up to 50% reductions in storage compared to the best primi-
late the back-end performance implications. We ensured fairness
tive data model.
by clearing the appropriate cache(s) before every run.
The goal of this section is to evaluate our data models—both our
Datasets. We evaluate our algorithms on a variety of real and syn-
primitive and hybrid data models—on real datasets. For each sheet
thetic datasets. Our real datasets are the ones listed in Table 1:
within each dataset, we run the dynamic programming algorithm
Internet, ClueWeb09, Enron, and Academic. The first three have
(denoted DP), the greedy algorithm (denoted Greedy), and the ag-
over 10,000 sheets each while the last one has about 700 sheets.
gressive greedy algorithm (denoted Agg) that help us identify ef-
To test scalability, our real-world datasets are insufficient, because
fective hybrid data models. We compare the resulting data models
they are limited in scale by what current spreadsheet tools can sup-
against the primitive data models: ROM, COM and RCV, where
port. Therefore, we constructed additional large synthetic spread-
the entire spreadsheet is stored in a single table.
sheet datasets. The spreadsheets in the datasets each have between
10–100 columns, with the number of rows varying from 103 to 107 , Storage Evaluation on PostgreSQL. We begin with an evaluation
and a density between 0–1; this last quantity indicates the proba- of storage for different data models on PostgreSQL. The costs for
bility that a given cell within the spreadsheet area is filled-in. Our storage on PostgreSQL as measured by us is as follows: s1 is 8
largest synthetic dataset has a billion non-empty cells, enabling us KB, s2 is 1 bit, s3 is 40 bytes, s4 is 50 bytes, and s5 is 52 bytes.
to explicitly verify the premise of the title of this work. We plot the results in Figure 12(a): here, we depict the average
normalized storage across sheets: for the Internet, ClueWeb09, and
We identify several goals for our experimental evaluation: Enron datasets, we found RCV to have the worst performance, and
Goal 1: Impact of Hybrid Data Models on Real Datasets. We hence normalized it to a cost of 100, and scaled the others accord-
evaluate the hybrid data models selected by our algorithms against ingly; for the Academic datasets, we found COM to have the worst
the primitive data models, when the cost model is optimized for performance, and hence normalized it to a cost of 100, and scaled
storage. The algorithms evaluated include: ROM, COM, RCV (the the others accordingly. For the first three datasets, recall that these
primitive data models, using a single table to represent a sheet), datasets are primarily used for data sharing, and as a result are quite
DP (the dynamic programming algorithm from Section 4.4), and dense. As a result, the ROM and COM data models do well, using
Greedy and Agg (the greedy and aggressive-greedy algorithms from about 40% of the storage of RCV. At the same time, DP, Greedy
Section 4.5). We evaluate these data models on both storage, as and Agg perform roughly similarly, and better than the primitive
well as formulae access cost, based on the formulae embedded data models, providing an additional reduction of 15-20%. On the
within the spreadsheets. In addition, we evaluate the running time other hand, the last dataset, which is primarily used for computa-
of the hybrid optimization algorithms for DP, Greedy, and Agg. tion as opposed to sharing, and is very sparse, RCV does better
Goal 2: Scalability on Synthetic Datasets. Since our real datasets than ROM and COM, while DP, Greedy, and Agg once again pro-
aren’t very large, we turn to synthetic datasets for testing out the vide additional benefits.
scalability of DATA S PREAD. We focus on the primitive data mod- Storage Evaluation on an Ideal Database. Note that the reason
els, i.e., ROM and RCV, coupled with positional mapping schemes, why RCV does so poorly for the first three datasets is because Post-
and evaluate the performance of select, update, and insert/delete greSQL imposes a high overhead per tuple, of 50 bytes, consider-
on these data models on varying the number of rows, number of ably larger than the amount of storage required to store each cell.
columns, and the density of the dataset. So, to explore this further, we investigated the scenario if we had
Goal 3: Impact of Positional Mapping Schemes. We evaluate the ability to redesign our database storage engine from scratch. We
the impact of our positional mapping schemes in aiding positional consider a theoretical “ideal” cost model, where additional over-
access on the spreadsheet. We focus on Row-number-as-is, Mono- heads are minimized. For this cost model, the cost of a ROM or
tonic, and Hierarchical positional mapping schemes applied on the COM table is equal to the number of cells, plus the length and

10
100 ROM
100

Normalized Storage

Normalized Storage
COM
80
RCV
DP
60
Greedy 10
40 Agg

20
0 1
Internet ClueWeb09 Enron Academic Internet ClueWeb09 Enron Academic

Figure 12: (a) Storage Comparison for PostgreSQL (b) Storage Comparison on an Ideal Database

ROM tables. For this evaluation, we focused on Agg, since it provided

Formulae Access Time (ms)

RCV
10 Agg the best trade-off between running time and storage costs. Given
a sheet in a dataset, for each data model, we measured the time
taken to evaluate the formulae in that sheet, and averaged this time
1 across all sheets and all formulae. We plot the results for different
datasets in Figure 14 in log scale in ms. As a concrete example, on
the Internet dataset, ROM has a formula access time of 0.23, RCV
0.1 has 3.17, while Agg has 0.13. Thus, Agg provides a substantial re-
Internet ClueWeb09 Enron Academic
duction of 96% over RCV and 45% over ROM—even though Agg
Figure 14: Average access time for formulae was optimized for storage and not for formula access. This vali-
breadth of the table (to store the data, the schema, as well as posi-
dates our design of hybrid data models to store spreadsheet data.
tional identifiers), while the cost of an RCV row is simply 3 units
Note that while the performance numbers for the real spreadsheet
(to store the data, as well as the row and column number). We plot
datasets are small for all data models (due to the size limitations
the results in Figure 12(b) in log scale for each of the datasets—we
in present spreadsheet tools) when scaling up to large datasets, and
exclude COM for this chart since it has the same performance as
formulae that operate on these large datasets, these numbers will
ROM. Here, we find that ROM has the worst cost across most of
increase in a proportional manner, at which point it is even more
the datasets since it no longer leverages benefits from minimizing
important to opt for hybrid data models.
the number of tuples. (For Internet, ROM and RCV are similar, but
RCV is slightly worse.) As before, we normalize the cost of the 7.3 Scalability of Data Models
ROM model to 100 for each sheet, and scaled the others accord-
ingly, followed by taking an average across all sheets per dataset. Takeaway: Our primitive data models, augmented with posi-
As an example, we find that for the ClueWeb09 corpus, RCV, DP, tional mapping provide interactive (<500ms) response time
Greedy and Agg have normalized costs of about 36, 14, 18, and on spreadsheet datasets ranging up to 1 billion cells for se-
14 respectively—with the hybrid data models more than halving lect, insert, and update operations.
th
the cost of RCV, and getting 17 the cost of ROM. Furthermore, Since our real datasets did not have any spreadsheets that are ex-
in this ideal cost model, DP provides additional benefits relative tremely large, we now evaluate the scalability of the DATA S PREAD
to Greedy, and Agg ends up bringing us close to or equal to DP data models in supporting very large synthetic spreadsheets. We fo-
performance. cus on the two primitive data models i.e., ROM and RCV, with the
Running Time of Hybrid Optimization Algorithm. Our next spreadsheet being represented as a single table in these data mod-
question is how long our hybrid data model optimization algo- els. Since we use synthetic datasets where cells are “filled in” with
rithms for DP, Greedy, and Agg, take on real datasets. In Figure 13, a certain probability, we did not involve hybrid data models, since
we depict the average running time of these algorithms on the four they would (in this artificial context) typically end up preferring the
real datasets. The results for all datasets are similar—as an ex- ROM data model. These primitive data models are augmented with
ample, for Enron, DP took 6.3s on average, Greedy took 45ms (a hierarchical positional mapping. We consider the performance on
140⇥ reduction), while Agg took 345ms (a 20⇥ reduction). Thus varying several parameters of these datasets: the density (i.e., the
DP has the highest running time for all datasets, since it explores number of cells that are filled in), the number of rows, and the num-
the entire space of models that can be obtained by recursive par- ber of columns. The default values of these parameters are 1, 107
titioning. Between Greedy and Agg, Greedy turns out to take less and 100 respectively. We repeat each operation 500 times and re-
time. Note that these observations are consistent with our complex- port the averages.
ity analyses from Section 4.5. That said, Agg allows us to trade off In Figure 15, we depict the charts corresponding to average time
a little bit more running time for improved performance on storage to perform a random select operation on a region of 1000 rows and
(as we saw earlier). We note that for the cases where the spread- 20 columns. This is, for example, the operation that would corre-
sheets were large, we terminated DP after about 10 minutes, since spond to a user scrolling to a certain position on our spreadsheet.
we want our optimization to be relatively fast. (Note that using a As can be seen in Figure 15(a), ROM starts dominating RCV be-
similar criterion for termination, Agg and Greedy did not have to yond a certain density, at which point it makes more sense to store
be terminated for any of the real datasets.) To be fair across all the the data in as tuples that span rows instead of incurring the penalty
algorithms, we excluded all of these spreadsheets from this chart— of creating a tuple for every cell. Nevertheless, the best of these
if we had included them, the difference between DP and the other two models takes less than 150ms across sheets of varying densi-
algorithms would be even more stark. ties. In Figure 15(b)(c), since the spreadsheet is very dense (density
Formulae Access Evaluation on PostgreSQL. Next, we wanted = 1), ROM takes less time than RCV. Overall, in all cases, even on
to evaluate if our hybrid data models, optimized only on storage, spreadsheets with 100 columns and 107 rows and a density of 1,
have any impact on the access cost for formulae within the real the average time to select a region is well within 500ms.
datasets. Our hope is that the formulae embedded within spread- We report briefly on the update and insert performance—detailed
sheets end up focusing on “tightly coupled” tabular areas, which results and charts can be found in the Appendix. Overall, for both
our hybrid data models are able to capture and store in separate RCV and ROM, for inserting a row, the time is well below 500ms

11
300 300 300
RCV RCV RCV
250 ROM 250 ROM 250 ROM
Time (ms)

Time (ms)

Time (ms)
200 200 200

150 150 150

100 100 100

50 50 50

0
0.2 0.4 0.6 0.8 1.0 10 30 50 70 90 100 104 105 106 107
Sheet Density #Columns #Rows

Figure 15: Select performance vs — (a) Sheet Density (b) Column Count (c) Row Count

for all of the charts; for updates of a large region, while ROM is 2b. One way export of operations from spreadsheets to databases.
still highly interactive, RCV ends up taking longer since 1000s of There has been some work on exporting spreadsheet operations
queries need to be issued to the database. In practice, users won’t into database systems, such as the work from Oracle [47, 48] as
update such a large region at a time, and we can batch these queries. well as startups 1010Data [40] and AirTable [41], to improve the
We discuss this further in the appendix. performance of spreadsheets. However, the database itself has no
awareness of the existence of the spreadsheet, making the integra-
7.4 Evaluation of Positional Mapping tion superficial. In particular, positional and ordering aspects are
not captured, and user operations on the front-end, e.g., inserts,
Takeaway: Hierarchical positional mapping retains the rapid
deletes, and adding formulae, are not supported.
fetch benefits of row-number-as-is, while also providing the
rapid insert and update benefits of monotonic positional map- 2c. Using a spreadsheet to mimic a database. There has been
ping. Overall, hierarchical positional mapping is able to per- some work on using a spreadsheet as an interface for posing tradi-
form positional operations within a few milliseconds, while tional database queries. For example, Tyszkiewicz [38] describes
the other positional mapping schemes scale poorly, taking sec- how to simulate database operations in a spreadsheet. However,
onds on large datasets for certain operations. this approach loses the scalability benefits of relational databases.
Bakke et al. [9, 8, 7] support joins by depicting relations using a
We report detailed results and charts for this evaluation in Ap- nested relational model. Liu et al. [28] use spreadsheet operations
pendix D. to specify single-block SQL queries; this effort is essentially a re-
placement for visual query builders. Recently, Google Sheets [39]
8. RELATED WORK has provided the ability to use single-table SQL on its frontend,
Our work draws on related work from multiple areas; we re- without availing of the scalability benefits of database integration.
view papers in each of the areas, and describe how they relate to Excel, with its Power Pivot and Power Query [46] functionality
DATA S PREAD. We discuss 1) efforts that enhance the usability of has made moves towards supporting SQL in the front-end, with the
databases, 2) those that attempt to merge the functionality of the same limitations. Like this line of work, we support SQL queries
spreadsheet and database paradigms, but without a holistic inte- on the spreadsheet frontend, but our focus is on representing and
gration, and 3) using array-based database management systems. operating on spreadsheet data within a database.
We described our vision for DATA S PREAD in an earlier demo pa- 3. Array database systems. While there has been work on array-
per [10]. based databases, most of these systems do not support edits: for
1. Making databases more usable. There has been a lot of re- instance, SciDB [13] supports an append-only, no-overwrite data
cent work on making database interfaces more user friendly [4, model.
23]. This includes recent work on gestural query and scrolling in-
terfaces [22, 31, 33, 32, 36], visual query builders [6, 16], query 9. CONCLUSIONS
sharing and recommendation tools [24, 18, 17, 25], schema-free
We presented DATA S PREAD, a data exploration tool that holisti-
databases [34], schema summarization [49], and visual analytics
cally unifies spreadsheets and databases with a goal towards work-
tools [14, 30, 37, 21]. However, none of these tools can replace
ing with large datasets. We proposed three primitive data models
spreadsheet software which has the ability to analyze, view, and
for representing spreadsheet data within a database, along with al-
modify data via a direct manipulation interface [35] and has a large
gorithms for identifying the optimal hybrid data model arising from
user base.
recursive decomposition to give one or more primitive data models.
2a. One way import of data from databases to spreadsheets. Our hybrid data models provide substantial reductions in terms of
There are various mechanisms for importing data from databases to storage (up to 20–50%) and formula evaluation (up to 50%) over
spreadsheets, and then analyzing this data within the spreadsheet. the primitive data models. Our primitive and hybrid data models,
This approach is followed by Excel’s Power BI tools, including coupled with positional mapping schemes, make working with very
Power Pivot [45], with Power Query [46] for exporting data from large spreadsheets—over a billion cells—interactive.
databases and the web or deriving additional columns and Power
View [46] to create presentations; and Zoho [42] and ExcelDB [44]
(on Excel), and Blockspring [43] (on Google Sheets [39]) enabling 10. REFERENCES
the import from a variety of sources including the databases and [1] Google sheets. https://www.google.com/sheets/about/.
the web. Typically, the import is one-shot, with the data residing in [2] Microsoft excel. http://products.office.com/en-us/excel.
the spreadsheet from that point on, negating the scalability benefits [3] ZK Spreadsheet.
derived from the database. Indeed, Excel 2016 specifies a limit of https://www.zkoss.org/product/zkspreadsheet.
1M records that can be analyzed once imported, illustrating that the [4] S. Abiteboul, R. Agrawal, P. Bernstein, M. Carey, S. Ceri,
scalability benefits are lost; Zoho specifies a limit of 0.5M records. B. Croft, D. DeWitt, M. Franklin, H. G. Molina, D. Gawlick,
Furthermore, the connection to the base data is lost: any modifica- J. Gray, L. Haas, A. Halevy, J. Hellerstein, Y. Ioannidis,
tions made at either end are not propagated. M. Kersten, M. Pazzani, M. Lesk, D. Maier, J. Naughton,

12
H. Schek, T. Sellis, A. Silberschatz, M. Stonebraker, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon.
R. Snodgrass, J. Ullman, G. Weikum, J. Widom, and Google fusion tables: web-centered data management and
S. Zdonik. The lowell database research self-assessment. collaboration. In Proceedings of the 2010 ACM SIGMOD
Commun. ACM, 48(5):111–118, May 2005. International Conference on Management of data, pages
[5] A. Abouzied, J. Hellerstein, and A. Silberschatz. Dataplay: 1061–1066. ACM, 2010.
Interactive tweaking and example-driven correction of [22] S. Idreos and E. Liarou. dbTouch: Analytics at your
graphical database queries. In Proceedings of the 25th Fingertips. In CIDR, 2013.
Annual ACM Symposium on User Interface Software and [23] H. V. Jagadish, A. Chapman, A. Elkiss, M. Jayapandian,
Technology, UIST ’12, pages 207–218, New York, NY, USA, Y. Li, A. Nandi, and C. Yu. Making database systems usable.
2012. ACM. In Proceedings of the 2007 ACM SIGMOD international
[6] A. Abouzied, J. Hellerstein, and A. Silberschatz. DataPlay: conference on Management of data, pages 13–24. ACM,
interactive tweaking and example-driven correction of 2007.
graphical database queries. In Proceedings of the 25th [24] N. Khoussainova, M. Balazinska, W. Gatterbauer, Y. Kwon,
annual ACM symposium on User interface software and and D. Suciu. A Case for A Collaborative Query
technology, pages 207–218. ACM, 2012. Management System. In CIDR. www.cidrdb.org, 2009.
[7] E. Bakke and E. Benson. The Schema-Independent Database [25] N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu.
UI: A Proposed Holy Grail and Some Suggestions. In CIDR, SnipSuggest: Context-aware autocompletion for SQL.
pages 219–222. www.cidrdb.org, 2011. Proceedings of the VLDB Endowment, 4(1):22–33, 2010.
[8] E. Bakke, D. Karger, and R. Miller. A spreadsheet-based user [26] B. Klimt and Y. Yang. Introducing the enron corpus. In
interface for managing plural relationships in structured data. CEAS, 2004.
In Proceedings of the SIGCHI conference on human factors [27] A. Lingas, R. Y. Pinter, R. L. Rivest, and A. Shamir.
in computing systems, pages 2541–2550. ACM, 2011. Minimum edge length partitioning of rectilinear polygons. In
[9] E. Bakke and D. R. Karger. Expressive query construction Proceedings - Annual Allerton Conference on
through direct manipulation of nested relational results. In Communication, Control, and Computing, pages 53–63,
Proceedings of the 2016 International Conference on 1982.
Management of Data, pages 1377–1392. ACM, 2016. [28] B. Liu and H. V. Jagadish. A Spreadsheet Algebra for a
[10] M. Bendre, B. Sun, D. Zhang, X. Zhou, K. C.-C. Chang, and Direct Data Manipulation Query Interface. pages 417–428.
A. Parameswaran. Dataspread: Unifying databases and IEEE, Mar. 2009.
spreadsheets. Proc. VLDB Endow., 8(12):2000–2003, Aug. [29] Z. Liu and J. Heer. The effects of interactive latency on
2015. exploratory visual analysis. IEEE Trans. Vis. Comput.
[11] M. Bendre, B. Sun, X. Zhou, D. Zhang, K. Chang, and Graph., 20(12):2122–2131, 2014.
A. Parameswaran. Dataspread: Unifying databases and [30] J. Mackinlay, P. Hanrahan, and C. Stolte. Show me:
spreadsheets. In VLDB, volume 8, 2015. Automatic presentation for visual analysis. Visualization and
[12] D. Bricklin and B. Frankston. Visicalc 1979. Creative Computer Graphics, IEEE Transactions on,
Computing, 10(11):122, 1984. 13(6):1137–1144, 2007.
[13] P. G. Brown. Overview of scidb: Large scale array storage, [31] A. N. L. J. M. Mandel, A. Nandi, and L. Jiang. Gestural
processing and analysis. In Proceedings of the 2010 ACM Query Specification. Proceedings of the VLDB Endowment,
SIGMOD International Conference on Management of Data, 7(4), 2013.
SIGMOD ’10, pages 963–968, New York, NY, USA, 2010. [32] A. Nandi. Querying Without Keyboards. In CIDR, 2013.
ACM. [33] A. Nandi and H. V. Jagadish. Guided interaction: Rethinking
[14] S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. the query-result paradigm. Proceedings of the VLDB
Silva, and H. T. Vo. VisTrails: visualization meets data Endowment, 4(12):1466–1469, 2011.
management. In Proceedings of the 2006 ACM SIGMOD [34] L. Qian, K. LeFevre, and H. V. Jagadish. CRIUS:
international conference on Management of data, pages user-friendly database design. Proceedings of the VLDB
745–747. ACM, 2006. Endowment, 4(2):81–92, 2010.
[15] J. Callan, M. Hoy, C. Yoo, and L. Zhao. Clueweb09 data set, [35] B. Shneiderman. Direct Manipulation: A Step Beyond
2009. Programming Languages. IEEE Computer, 16(8):57–69,
[16] T. Catarci, M. F. Costabile, S. Levialdi, and C. Batini. Visual 1983.
query systems for databases: A survey. Journal of Visual [36] M. Singh, A. Nandi, and H. V. Jagadish. Skimmer: rapid
Languages & Computing, 8(2):215–260, 1997. scrolling of relational query results. In Proceedings of the
[17] U. Cetintemel, M. Cherniack, J. DeBrabant, Y. Diao, 2012 ACM SIGMOD International Conference on
K. Dimitriadou, A. Kalinin, O. Papaemmanouil, and S. B. Management of Data, pages 181–192. ACM, 2012.
Zdonik. Query Steering for Interactive Data Exploration. In [37] C. Stolte, D. Tang, and P. Hanrahan. Polaris: A system for
CIDR, 2013. query, analysis, and visualization of multidimensional
[18] G. Chatzopoulou, M. Eirinaki, and N. Polyzotis. Query relational databases. Visualization and Computer Graphics,
recommendations for interactive database exploration. In IEEE Transactions on, 8(1):52–65, 2002.
Scientific and Statistical Database Management, pages 3–18. [38] J. Tyszkiewicz. Spreadsheet as a relational database engine.
Springer, 2009. In Proceedings of the 2010 ACM SIGMOD International
[19] C. E. L. Cormen, Thomas H. and R. L. Rivest. Introduction Conference on Management of data, pages 195–206. ACM,
to Algorithms. Cambridge. MA: MIT, 1990. 2010.
[20] D. Flax. Gesturedb: An accessible & touch-guided ipad app [39] http:/google.com/sheets. Google Sheets (retrieved March 10,
for mysql database browsing. 2016. 2015).
[21] H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, [40] https://www.1010data.com/. 1010 Data (retrieved March 10,

13
2015). in the spreadsheet. We claim that a minimum edge length parti-
[41] https://www.airtable.com/. Airtable (retrieved March 10, tion of the given rectilinear polygon P of length at most k exists
2015). iff the following setting of the optimal hybrid data model problem:
[42] https://www.zoho.com/. Zoho Reports (retrieved March 10, s1 = 0, s2 = 2|C|+1, s3 = s4 = 1, where the storage cost should
2015). not exceed k0 = k + Perimeter(P
2
)
+ s2 |C| for some decomposition
[43] http://www.blockspring.com/. Blockspring (retrieved March of the spreadsheet.
10, 2015). ) Let us assume that the spreadsheet we generate using P has
[44] http://www.excel-db.net/. Excel-DB (retrieved March 10, a decomposition of rectangles whose storage cost is less than k0 =
2015). k + Perimeter(P
2
)
+ s2 |C|. We have to show that there exists a
[45] http://www.microsoft.com/en-us/download/details.aspx?id= partition with minimum edge length of at most k. We first make
43348. Microsoft sql server power pivot (retrieved march 10, the following key observations:
2015). 1. There exists a valid decomposition that doesn’t store any blank
[46] C. Webb. Power Query for Power BI and Excel. Apress, cell. Let’s assume the contrary and consider a decomposition
2014. that stores a blank cell. Since we are now storing |C| + 1 cells
[47] A. Witkowski, S. Bellamkonda, T. Bozkaya, N. Folkert, at minimum,
A. Gupta, J. Haydu, L. Sheng, and S. Subramanian.
Advanced SQL modeling in RDBMS. ACM Transactions on k0 > s2 (|C| + 1) = |C|s2 + s2 = |C|s2 + 2|C| + 1
Database Systems (TODS), 30(1):83–121, 2005. k0 > |C|(s2 + 1 + 1)
[48] A. Witkowski, S. Bellamkonda, T. Bozkaya, A. Naimat, | {z }
storing each cell in a separate table
L. Sheng, S. Subramanian, and A. Waingold. Query by excel.
In Proceedings of the 31st international conference on Very Therefore, if we have a decomposition that stores a blank cell,
large data bases, pages 1204–1215. VLDB Endowment, we also have a decomposition that does not store any blank
2005. cell and has lower cost.
[49] C. Yu and H. V. Jagadish. Schema summarization. In
2. There exists a decomposition of the spreadsheet where all the
Proceedings of the 32nd international conference on Very
tables are disjoint. The argument is similar to the previous
large data bases, pages 319–330. VLDB Endowment, 2006.
case since storing the same cell twice in different tables is
equivalent to storing an extra blank cell.
APPENDIX From our above two observations, we conclude that there exists
A. OPTIMAL HYBRID DATA MODELS a decomposition where all tables are disjoint, and no table stores a
blank cell. Therefore, this decomposition corresponds to partition-
In this section, we demonstrate that the following problem is ing the given spreadsheet into rectangles. We represent this parti-
NP-H ARD. tion of the spreadsheet by T = {T1 , T2 , . . . , Tp }. We now show
P ROBLEM 2 (H YBRID -ROM). Given a spreadsheet with a that this partition of the spreadsheet corresponds to a partitioning
collection of cells C, identify the hybrid data model T with only of the rectilinear polygon P with edge-length less than k.
ROM tables that minimizes cost(T ).
p
X
As before, the cost model is defined as: cost(T ) = s1 + s2 · (ri ⇥ ci ) + s3 · ci + s4 · ri
p
X i=1
cost(T ) = s1 + s2 · (ri ⇥ ci ) + s3 · ci + s4 · ri . (7) X p p
X p
X p
X
i=1 = s1 + s2 ·(ri ⇥ ci ) + s3 ci + s4 ri
i=1 i=1 i=1 i=1
The decision version of the above problem has the following struc-
ture: a value k is provided, and the goal is to test whether there is a substituting s1 = 0, s2 = 2|C| + 1, s3 = s4 = 1, we get:
hybrid data model with cost(T )  k.
We reduce the minimum edge length partitioning problem [27] p p p
!
X X X
of rectilinear polygons to Problem 2, thereby showing that it is NP- = 0 + s2 |C| + 1 · ci + ri
Hard. First, a rectilinear polygon is a polygon in which all edges i=1 i=1 i=1
are either aligned with the x-axis or the y- axis. We consider the
problem of partitioning a rectilinear polygon into disjoint rectan- since cost(T )  k0 = k + Perimeter(P )
+ s2 |C|,
2
gles using the minimum amount of “ink”. In other words, the min-
imality criterion is the total length of the edges (lines) used to form p p
!
X X
the internal partition. Notice that this doesn’t correspond to the cost(T ) = s2 |C| + 1 · ci + ri
minimality criterion of reducing the number of components. We i=1 i=1
illustrate this in Figure 19, which is borrowed from the original p
X Perimeter(P )
paper [27]. The following decision problem was shown to be NP- =) (ri + ci )  k +
Hard in [27]: Given any rectilinear polygon P and a number k, is i=1
2
there a rectangular partitioning whose edge length does not exceed Xp
Perimeter(Ti ) Perimeter(P )
k? We now provide the reduction. =) k+
i=1
2 2
P ROOF FOR P ROBLEM 2. Consider an instance of the polygon p
partitioning problem with minimum edge length required to be at X
=) Perimeter(Ti )  2 ⇥ k + Perimeter(P )
most k. We are given a rectilinear polygon P . We now repre-
i=1
sent the polygon P in a spreadsheet by filling the cells interior
of the polygon, and not filling any other cell in the spreadsheet. Since, the sum of perimeters of all the tables Ti counts the bound-
Let C = {C1 , C2 , . . . , Cm } represent the set of all filled cells ary of P exactly once, and the edge length partition of P exactly

14
RCV RCV RCV
ROM ROM ROM
1000 1000 1000

Time (ms)
Time (ms)

Time (ms)
100 100 100

10 10
0.2 0.4 0.6 0.8 1.0 30 50 70 90 100 104 105 106 107
Sheet Density #Columns #Rows

Figure 16: Update range performance vs (a) Sheet Density (b) Column Count (c) Row Count
RCV RCV RCV
100 ROM 100 ROM 100 ROM
Time(ms)

Time(ms)

Time(ms)
10 10 10

1 1 1
0.2 0.4 0.6 0.8 1.0 10 30 50 70 90 100 104 105 106 107
Sheet Density #Column #rows

Figure 17: Insert row performance vs (a) Sheet Density (b) Column Count (c) Row Count

twice, the partition of the spreadsheet T = {T1 , T2 , . . . , Tp } corre-

sponds to an edge-length partitioning of the given rectilinear poly-
gon P with edge-lengh less than k.
( Let us assume that the given rectilinear polygon P has a min-
imum edge length partition of length at most k. We have to show
that there exists a decomposition of the spreadsheet whose storage
cost is at most k0 = k + Perimeter(P
2
)
+ s2 |C|. Let us represent the
set of rectangles that corresponds to an edge length partition of P
of at most k as T = {T1 , T2 , . . . , Tp }. We shall use the partition Figure 19: Minimum number of rectangles (– – –) does not coin-
T of P as the decomposition of the spreadsheet itself: cide with minimum edge length (· · · )
p
X B. HYBRID DATA MODEL: EXTENSIONS
cost(T ) = s1 + s2 · (ri ⇥ ci ) + s3 · ci + s4 · ri In this section, we discuss a number of extensions to the cost
i=1
model of the hybrid data model. We will describe these exten-
p p p p
X X X X sions to the cost model, and then describe the changes to the basic
= s1 + s2 ·(ri ⇥ ci ) + s3 ci + s4 ri dynamic programming algorithm; modifications to the greedy and
i=1 i=1 i=1 i=1
aggressive greedy decomposition algorithms are straightforward.
substituting s1 = 0, s2 = 2|C| + 1, s3 = s4 = 1, we get: Access Cost. So far, within our cost model, we have only been fo-
cusing on storage. As it turns out, our cost model can be extended
!
X p
Xp
Xp in a straightforward manner to handle access cost — both scrolling-
= 0 + s2 |C| + 1 · ci + ri based operations, and formulae, and our dynamic programming al-
i=1 i=1 i=1 gorithms can similarly be extended to handle access cost without
Xp
X p any substantial changes. We focus on formulae since they are often
Perimeter(Ti )
= s2 |C| + (ri + ci ) = s2 |C| + the more substantial cost of the two; scrolling-based operations can
i=1 i=1
2 be similarly handled. For formulae, there are multiple aspects that
Pp contribute to the time for access: the number of tables accessed, and
since i=1 Perimeter(Ti ) = 2 ⇥ k + Perimeter(P ), we have: within each table, since data is retrieved at a tuple level, the num-
ber of tuples that need to be accessed, and the size of these tuples.
Perimeter(P ) 0 Once again, each of these aspects can be captured within the cost
cost(T ) = s2 |C| + k + =k model via constants similar to s1 , . . . , s5 , and can be seamlessly
2
=) cost(T ) = k0 incorporated into the dynamic programming algorithm. Thus, we
have:
Therefore, the decomposition of the spreadsheet using T corre- T HEOREM 5 (O PTIMALITY WITH ACCESS C OST ). The opti-
sponds to a decomposition whose storage cost equals k0 . Note that mal ROM, COM, and RCV-based hybrid data model based on re-
our reduction can be done in polynomial time. Therefore we can cursive decomposition, across both storage and access cost, can be
solve the minimum length partitioning problem in polynomial time determined via dynamic programming.
if we have a polynomial time solution to the optimal storage prob-
lem. However, since it is shown in [27] that the minimum length Size Limitations of Present Databases. Current databases im-
partitioning problem is NP-Hard, the optimal hybrid data model pose limitations on the number of columns within a relation1 ; since
problem is NP-Hard. This completes our proof. spreadsheets often have an arbitrarily large number of rows and
1
Oracle column number limitations: https://docs.oracle.com/cd/B19306_01/server.

15
Row number as-is Row num as-is Row num as-is
Monotonic Monotonic Monotonic
100 Hierarchical
100 Hierarchical
100 Hierarchical
Time (ms)

Time (ms)

Time (ms)
10 10 10

1 1 1

0.1 0.1 0.1

103 104 105 106 107 103 104 105 106 107 103 104 105 106 107
DataSize DataSize DataSize

Figure 18: Positional mapping performance for (a) Select (b) Insert (c) Delete

columns (sometimes 10s of thousands each), we need to be care- D. ADDITIONAL EXPERIMENTS

ful when trying to capture a spreadsheet area within a collection of
tables that are represented in a database. D.1 Scalability of Inserts and Deletes
This is relatively straightforward to capture in our context: in the We now supplement our evaluation of the scalability of selects
case where we don’t split (Equation 3), if the number of columns in the main body of the paper with an evaluation of the scalabil-
is too large to be acceptable, we simply return 1 as the cost. ity of inserts and updates for the primitive data models on a syn-
T HEOREM 6 (O PTIMALITY WITH S IZE C ONSTRAINTS ). The thetic dataset. Figures 16 and 17 depict the corresponding charts
storage optimal ROM, COM, and RCV-based hybrid data model, for updating a region of 100 rows and 20 columns, and inserting
with the constraint that no tables violate size constraints, based on one row of 100 columns for the primitive data models. In Fig-
recursive decomposition, can be determined via dynamic program- ures 16, we find that the update time taken for RCV is a lot higher
ming. than the time for inserts or selects. This is because in this bench-
Incorporating the Costs of Indexes. Within our cost model, it is mark, DATA S PREAD assumes that the entire region update happens
straightforward to incorporate the costs associated with storage of at once, and fires 100 ⇥ 20 = 2000 update queries one at a time to
indexes, since the size of the indexes are typically proportional to the underlying database, to update each individual cell. In practice,
the number of tuples for a given table, and the cost of instantiating users may only update a small number of cells at a time; and fur-
an index is another fixed constant cost. Since our cost model is ther, we may be able to batch these queries or issue them in parallel
general, by suitably reweighting one or more of s1 , s2 , s3 , s4 , we to further save time. In Figures 17, we find that like in Figures 16,
can capture this aspect within our cost model, and apply the same the time taken for updates on ROM is faster than RCV since it
dynamic programming algorithm. only needs to issue one query, while RCV needs to issue multiple
T HEOREM 7 (O PTIMALITY WITH I NDEXES ). The storage op- queries. However, in this case, since the number of queries issued
timal ROM-based hybrid data model, with the costs of indexes in- is small, the response time is always within 100ms.
cluded, based on recursive decomposition, can be determined via
dynamic programming.
D.2 Impact of Positional Mapping
We now compare the performance of our different positional
mapping methods as described in Section 5. Specifically, we con-
C. RELATIONAL OPERATIONS SUPPORT trast between (i) storing row-number-as is (denoted row-number-
In addition to standard spreadsheet operations, DATA S PREAD as-is), (ii) monotonic positional mapping (denoted monotonic), and
benefits from being built on a standard relational database, and as (iii) hierarchical positional mapping (denoted hierarchical). As de-
a result, seamlessly supports standard relational operations as well. scribed previously, we operate on a dense dataset ranging from 103
To support relational operations from the spreadsheet interface, and to 107 rows, with 100 columns, all of whose cells are filled. The
in particular to enable table declaration and query execution, we in- evaluation was performed on a single ROM table that captures all
troduce two functions in our system, namely DBTable and DBSQL. of the data on the sheet; evaluations for other primitive data models
DBTable enables a user to declare a portion of the spreadsheet are similar. Figure 18 displays the average time taken to perform a
front-end as a database table. Here, the displayed table cells re- fetch, insert, and delete of a single (random) row, averaged across
flect the content of the database table. This is a cue for the hybrid 1000 iterations.
optimizer to “force” this region to be stored as a separate ROM ta- We see that the storing the row number as-is performs well for
ble. Note that there is a two-way synchronization for such a table, the fetch operation. However, the time for insert and delete opera-
i.e., any updates to the table from the front-end is reflected at the tions increases rapidly with the data size, due to cascading updates
back-end and vice versa. of subsequent rows; thus, beyond a data size of 105 , row number-
DBSQL enables a user to execute arbitrary SQL queries combin- as-is is no longer interactive (> 500ms) for insert and delete. On
ing data present on the spreadsheet, and other tables present in the the other hand, the response time of the monotonic positional map-
relational database. To support positional addressing or referenc- ping for fetch operation increases rapidly with data size. This is
ing of spreadsheet data using DBSQL, we introduce two functions: again expected, as we need to search linearly through the positional
RangeValue and RangeTable. RangeValue allows a user to refer a scalar mapping keys to retrieve the required records—making it infeasible
value contained in a cell, e.g., SELECT FROM Actors WHERE ActorId = to use on large datasets. Lastly, we find that hierarchical positional
RangeValue(A1); here, RangeValue(A1) refers to the value of cell A1. mapping performs well for all operations and performance does not
RangeTable allows a user to refer to a range, and perform operations get degrade even with data sizes of 109 tuples. In comparison with
on this range like a database table. This enables any range on a the other schemes, hierarchical positional mapping performs all the
spreadsheet to be treated as a table, e.g., SELECT FROM Actors NATU- three aforementioned operations in few milliseconds, which makes
RAL JOIN RangeTable(A1:D100). it the practical choice for positional mapping for DATA S PREAD.
102/b14237/limits003.htm#i288032; MySQL column limitations: https://dev.mysql.
com/doc/mysql-reslimits-excerpt/5.5/en/column-count-limit.html; PostgreSQL col-
umn limitations: https://www.postgresql.org/about/

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Semarchy XDM Ebook
No ratings yet
Semarchy XDM Ebook
46 pages
Exam
No ratings yet
Exam
12 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Learn SAP BI in 24 Hours
From Everand
Learn SAP BI in 24 Hours
Alex Nordeen
3/5 (1)
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Handstar Inc.: Project Management Prof. Rupesh Kumar Pati
100% (1)
Handstar Inc.: Project Management Prof. Rupesh Kumar Pati
8 pages
Mitigating Security Risks in Ussd-Based Mobile Payment Applications
100% (1)
Mitigating Security Risks in Ussd-Based Mobile Payment Applications
7 pages
Developing Analytic Talent: Becoming a Data Scientist
From Everand
Developing Analytic Talent: Becoming a Data Scientist
Vincent Granville
3/5 (7)
Power BI DAX: A Guide to Using Basic Functions in Data Analysis
From Everand
Power BI DAX: A Guide to Using Basic Functions in Data Analysis
Kiet Huynh
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Redshift Essentials: Definitive Reference for Developers and Engineers
From Everand
Redshift Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
From Spreadsheets To Relational Databases and Back: Abstract
No ratings yet
From Spreadsheets To Relational Databases and Back: Abstract
21 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Data Structures Explained: A Practical Guide with Examples
From Everand
Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
SQL Demystified: A Beginner's Roadmap to Data Retrieval and Management
From Everand
SQL Demystified: A Beginner's Roadmap to Data Retrieval and Management
Kaushal Mehta
No ratings yet
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Citus for Scalable PostgreSQL Systems: The Complete Guide for Developers and Engineers
From Everand
Citus for Scalable PostgreSQL Systems: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
C Data Structures and Algorithms: Implementing Efficient ADTs
From Everand
C Data Structures and Algorithms: Implementing Efficient ADTs
Larry Jones
No ratings yet
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Exploring the Fundamentals of Database Management Systems: Business strategy books, #2
From Everand
Exploring the Fundamentals of Database Management Systems: Business strategy books, #2
SANJIVAN SAINI
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Automatic Web Spreadsheet Data Extraction
No ratings yet
Automatic Web Spreadsheet Data Extraction
8 pages
Model-Based Programming Environments For Spreadsheets (Jacome Cunha) (2014)
No ratings yet
Model-Based Programming Environments For Spreadsheets (Jacome Cunha) (2014)
22 pages
Basic Concepts in Data Structures
From Everand
Basic Concepts in Data Structures
K.Meenendranath Reddy
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
StarPU: Parallel Computing and Task Scheduling Techniques
From Everand
StarPU: Parallel Computing and Task Scheduling Techniques
Richard Johnson
No ratings yet
Topic 7
No ratings yet
Topic 7
16 pages
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Teradata Architecture and SQL Essentials: Definitive Reference for Developers and Engineers
From Everand
Teradata Architecture and SQL Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Spreadsheet Monograph
No ratings yet
Spreadsheet Monograph
11 pages
Efficient Data Science Workflows with Vaex: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Science Workflows with Vaex: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
From Everand
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
1985-5 Spreadsheet Analysis of Logs CWLS Crain
No ratings yet
1985-5 Spreadsheet Analysis of Logs CWLS Crain
14 pages
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
Tableau 8.2 Training Manual: From Clutter to Clarity
From Everand
Tableau 8.2 Training Manual: From Clutter to Clarity
Larry Keller
No ratings yet
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
From Everand
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Mastering Database Design
From Everand
Mastering Database Design
Ted Noreux
No ratings yet
Mastering Algorithms and Data Structures
From Everand
Mastering Algorithms and Data Structures
Manish Soni
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
The Beginner’s Guide to Databases & SQL
From Everand
The Beginner’s Guide to Databases & SQL
Steven Mcananey
No ratings yet
Data Lakes & Pipelines: A Modern Azure Guide
From Everand
Data Lakes & Pipelines: A Modern Azure Guide
Kameron Hussain
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
Learn SQL in 24 Hours
From Everand
Learn SQL in 24 Hours
Alex Nordeen
5/5 (4)
Using Excel and Excel VBA For Preliminary Analysis in Big Data Research
No ratings yet
Using Excel and Excel VBA For Preliminary Analysis in Big Data Research
28 pages
Preparing Data for Analysis with JMP
From Everand
Preparing Data for Analysis with JMP
Robert Carver
No ratings yet
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Spreadsheets Are Code
No ratings yet
Spreadsheets Are Code
10 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Dgraph Essentials: The Complete Guide for Developers and Engineers
From Everand
Dgraph Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Structured Query Language Simplified: Efficient and Effective Database Management
From Everand
Structured Query Language Simplified: Efficient and Effective Database Management
Angela White
No ratings yet
PrestoDB in Practice: Definitive Reference for Developers and Engineers
From Everand
PrestoDB in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Elements of The Theory of Computation 2nd Solution Manual
No ratings yet
Elements of The Theory of Computation 2nd Solution Manual
2 pages
New Doc 2020-03-18 18.59.23 PDF
No ratings yet
New Doc 2020-03-18 18.59.23 PDF
2 pages
COVID-19 Testing Laboratories
No ratings yet
COVID-19 Testing Laboratories
1 page
Opti PDF
No ratings yet
Opti PDF
31 pages
Take Home Assignment
No ratings yet
Take Home Assignment
2 pages
Binary Number System
No ratings yet
Binary Number System
87 pages
Ijso Result
No ratings yet
Ijso Result
5 pages
General: Units and Dimensions, Dimensional Analysis Least Count, Significant Figures
No ratings yet
General: Units and Dimensions, Dimensional Analysis Least Count, Significant Figures
9 pages
Com 9 and 10 q2 w2
No ratings yet
Com 9 and 10 q2 w2
5 pages
W2L4 - DBMS Concepts
No ratings yet
W2L4 - DBMS Concepts
29 pages
ITR - Report13 Vishal
No ratings yet
ITR - Report13 Vishal
23 pages
User Experience and Usability in Agriculture - Sel
No ratings yet
User Experience and Usability in Agriculture - Sel
10 pages
LIB 500 Base & LIB 510 Operator's Manual: 1MRS751424-MUM
No ratings yet
LIB 500 Base & LIB 510 Operator's Manual: 1MRS751424-MUM
32 pages
Francis Samonte Profile
No ratings yet
Francis Samonte Profile
9 pages
Professional Snapshot: Project-1#
No ratings yet
Professional Snapshot: Project-1#
4 pages
Operating System Overview Objectives and Functions
No ratings yet
Operating System Overview Objectives and Functions
3 pages
GaBBA - CPTR 105 - Introduction To Computers - Rev3
No ratings yet
GaBBA - CPTR 105 - Introduction To Computers - Rev3
5 pages
csc101 Revision Questions
No ratings yet
csc101 Revision Questions
34 pages
Online Food Order
No ratings yet
Online Food Order
95 pages
Visual Basic 2013-0049
No ratings yet
Visual Basic 2013-0049
2 pages
SQL Server 2019 Editions Datasheet
No ratings yet
SQL Server 2019 Editions Datasheet
3 pages
Exam 150 Questions
No ratings yet
Exam 150 Questions
482 pages
Unit Ii LM CC
No ratings yet
Unit Ii LM CC
21 pages
Spherical Roller Thrust Bearings: Dimensions
No ratings yet
Spherical Roller Thrust Bearings: Dimensions
3 pages
Music Player Using Python: Submitted By: Mayank Kumar (1808210088)
No ratings yet
Music Player Using Python: Submitted By: Mayank Kumar (1808210088)
18 pages
Level-V IT Service Management Latest
No ratings yet
Level-V IT Service Management Latest
57 pages
The Ultimate Raster Graphics Software List 8 Programs To Consider
No ratings yet
The Ultimate Raster Graphics Software List 8 Programs To Consider
11 pages
Microsoft Azure
100% (1)
Microsoft Azure
13 pages
Getting Started With Simulis Thermodynamics: Use Case 2: Create, Dispatch and Install A Simulis Thermodynamic Package
No ratings yet
Getting Started With Simulis Thermodynamics: Use Case 2: Create, Dispatch and Install A Simulis Thermodynamic Package
15 pages
CXD 210 3I Course - Description
No ratings yet
CXD 210 3I Course - Description
3 pages
SKF NU 1044 ML Specification
No ratings yet
SKF NU 1044 ML Specification
5 pages
File System Vs DBMS
100% (2)
File System Vs DBMS
13 pages
Housekeeping Unusued or Old Tablespace
No ratings yet
Housekeeping Unusued or Old Tablespace
3 pages
Viewpoint:: Bacnet Versus Lonworks
No ratings yet
Viewpoint:: Bacnet Versus Lonworks
3 pages

Scaling Up To Billions of Cells With: Supporting Large Spreadsheets With Databases

Uploaded by

Scaling Up To Billions of Cells With: Supporting Large Spreadsheets With Databases

Uploaded by

Scaling up to Billions of Cells with DATA S PREAD:

Supporting Large Spreadsheets with Databases

Mangesh Bendre, Vipul Venkataraman, Xinyan Zhou

ABSTRACT functionality, e.g., expressiveness, that databases natively provide.

Table 1: Spreadsheet Datasets: Preliminary Statistics

Table 2: Tabular Regions in Spreadsheets. Table 3: Cells accessed by formulae.

columns and relatively few rows, necessitating such a representa- 4 ✕ ✕ ✕ 1

Thus, monotonic positional mapping trades-off the performance

need to scan n items, i.e., the average time complexity is O(N ),

lying indexing structure. In Table 4(b), we experimentally observe Positional Mapper

SQL engine of the database and seamlessly support SQL queries on

Avg time (ms)

standard relational operations in more detail in Appendix C.

ROM tables. For this evaluation, we focused on Agg, since it provided

150 150 150

100 100 100

twice, the partition of the spreadsheet T = {T1 , T2 , . . . , Tp } corre-

0.1 0.1 0.1

columns (sometimes 10s of thousands each), we need to be care- D. ADDITIONAL EXPERIMENTS

You might also like