Thesis icon

Thesis

Declarative nested data transformations at scale and biomedical applications

Abstract:

While large-scale distributed data processing platforms have become an attractive tar- get for query processing, these systems are problematic for applications that deal with nested collections. Programmers are forced either to perform non-trivial translations of collection programs or to employ automated flattening procedures, both of which lead to performance problems. These challenges only worsen for nested collections with skewed cardinalities, where both handcrafted rewriting and automated flattening are unable to enforce load balancing across partitions.

In this work, the TraNCE compilation framework is proposed that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated. The framework employs a combination of query compilation techniques, an efficient data representation for nested collections, and automated skew-handling. Biomedical case studies are presented that outline research and clinical applications for the platform, including data integration support for building feature sets for classification. An extensive experimental evaluation is provided using both synthetic and real-world dataset from the biomedical domain. The evaluation shows that the system is capable of outperforming the common alternative, based on “flattening” complex data structures, and runs efficiently when alternative approaches are unable to perform at all.

Actions


Access Document


Files:

Authors


More by this author
Division:
MPLS
Department:
Computer Science
Role:
Author

Contributors

Role:
Supervisor
ORCID:
0000-0003-2964-0880
Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP