Declarative nested data transformations at scale and biomedical applications

Thesis

Abstract:: While large-scale distributed data processing platforms have become an attractive tar- get for query processing, these systems are problematic for applications that deal with nested collections. Programmers are forced either to perform non-trivial translations of collection programs or to employ automated flattening procedures, both of which lead to performance problems. These challenges only worsen for nested collections with skewed cardinalities, where both handcrafted rewriting and automated flattening are unable to enforce load balancing across partitions.

In this work, the TraNCE compilation framework is proposed that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated. The framework employs a combination of query compilation techniques, an efficient data representation for nested collections, and automated skew-handling. Biomedical case studies are presented that outline research and clinical applications for the platform, including data integration support for building feature sets for classification. An extensive experimental evaluation is provided using both synthetic and real-world dataset from the biomedical domain. The evaluation shows that the system is capable of outperforming the common alternative, based on “flattening” complex data structures, and runs efficiently when alternative approaches are unable to perform at all.

Files:: JSmithThesisFinal.pdf

(Preview, Dissemination version, 6.4MB, Terms of use)

Division:

MPLS

Department:

Computer Science

Role:

Author

Role:

Supervisor

ORCID:

0000-0003-2964-0880

Language:: English
Keywords:: nested data

unnesting

shredding

genomic analysis

distributed processing platforms

nested relational calculus
Subjects:: Genomics

Data processing

Query languages (Computer science)

Big data
Deposit date:: 2022-06-01

Licence:: Terms and Conditions of Use for Oxford University Research Archive

If you are the owner of this record, you can report an update to it here: Report update to this record