Thesis
Declarative nested data transformations at scale and biomedical applications
- Abstract:
-
While large-scale distributed data processing platforms have become an attractive tar- get for query processing, these systems are problematic for applications that deal with nested collections. Programmers are forced either to perform non-trivial translations of collection programs or to employ automated flattening procedures, both of which lead to performance problems. These challenges only worsen for nested collections with skewed cardinalities, where both handcrafted rewriting and automated flattening are unable to enforce load balancing across partitions.
In this work, the TraNCE compilation framework is proposed that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated. The framework employs a combination of query compilation techniques, an efficient data representation for nested collections, and automated skew-handling. Biomedical case studies are presented that outline research and clinical applications for the platform, including data integration support for building feature sets for classification. An extensive experimental evaluation is provided using both synthetic and real-world dataset from the biomedical domain. The evaluation shows that the system is capable of outperforming the common alternative, based on “flattening” complex data structures, and runs efficiently when alternative approaches are unable to perform at all.
Actions
Bibliographic Details
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
Item Description
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2022-06-01
Terms of use
- Copyright holder:
- Smith, J
- Copyright date:
- 2021
Metrics
If you are the owner of this record, you can report an update to it here: Report update to this record