A hybrid approach for efficient unique column combination discovery
T Papenbrock, F Naumann - 2017 - dl.gi.de
2017•dl.gi.de
Unique column combinations (UCCs) are groups of attributes in relational datasets that
contain no value-entry more than once. Hence, they indicate keys and serve data
management tasks, such as schema normalization, data integration, and data cleansing.
Because the unique column combinations of a particular dataset are usually unknown, UCC
discovery algorithms have been proposed to find them. All previous such discovery
algorithms are, however, inapplicable to datasets of typical real-world size, eg, datasets with …
contain no value-entry more than once. Hence, they indicate keys and serve data
management tasks, such as schema normalization, data integration, and data cleansing.
Because the unique column combinations of a particular dataset are usually unknown, UCC
discovery algorithms have been proposed to find them. All previous such discovery
algorithms are, however, inapplicable to datasets of typical real-world size, eg, datasets with …
Abstract
Unique column combinations (UCCs) are groups of attributes in relational datasets that contain no value-entry more than once. Hence, they indicate keys and serve data management tasks, such as schema normalization, data integration, and data cleansing. Because the unique column combinations of a particular dataset are usually unknown, UCC discovery algorithms have been proposed to find them. All previous such discovery algorithms are, however, inapplicable to datasets of typical real-world size, eg, datasets with more than 50 attributes and a million records.
We present the hybrid discovery algorithm HUCC, which uses the same discovery techniques as the recently proposed functional dependency discovery algorithm HFD: A hybrid combination of fast approximation techniques and efficient validation techniques. With it, the algorithm discovers all minimal unique column combinations in a given dataset. HUCC does not only outperform all existing approaches, it also scales to much larger datasets.
dl.gi.de
Showing the best result for this search. See all results