-
Let them have CAKES: A Cutting-Edge Algorithm for Scalable, Efficient, and Exact Search on Big Data
Authors:
Morgan E. Prior,
Thomas J. Howard III,
Oliver McLaughlin,
Terrence Ferguson,
Najib Ishaq,
Noah M. Daniels
Abstract:
The ongoing Big Data explosion has created a demand for efficient and scalable algorithms for similarity search.
Most recent work has focused on \textit{approximate} $k$-NN search, and while this may be sufficient for some applications, \textit{exact} $k$-NN search would be ideal for many applications.
We present CAKES, a set of three novel, exact algorithms for $k$-NN search.
CAKES's algori…
▽ More
The ongoing Big Data explosion has created a demand for efficient and scalable algorithms for similarity search.
Most recent work has focused on \textit{approximate} $k$-NN search, and while this may be sufficient for some applications, \textit{exact} $k$-NN search would be ideal for many applications.
We present CAKES, a set of three novel, exact algorithms for $k$-NN search.
CAKES's algorithms are generic over \textit{any} distance function, and they \textit{do not} scale with the cardinality or embedding dimension of the dataset, but rather with its metric entropy and fractal dimension.
We test these claims on datasets from the ANN-Benchmarks suite under commonly-used distance functions, as well as on a genomic dataset with Levenshtein distance and a radio-frequency dataset with Dynamic Time Warping distance.
We demonstrate that CAKES exhibits near-constant scaling with cardinality on data conforming to the manifold hypothesis, and has perfect recall on data in \textit{metric} spaces.
We also demonstrate that CAKES exhibits significantly higher recall than state-of-the-art $k$-NN search algorithms when the distance function is not a metric.
Additionally, we show that indexing and tuning time for CAKES is an order of magnitude, or more, faster than state-of-the-art approaches.
We conclude that CAKES is a highly efficient and scalable algorithm for exact $k$-NN search on Big Data.
We provide a Rust implementation of CAKES.
△ Less
Submitted 9 January, 2024; v1 submitted 11 September, 2023;
originally announced September 2023.
-
MEDFORD: A human and machine readable metadata markup language
Authors:
Polina Shpilker,
John Freeman,
Hailey McKelvie,
Jill Ashey,
Jay-Miguel Fonticella,
Hollie Putnam,
Jane Greenberg,
Lenore J. Cowen,
Alva Couch,
Noah M. Daniels
Abstract:
Reproducibility of research is essential for science. However, in the way modern computational biology research is done, it is easy to lose track of small, but extremely critical, details. Key details, such as the specific version of a software used or iteration of a genome can easily be lost in the shuffle, or perhaps not noted at all. Much work is being done on the database and storage side of t…
▽ More
Reproducibility of research is essential for science. However, in the way modern computational biology research is done, it is easy to lose track of small, but extremely critical, details. Key details, such as the specific version of a software used or iteration of a genome can easily be lost in the shuffle, or perhaps not noted at all. Much work is being done on the database and storage side of things, ensuring that there exists a space to store experiment-specific details, but current mechanisms for recording details are cumbersome for scientists to use. We propose a new metadata description language, named MEDFORD, in which scientists can record all details relevant to their research. Human-readable, easily-editable, and templatable, MEDFORD serves as a collection point for all notes that a researcher could find relevant to their research, be it for internal use or for future replication. MEDFORD has been applied to coral research, documenting research from RNA-seq analyses to photo collections.
△ Less
Submitted 16 June, 2022; v1 submitted 20 April, 2022;
originally announced April 2022.
-
Clustered Hierarchical Anomaly and Outlier Detection Algorithms
Authors:
Najib Ishaq,
Thomas J. Howard III,
Noah M. Daniels
Abstract:
Anomaly and outlier detection is a long-standing problem in machine learning. In some cases, anomaly detection is easy, such as when data are drawn from well-characterized distributions such as the Gaussian. However, when data occupy high-dimensional spaces, anomaly detection becomes more difficult. We present CLAM (Clustered Learning of Approximate Manifolds), a manifold mapping technique in any…
▽ More
Anomaly and outlier detection is a long-standing problem in machine learning. In some cases, anomaly detection is easy, such as when data are drawn from well-characterized distributions such as the Gaussian. However, when data occupy high-dimensional spaces, anomaly detection becomes more difficult. We present CLAM (Clustered Learning of Approximate Manifolds), a manifold mapping technique in any metric space. CLAM begins with a fast hierarchical clustering technique and then induces a graph from the cluster tree, based on overlapping clusters as selected using several geometric and topological features. Using these graphs, we implement CHAODA (Clustered Hierarchical Anomaly and Outlier Detection Algorithms), exploring various properties of the graphs and their constituent clusters to find outliers. CHAODA employs a form of transfer learning based on a training set of datasets, and applies this knowledge to a separate test set of datasets of different cardinalities, dimensionalities, and domains. On 24 publicly available datasets, we compare CHAODA (by measure of ROC AUC) to a variety of state-of-the-art unsupervised anomaly-detection algorithms. Six of the datasets are used for training. CHAODA outperforms other approaches on 16 of the remaining 18 datasets. CLAM and CHAODA scale to large, high-dimensional "big data" anomaly-detection problems, and generalize across datasets and distance functions. Source code to CLAM and CHAODA are freely available on GitHub at https://github.com/URI-ABD/clam.
△ Less
Submitted 21 November, 2021; v1 submitted 9 February, 2021;
originally announced March 2021.
-
Clustered Hierarchical Entropy-Scaling Search of Astronomical and Biological Data
Authors:
Najib Ishaq,
George Student,
Noah M. Daniels
Abstract:
Both astronomy and biology are experiencing explosive growth of data, resulting in a "big data" problem that stands in the way of a "big data" opportunity for discovery. One common question asked of such data is that of approximate search ($ρ-$nearest neighbors search). We present a hierarchical search algorithm for such data sets that takes advantage of particular geometric properties apparent in…
▽ More
Both astronomy and biology are experiencing explosive growth of data, resulting in a "big data" problem that stands in the way of a "big data" opportunity for discovery. One common question asked of such data is that of approximate search ($ρ-$nearest neighbors search). We present a hierarchical search algorithm for such data sets that takes advantage of particular geometric properties apparent in both astronomical and biological data sets, namely the metric entropy and fractal dimensionality of the data. We present CHESS (Clustered Hierarchical Entropy-Scaling Search), a search tool with virtually no loss in specificity or sensitivity, demonstrating a $13.6\times$ speedup over linear search on the Sloan Digital Sky Survey's APOGEE data set and a $68\times$ speedup on the GreenGenes 16S metagenomic data set, as well as asymptotically fewer distance comparisons on APOGEE when compared to the FALCONN locality-sensitive hashing library. CHESS demonstrates an asymptotic complexity not directly dependent on data set size, and is in practice at least an order of magnitude faster than linear search by performing fewer distance comparisons. Unlike locality-sensitive hashing approaches, CHESS can work with any user-defined distance function. CHESS also allows for implicit data compression, which we demonstrate on the APOGEE data set. We also discuss an extension allowing for efficient k-nearest neighbors search.
△ Less
Submitted 10 November, 2019; v1 submitted 22 August, 2019;
originally announced August 2019.
-
Entropy-scaling search of massive biological data
Authors:
Y. William Yu,
Noah M. Daniels,
David Christian Danko,
Bonnie Berger
Abstract:
Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimensio…
▽ More
Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.
△ Less
Submitted 21 September, 2015; v1 submitted 18 March, 2015;
originally announced March 2015.
-
Remote Homology Detection in Proteins Using Graphical Models
Authors:
Noah M. Daniels
Abstract:
Given the amino acid sequence of a protein, researchers often infer its structure and function by finding homologous, or evolutionarily-related, proteins of known structure and function. Since structure is typically more conserved than sequence over long evolutionary distances, recognizing remote protein homologs from their sequence poses a challenge.
We first consider all proteins of known thre…
▽ More
Given the amino acid sequence of a protein, researchers often infer its structure and function by finding homologous, or evolutionarily-related, proteins of known structure and function. Since structure is typically more conserved than sequence over long evolutionary distances, recognizing remote protein homologs from their sequence poses a challenge.
We first consider all proteins of known three-dimensional structure, and explore how they cluster according to different levels of homology. An automatic computational method reasonably approximates a human-curated hierarchical organization of proteins according to their degree of homology.
Next, we return to homology prediction, based only on the one-dimensional amino acid sequence of a protein. Menke, Berger, and Cowen proposed a Markov random field model to predict remote homology for beta-structural proteins, but their formulation was computationally intractable on many beta-strand topologies.
We show two different approaches to approximate this random field, both of which make it computationally tractable, for the first time, on all protein folds. One method simplifies the random field itself, while the other retains the full random field, but approximates the solution through stochastic search. Both methods achieve improvements over the state of the art in remote homology detection for beta-structural protein folds.
△ Less
Submitted 23 April, 2013;
originally announced April 2013.