Search | arXiv e-print repository

Polynomial Time Algorithms for Integer Programming and Unbounded Subset Sum in the Total Regime

Authors: Divesh Aggarwal, Antoine Joux, Miklos Santha, Karol Węgrzycki

Abstract: The Unbounded Subset Sum (USS) problem is an NP-hard computational problem where the goal is to decide whether there exist non-negative integers $x_1, \ldots, x_n$ such that $x_1 a_1 + \ldots + x_n a_n = b$, where $a_1 < \cdots < a_n < b$ are distinct positive integers with $\text{gcd}(a_1, \ldots, a_n)$ dividing $b$. The problem can be solved in pseudopolynomial time, while specialized cases, suc… ▽ More The Unbounded Subset Sum (USS) problem is an NP-hard computational problem where the goal is to decide whether there exist non-negative integers $x_1, \ldots, x_n$ such that $x_1 a_1 + \ldots + x_n a_n = b$, where $a_1 < \cdots < a_n < b$ are distinct positive integers with $\text{gcd}(a_1, \ldots, a_n)$ dividing $b$. The problem can be solved in pseudopolynomial time, while specialized cases, such as when $b$ exceeds the Frobenius number of $a_1, \ldots, a_n$ simplify to a total problem where a solution always exists. This paper explores the concept of totality in USS. The challenge in this setting is to actually find a solution, even though we know its existence is guaranteed. We focus on the instances of USS where solutions are guaranteed for large $b$. We show that when $b$ is slightly greater than the Frobenius number, we can find the solution to USS in polynomial time. We then show how our results extend to Integer Programming with Equalities (ILPE), highlighting conditions under which ILPE becomes total. We investigate the diagonal Frobenius number, which is the appropriate generalization of the Frobenius number to this context. In this setting, we give a polynomial-time algorithm to find a solution of ILPE. The bound obtained from our algorithmic procedure for finding a solution almost matches the recent existential bound of Bach, Eisenbrand, Rothvoss, and Weismantel (2024). △ Less

Submitted 11 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

Comments: 12 pages

arXiv:2405.13000 [pdf, other]

RAGE Against the Machine: Retrieval-Augmented LLM Explanations

Authors: Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jaroslaw Szlichta

Abstract: This paper demonstrates RAGE, an interactive tool for explaining Large Language Models (LLMs) augmented with retrieval capabilities; i.e., able to query external sources and pull relevant information into their input context. Our explanations are counterfactual in the sense that they identify parts of the input context that, when removed, change the answer to the question posed to the LLM. RAGE in… ▽ More This paper demonstrates RAGE, an interactive tool for explaining Large Language Models (LLMs) augmented with retrieval capabilities; i.e., able to query external sources and pull relevant information into their input context. Our explanations are counterfactual in the sense that they identify parts of the input context that, when removed, change the answer to the question posed to the LLM. RAGE includes pruning methods to navigate the vast space of possible explanations, allowing users to view the provenance of the produced answers. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: Accepted by ICDE 2024 (Demonstration Track)

arXiv:2404.07354 [pdf, other]

FairEM360: A Suite for Responsible Entity Matching

Authors: Nima Shahbazi, Mahdi Erfanian, Abolfazl Asudeh, Fatemeh Nargesian, Divesh Srivastava

Abstract: Entity matching is one the earliest tasks that occur in the big data pipeline and is alarmingly exposed to unintentional biases that affect the quality of data. Identifying and mitigating the biases that exist in the data or are introduced by the matcher at this stage can contribute to promoting fairness in downstream tasks. This demonstration showcases FairEM360, a framework for 1) auditing the o… ▽ More Entity matching is one the earliest tasks that occur in the big data pipeline and is alarmingly exposed to unintentional biases that affect the quality of data. Identifying and mitigating the biases that exist in the data or are introduced by the matcher at this stage can contribute to promoting fairness in downstream tasks. This demonstration showcases FairEM360, a framework for 1) auditing the output of entity matchers across a wide range of fairness measures and paradigms, 2) providing potential explanations for the underlying reasons for unfairness, and 3) providing resolutions for the unfairness issues through an exploratory process with human-in-the-loop feedback, utilizing an ensemble of matchers. We aspire for FairEM360 to contribute to the prioritization of fairness as a key consideration in the evaluation of EM pipelines. △ Less

Submitted 18 July, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

arXiv:2403.00526 [pdf, other]

Data Quality Assessment: Challenges and Opportunities

Authors: Sedir Mohammed, Hazar Harmouch, Felix Naumann, Divesh Srivastava

Abstract: Data-oriented applications, their users, and even the law require data of high quality. Research has broken down the rather vague notion of data quality into various dimensions, such as accuracy, consistency, and reputation, to name but a few. To achieve the goal of high data quality, many tools and techniques exist to clean and otherwise improve data. Yet, systematic research on actually assessin… ▽ More Data-oriented applications, their users, and even the law require data of high quality. Research has broken down the rather vague notion of data quality into various dimensions, such as accuracy, consistency, and reputation, to name but a few. To achieve the goal of high data quality, many tools and techniques exist to clean and otherwise improve data. Yet, systematic research on actually assessing data quality in all of its dimensions is largely absent, and with it the ability to gauge the success of any data cleaning effort. It is our vision to establish a systematic and comprehensive framework for the (numeric) assessment of data quality for a given dataset and its intended use. Such a framework must cover the various facets that influence data quality, as well as the many types of data quality dimensions. In particular, we identify five facets that serve as a foundation of data quality assessment. For each facet, we outline the challenges and opportunities that arise when trying to actually assign quality scores to data and create a data quality profile for it, along with a wide range of technologies needed for this purpose. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2402.14863 [pdf, other]

Evaluation of a semi-autonomous attentive listening system with takeover prompting

Authors: Haruki Kawai, Divesh Lala, Koji Inoue, Keiko Ochi, Tatsuya Kawahara

Abstract: The handling of communication breakdowns and loss of engagement is an important aspect of spoken dialogue systems, particularly for chatting systems such as attentive listening, where the user is mostly speaking. We presume that a human is best equipped to handle this task and rescue the flow of conversation. To this end, we propose a semi-autonomous system, where a remote operator can take contro… ▽ More The handling of communication breakdowns and loss of engagement is an important aspect of spoken dialogue systems, particularly for chatting systems such as attentive listening, where the user is mostly speaking. We presume that a human is best equipped to handle this task and rescue the flow of conversation. To this end, we propose a semi-autonomous system, where a remote operator can take control of an autonomous attentive listening system in real-time. In order to make human intervention easy and consistent, we introduce automatic detection of low interest and engagement to provide explicit takeover prompts to the remote operator. We implement this semi-autonomous system which detects takeover points for the operator and compare it to fully tele-operated and fully autonomous attentive listening systems. We find that the semi-autonomous system is generally perceived more positively than the autonomous system. The results suggest that identifying points of conversation when the user starts to lose interest may help us improve a fully autonomous dialogue system. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.12770 [pdf, other]

Acknowledgment of Emotional States: Generating Validating Responses for Empathetic Dialogue

Authors: Zi Haur Pang, Yahui Fu, Divesh Lala, Keiko Ochi, Koji Inoue, Tatsuya Kawahara

Abstract: In the realm of human-AI dialogue, the facilitation of empathetic responses is important. Validation is one of the key communication techniques in psychology, which entails recognizing, understanding, and acknowledging others' emotional states, thoughts, and actions. This study introduces the first framework designed to engender empathetic dialogue with validating responses. Our approach incorpora… ▽ More In the realm of human-AI dialogue, the facilitation of empathetic responses is important. Validation is one of the key communication techniques in psychology, which entails recognizing, understanding, and acknowledging others' emotional states, thoughts, and actions. This study introduces the first framework designed to engender empathetic dialogue with validating responses. Our approach incorporates a tripartite module system: 1) validation timing detection, 2) users' emotional state identification, and 3) validating response generation. Utilizing Japanese EmpatheticDialogues dataset - a textual-based dialogue dataset consisting of 8 emotional categories from Plutchik's wheel of emotions - the Task Adaptive Pre-Training (TAPT) BERT-based model outperforms both random baseline and the ChatGPT performance, in term of F1-score, in all modules. Further validation of our model's efficacy is confirmed in its application to the TUT Emotional Storytelling Corpus (TESC), a speech-based dialogue dataset, by surpassing both random baseline and the ChatGPT. This consistent performance across both textual and speech-based dialogues underscores the effectiveness of our framework in fostering empathetic human-AI communication. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024)

arXiv:2401.04867 [pdf, other]

An Analysis of User Behaviors for Objectively Evaluating Spoken Dialogue Systems

Authors: Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara, Gabriel Skantze

Abstract: Establishing evaluation schemes for spoken dialogue systems is important, but it can also be challenging. While subjective evaluations are commonly used in user experiments, objective evaluations are necessary for research comparison and reproducibility. To address this issue, we propose a framework for indirectly but objectively evaluating systems based on users' behaviors. In this paper, to this… ▽ More Establishing evaluation schemes for spoken dialogue systems is important, but it can also be challenging. While subjective evaluations are commonly used in user experiments, objective evaluations are necessary for research comparison and reproducibility. To address this issue, we propose a framework for indirectly but objectively evaluating systems based on users' behaviors. In this paper, to this end, we investigate the relationship between user behaviors and subjective evaluation scores in social dialogue tasks: attentive listening, job interview, and first-meeting conversation. The results reveal that in dialogue tasks where user utterances are primary, such as attentive listening and job interview, indicators like the number of utterances and words play a significant role in evaluation. Observing disfluency also can indicate the effectiveness of formal tasks, such as job interview. On the other hand, in dialogue tasks with high interactivity, such as first-meeting conversation, behaviors related to turn-taking, like average switch pause length, become more important. These findings suggest that selecting appropriate user behaviors can provide valuable insights for objective evaluation in each social dialogue task. △ Less

Submitted 23 January, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024) and represents the author's version of the work

arXiv:2311.15064 [pdf, other]

Recursive lattice reduction -- A framework for finding short lattice vectors

Authors: Divesh Aggarwal, Thomas Espitau, Spencer Peters, Noah Stephens-Davidowitz

Abstract: We propose a new framework called recursive lattice reduction for finding short non-zero vectors in a lattice or for finding dense sublattices of a lattice. At a high level, the framework works by recursively searching for dense sublattices of dense sublattices (or their duals). Eventually, the procedure encounters a recursive call on a lattice $\mathcal{L}$ with relatively low rank $k$, at which… ▽ More We propose a new framework called recursive lattice reduction for finding short non-zero vectors in a lattice or for finding dense sublattices of a lattice. At a high level, the framework works by recursively searching for dense sublattices of dense sublattices (or their duals). Eventually, the procedure encounters a recursive call on a lattice $\mathcal{L}$ with relatively low rank $k$, at which point we simply use a known algorithm to find a short non-zero vector in $\mathcal{L}$. We view our framework as complementary to basis reduction algorithms, which similarly work to reduce an $n$-dimensional lattice problem with some approximation factor $γ$ to an exact lattice problem in dimension $k < n$, with a tradeoff between $γ$, $n$, and $k$. Our framework provides an alternative and arguably simpler perspective, which in particular can be described without explicitly referencing any specific basis of the lattice, Gram-Schmidt vectors, or even projection (though implementations of algorithms in this framework will likely make use of such things). We present a number of specific instantiations of our framework. Our main concrete result is a reduction that matches the tradeoff between $γ$, $n$, and $k$ achieved by the best-known basis reduction algorithms (in terms of the Hermite factor, up to low-order terms) across all parameter regimes. In fact, this reduction also can be used to find dense sublattices with any rank $\ell$ satisfying $\min\{\ell,n-\ell\} \leq n-k+1$, using only an oracle for SVP (or even just Hermite SVP) in $k$ dimensions, which is itself a novel result (as far as the authors know). We also show a very simple reduction that achieves the same tradeoff in quasipolynomial time. Finally, we present an automated approach for searching for algorithms in this framework that (provably) achieve better approximations with fewer oracle calls. △ Less

Submitted 25 November, 2023; originally announced November 2023.

arXiv:2308.11020 [pdf, other]

doi 10.1145/3610661.3617151

Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors

Authors: Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara, Gabriel Skantze

Abstract: This paper tackles the challenging task of evaluating socially situated conversational robots and presents a novel objective evaluation approach that relies on multimodal user behaviors. In this study, our main focus is on assessing the human-likeness of the robot as the primary evaluation metric. While previous research often relied on subjective evaluations from users, our approach aims to evalu… ▽ More This paper tackles the challenging task of evaluating socially situated conversational robots and presents a novel objective evaluation approach that relies on multimodal user behaviors. In this study, our main focus is on assessing the human-likeness of the robot as the primary evaluation metric. While previous research often relied on subjective evaluations from users, our approach aims to evaluate the robot's human-likeness based on observable user behaviors indirectly, thus enhancing objectivity and reproducibility. To begin, we created an annotated dataset of human-likeness scores, utilizing user behaviors found in an attentive listening dialogue corpus. We then conducted an analysis to determine the correlation between multimodal user behaviors and human-likeness scores, demonstrating the feasibility of our proposed behavior-based evaluation method. △ Less

Submitted 25 September, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

Comments: Accepted by 25th ACM International Conference on Multimodal Interaction (ICMI '23), Late-Breaking Results

arXiv:2307.02726 [pdf, other]

Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching

Authors: Nima Shahbazi, Nikola Danevski, Fatemeh Nargesian, Abolfazl Asudeh, Divesh Srivastava

Abstract: Entity matching (EM) is a challenging problem studied by different communities for over half a century. Algorithmic fairness has also become a timely topic to address machine bias and its societal impacts. Despite extensive research on these two topics, little attention has been paid to the fairness of entity matching. Towards addressing this gap, we perform an extensive experimental evaluation… ▽ More Entity matching (EM) is a challenging problem studied by different communities for over half a century. Algorithmic fairness has also become a timely topic to address machine bias and its societal impacts. Despite extensive research on these two topics, little attention has been paid to the fairness of entity matching. Towards addressing this gap, we perform an extensive experimental evaluation of a variety of EM techniques in this paper. We generated two social datasets from publicly available datasets for the purpose of auditing EM through the lens of fairness. Our findings underscore potential unfairness under two common conditions in real-world societies: (i) when some demographic groups are overrepresented, and (ii) when names are more similar in some groups compared to others. Among our many findings, it is noteworthy to mention that while various fairness definitions are valuable for different settings, due to EM's class imbalance nature, measures such as positive predictive value parity and true positive rate parity are, in general, more capable of revealing EM unfairness. △ Less

Submitted 5 July, 2023; originally announced July 2023.

Comments: Accepted to VLDB'23

arXiv:2302.04983 [pdf, other]

doi 10.1109/ICDE55515.2023.00285

CREDENCE: Counterfactual Explanations for Document Ranking

Authors: Joel Rorseth, Parke Godfrey, Lukasz Golab, Mehdi Kargar, Divesh Srivastava, Jaroslaw Szlichta

Abstract: Towards better explainability in the field of information retrieval, we present CREDENCE, an interactive tool capable of generating counterfactual explanations for document rankers. Embracing the unique properties of the ranking problem, we present counterfactual explanations in terms of document perturbations, query perturbations, and even other documents. Additionally, users may build and test t… ▽ More Towards better explainability in the field of information retrieval, we present CREDENCE, an interactive tool capable of generating counterfactual explanations for document rankers. Embracing the unique properties of the ranking problem, we present counterfactual explanations in terms of document perturbations, query perturbations, and even other documents. Additionally, users may build and test their own perturbations, and extract insights about their query, documents, and ranker. △ Less

Submitted 9 February, 2023; originally announced February 2023.

Comments: Accepted by ICDE 2023 (Demonstration Track)

arXiv:2211.11693 [pdf, other]

Lattice Problems Beyond Polynomial Time

Authors: Divesh Aggarwal, Huck Bennett, Zvika Brakerski, Alexander Golovnev, Rajendra Kumar, Zeyong Li, Spencer Peters, Noah Stephens-Davidowitz, Vinod Vaikuntanathan

Abstract: We study the complexity of lattice problems in a world where algorithms, reductions, and protocols can run in superpolynomial time, revisiting four foundational results: two worst-case to average-case reductions and two protocols. We also show a novel protocol. 1. We prove that secret-key cryptography exists if $\widetilde{O}(\sqrt{n})$-approximate SVP is hard for $2^{\varepsilon n}$-time algori… ▽ More We study the complexity of lattice problems in a world where algorithms, reductions, and protocols can run in superpolynomial time, revisiting four foundational results: two worst-case to average-case reductions and two protocols. We also show a novel protocol. 1. We prove that secret-key cryptography exists if $\widetilde{O}(\sqrt{n})$-approximate SVP is hard for $2^{\varepsilon n}$-time algorithms. I.e., we extend to our setting (Micciancio and Regev's improved version of) Ajtai's celebrated polynomial-time worst-case to average-case reduction from $\widetilde{O}(n)$-approximate SVP to SIS. 2. We prove that public-key cryptography exists if $\widetilde{O}(n)$-approximate SVP is hard for $2^{\varepsilon n}$-time algorithms. This extends to our setting Regev's celebrated polynomial-time worst-case to average-case reduction from $\widetilde{O}(n^{1.5})$-approximate SVP to LWE. In fact, Regev's reduction is quantum, but ours is classical, generalizing Peikert's polynomial-time classical reduction from $\widetilde{O}(n^2)$-approximate SVP. 3. We show a $2^{\varepsilon n}$-time coAM protocol for $O(1)$-approximate CVP, generalizing the celebrated polynomial-time protocol for $O(\sqrt{n/\log n})$-CVP due to Goldreich and Goldwasser. These results show complexity-theoretic barriers to extending the recent line of fine-grained hardness results for CVP and SVP to larger approximation factors. (This result also extends to arbitrary norms.) 4. We show a $2^{\varepsilon n}$-time co-non-deterministic protocol for $O(\sqrt{\log n})$-approximate SVP, generalizing the (also celebrated!) polynomial-time protocol for $O(\sqrt{n})$-CVP due to Aharonov and Regev. 5. We give a novel coMA protocol for $O(1)$-approximate CVP with a $2^{\varepsilon n}$-time verifier. All of the results described above are special cases of more general theorems that achieve time-approximation factor tradeoffs. △ Less

Submitted 21 November, 2022; originally announced November 2022.

arXiv:2211.08526 [pdf, other]

Alzheimer's Dementia Detection through Spontaneous Dialogue with Proactive Robotic Listeners

Authors: Yuanchao Li, Catherine Lai, Divesh Lala, Koji Inoue, Tatsuya Kawahara

Abstract: As the aging of society continues to accelerate, Alzheimer's Disease (AD) has received more and more attention from not only medical but also other fields, such as computer science, over the past decade. Since speech is considered one of the effective ways to diagnose cognitive decline, AD detection from speech has emerged as a hot topic. Nevertheless, such approaches fail to tackle several key is… ▽ More As the aging of society continues to accelerate, Alzheimer's Disease (AD) has received more and more attention from not only medical but also other fields, such as computer science, over the past decade. Since speech is considered one of the effective ways to diagnose cognitive decline, AD detection from speech has emerged as a hot topic. Nevertheless, such approaches fail to tackle several key issues: 1) AD is a complex neurocognitive disorder which means it is inappropriate to conduct AD detection using utterance information alone while ignoring dialogue information; 2) Utterances of AD patients contain many disfluencies that affect speech recognition yet are helpful to diagnosis; 3) AD patients tend to speak less, causing dialogue breakdown as the disease progresses. This fact leads to a small number of utterances, which may cause detection bias. Therefore, in this paper, we propose a novel AD detection architecture consisting of two major modules: an ensemble AD detector and a proactive listener. This architecture can be embedded in the dialogue system of conversational robots for healthcare. △ Less

Submitted 15 November, 2022; originally announced November 2022.

Comments: Accepted for HRI2022 Late-Breaking Report

arXiv:2211.04568 [pdf, ps, other]

Towards Algorithmic Fairness in Space-Time: Filling in Black Holes

Authors: Cheryl Flynn, Aritra Guha, Subhabrata Majumdar, Divesh Srivastava, Zhengyi Zhou

Abstract: New technologies and the availability of geospatial data have drawn attention to spatio-temporal biases present in society. For example: the COVID-19 pandemic highlighted disparities in the availability of broadband service and its role in the digital divide; the environmental justice movement in the United States has raised awareness to health implications for minority populations stemming from h… ▽ More New technologies and the availability of geospatial data have drawn attention to spatio-temporal biases present in society. For example: the COVID-19 pandemic highlighted disparities in the availability of broadband service and its role in the digital divide; the environmental justice movement in the United States has raised awareness to health implications for minority populations stemming from historical redlining practices; and studies have found varying quality and coverage in the collection and sharing of open-source geospatial data. Despite the extensive literature on machine learning (ML) fairness, few algorithmic strategies have been proposed to mitigate such biases. In this paper we highlight the unique challenges for quantifying and addressing spatio-temporal biases, through the lens of use cases presented in the scientific literature and media. We envision a roadmap of ML strategies that need to be developed or adapted to quantify and overcome these challenges -- including transfer learning, active learning, and reinforcement learning techniques. Further, we discuss the potential role of ML in providing guidance to policy makers on issues related to spatial fairness. △ Less

Submitted 8 November, 2022; originally announced November 2022.

arXiv:2211.04385 [pdf, other]

Why we couldn't prove SETH hardness of the Closest Vector Problem for even norms!

Authors: Divesh Aggarwal, Rajendra Kumar

Abstract: Recent work [BGS17,ABGS19] has shown SETH hardness of CVP in the $\ell_p$ norm for any $p$ that is not an even integer. This result was shown by giving a Karp reduction from $k$-SAT on $n$ variables to CVP on a lattice of rank $n$. In this work, we show a barrier towards proving a similar result for CVP in the $\ell_p$ norm where $p$ is an even integer. We show that for any $c>0$, if for every… ▽ More Recent work [BGS17,ABGS19] has shown SETH hardness of CVP in the $\ell_p$ norm for any $p$ that is not an even integer. This result was shown by giving a Karp reduction from $k$-SAT on $n$ variables to CVP on a lattice of rank $n$. In this work, we show a barrier towards proving a similar result for CVP in the $\ell_p$ norm where $p$ is an even integer. We show that for any $c>0$, if for every $k > 0$, there exists an efficient reduction that maps a $k$-SAT instance on $n$ variables to a CVP instance for a lattice of rank at most $n^{c}$ in the Euclidean norm, then $\mathsf{coNP} \subset \mathsf{NP/Poly}$. We prove a similar result for CVP for all even norms under a mild additional promise that the ratio of the distance of the target from the lattice and the shortest non-zero vector in the lattice is bounded by $exp(n^{O(1)})$. Furthermore, we show that for any $c> 0$, and any even integer $p$, if for every $k > 0$, there exists an efficient reduction that maps a $k$-SAT instance on $n$ variables to a $SVP_p$ instance for a lattice of rank at most $n^{c}$, then $\mathsf{coNP} \subset \mathsf{NP/Poly}$. The result for SVP does not require any additional promise. While prior results have indicated that lattice problems in the $\ell_2$ norm (Euclidean norm) are easier than lattice problems in other norms, this is the first result that shows a separation between these problems. We achieve this by using a result by Dell and van Melkebeek [JACM, 2014] on the impossibility of the existence of a reduction that compresses an arbitrary $k$-SAT instance into a string of length $\mathcal{O}(n^{k-ε})$ for any $ε>0$. In addition to CVP, we also show that the same result holds for the Subset-Sum problem using similar techniques. △ Less

Submitted 25 November, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

Comments: Added: Instance compression of exact-CVP

arXiv:2203.12978 [pdf, other]

Effective Explanations for Entity Resolution Models

Authors: Tommaso Teofili, Donatella Firmani, Nick Koudas, Vincenzo Martello, Paolo Merialdo, Divesh Srivastava

Abstract: Entity resolution (ER) aims at matching records that refer to the same real-world entity. Although widely studied for the last 50 years, ER still represents a challenging data management problem, and several recent works have started to investigate the opportunity of applying deep learning (DL) techniques to solve this problem. In this paper, we study the fundamental problem of explainability of t… ▽ More Entity resolution (ER) aims at matching records that refer to the same real-world entity. Although widely studied for the last 50 years, ER still represents a challenging data management problem, and several recent works have started to investigate the opportunity of applying deep learning (DL) techniques to solve this problem. In this paper, we study the fundamental problem of explainability of the DL solution for ER. Understanding the matching predictions of an ER solution is indeed crucial to assess the trustworthiness of the DL model and to discover its biases. We treat the DL model as a black box classifier and - while previous approaches to provide explanations for DL predictions are agnostic to the classification task. we propose the CERTA approach that is aware of the semantics of the ER problem. Our approach produces both saliency explanations, which associate each attribute with a saliency score, and counterfactual explanations, which provide examples of values that can flip the prediction. CERTA builds on a probabilistic framework that aims at computing the explanations evaluating the outcomes produced by using perturbed copies of the input records. We experimentally evaluate CERTA's explanations of state-of-the-art ER solutions based on DL models using publicly available datasets, and demonstrate the effectiveness of CERTA over recently proposed methods for this problem. △ Less

Submitted 1 April, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

arXiv:2202.13354 [pdf, ps, other]

doi 10.3390/e24081038

Quantum secure non-malleable codes in the split-state model

Authors: Divesh Aggarwal, Naresh Goud Boddu, Rahul Jain

Abstract: Non-malleable-codes introduced by Dziembowski, Pietrzak and Wichs [DPW18] encode a classical message $S$ in a manner such that tampering the codeword results in the decoder either outputting the original message $S$ or a message that is unrelated/independent of $S$. Providing such non-malleable security for various tampering function families has received significant attention in recent years. We… ▽ More Non-malleable-codes introduced by Dziembowski, Pietrzak and Wichs [DPW18] encode a classical message $S$ in a manner such that tampering the codeword results in the decoder either outputting the original message $S$ or a message that is unrelated/independent of $S$. Providing such non-malleable security for various tampering function families has received significant attention in recent years. We consider the well-studied (2-part) split-state model, in which the message $S$ is encoded into two parts $X$ and $Y$, and the adversary is allowed to arbitrarily tamper with each $X$ and $Y$ individually. We consider the security of non-malleable-codes in the split-state model when the adversary is allowed to make use of arbitrary entanglement to tamper the parts $X$ and $Y$. We construct explicit quantum secure non-malleable-codes in the split-state model. Our construction of quantum secure non-malleable-codes is based on the recent construction of quantum secure $2$-source non-malleable-extractors by Boddu, Jain and Kapshikar [BJK21]. △ Less

Submitted 8 June, 2023; v1 submitted 27 February, 2022; originally announced February 2022.

Comments: arXiv admin note: text overlap with arXiv:2109.03097. text overlap with arXiv:1611.09248 by other authors

arXiv:2111.04157 [pdf, ps, other]

Extractors: Low Entropy Requirements Colliding With Non-Malleability

Authors: Divesh Aggarwal, Eldon Chung, Maciej Obremski

Abstract: The known constructions of negligible error (non-malleable) two-source extractors can be broadly classified in three categories: (1) Constructions where one source has min-entropy rate about $1/2$, the other source can have small min-entropy rate, but the extractor doesn't guarantee non-malleability. (2) Constructions where one source is uniform, and the other can have small min-entropy rate,… ▽ More The known constructions of negligible error (non-malleable) two-source extractors can be broadly classified in three categories: (1) Constructions where one source has min-entropy rate about $1/2$, the other source can have small min-entropy rate, but the extractor doesn't guarantee non-malleability. (2) Constructions where one source is uniform, and the other can have small min-entropy rate, and the extractor guarantees non-malleability when the uniform source is tampered. (3) Constructions where both sources have entropy rate very close to $1$ and the extractor guarantees non-malleability against the tampering of both sources. We introduce a new notion of collision resistant extractors and in using it we obtain a strong two source non-malleable extractor where we require the first source to have $0.8$ entropy rate and the other source can have min-entropy polylogarithmic in the length of the source. We show how the above extractor can be applied to obtain a non-malleable extractor with output rate $\frac 1 2$, which is optimal. We also show how, by using our extractor and extending the known protocol, one can obtain a privacy amplification secure against memory tampering where the size of the secret output is almost optimal. △ Less

Submitted 9 June, 2023; v1 submitted 7 November, 2021; originally announced November 2021.

arXiv:2108.02084 [pdf, other]

doi 10.14778/3476249.3476280

Real-World Trajectory Sharing with Local Differential Privacy

Authors: Teddy Cunningham, Graham Cormode, Hakan Ferhatosmanoglu, Divesh Srivastava

Abstract: Sharing trajectories is beneficial for many real-world applications, such as managing disease spread through contact tracing and tailoring public services to a population's travel patterns. However, public concern over privacy and data protection has limited the extent to which this data is shared. Local differential privacy enables data sharing in which users share a perturbed version of their da… ▽ More Sharing trajectories is beneficial for many real-world applications, such as managing disease spread through contact tracing and tailoring public services to a population's travel patterns. However, public concern over privacy and data protection has limited the extent to which this data is shared. Local differential privacy enables data sharing in which users share a perturbed version of their data, but existing mechanisms fail to incorporate user-independent public knowledge (e.g., business locations and opening times, public transport schedules, geo-located tweets). This limitation makes mechanisms too restrictive, gives unrealistic outputs, and ultimately leads to low practical utility. To address these concerns, we propose a local differentially private mechanism that is based on perturbing hierarchically-structured, overlapping $n$-grams (i.e., contiguous subsequences of length $n$) of trajectory data. Our mechanism uses a multi-dimensional hierarchy over publicly available external knowledge of real-world places of interest to improve the realism and utility of the perturbed, shared trajectories. Importantly, including real-world public data does not negatively affect privacy or efficiency. Our experiments, using real-world data and a range of queries, each with real-world application analogues, demonstrate the superiority of our approach over a range of alternative methods. △ Less

Submitted 4 August, 2021; originally announced August 2021.

Journal ref: PVLDB, 14(11): 2283 - 2295, 2021

arXiv:2106.02766 [pdf, ps, other]

Quantum Measurement Adversary

Authors: Divesh Aggarwal, Naresh Goud Boddu, Rahul Jain, Maciej Obremski

Abstract: Multi-source-extractors are functions that extract uniform randomness from multiple (weak) sources of randomness. Quantum multi-source-extractors were considered by Kasher and Kempe (for the quantum-independent-adversary and the quantum-bounded-storage-adversary), Chung, Li and Wu (for the general-entangled-adversary) and Arnon-Friedman, Portmann and Scholz (for the quantum-Markov-adversary). One… ▽ More Multi-source-extractors are functions that extract uniform randomness from multiple (weak) sources of randomness. Quantum multi-source-extractors were considered by Kasher and Kempe (for the quantum-independent-adversary and the quantum-bounded-storage-adversary), Chung, Li and Wu (for the general-entangled-adversary) and Arnon-Friedman, Portmann and Scholz (for the quantum-Markov-adversary). One of the main objectives of this work is to unify all the existing quantum multi-source adversary models. We propose two new models of adversaries: 1) the quantum-measurement-adversary (qm-adv), which generates side-information using entanglement and on post-measurement and 2) the quantum-communication-adversary (qc-adv), which generates side-information using entanglement and communication between multiple sources. We show that, 1. qm-adv is the strongest adversary among all the known adversaries, in the sense that the side-information of all other adversaries can be generated by qm-adv. 2. The (generalized) inner-product function (in fact a general class of two-wise independent functions) continues to work as a good extractor against qm-adv with matching parameters as that of Chor and Goldreich. 3. A non-malleable-extractor proposed by Li (against classical-adversaries) continues to be secure against quantum side-information. This result implies a non-malleable-extractor result of Aggarwal, Chung, Lin and Vidick with uniform seed. We strengthen their result via a completely different proof to make the non-malleable-extractor of Li secure against quantum side-information even when the seed is not uniform. 4. A modification (working with weak sources instead of uniform sources) of the Dodis and Wichs protocol for privacy-amplification is secure against active quantum adversaries. This strengthens on a recent result due to Aggarwal, Chung, Lin and Vidick which uses uniform sources. △ Less

Submitted 6 June, 2023; v1 submitted 4 June, 2021; originally announced June 2021.

arXiv:2106.02325 [pdf, other]

ERICA: An Empathetic Android Companion for Covid-19 Quarantine

Authors: Etsuko Ishii, Genta Indra Winata, Samuel Cahyawijaya, Divesh Lala, Tatsuya Kawahara, Pascale Fung

Abstract: Over the past year, research in various domains, including Natural Language Processing (NLP), has been accelerated to fight against the COVID-19 pandemic, yet such research has just started on dialogue systems. In this paper, we introduce an end-to-end dialogue system which aims to ease the isolation of people under self-quarantine. We conduct a control simulation experiment to assess the effects… ▽ More Over the past year, research in various domains, including Natural Language Processing (NLP), has been accelerated to fight against the COVID-19 pandemic, yet such research has just started on dialogue systems. In this paper, we introduce an end-to-end dialogue system which aims to ease the isolation of people under self-quarantine. We conduct a control simulation experiment to assess the effects of the user interface, a web-based virtual agent called Nora vs. the android ERICA via a video call. The experimental results show that the android offers a more valuable user experience by giving the impression of being more empathetic and engaging in the conversation due to its nonverbal information, such as facial expressions and body gestures. △ Less

Submitted 4 June, 2021; originally announced June 2021.

Comments: Accepted in SIGDIAL 2021

arXiv:2105.06058 [pdf, other]

DataExposer: Exposing Disconnect between Data and Systems

Authors: Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, Divesh Srivastava

Abstract: As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of the data. For example, a health-monitoring system that is designed under the assumption that weight is reported in imperial units (lbs) will malfunction when encountering weight reported in metric units (kilograms). Similar to software debuggi… ▽ More As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of the data. For example, a health-monitoring system that is designed under the assumption that weight is reported in imperial units (lbs) will malfunction when encountering weight reported in metric units (kilograms). Similar to software debugging, which aims to find bugs in the mechanism (source code or runtime conditions), our goal is to debug the data to identify potential sources of disconnect between the assumptions about the data and the systems that operate on that data. Specifically, we seek which properties of the data cause a data-driven system to malfunction. We propose DataExposer, a framework to identify data properties, called profiles, that are the root causes of performance degradation or failure of a system that operates on the data. Such identification is necessary to repair the system and resolve the disconnect between data and system. Our technique is based on causal reasoning through interventions: when a system malfunctions for a dataset, DataExposer alters the data profiles and observes changes in the system's behavior due to the alteration. Unlike statistical observational analysis that reports mere correlations, DataExposer reports causally verified root causes, in terms of data profiles, of the system malfunction. We empirically evaluate DataExposer on three real-world and several synthetic data-driven systems that fail on datasets due to a diverse set of reasons. In all cases, DataExposer identifies the root causes precisely while requiring orders of magnitude fewer interventions than prior techniques. △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2105.00403 [pdf, other]

Intelligent Conversational Android ERICA Applied to Attentive Listening and Job Interview

Authors: Tatsuya Kawahara, Koji Inoue, Divesh Lala

Abstract: Following the success of spoken dialogue systems (SDS) in smartphone assistants and smart speakers, a number of communicative robots are developed and commercialized. Compared with the conventional SDSs designed as a human-machine interface, interaction with robots is expected to be in a closer manner to talking to a human because of the anthropomorphism and physical presence. The goal or task of… ▽ More Following the success of spoken dialogue systems (SDS) in smartphone assistants and smart speakers, a number of communicative robots are developed and commercialized. Compared with the conventional SDSs designed as a human-machine interface, interaction with robots is expected to be in a closer manner to talking to a human because of the anthropomorphism and physical presence. The goal or task of dialogue may not be information retrieval, but the conversation itself. In order to realize human-level "long and deep" conversation, we have developed an intelligent conversational android ERICA. We set up several social interaction tasks for ERICA, including attentive listening, job interview, and speed dating. To allow for spontaneous, incremental multiple utterances, a robust turn-taking model is implemented based on TRP (transition-relevance place) prediction, and a variety of backchannels are generated based on time frame-wise prediction instead of IPU-based prediction. We have realized an open-domain attentive listening system with partial repeats and elaborating questions on focus words as well as assessment responses. It has been evaluated with 40 senior people, engaged in conversation of 5-7 minutes without a conversation breakdown. It was also compared against the WOZ setting. We have also realized a job interview system with a set of base questions followed by dynamic generation of elaborating questions. It has also been evaluated with student subjects, showing promising results. △ Less

Submitted 2 May, 2021; originally announced May 2021.

Comments: 7 pages, 5 figures, 1 table

arXiv:2104.06576 [pdf, ps, other]

Dimension-Preserving Reductions Between SVP and CVP in Different $p$-Norms

Authors: Divesh Aggarwal, Yanlin Chen, Rajendra Kumar, Zeyong Li, Noah Stephens-Davidowitz

Abstract: $ \newcommand{\SVP}{\textsf{SVP}} \newcommand{\CVP}{\textsf{CVP}} \newcommand{\eps}{\varepsilon} $We show a number of reductions between the Shortest Vector Problem and the Closest Vector Problem over lattices in different $\ell_p$ norms ($\SVP_p$ and $\CVP_p$ respectively). Specifically, we present the following $2^{\eps m}$-time reductions for $1 \leq p \leq q \leq \infty… ▽ More $ \newcommand{\SVP}{\textsf{SVP}} \newcommand{\CVP}{\textsf{CVP}} \newcommand{\eps}{\varepsilon} $We show a number of reductions between the Shortest Vector Problem and the Closest Vector Problem over lattices in different $\ell_p$ norms ($\SVP_p$ and $\CVP_p$ respectively). Specifically, we present the following $2^{\eps m}$-time reductions for $1 \leq p \leq q \leq \infty$, which all increase the rank $n$ and dimension $m$ of the input lattice by at most one: $\bullet$ a reduction from $\widetilde{O}(1/\eps^{1/p})γ$-approximate $\SVP_q$ to $γ$-approximate $\SVP_p$; $\bullet$ a reduction from $\widetilde{O}(1/\eps^{1/p}) γ$-approximate $\CVP_p$ to $γ$-approximate $\CVP_q$; and $\bullet$ a reduction from $\widetilde{O}(1/\eps^{1+1/p})$-$\CVP_q$ to $(1+\eps)$-unique $\SVP_p$ (which in turn trivially reduces to $(1+\eps)$-approximate $\SVP_p$). The last reduction is interesting even in the case $p = q$. In particular, this special case subsumes much prior work adapting $2^{O(m)}$-time $\SVP_p$ algorithms to solve $O(1)$-approximate $\CVP_p$. In the (important) special case when $p = q$, $1 \leq p \leq 2$, and the $\SVP_p$ oracle is exact, we show a stronger reduction, from $O(1/\eps^{1/p})\text{-}\CVP_p$ to (exact) $\SVP_p$ in $2^{\eps m}$ time. For example, taking $\eps = \log m/m$ and $p = 2$ gives a slight improvement over Kannan's celebrated polynomial-time reduction from $\sqrt{m}\text{-}\CVP_2$ to $\SVP_2$. We also note that the last two reductions can be combined to give a reduction from approximate-$\CVP_p$ to $\SVP_q$ for any $p$ and $q$, regardless of whether $p \leq q$ or $p > q$. Our techniques combine those from the recent breakthrough work of Eisenbrand and Venzin (which showed how to adapt the current fastest known algorithm for these problems in the $\ell_2$ norm to all $\ell_p$ norms) together with sparsification-based techniques. △ Less

Submitted 13 April, 2021; originally announced April 2021.

arXiv:2101.11259 [pdf, other]

Alaska: A Flexible Benchmark for Data Integration Tasks

Authors: Valter Crescenzi, Andrea De Angelis, Donatella Firmani, Maurizio Mazzei, Paolo Merialdo, Federico Piai, Divesh Srivastava

Abstract: Data integration is a long-standing interest of the data management community and has many disparate applications, including business, science and government. We have recently witnessed impressive results in specific data integration tasks, such as Entity Resolution, thanks to the increasing availability of benchmarks. A limitation of such benchmarks is that they typically come with their own task… ▽ More Data integration is a long-standing interest of the data management community and has many disparate applications, including business, science and government. We have recently witnessed impressive results in specific data integration tasks, such as Entity Resolution, thanks to the increasing availability of benchmarks. A limitation of such benchmarks is that they typically come with their own task definition and it can be difficult to leverage them for complex integration pipelines. As a result, evaluating end-to-end pipelines for the entire data integration process is still an elusive goal. In this work, we present Alaska, the first benchmark based on real-world dataset to support seamlessly multiple tasks (and their variants) of the data integration pipeline. The dataset consists of ~70k heterogeneous product specifications from 71 e-commerce websites with thousands of different product attributes. Our benchmark comes with profiling meta-data, a set of pre-defined use cases with diverse characteristics, and an extensive manually curated ground truth. We demonstrate the flexibility of our benchmark by focusing on several variants of two crucial data integration tasks, Schema Matching and Entity Resolution. Our experiments show that our benchmark enables the evaluation of a variety of methods that previously were difficult to compare, and can foster the design of more holistic data integration solutions. △ Less

Submitted 3 February, 2021; v1 submitted 27 January, 2021; originally announced January 2021.

arXiv:2101.02174 [pdf, other]

Efficient Discovery of Approximate Order Dependencies

Authors: Reza Karegar, Parke Godfrey, Lukasz Golab, Mehdi Kargar, Divesh Srivastava, Jaroslaw Szlichta

Abstract: Order dependencies (ODs) capture relationships between ordered domains of attributes. Approximate ODs (AODs) capture such relationships even when there exist exceptions in the data. During automated discovery of ODs, validation is the process of verifying whether an OD holds. We present an algorithm for validating approximate ODs with significantly improved runtime performance over existing method… ▽ More Order dependencies (ODs) capture relationships between ordered domains of attributes. Approximate ODs (AODs) capture such relationships even when there exist exceptions in the data. During automated discovery of ODs, validation is the process of verifying whether an OD holds. We present an algorithm for validating approximate ODs with significantly improved runtime performance over existing methods for AODs, and prove that it is correct and has optimal runtime. By replacing the validation step in a leading algorithm for approximate OD discovery with ours, we achieve orders-of-magnitude improvements in performance. △ Less

Submitted 6 January, 2021; originally announced January 2021.

arXiv:2012.04117 [pdf, other]

Local Dampening: Differential Privacy for Non-numeric Queries via Local Sensitivity

Authors: Victor A. E. Farias, Felipe T. Brito, Cheryl Flynn, Javam C. Machado, Subhabrata Majumdar, Divesh Srivastava

Abstract: Differential privacy is the state-of-the-art formal definition for data release under strong privacy guarantees. A variety of mechanisms have been proposed in the literature for releasing the output of numeric queries (e.g., the Laplace mechanism and smooth sensitivity mechanism). Those mechanisms guarantee differential privacy by adding noise to the true query's output. The amount of noise added… ▽ More Differential privacy is the state-of-the-art formal definition for data release under strong privacy guarantees. A variety of mechanisms have been proposed in the literature for releasing the output of numeric queries (e.g., the Laplace mechanism and smooth sensitivity mechanism). Those mechanisms guarantee differential privacy by adding noise to the true query's output. The amount of noise added is calibrated by the notions of global sensitivity and local sensitivity of the query that measure the impact of the addition or removal of an individual on the query's output. Mechanisms that use local sensitivity add less noise and, consequently, have a more accurate answer. However, although there has been some work on generic mechanisms for releasing the output of non-numeric queries using global sensitivity (e.g., the Exponential mechanism), the literature lacks generic mechanisms for releasing the output of non-numeric queries using local sensitivity to reduce the noise in the query's output. In this work, we remedy this shortcoming and present the local dampening mechanism. We adapt the notion of local sensitivity for the non-numeric setting and leverage it to design a generic non-numeric mechanism. We provide theoretical comparisons to the exponential mechanism and show under which conditions the local dampening mechanism is more accurate than the exponential mechanism. We illustrate the effectiveness of the local dampening mechanism by applying it to three diverse problems: (i) percentile selection problem. We report the p-th element in the database; (ii) Influential node analysis. Given an influence metric, we release the top-k most influential nodes while preserving the privacy of the relationship between nodes in the network; (iii) Decision tree induction. We provide a private adaptation to the ID3 algorithm to build decision trees from a given tabular dataset. △ Less

Submitted 14 April, 2022; v1 submitted 7 December, 2020; originally announced December 2020.

arXiv:2010.08886 [pdf, other]

PPL Bench: Evaluation Framework For Probabilistic Programming Languages

Authors: Sourabh Kulkarni, Kinjal Divesh Shah, Nimar Arora, Xiaoyan Wang, Yucen Lily Li, Nazanin Khosravani Tehrani, Michael Tingley, David Noursi, Narjes Torabi, Sepehr Akhavan Masouleh, Eric Lippert, Erik Meijer

Abstract: We introduce PPL Bench, a new benchmark for evaluating Probabilistic Programming Languages (PPLs) on a variety of statistical models. The benchmark includes data generation and evaluation code for a number of models as well as implementations in some common PPLs. All of the benchmark code and PPL implementations are available on Github. We welcome contributions of new models and PPLs and as well a… ▽ More We introduce PPL Bench, a new benchmark for evaluating Probabilistic Programming Languages (PPLs) on a variety of statistical models. The benchmark includes data generation and evaluation code for a number of models as well as implementations in some common PPLs. All of the benchmark code and PPL implementations are available on Github. We welcome contributions of new models and PPLs and as well as improvements in existing PPL implementations. The purpose of the benchmark is two-fold. First, we want researchers as well as conference reviewers to be able to evaluate improvements in PPLs in a standardized setting. Second, we want end users to be able to pick the PPL that is most suited for their modeling application. In particular, we are interested in evaluating the accuracy and speed of convergence of the inferred posterior. Each PPL only needs to provide posterior samples given a model and observation data. The framework automatically computes and plots growth in predictive log-likelihood on held out data in addition to reporting other common metrics such as effective sample size and $\hat{r}$. △ Less

Submitted 17 October, 2020; originally announced October 2020.

Comments: 6 pages, PROBPROG 2020

arXiv:2007.09556 [pdf, ps, other]

A $2^{n/2}$-Time Algorithm for $\sqrt{n}$-SVP and $\sqrt{n}$-Hermite SVP, and an Improved Time-Approximation Tradeoff for (H)SVP

Authors: Divesh Aggarwal, Zeyong Li, Noah Stephens-Davidowitz

Abstract: We show a $2^{n/2+o(n)}$-time algorithm that finds a (non-zero) vector in a lattice $\mathcal{L} \subset \mathbb{R}^n$ with norm at most $\tilde{O}(\sqrt{n})\cdot \min\{λ_1(\mathcal{L}), \det(\mathcal{L})^{1/n}\}$, where $λ_1(\mathcal{L})$ is the length of a shortest non-zero lattice vector and $\det(\mathcal{L})$ is the lattice determinant. Minkowski showed that… ▽ More We show a $2^{n/2+o(n)}$-time algorithm that finds a (non-zero) vector in a lattice $\mathcal{L} \subset \mathbb{R}^n$ with norm at most $\tilde{O}(\sqrt{n})\cdot \min\{λ_1(\mathcal{L}), \det(\mathcal{L})^{1/n}\}$, where $λ_1(\mathcal{L})$ is the length of a shortest non-zero lattice vector and $\det(\mathcal{L})$ is the lattice determinant. Minkowski showed that $λ_1(\mathcal{L}) \leq \sqrt{n} \det(\mathcal{L})^{1/n}$ and that there exist lattices with $λ_1(\mathcal{L}) \geq Ω(\sqrt{n}) \cdot \det(\mathcal{L})^{1/n}$, so that our algorithm finds vectors that are as short as possible relative to the determinant (up to a polylogarithmic factor). The main technical contribution behind this result is new analysis of (a simpler variant of) an algorithm from arXiv:1412.7994, which was only previously known to solve less useful problems. To achieve this, we rely crucially on the ``reverse Minkowski theorem'' (conjectured by Dadush arXiv:1606.06913 and proven by arXiv:1611.05979), which can be thought of as a partial converse to the fact that $λ_1(\mathcal{L}) \leq \sqrt{n} \det(\mathcal{L})^{1/n}$. Previously, the fastest known algorithm for finding such a vector was the $2^{.802n + o(n)}$-time algorithm due to [Liu, Wang, Xu, and Zheng, 2011], which actually found a non-zero lattice vector with length $O(1) \cdot λ_1(\mathcal{L})$. Though we do not show how to find lattice vectors with this length in time $2^{n/2+o(n)}$, we do show that our algorithm suffices for the most important application of such algorithms: basis reduction. In particular, we show a modified version of Gama and Nguyen's slide-reduction algorithm [Gama and Nguyen, STOC 2008], which can be combined with the algorithm above to improve the time-length tradeoff for shortest-vector algorithms in nearly all regimes, including the regimes relevant to cryptography. △ Less

Submitted 18 July, 2020; originally announced July 2020.

arXiv:2005.14326 [pdf, other]

doi 10.1007/s00778-021-00656-7

Efficient and Effective ER with Progressive Blocking

Authors: Sainyam Galhotra, Donatella Firmani, Barna Saha, Divesh Srivastava

Abstract: Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this pap… ▽ More Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this paper, we propose a new methodology of progressive blocking (pBlocking) to enable both efficient and effective ER, which works seamlessly across different entity cluster size distributions. pBlocking is based on the insight that the effectiveness-efficiency trade-off is revealed only when the output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedback loop to refine the blocking result in a data-driven fashion. Specifically, we bootstrap pBlocking with traditional blocking methods and progressively improve the building and scoring of blocks until we get the desired trade-off, leveraging a limited amount of ER results as a guidance at every round. We formally prove that pBlocking converges efficiently ($O(n log^2 n)$ time complexity, where n is the total number of records). Our experiments show that incorporating partial ER output in a feedback loop can improve the efficiency and effectiveness of blocking by 5x and 60% respectively, improving the overall F-score of the entire ER process up to 60%. △ Less

Submitted 16 March, 2021; v1 submitted 28 May, 2020; originally announced May 2020.

Comments: Galhotra, S., Firmani, D., Saha, B. et al. Efficient and effective ER with progressive blocking. The VLDB Journal (2021)

arXiv:2005.14068 [pdf, other]

Discovering Domain Orders through Order Dependencies

Authors: Reza Karegar, Melicaalsadat Mirsafian, Parke Godfrey, Lukasz Golab, Mehdi Kargar, Divesh Srivastava, Jaroslaw Szlichta

Abstract: Much real-world data come with explicitly defined domain orders; e.g., lexicographic order for strings, numeric for integers, and chronological for time. Our goal is to discover implicit domain orders that we do not already know; for instance, that the order of months in the Chinese Lunar calendar is Corner < Apricot < Peach. To do so, we enhance data profiling methods by discovering implicit doma… ▽ More Much real-world data come with explicitly defined domain orders; e.g., lexicographic order for strings, numeric for integers, and chronological for time. Our goal is to discover implicit domain orders that we do not already know; for instance, that the order of months in the Chinese Lunar calendar is Corner < Apricot < Peach. To do so, we enhance data profiling methods by discovering implicit domain orders in data through order dependencies. We enumerate tractable special cases and proceed towards the most general case, which we prove is NP-complete. We show that the general case nevertheless can be effectively handled by a SAT solver. We also devise an interestingness measure to rank the discovered implicit domain orders, which we validate with a user study. Based on an extensive suite of experiments with real-world data, we establish the efficacy of our algorithms, and the utility of the domain orders discovered by demonstrating significant added value in three applications (data profiling, query optimization, and data mining). △ Less

Submitted 7 September, 2021; v1 submitted 28 May, 2020; originally announced May 2020.

arXiv:2005.11654 [pdf, ps, other]

A Note on the Concrete Hardness of the Shortest Independent Vectors Problem in Lattices

Authors: Divesh Aggarwal, Eldon Chung

Abstract: Blömer and Seifert showed that $\mathsf{SIVP}_2$ is NP-hard to approximate by giving a reduction from $\mathsf{CVP}_2$ to $\mathsf{SIVP}_2$ for constant approximation factors as long as the $\mathsf{CVP}$ instance has a certain property. In order to formally define this requirement on the $\mathsf{CVP}$ instance, we introduce a new computational problem called the Gap Closest Vector Problem with B… ▽ More Blömer and Seifert showed that $\mathsf{SIVP}_2$ is NP-hard to approximate by giving a reduction from $\mathsf{CVP}_2$ to $\mathsf{SIVP}_2$ for constant approximation factors as long as the $\mathsf{CVP}$ instance has a certain property. In order to formally define this requirement on the $\mathsf{CVP}$ instance, we introduce a new computational problem called the Gap Closest Vector Problem with Bounded Minima. We adapt the proof of Blömer and Seifert to show a reduction from the Gap Closest Vector Problem with Bounded Minima to $\mathsf{SIVP}$ for any $\ell_p$ norm for some constant approximation factor greater than $1$. In a recent result, Bennett, Golovnev and Stephens-Davidowitz showed that under Gap-ETH, there is no $2^{o(n)}$-time algorithm for approximating $\mathsf{CVP}_p$ up to some constant factor $γ\geq 1$ for any $1 \leq p \leq \infty$. We observe that the reduction in their paper can be viewed as a reduction from $\mathsf{Gap3SAT}$ to the Gap Closest Vector Problem with Bounded Minima. This, together with the above mentioned reduction, implies that, under Gap-ETH, there is no $2^{o(n)}$-time algorithm for approximating $\mathsf{SIVP}_p$ up to some constant factor $γ\geq 1$ for any $1 \leq p \leq \infty$. △ Less

Submitted 31 October, 2020; v1 submitted 24 May, 2020; originally announced May 2020.

arXiv:2004.04908 [pdf, ps, other]

Designing Precise and Robust Dialogue Response Evaluators

Authors: Tianyu Zhao, Divesh Lala, Tatsuya Kawahara

Abstract: Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental… ▽ More Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement and generalizes robustly to diverse responses and corpora. We open-source the code and data in https://github.com/ZHAOTING/dialog-processing. △ Less

Submitted 24 April, 2020; v1 submitted 10 April, 2020; originally announced April 2020.

Comments: Accepted at ACL 2020

arXiv:2002.07955 [pdf, ps, other]

Improved Classical and Quantum Algorithms for the Shortest Vector Problem via Bounded Distance Decoding

Authors: Divesh Aggarwal, Yanlin Chen, Rajendra Kumar, Yixin Shen

Abstract: The most important computational problem on lattices is the Shortest Vector Problem (SVP). In this paper, we present new algorithms that improve the state-of-the-art for provable classical/quantum algorithms for SVP. We present the following results. $\bullet$ A new algorithm for SVP that provides a smooth tradeoff between time complexity and memory requirement. For any positive integer… ▽ More The most important computational problem on lattices is the Shortest Vector Problem (SVP). In this paper, we present new algorithms that improve the state-of-the-art for provable classical/quantum algorithms for SVP. We present the following results. $\bullet$ A new algorithm for SVP that provides a smooth tradeoff between time complexity and memory requirement. For any positive integer $4\leq q\leq \sqrt{n}$, our algorithm takes $q^{13n+o(n)}$ time and requires $poly(n)\cdot q^{16n/q^2}$ memory. This tradeoff which ranges from enumeration ($q=\sqrt{n}$) to sieving ($q$ constant), is a consequence of a new time-memory tradeoff for Discrete Gaussian sampling above the smoothing parameter. $\bullet$ A quantum algorithm for SVP that runs in time $2^{0.950n+o(n)}$ and requires $2^{0.5n+o(n)}$ classical memory and poly(n) qubits. In Quantum Random Access Memory (QRAM) model this algorithm takes only $2^{0.835n+o(n)}$ time and requires a QRAM of size $2^{0.293n+o(n)}$, poly(n) qubits and $2^{0.5n}$ classical space. This improves over the previously fastest classical (which is also the fastest quantum) algorithm due to [ADRS15] that has a time and space complexity $2^{n+o(n)}$. $\bullet$ A classical algorithm for SVP that runs in time $2^{1.669n+o(n)}$ time and $2^{0.5n+o(n)}$ space. This improves over an algorithm of [CCL18] that has the same space complexity. The time complexity of our classical and quantum algorithms are obtained using a known upper bound on a quantity related to the lattice kissing number which is $2^{0.402n}$. We conjecture that for most lattices this quantity is a $2^{o(n)}$. Assuming that this is the case, our classical algorithm runs in time $2^{1.292n+o(n)}$, our quantum algorithm runs in time $2^{0.750n+o(n)}$ and our quantum algorithm in QRAM model runs in time $2^{0.667n+o(n)}$. △ Less

Submitted 10 May, 2022; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: Faster Quantum Algorithm for SVP in QRAM, 42 pages

arXiv:2001.05567 [pdf, other]

Newtonian Monte Carlo: single-site MCMC meets second-order gradient methods

Authors: Nimar S. Arora, Nazanin Khosravani Tehrani, Kinjal Divesh Shah, Michael Tingley, Yucen Lily Li, Narjes Torabi, David Noursi, Sepehr Akhavan Masouleh, Eric Lippert, Erik Meijer

Abstract: Single-site Markov Chain Monte Carlo (MCMC) is a variant of MCMC in which a single coordinate in the state space is modified in each step. Structured relational models are a good candidate for this style of inference. In the single-site context, second order methods become feasible because the typical cubic costs associated with these methods is now restricted to the dimension of each coordinate.… ▽ More Single-site Markov Chain Monte Carlo (MCMC) is a variant of MCMC in which a single coordinate in the state space is modified in each step. Structured relational models are a good candidate for this style of inference. In the single-site context, second order methods become feasible because the typical cubic costs associated with these methods is now restricted to the dimension of each coordinate. Our work, which we call Newtonian Monte Carlo (NMC), is a method to improve MCMC convergence by analyzing the first and second order gradients of the target density to determine a suitable proposal density at each point. Existing first order gradient-based methods suffer from the problem of determining an appropriate step size. Too small a step size and it will take a large number of steps to converge, while a very large step size will cause it to overshoot the high density region. NMC is similar to the Newton-Raphson update in optimization where the second order gradient is used to automatically scale the step size in each dimension. However, our objective is to find a parameterized proposal density rather than the maxima. As a further improvement on existing first and second order methods, we show that random variables with constrained supports don't need to be transformed before taking a gradient step. We demonstrate the efficiency of NMC on a number of different domains. For statistical models where the prior is conjugate to the likelihood, our method recovers the posterior quite trivially in one step. However, we also show results on fairly large non-conjugate models, where NMC performs better than adaptive first order methods such as NUTS or other inexact scalable inference methods such as Stochastic Variational Inference or bootstrapping. △ Less

Submitted 15 January, 2020; originally announced January 2020.

Comments: StarAI has a 6 page limit excluding references

arXiv:1911.02440 [pdf, other]

Fine-grained hardness of CVP(P) -- Everything that we can prove (and nothing else)

Authors: Divesh Aggarwal, Huck Bennett, Alexander Golovnev, Noah Stephens-Davidowitz

Abstract: We show a number of fine-grained hardness results for the Closest Vector Problem in the $\ell_p$ norm ($\mathrm{CVP}_p$), and its approximate and non-uniform variants. First, we show that $\mathrm{CVP}_p$ cannot be solved in $2^{(1-\varepsilon)n}$ time for all $p \notin 2\mathbb{Z}$ and $\varepsilon > 0$, assuming the Strong Exponential Time Hypothesis (SETH). Second, we extend this by showing tha… ▽ More We show a number of fine-grained hardness results for the Closest Vector Problem in the $\ell_p$ norm ($\mathrm{CVP}_p$), and its approximate and non-uniform variants. First, we show that $\mathrm{CVP}_p$ cannot be solved in $2^{(1-\varepsilon)n}$ time for all $p \notin 2\mathbb{Z}$ and $\varepsilon > 0$, assuming the Strong Exponential Time Hypothesis (SETH). Second, we extend this by showing that there is no $2^{(1-\varepsilon)n}$-time algorithm for approximating $\mathrm{CVP}_p$ to within a constant factor $γ$ for such $p$ assuming a "gap" version of SETH, with an explicit relationship between $γ$, $p$, and the arity $k = k(\varepsilon)$ of the underlying hard CSP. Third, we show the same hardness result for (exact) $\mathrm{CVP}_p$ with preprocessing (assuming non-uniform SETH). For exact "plain" $\mathrm{CVP}_p$, the same hardness result was shown in [Bennett, Golovnev, and Stephens-Davidowitz FOCS 2017] for all but finitely many $p \notin 2\mathbb{Z}$, where the set of exceptions depended on $\varepsilon$ and was not explicit. For the approximate and preprocessing problems, only very weak bounds were known prior to this work. We also show that the restriction to $p \notin 2\mathbb{Z}$ is in some sense inherent. In particular, we show that no "natural" reduction can rule out even a $2^{3n/4}$-time algorithm for $\mathrm{CVP}_2$ under SETH. For this, we prove that the possible sets of closest lattice vectors to a target in the $\ell_2$ norm have quite rigid structure, which essentially prevents them from being as expressive as $3$-CNFs. We prove these results using techniques from many different fields, including complex analysis, functional analysis, additive combinatorics, and discrete Fourier analysis. E.g., along the way, we give a new (and tighter) proof of Szemerédi's cube lemma for the boolean cube. △ Less

Submitted 7 August, 2021; v1 submitted 6 November, 2019; originally announced November 2019.

arXiv:1910.08678 [pdf, other]

Effective Discovery of Meaningful Outlier Relationships

Authors: Aline Bessa, Juliana Freire, Divesh Srivastava, Tamraparni Dasu

Abstract: We propose PODS (Predictable Outliers in Data-trendS), a method that, given a collection of temporal data sets, derives data-driven explanations for outliers by identifying meaningful relationships between them. First, we formalize the notion of meaningfulness, which so far has been informally framed in terms of explainability. Next, since outliers are rare and it is difficult to determine whether… ▽ More We propose PODS (Predictable Outliers in Data-trendS), a method that, given a collection of temporal data sets, derives data-driven explanations for outliers by identifying meaningful relationships between them. First, we formalize the notion of meaningfulness, which so far has been informally framed in terms of explainability. Next, since outliers are rare and it is difficult to determine whether their relationships are meaningful, we develop a new criterion that does so by checking if these relationships could have been predicted from non-outliers, i.e., if we could see the outlier relationships coming. Finally, searching for meaningful outlier relationships between every pair of data sets in a large data collection is computationally infeasible. To address that, we propose an indexing strategy that prunes irrelevant comparisons across data sets, making the approach scalable. We present the results of an experimental evaluation using real data sets and different baselines, which demonstrates the effectiveness, robustness, and scalability of our approach. △ Less

Submitted 8 April, 2020; v1 submitted 18 October, 2019; originally announced October 2019.

arXiv:1909.02629 [pdf, other]

Random Sampling for Group-By Queries

Authors: Trong Duc Nguyen, Ming-Hung Shih, Sai Sree Parvathaneni, Bojian Xu, Divesh Srivastava, Srikanta Tirthapura

Abstract: Random sampling has been widely used in approximate query processing on large databases, due to its potential to significantly reduce resource usage and response times, at the cost of a small approximation error. We consider random sampling for answering the ubiquitous class of group-by queries, which first group data according to one or more attributes, and then aggregate within each group after… ▽ More Random sampling has been widely used in approximate query processing on large databases, due to its potential to significantly reduce resource usage and response times, at the cost of a small approximation error. We consider random sampling for answering the ubiquitous class of group-by queries, which first group data according to one or more attributes, and then aggregate within each group after filtering through a predicate. The challenge with group-by queries is that a sampling method cannot focus on optimizing the quality of a single answer (e.g. the mean of selected data), but must simultaneously optimize the quality of a set of answers (one per group). We present CVOPT, a query- and data-driven sampling framework for a set of group-by queries. To evaluate the quality of a sample, CVOPT defines a metric based on the norm (e.g. $\ell_2$ or $\ell_\infty$) of the coefficients of variation (CVs) of different answers, and constructs a stratified sample that provably optimizes the metric. CVOPT can handle group-by queries on data where groups have vastly different statistical characteristics, such as frequencies, means, or variances. CVOPT jointly optimizes for multiple aggregations and multiple group-by clauses, and provides a way to prioritize specific groups or aggregates. It can be tuned to cases when partial information about a query workload is known, such as a data warehouse where queries are scheduled to run periodically. Our experimental results show that CVOPT outperforms current state-of-the-art on sample quality and estimation accuracy for group-by queries. On a set of queries on two real-world data sets, CVOPT yields relative errors that are 5x smaller than competing approaches, under the same space budget. △ Less

Submitted 12 September, 2019; v1 submitted 5 September, 2019; originally announced September 2019.

arXiv:1908.03724 [pdf, other]

Slide Reduction, Revisited---Filling the Gaps in SVP Approximation

Authors: Divesh Aggarwal, Jianwei Li, Phong Q. Nguyen, Noah Stephens-Davidowitz

Abstract: We show how to generalize Gama and Nguyen's slide reduction algorithm [STOC '08] for solving the approximate Shortest Vector Problem over lattices (SVP). As a result, we show the fastest provably correct algorithm for $δ$-approximate SVP for all approximation factors $n^{1/2+\varepsilon} \leq δ\leq n^{O(1)}$. This is the range of approximation factors most relevant for cryptography. We show how to generalize Gama and Nguyen's slide reduction algorithm [STOC '08] for solving the approximate Shortest Vector Problem over lattices (SVP). As a result, we show the fastest provably correct algorithm for $δ$-approximate SVP for all approximation factors $n^{1/2+\varepsilon} \leq δ\leq n^{O(1)}$. This is the range of approximation factors most relevant for cryptography. △ Less

Submitted 10 August, 2019; originally announced August 2019.

arXiv:1905.11948 [pdf, other]

ABC of Order Dependencies

Authors: Pei Li, Michael Bohlen, Jaroslaw Szlichta, Divesh Srivastava

Abstract: We enhance constrained-based data quality with approximate band conditional order dependencies (abcODs). Band ODs model the semantics of attributes that are monotonically related with small variations without there being an intrinsic violation of semantics. The class of abcODs generalizes band ODs to make them more relevant to real-world applications by relaxing them to hold approximately (abODs)… ▽ More We enhance constrained-based data quality with approximate band conditional order dependencies (abcODs). Band ODs model the semantics of attributes that are monotonically related with small variations without there being an intrinsic violation of semantics. The class of abcODs generalizes band ODs to make them more relevant to real-world applications by relaxing them to hold approximately (abODs) with some exceptions and conditionally (bcODs) on subsets of the data. We study the problem of automatic dependency discovery over a hierarchy of abcODs. First, we propose a more efficient algorithm to discover abODs than in recent prior work. The algorithm is based on a new optimization to compute a longest monotonic band (longest subsequence of tuples that satisfy a band OD) through dynamic programming by decreasing the runtime from O(n^2) to O(n \log n) time. We then illustrate that while the discovery of bcODs is relatively straightforward, there exist codependencies between approximation and conditioning that make the problem of abcOD discovery challenging. The naive solution is prohibitively expensive as it considers all possible segmentations of tuples resulting in exponential time complexity. To reduce the search space, we devise a dynamic programming algorithm for abcOD discovery that determines the optimal solution in O(n^3 \log n) complexity. To further optimize the performance, we adapt the algorithm to cheaply identify consecutive tuples that are guaranteed to belong to the same band--this improves the performance significantly in practice without losing optimality. While unidirectional abcODs are most common in practice, for generality we extend our algorithms with both ascending and descending orders to discover bidirectional abcODs. Finally, we perform a thorough experimental evaluation of our techniques over real-world and synthetic datasets. △ Less

Submitted 28 February, 2020; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1905.02010 [pdf, other]

Errata Note: Discovering Order Dependencies through Order Compatibility

Authors: Parke Godfrey, Lukasz Golab, Mehdi Kargar, Divesh Srivastava, Jaroslaw Szlichta

Abstract: A number of extensions to the classical notion of functional dependencies have been proposed to express and enforce application semantics. One of these extensions is that of order dependencies (ODs), which express rules involving order. The article entitled "Discovering Order Dependencies through Order Compatibility" by Consonni et al., published in the EDBT conference proceedings in March 2019, i… ▽ More A number of extensions to the classical notion of functional dependencies have been proposed to express and enforce application semantics. One of these extensions is that of order dependencies (ODs), which express rules involving order. The article entitled "Discovering Order Dependencies through Order Compatibility" by Consonni et al., published in the EDBT conference proceedings in March 2019, investigates the OD discovery problem. They claim to prove that their OD discovery algorithm, OCDDISCOVER, is complete, as well as being significantly more efficient in practice than the state-of-the-art. They further claim that the implementation of the existing FASTOD algorithm (ours)-we shared our code base with the authors-which they benchmark against is flawed, as OCDDISCOVER and FASTOD report different sets of ODs over the same data sets. In this rebuttal, we show that their claim of completeness is, in fact, not true. Built upon their incorrect claim, OCDDISCOVER's pruning rules are overly aggressive, and prune parts of the search space that contain legitimate ODs. This is the reason their approach appears to be "faster" in practice. Finally, we show that Consonni et al. misinterpret our set-based canonical form for ODs, leading to an incorrect claim that our FASTOD implementation has an error. △ Less

Submitted 6 May, 2019; originally announced May 2019.

Comments: 5

arXiv:1812.10942 [pdf, other]

Answering Range Queries Under Local Differential Privacy

Authors: Tejas Kulkarni, Graham Cormode, Divesh Srivastava

Abstract: Counting the fraction of a population having an input within a specified interval i.e. a \emph{range query}, is a fundamental data analysis primitive. Range queries can also be used to compute other interesting statistics such as \emph{quantiles}, and to build prediction models. However, frequently the data is subject to privacy concerns when it is drawn from individuals, and relates for example t… ▽ More Counting the fraction of a population having an input within a specified interval i.e. a \emph{range query}, is a fundamental data analysis primitive. Range queries can also be used to compute other interesting statistics such as \emph{quantiles}, and to build prediction models. However, frequently the data is subject to privacy concerns when it is drawn from individuals, and relates for example to their financial, health, religious or political status. In this paper, we introduce and analyze methods to support range queries under the local variant of differential privacy, an emerging standard for privacy-preserving data analysis. The local model requires that each user releases a noisy view of her private data under a privacy guarantee. While many works address the problem of range queries in the trusted aggregator setting, this problem has not been addressed specifically under untrusted aggregation (local DP) model even though many primitives have been developed recently for estimating a discrete distribution. We describe and analyze two classes of approaches for range queries, based on hierarchical histograms and the Haar wavelet transform. We show that both have strong theoretical accuracy guarantees on variance. In practice, both methods are fast and require minimal computation and communication resources. Our experiments show that the wavelet approach is most accurate in high privacy settings, while the hierarchical approach dominates for weaker privacy requirements. △ Less

Submitted 31 December, 2018; v1 submitted 28 December, 2018; originally announced December 2018.

arXiv:1801.09039 [pdf, other]

Variance-Optimal Offline and Streaming Stratified Random Sampling

Authors: Trong Duc Nguyen, Ming-Hung Shih, Divesh Srivastava, Srikanta Tirthapura, Bojian Xu

Abstract: Stratified random sampling (SRS) is a fundamental sampling technique that provides accurate estimates for aggregate queries using a small size sample, and has been used widely for approximate query processing. A key question in SRS is how to partition a target sample size among different strata. While Neyman allocation provides a solution that minimizes the variance of an estimate using this sampl… ▽ More Stratified random sampling (SRS) is a fundamental sampling technique that provides accurate estimates for aggregate queries using a small size sample, and has been used widely for approximate query processing. A key question in SRS is how to partition a target sample size among different strata. While Neyman allocation provides a solution that minimizes the variance of an estimate using this sample, it works under the assumption that each stratum is abundant, i.e., has a large number of data points to choose from. This assumption may not hold in general: one or more strata may be bounded, and may not contain a large number of data points, even though the total data size may be large. We first present VOILA, an offline method for allocating sample sizes to strata in a variance-optimal manner, even for the case when one or more strata may be bounded. We next consider SRS on streaming data that are continuously arriving. We show a lower bound, that any streaming algorithm for SRS must have (in the worst case) a variance that is Ω(r) factor away from the optimal, where r is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS that is locally variance-optimal in its allocation of sample sizes to different strata. Our result from experiments on real and synthetic data show that VOILA can have significantly (1.4 to 50.0 times) smaller variance than Neyman allocation. The streaming algorithm S-VOILA results in a variance that is typically close to VOILA, which was given the entire input beforehand. △ Less

Submitted 20 February, 2018; v1 submitted 27 January, 2018; originally announced January 2018.

arXiv:1801.02358 [pdf, ps, other]

Improved algorithms for the Shortest Vector Problem and the Closest Vector Problem in the infinity norm

Authors: Divesh Aggarwal, Priyanka Mukhopadhyay

Abstract: Blomer and Naewe[BN09] modified the randomized sieving algorithm of Ajtai, Kumar and Sivakumar[AKS01] to solve the shortest vector problem (SVP). The algorithm starts with $N = 2^{O(n)}$ randomly chosen vectors in the lattice and employs a sieving procedure to iteratively obtain shorter vectors in the lattice. The running time of the sieving procedure is quadratic in $N$. We study this problem f… ▽ More Blomer and Naewe[BN09] modified the randomized sieving algorithm of Ajtai, Kumar and Sivakumar[AKS01] to solve the shortest vector problem (SVP). The algorithm starts with $N = 2^{O(n)}$ randomly chosen vectors in the lattice and employs a sieving procedure to iteratively obtain shorter vectors in the lattice. The running time of the sieving procedure is quadratic in $N$. We study this problem for the special but important case of the $\ell_\infty$ norm. We give a new sieving procedure that runs in time linear in $N$, thereby significantly improving the running time of the algorithm for SVP in the $\ell_\infty$ norm. As in [AKS02,BN09], we also extend this algorithm to obtain significantly faster algorithms for approximate versions of the shortest vector problem and the closest vector problem (CVP) in the $\ell_\infty$ norm. We also show that the heuristic sieving algorithms of Nguyen and Vidick[NV08] and Wang et al.[WLTB11] can also be analyzed in the $\ell_{\infty}$ norm. The main technical contribution in this part is to calculate the expected volume of intersection of a unit ball centred at origin and another ball of a different radius centred at a uniformly random point on the boundary of the unit ball. This might be of independent interest. △ Less

Submitted 15 May, 2018; v1 submitted 8 January, 2018; originally announced January 2018.

Comments: Changed the title

arXiv:1712.00942 [pdf, other]

doi 10.1145/3188745.3188840

(Gap/S)ETH Hardness of SVP

Authors: Divesh Aggarwal, Noah Stephens-Davidowitz

Abstract: $ \newcommand{\problem}[1]{\ensuremath{\mathrm{#1}} } \newcommand{\SVP}{\problem{SVP}} \newcommand{\ensuremath}[1]{#1} $We prove the following quantitative hardness results for the Shortest Vector Problem in the $\ell_p$ norm ($\SVP_p$), where $n$ is the rank of the input lattice. $\bullet$ For "almost all" $p > p_0 \approx 2.1397$, there no $2^{n/C_p}$-time algorithm for $\SVP_p… ▽ More $ \newcommand{\problem}[1]{\ensuremath{\mathrm{#1}} } \newcommand{\SVP}{\problem{SVP}} \newcommand{\ensuremath}[1]{#1} $We prove the following quantitative hardness results for the Shortest Vector Problem in the $\ell_p$ norm ($\SVP_p$), where $n$ is the rank of the input lattice. $\bullet$ For "almost all" $p > p_0 \approx 2.1397$, there no $2^{n/C_p}$-time algorithm for $\SVP_p$ for some explicit constant $C_p > 0$ unless the (randomized) Strong Exponential Time Hypothesis (SETH) is false. $\bullet$ For any $p > 2$, there is no $2^{o(n)}$-time algorithm for $\SVP_p$ unless the (randomized) Gap-Exponential Time Hypothesis (Gap-ETH) is false. Furthermore, for each $p > 2$, there exists a constant $γ_p > 1$ such that the same result holds even for $γ_p$-approximate $\SVP_p$. $\bullet$ There is no $2^{o(n)}$-time algorithm for $\SVP_p$ for any $1 \leq p \leq 2$ unless either (1) (non-uniform) Gap-ETH is false; or (2) there is no family of lattices with exponential kissing number in the $\ell_2$ norm. Furthermore, for each $1 \leq p \leq 2$, there exists a constant $γ_p > 1$ such that the same result holds even for $γ_p$-approximate $\SVP_p$. △ Less

Submitted 4 December, 2017; originally announced December 2017.

Journal ref: STOC 2018

arXiv:1711.02952 [pdf, other]

Marginal Release Under Local Differential Privacy

Authors: Tejas Kulkarni, Graham Cormode, Divesh Srivastava

Abstract: Many analysis and machine learning tasks require the availability of marginal statistics on multidimensional datasets while providing strong privacy guarantees for the data subjects. Applications for these statistics range from finding correlations in the data to fitting sophisticated prediction models. In this paper, we provide a set of algorithms for materializing marginal statistics under the s… ▽ More Many analysis and machine learning tasks require the availability of marginal statistics on multidimensional datasets while providing strong privacy guarantees for the data subjects. Applications for these statistics range from finding correlations in the data to fitting sophisticated prediction models. In this paper, we provide a set of algorithms for materializing marginal statistics under the strong model of local differential privacy. We prove the first tight theoretical bounds on the accuracy of marginals compiled under each approach, perform empirical evaluation to confirm these bounds, and evaluate them for tasks such as modeling and correlation testing. Our results show that releasing information based on (local) Fourier transformations of the input is preferable to alternatives based directly on (local) marginals. △ Less

Submitted 8 November, 2017; originally announced November 2017.

arXiv:1710.00608 [pdf, other]

Constrained Differential Privacy for Count Data

Authors: Graham Cormode, Tejas Kulkarni, Divesh Srivastava

Abstract: Concern about how to aggregate sensitive user data without compromising individual privacy is a major barrier to greater availability of data. The model of differential privacy has emerged as an accepted model to release sensitive information while giving a statistical guarantee for privacy. Many different algorithms are possible to address different target functions. We focus on the core problem… ▽ More Concern about how to aggregate sensitive user data without compromising individual privacy is a major barrier to greater availability of data. The model of differential privacy has emerged as an accepted model to release sensitive information while giving a statistical guarantee for privacy. Many different algorithms are possible to address different target functions. We focus on the core problem of count queries, and seek to design mechanisms to release data associated with a group of n individuals. Prior work has focused on designing mechanisms by raw optimization of a loss function, without regard to the consequences on the results. This can leads to mechanisms with undesirable properties, such as never reporting some outputs (gaps), and overreporting others (spikes). We tame these pathological behaviors by introducing a set of desirable properties that mechanisms can obey. Any combination of these can be satisfied by solving a linear program (LP) which minimizes a cost function, with constraints enforcing the properties. We focus on a particular cost function, and provide explicit constructions that are optimal for certain combinations of properties, and show a closed form for their cost. In the end, there are only a handful of distinct optimal mechanisms to choose between: one is the well-known (truncated) geometric mechanism; the second a novel mechanism that we introduce here, and the remainder are found as the solution to particular LPs. These all avoid the bad behaviors we identify. We demonstrate in a set of experiments on real and synthetic data which is preferable in practice, for different combinations of data distributions, constraints, and privacy parameters. △ Less

Submitted 2 October, 2017; originally announced October 2017.

arXiv:1710.00557 [pdf, ps, other]

A Quantum-Proof Non-Malleable Extractor, With Application to Privacy Amplification against Active Quantum Adversaries

Authors: Divesh Aggarwal, Kai-Min Chung, Han-Hsuan Lin, Thomas Vidick

Abstract: In privacy amplification, two mutually trusted parties aim to amplify the secrecy of an initial shared secret $X$ in order to establish a shared private key $K$ by exchanging messages over an insecure communication channel. If the channel is authenticated the task can be solved in a single round of communication using a strong randomness extractor; choosing a quantum-proof extractor allows one to… ▽ More In privacy amplification, two mutually trusted parties aim to amplify the secrecy of an initial shared secret $X$ in order to establish a shared private key $K$ by exchanging messages over an insecure communication channel. If the channel is authenticated the task can be solved in a single round of communication using a strong randomness extractor; choosing a quantum-proof extractor allows one to establish security against quantum adversaries. In the case that the channel is not authenticated, Dodis and Wichs (STOC'09) showed that the problem can be solved in two rounds of communication using a non-malleable extractor, a stronger pseudo-random construction than a strong extractor. We give the first construction of a non-malleable extractor that is secure against quantum adversaries. The extractor is based on a construction by Li (FOCS'12), and is able to extract from source of min-entropy rates larger than $1/2$. Combining this construction with a quantum-proof variant of the reduction of Dodis and Wichs, shown by Cohen and Vidick (unpublished), we obtain the first privacy amplification protocol secure against active quantum adversaries. △ Less

Submitted 14 February, 2018; v1 submitted 2 October, 2017; originally announced October 2017.

arXiv:1709.10257 [pdf, other]

Detection of social signals for recognizing engagement in human-robot interaction

Authors: Divesh Lala, Koji Inoue, Pierrick Milhorat, Tatsuya Kawahara

Abstract: Detection of engagement during a conversation is an important function of human-robot interaction. The level of user engagement can influence the dialogue strategy of the robot. Our motivation in this work is to detect several behaviors which will be used as social signal inputs for a real-time engagement recognition model. These behaviors are nodding, laughter, verbal backchannels and eye gaze. W… ▽ More Detection of engagement during a conversation is an important function of human-robot interaction. The level of user engagement can influence the dialogue strategy of the robot. Our motivation in this work is to detect several behaviors which will be used as social signal inputs for a real-time engagement recognition model. These behaviors are nodding, laughter, verbal backchannels and eye gaze. We describe models of these behaviors which have been learned from a large corpus of human-robot interactions with the android robot ERICA. Input data to the models comes from a Kinect sensor and a microphone array. Using our engagement recognition model, we can achieve reasonable performance using the inputs from automatic social signal detection, compared to using manual annotation as input. △ Less

Submitted 29 September, 2017; originally announced September 2017.

Comments: AAAI Fall Symposium on Natural Communication for Human-Robot Collaboration, 2017

arXiv:1709.01535 [pdf, ps, other]

Just Take the Average! An Embarrassingly Simple $2^n$-Time Algorithm for SVP (and CVP)

Authors: Divesh Aggarwal, Noah Stephens-Davidowitz

Abstract: We show a $2^{n+o(n)}$-time (and space) algorithm for the Shortest Vector Problem on lattices (SVP) that works by repeatedly running an embarrassingly simple "pair and average" sieving-like procedure on a list of lattice vectors. This matches the running time (and space) of the current fastest known algorithm, due to Aggarwal, Dadush, Regev, and Stephens-Davidowitz (ADRS, in STOC, 2015), with a fa… ▽ More We show a $2^{n+o(n)}$-time (and space) algorithm for the Shortest Vector Problem on lattices (SVP) that works by repeatedly running an embarrassingly simple "pair and average" sieving-like procedure on a list of lattice vectors. This matches the running time (and space) of the current fastest known algorithm, due to Aggarwal, Dadush, Regev, and Stephens-Davidowitz (ADRS, in STOC, 2015), with a far simpler algorithm. Our algorithm is in fact a modification of the ADRS algorithm, with a certain careful rejection sampling step removed. The correctness of our algorithm follows from a more general "meta-theorem," showing that such rejection sampling steps are unnecessary for a certain class of algorithms and use cases. In particular, this also applies to the related $2^{n + o(n)}$-time algorithm for the Closest Vector Problem (CVP), due to Aggarwal, Dadush, and Stephens-Davidowitz (ADS, in FOCS, 2015), yielding a similar embarrassingly simple algorithm for $γ$-approximate CVP for any $γ= 1+2^{-o(n/\log n)}$. (We can also remove the rejection sampling procedure from the $2^{n+o(n)}$-time ADS algorithm for exact CVP, but the resulting algorithm is still quite complicated.) △ Less

Submitted 5 September, 2017; originally announced September 2017.

Journal ref: SOSA 2018

Showing 1–50 of 68 results for author: Divesh