Skip to main content

Showing 1–48 of 48 results for author: Tang, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.05109  [pdf, other

    cs.DB

    A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?

    Authors: Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuyu Luo, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang

    Abstract: Translating users' natural language queries (NL) into SQL queries (i.e., NL2SQL) can significantly reduce barriers to accessing relational databases and support various commercial applications. The performance of NL2SQL has been greatly enhanced with the emergence of Large Language Models (LLMs). In this survey, we provide a comprehensive review of NL2SQL techniques powered by LLMs, covering its e… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

  2. arXiv:2406.17559  [pdf, other

    cs.CV

    Minimal Interaction Edge Tuning: A New Paradigm for Visual Adaptation

    Authors: Ningyuan Tang, Minghao Fu, Jianxin Wu

    Abstract: The rapid scaling of large vision pretrained models makes fine-tuning tasks more and more difficult on edge devices with low computational resources. We explore a new visual adaptation paradigm called edge tuning, which treats large pretrained models as standalone feature extractors that run on powerful cloud servers. The fine-tuning carries out on edge devices with small networks which require lo… ▽ More

    Submitted 25 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

    Comments: 9 pages

  3. arXiv:2406.11131  [pdf, other

    cs.CL cs.AI cs.DB

    Are Large Language Models a Good Replacement of Taxonomies?

    Authors: Yushi Sun, Hao Xin, Kai Sun, Yifan Ethan Xu, Xiao Yang, Xin Luna Dong, Nan Tang, Lei Chen

    Abstract: Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask… ▽ More

    Submitted 20 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

    Comments: Accepted by VLDB 2024

  4. arXiv:2406.11033  [pdf, other

    cs.DB cs.AI

    HAIChart: Human and AI Paired Visualization System

    Authors: Yupeng Xie, Yuyu Luo, Guoliang Li, Nan Tang

    Abstract: The growing importance of data visualization in business intelligence and data science emphasizes the need for tools that can efficiently generate meaningful visualizations from large datasets. Existing tools fall into two main categories: human-powered tools (e.g., Tableau and PowerBI), which require intensive expert involvement, and AI-powered automated tools (e.g., Draco and Table2Charts), whic… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: 16 pages, 14 figures, 7 tables

  5. arXiv:2406.07815  [pdf, other

    cs.CL cs.AI

    Are Large Language Models Good Statisticians?

    Authors: Yizhang Zhu, Shiyin Du, Boyan Li, Yuyu Luo, Nan Tang

    Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across a range of scientific tasks including mathematics, physics, and chemistry. Despite their successes, the effectiveness of LLMs in handling complex statistical tasks remains systematically under-explored. To bridge this gap, we introduce StatQA, a new benchmark designed for statistical analysis tasks. StatQA comprises 11,6… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 31 pages, 10 figures,19 tables. Work in progress

  6. arXiv:2406.04744  [pdf, other

    cs.CL

    CRAG -- Comprehensive RAG Benchmark

    Authors: Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar , et al. (2 additional authors not shown)

    Abstract: Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering bench… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  7. The Dawn of Natural Language to SQL: Are We Fully Ready?

    Authors: Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, Nan Tang

    Abstract: Translating users' natural language questions into SQL queries (i.e., NL2SQL) significantly lowers the barriers to accessing relational databases. The emergence of Large Language Models has introduced a novel paradigm in NL2SQL tasks, enhancing capabilities dramatically. However, this raises a critical question: Are we fully prepared to deploy NL2SQL models in production? To address the posed qu… ▽ More

    Submitted 27 July, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: VLDB 2024

  8. arXiv:2405.18573  [pdf, other

    cs.SE

    Programmer Visual Attention During Context-Aware Code Summarization

    Authors: Aakash Bansal, Robert Wallace, Zachary Karas, Ningzhi Tang, Yu Huang, Toby Jia-Jun Li, Collin McMillan

    Abstract: Abridged: Programmer attention represents the visual focus of programmers on parts of the source code in pursuit of programming tasks. We conducted an in-depth human study with XY Java programmers, where each programmer generated summaries for 40 methods from five large Java projects over five one-hour sessions. We used eye-tracking equipment to map the visual attention of programmers while they w… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: 10 pages, 4 figures, 4 tables. this is a pre-print submitted to IEEE Transactions on Software Engineering for review

  9. arXiv:2405.17039  [pdf, other

    cs.CL cs.LG

    BWArea Model: Learning World Model, Inverse Dynamics, and Policy for Controllable Language Generation

    Authors: Chengxing Jia, Pengyuan Wang, Ziniu Li, Yi-Chen Li, Zhilong Zhang, Nan Tang, Yang Yu

    Abstract: Large language models (LLMs) have catalyzed a paradigm shift in natural language processing, yet their limited controllability poses a significant challenge for downstream applications. We aim to address this by drawing inspiration from the neural mechanisms of the human brain, specifically Broca's and Wernicke's areas, which are crucial for language generation and comprehension, respectively. In… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  10. arXiv:2405.16113  [pdf, other

    cs.LG

    Enabling On-Device Learning via Experience Replay with Efficient Dataset Condensation

    Authors: Gelei Xu, Ningzhi Tang, Jun Xia, Wei Jin, Yiyu Shi

    Abstract: Upon deployment to edge devices, it is often desirable for a model to further learn from streaming data to improve accuracy. However, extracting representative features from such data is challenging because it is typically unlabeled, non-independent and identically distributed (non-i.i.d), and is seen only once. To mitigate this issue, a common strategy is to maintain a small data buffer on the ed… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

    Comments: 9 pages, 10 figures

  11. arXiv:2405.16081  [pdf, other

    cs.SE cs.HC

    A Study on Developer Behaviors for Validating and Repairing LLM-Generated Code Using Eye Tracking and IDE Actions

    Authors: Ningzhi Tang, Meng Chen, Zheng Ning, Aakash Bansal, Yu Huang, Collin McMillan, Toby Jia-Jun Li

    Abstract: The increasing use of large language model (LLM)-powered code generation tools, such as GitHub Copilot, is transforming software engineering practices. This paper investigates how developers validate and repair code generated by Copilot and examines the impact of code provenance awareness during these processes. We conducted a lab study with 28 participants, who were tasked with validating and rep… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  12. arXiv:2405.07001   

    cs.CL cs.AI cs.CV

    Evaluating Task-based Effectiveness of MLLMs on Charts

    Authors: Yifan Wu, Lutao Yan, Yuyu Luo, Yunhai Wang, Nan Tang

    Abstract: In this paper, we explore a forward-thinking question: Is GPT-4V effective at low-level data analysis tasks on charts? To this end, we first curate a large-scale dataset, named ChartInsights, consisting of 89,388 quartets (chart, task, question, answer) and covering 10 widely-used low-level data analysis tasks on 7 chart types. Firstly, we conduct systematic evaluations to understand the capabilit… ▽ More

    Submitted 17 June, 2024; v1 submitted 11 May, 2024; originally announced May 2024.

    Comments: The experimental part needs to be revised. Withdraw this version

  13. arXiv:2404.09248  [pdf, other

    cs.LG cs.AI cs.CL

    Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts

    Authors: Jing-Cheng Pang, Si-Hang Yang, Kaiyuan Li, Jiaji Zhang, Xiong-Hui Chen, Nan Tang, Yang Yu

    Abstract: Reinforcement learning (RL) trains agents to accomplish complex tasks through environmental interaction data, but its capacity is also limited by the scope of the available data. To obtain a knowledgeable agent, a promising approach is to leverage the knowledge from large language models (LLMs). Despite previous studies combining LLMs with RL, seamless integration of the two components remains cha… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

  14. arXiv:2403.17285  [pdf, other

    stat.ML cs.LG

    An Analysis of Switchback Designs in Reinforcement Learning

    Authors: Qianglin Wen, Chengchun Shi, Ying Yang, Niansheng Tang, Hongtu Zhu

    Abstract: This paper offers a detailed investigation of switchback designs in A/B testing, which alternate between baseline and new policies over time. Our aim is to thoroughly evaluate the effects of these designs on the accuracy of their resulting average treatment effect (ATE) estimators. We propose a novel "weak signal analysis" framework, which substantially simplifies the calculations of the mean squa… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  15. arXiv:2402.04009  [pdf, other

    cs.CV cs.AI

    Low-rank Attention Side-Tuning for Parameter-Efficient Fine-Tuning

    Authors: Ningyuan Tang, Minghao Fu, Ke Zhu, Jianxin Wu

    Abstract: In finetuning a large pretrained model to downstream tasks, parameter-efficient fine-tuning (PEFT) methods can effectively finetune pretrained models with few trainable parameters, but suffer from high GPU memory consumption and slow training speed. Because learnable parameters from these methods are entangled with the pretrained model, gradients related to the frozen pretrained model's parameters… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

  16. arXiv:2402.03719  [pdf, other

    cs.CL cs.AI

    Empowering Language Models with Active Inquiry for Deeper Understanding

    Authors: Jing-Cheng Pang, Heng-Bo Fan, Pengyuan Wang, Jia-Hao Xiao, Nan Tang, Si-Hang Yang, Chengxing Jia, Sheng-Jun Huang, Yang Yu

    Abstract: The rise of large language models (LLMs) has revolutionized the way that we interact with artificial intelligence systems through natural language. However, LLMs often misinterpret user queries because of their uncertain intention, leading to less helpful responses. In natural human interactions, clarification is sought through targeted questioning to uncover obscure information. Thus, in this pap… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

  17. arXiv:2312.03987  [pdf, other

    cs.CL cs.AI

    Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration

    Authors: Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, Guoliang Li, Xiaoyong Du

    Abstract: Entity resolution (ER) is an important data integration task with a wide spectrum of applications. The state-of-the-art solutions on ER rely on pre-trained language models (PLMs), which require fine-tuning on a lot of labeled matching/non-matching entity pairs. Recently, large languages models (LLMs), such as GPT-4, have shown the ability to perform many tasks without tuning model parameters, whic… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: 14 pages, 7 figures

  18. arXiv:2310.00749  [pdf, other

    cs.DB cs.LG

    SEED: Domain-Specific Data Curation With Large Language Models

    Authors: Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, Zihui Gu, Chunwei Liu, Michael Cafarella

    Abstract: Data curation tasks that prepare data for analytics are critical for turning data into actionable insights. However, due to the diverse requirements of applications in different domains, generic off-the-shelf tools are typically insufficient. As a result, data scientists often have to develop domain-specific solutions tailored to both the dataset and the task, e.g. writing domain-specific code or… ▽ More

    Submitted 24 April, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: preprint, 20 pages, 4 figures

  19. arXiv:2307.02796  [pdf, other

    cs.DB cs.CL cs.LG

    VerifAI: Verified Generative AI

    Authors: Nan Tang, Chenyu Yang, Ju Fan, Lei Cao, Yuyu Luo, Alon Halevy

    Abstract: Generative AI has made significant strides, yet concerns about the accuracy and reliability of its outputs continue to grow. Such inaccuracies can have serious consequences such as inaccurate decision-making, the spread of false information, privacy violations, legal liabilities, and more. Although efforts to address these risks are underway, including explainable AI and responsible AI practices s… ▽ More

    Submitted 10 October, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

    Comments: 8 pages, 4 figures

  20. arXiv:2306.08891  [pdf, other

    cs.CL

    Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation

    Authors: Zihui Gu, Ju Fan, Nan Tang, Songyue Zhang, Yuxin Zhang, Zui Chen, Lei Cao, Guoliang Li, Sam Madden, Xiaoyong Du

    Abstract: Zero-shot NL2SQL is crucial in achieving natural language to SQL that is adaptive to new environments (e.g., new databases, new linguistic phenomena or SQL structures) with zero annotated NL2SQL samples from such environments. Existing approaches either fine-tune pre-trained language models (PLMs) based on annotated data or use prompts to guide fixed large language models (LLMs) such as ChatGPT. P… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Working in progress

  21. arXiv:2304.03540  [pdf, other

    cs.DB cs.AI cs.HC cs.LG

    ChatPipe: Orchestrating Data Preparation Program by Optimizing Human-ChatGPT Interactions

    Authors: Sibei Chen, Hanbing Liu, Weiting Jin, Xiangyu Sun, Xiaoyao Feng, Ju Fan, Xiaoyong Du, Nan Tang

    Abstract: Orchestrating a high-quality data preparation program is essential for successful machine learning (ML), but it is known to be time and effort consuming. Despite the impressive capabilities of large language models like ChatGPT in generating programs by interacting with users through natural language prompts, there are still limitations. Specifically, a user must provide specific prompts to iterat… ▽ More

    Submitted 7 April, 2023; originally announced April 2023.

  22. arXiv:2303.16909  [pdf, other

    cs.DB cs.AI

    RetClean: Retrieval-Based Data Cleaning Using Foundation Models and Data Lakes

    Authors: Mohammad Shahmeer Ahmad, Zan Ahmad Naeem, Mohamed Eltabakh, Mourad Ouzzani, Nan Tang

    Abstract: Can foundation models (such as ChatGPT) clean your data? In this proposal, we demonstrate that indeed ChatGPT can assist in data cleaning by suggesting corrections for specific cells in a data table (scenario 1). However, ChatGPT may struggle with datasets it has never encountered before (e.g., local enterprise data) or when the user requires an explanation of the source of the suggested clean val… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

  23. arXiv:2303.09055  [pdf, other

    cs.CV

    TemporalMaxer: Maximize Temporal Context with only Max Pooling for Temporal Action Localization

    Authors: Tuan N. Tang, Kwonyoung Kim, Kwanghoon Sohn

    Abstract: Temporal Action Localization (TAL) is a challenging task in video understanding that aims to identify and localize actions within a video sequence. Recent studies have emphasized the importance of applying long-term temporal context modeling (TCM) blocks to the extracted video clip features such as employing complex self-attention mechanisms. In this paper, we present the simplest method ever to a… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

  24. arXiv:2302.10900  [pdf, other

    cs.LG cs.AI cs.IR

    Semi-decentralized Federated Ego Graph Learning for Recommendation

    Authors: Liang Qu, Ningzhi Tang, Ruiqi Zheng, Quoc Viet Hung Nguyen, Zi Huang, Yuhui Shi, Hongzhi Yin

    Abstract: Collaborative filtering (CF) based recommender systems are typically trained based on personal interaction data (e.g., clicks and purchases) that could be naturally represented as ego graphs. However, most existing recommendation methods collect these ego graphs from all users to compose a global graph to obtain high-order collaborative information between users and items, and these centralized CF… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

  25. arXiv:2211.04905  [pdf, other

    cs.CV

    SimOn: A Simple Framework for Online Temporal Action Localization

    Authors: Tuan N. Tang, Jungin Park, Kwonyoung Kim, Kwanghoon Sohn

    Abstract: Online Temporal Action Localization (On-TAL) aims to immediately provide action instances from untrimmed streaming videos. The model is not allowed to utilize future frames and any processing techniques to modify past predictions, making On-TAL much more challenging. In this paper, we propose a simple yet effective framework, termed SimOn, that learns to predict action instances using the popular… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

  26. arXiv:2211.02816  [pdf, other

    cs.CL cs.AI cs.LG

    PASTA: Table-Operations Aware Fact Verification via Sentence-Table Cloze Pre-training

    Authors: Zihui Gu, Ju Fan, Nan Tang, Preslav Nakov, Xiaoman Zhao, Xiaoyong Du

    Abstract: Fact verification has attracted a lot of research attention recently, e.g., in journalism, marketing, and policymaking, as misinformation and disinformation online can sway one's opinion and affect one's actions. While fact-checking is a hard task in general, in many cases, false statements can be easily debunked based on analytics over tables with reliable information. Hence, table-based fact ver… ▽ More

    Submitted 5 November, 2022; originally announced November 2022.

    Comments: EMNLP 2022

    MSC Class: 68T50 ACM Class: I.2.7; I.2.6

  27. New Decoding of Reed-Solomon Codes Based on FFT and Modular Approach

    Authors: Nianqi Tang, Yunghsiang S. Han

    Abstract: Decoding algorithms for Reed--Solomon (RS) codes are of great interest for both practical and theoretical reasons. In this paper, an efficient algorithm, called the modular approach (MA), is devised for solving the Welch--Berlekamp (WB) key equation. By taking the MA as the key equation solver, we propose a new decoding algorithm for systematic RS codes. For $(n,k)$ RS codes, where $n$ is the code… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

  28. arXiv:2206.06908  [pdf, other

    cs.SD eess.AS

    LPCSE: Neural Speech Enhancement through Linear Predictive Coding

    Authors: Yang Liu, Na Tang, Xiaoli Chu, Yang Yang, Jun Wang

    Abstract: The increasingly stringent requirement on quality-of-experience in 5G/B5G communication systems has led to the emerging neural speech enhancement techniques, which however have been developed in isolation from the existing expert-rule based models of speech pronunciation and distortion, such as the classic Linear Predictive Coding (LPC) speech model because it is difficult to integrate the models… ▽ More

    Submitted 22 June, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

  29. arXiv:2204.03281  [pdf, other

    cs.IR

    Single-shot Embedding Dimension Search in Recommender System

    Authors: Liang Qu, Yonghong Ye, Ningzhi Tang, Lixin Zhang, Yuhui Shi, Hongzhi Yin

    Abstract: As a crucial component of most modern deep recommender systems, feature embedding maps high-dimensional sparse user/item features into low-dimensional dense embeddings. However, these embeddings are usually assigned a unified dimension, which suffers from the following issues: (1) high memory usage and computation cost. (2) sub-optimal performance due to inferior dimension assignments. In order to… ▽ More

    Submitted 14 April, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

  30. arXiv:2110.01133  [pdf, other

    cs.IT

    Lifetime Maximization for UAV-Enabled Cognitive-NOMA IoT Networks: Joint Location, Power, and Decoding Order Optimization

    Authors: Na Tang

    Abstract: This paper investigates a cognitive unmanned aerial vehicle (UAV) enabled Internet of Things (IoT) network, where secondary/cognitive IoT devices upload their data to the UAV hub following a non-orthogonal multiple access (NOMA) protocol in the spectrum of the primary network. We aim to maximize the minimum lifetime of IoT devices by jointly optimizing the UAV location, transmit power, and decodin… ▽ More

    Submitted 29 October, 2021; v1 submitted 3 October, 2021; originally announced October 2021.

  31. arXiv:2108.10520  [pdf, other

    cs.CV

    Improving Object Detection by Label Assignment Distillation

    Authors: Chuong H. Nguyen, Thuy C. Nguyen, Tuan N. Tang, Nam L. H. Phan

    Abstract: Label assignment in object detection aims to assign targets, foreground or background, to sampled regions in an image. Unlike labeling for image classification, this problem is not well defined due to the object's bounding box. In this paper, we investigate the problem from a perspective of distillation, hence we call Label Assignment Distillation (LAD). Our initial motivation is very simple, we u… ▽ More

    Submitted 19 October, 2021; v1 submitted 24 August, 2021; originally announced August 2021.

    Comments: To appear in WACV 2022

  32. arXiv:2106.06649  [pdf, other

    cs.CV

    1st Place Solution for YouTubeVOS Challenge 2021:Video Instance Segmentation

    Authors: Thuy C. Nguyen, Tuan N. Tang, Nam LH. Phan, Chuong H. Nguyen, Masayuki Yamazaki, Masao Yamanaka

    Abstract: Video Instance Segmentation (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously. Extended from image set applications, video data additionally induces the temporal information, which, if handled appropriately, is very useful to identify and predict object motions. In this work, we design a unified model to mutually learn these tasks. Specifically, we propo… ▽ More

    Submitted 8 July, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: Accepted to CPVR 2021 Workshop

  33. arXiv:2012.02469  [pdf, other

    cs.LG cs.DB

    RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation

    Authors: Nan Tang, Ju Fan, Fangyi Li, Jianhong Tu, Xiaoyong Du, Guoliang Li, Sam Madden, Mourad Ouzzani

    Abstract: Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers? We answer this question by presenting RPT, a denoising auto-encoder for tuple-to-X models (X could be tuple, token, label, JSON, and so on). RPT is pre-trained for a tuple-to-tuple model by corrupting the input tuple and then learning a model to reconstruct the or… ▽ More

    Submitted 31 March, 2021; v1 submitted 4 December, 2020; originally announced December 2020.

  34. arXiv:2007.04856  [pdf, other

    cs.RO

    Computing High-Quality Clutter Removal Solutions for Multiple Robots

    Authors: Wei N. Tang, Shuai D. Han, Jingjin Yu

    Abstract: We investigate the task and motion planning problem of clearing clutter from a workspace with limited ingress/egress access for multiple robots. We call the problem multi-robot clutter removal (MRCR). Targeting practical applications where motion planning is non-trivial but is not a bottleneck, we focus on finding high-quality solutions for feasible MRCR instances, which depends on the ability to… ▽ More

    Submitted 9 July, 2020; originally announced July 2020.

  35. arXiv:1911.11876  [pdf, other

    cs.DB

    Dataset-On-Demand: Automatic View Search and Presentation for Data Discovery

    Authors: Raul Castro Fernandez, Nan Tang, Mourad Ouzzani, Michael Stonebraker, Samuel Madden

    Abstract: Many data problems are solved when the right view of a combination of datasets is identified. Finding such a view is challenging because of the many tables spread across many databases, data lakes, and cloud storage in modern organizations. Finding relevant tables, and identifying how to combine them is a difficult and time-consuming process that hampers users' productivity. In this paper, we de… ▽ More

    Submitted 26 November, 2019; originally announced November 2019.

  36. arXiv:1908.10583  [pdf, other

    cs.SI cs.DB cs.DS

    Efficient Algorithms for Approximate Single-Source Personalized PageRank Queries

    Authors: Sibo Wang, Renchi Yang, Runhui Wang, Xiaokui Xiao, Zhewei Wei, Wenqing Lin, Yin Yang, Nan Tang

    Abstract: Given a graph $G$, a source node $s$ and a target node $t$, the personalized PageRank (PPR) of $t$ with respect to $s$ is the probability that a random walk starting from $s$ terminates at $t$. An important variant of the PPR query is single-source PPR (SSPPR), which enumerates all nodes in $G$, and returns the top-$k$ nodes with the highest PPR values with respect to a given source $s$. PPR in ge… ▽ More

    Submitted 28 August, 2019; originally announced August 2019.

    Comments: Accepted in the ACM Transactions on Database Systems (TODS)

  37. arXiv:1906.06574  [pdf, ps, other

    cs.DB

    Technical Report: Optimizing Human Involvement for Entity Matching and Consolidation

    Authors: Ji Sun, Dong Deng, Ihab Ilyas, Guoliang Li, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

    Abstract: An end-to-end data integration system requires human feedback in several phases, including collecting training data for entity matching, debugging the resulting clusters, confirming transformations applied on these clusters for data standardization, and finally, reducing each cluster to a single, canonical representation (or "golden record"). The traditional wisdom is to sequentially apply the hum… ▽ More

    Submitted 15 June, 2019; originally announced June 2019.

  38. arXiv:1905.13530  [pdf, other

    cs.RO

    Taming Combinatorial Challenges in Optimal Clutter Removal Tasks

    Authors: Wei N. Tang, Jingjin Yu

    Abstract: We examine an important combinatorial challenge in clearing clutter using a mobile robot equipped with a manipulator, seeking to compute an optimal object removal sequence for minimizing the task completion time, assuming that each object is grasped once and then subsequently removed. On the structural side, we establish that such an optimal sequence can be NP-hard to compute, even when no two obj… ▽ More

    Submitted 31 May, 2019; originally announced May 2019.

  39. arXiv:1903.03229  [pdf, other

    cs.PL cs.DB

    Deductive Optimization of Relational Data Storage

    Authors: John K. Feser, Samuel Madden, Nan Tang, Armando Solar-Lezama

    Abstract: Optimizing the physical data storage and retrieval of data are two key database management problems. In this paper, we propose a language that can express a wide range of physical database layouts, going well beyond the row- and column-based methods that are widely used in database management systems. We use deductive synthesis to turn a high-level relational representation of a database query int… ▽ More

    Submitted 5 February, 2020; v1 submitted 7 March, 2019; originally announced March 2019.

  40. arXiv:1903.00984  [pdf, other

    cs.RO

    Tight Robot Packing in the Real World: A Complete Manipulation Pipeline with Robust Primitives

    Authors: Rahul Shome, Wei N. Tang, Changkyu Song, Chaitanya Mitash, Hristiyan Kourtev, Jingjin Yu, Abdeslam Boularias, Kostas E. Bekris

    Abstract: Many order fulfillment applications in logistics, such as packing, involve picking objects from unstructured piles before tightly arranging them in bins or shipping containers. Desirable robotic solutions in this space need to be low-cost, robust, easily deployable and simple to control. The current work proposes a complete pipeline for solving packing tasks for cuboid objects, given access only t… ▽ More

    Submitted 30 September, 2021; v1 submitted 3 March, 2019; originally announced March 2019.

  41. arXiv:1809.11084  [pdf, other

    cs.DB cs.LG stat.ML

    Reuse and Adaptation for Entity Resolution through Transfer Learning

    Authors: Saravanan Thirumuruganathan, Shameem A Puthiya Parambath, Mourad Ouzzani, Nan Tang, Shafiq Joty

    Abstract: Entity resolution (ER) is one of the fundamental problems in data integration, where machine learning (ML) based classifiers often provide the state-of-the-art results. Considerable human effort goes into feature engineering and training data creation. In this paper, we investigate a new problem: Given a dataset D_T for ER with limited or no training data, is it possible to train a good ML classif… ▽ More

    Submitted 28 September, 2018; originally announced September 2018.

  42. arXiv:1803.01384  [pdf, other

    cs.DB

    Data Curation with Deep Learning [Vision]

    Authors: Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, AnHai Doan

    Abstract: Data curation - the process of discovering, integrating, and cleaning data - is one of the oldest, hardest, yet inevitable data management problems. Despite decades of efforts from both researchers and practitioners, it is still one of the most time consuming and least enjoyable work of data scientists. In most organizations, data curation plays an important role so as to fully unlock the value of… ▽ More

    Submitted 24 March, 2019; v1 submitted 4 March, 2018; originally announced March 2018.

  43. DeepER -- Deep Entity Resolution

    Authors: Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, Nan Tang

    Abstract: Entity resolution (ER) is a key data integration problem. Despite the efforts in 70+ years in all aspects of ER, there is still a high demand for democratizing ER - humans are heavily involved in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representation of words (a.k.a. word… ▽ More

    Submitted 18 November, 2019; v1 submitted 2 October, 2017; originally announced October 2017.

    Comments: Accepted to PVLDB 2018 as "Distributed Representations of Tuples for Entity Resolution". This version corrects a minor issue in Example 4 pointed out by Andrew Borthwick and Matthias Boehm

  44. arXiv:1709.10436  [pdf, other

    cs.DB

    Unsupervised String Transformation Learning for Entity Consolidation

    Authors: Dong Deng, Wenbo Tao, Ziawasch Abedjan, Ahmed Elmagarmid, Guoliang Li, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

    Abstract: Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single "golden record" for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods, as well as Master Data Management… ▽ More

    Submitted 30 July, 2018; v1 submitted 29 September, 2017; originally announced September 2017.

  45. arXiv:1609.04745  [pdf, other

    cs.RO

    A Portable, 3D-Printing Enabled Multi-Vehicle Platform for Robotics Research and Education

    Authors: Jingjin Yu, Shuai D Han, Wei N Tang, Daniela Rus

    Abstract: microMVP is an affordable, portable, and open source micro-scale mobile robot platform designed for robotics research and education. As a complete and unique multi-vehicle platform enabled by 3D printing and the maker culture, microMVP can be easily reproduced and requires little maintenance: a set of six micro vehicles, each measuring $8\times 5\times 6$ cubic centimeters and weighing under… ▽ More

    Submitted 29 May, 2017; v1 submitted 15 September, 2016; originally announced September 2016.

    Comments: Updated author list and paper

  46. arXiv:1510.02219  [pdf, ps, other

    cs.DB

    On Summarizing Graph Streams

    Authors: Nan Tang, Qing Chen, Prasenjit Mitra

    Abstract: Graph streams, which refer to the graph with edges being updated sequentially in a form of a stream, have wide applications such as cyber security, social networks and transportation networks. This paper studies the problem of summarizing graph streams. Specifically, given a graph stream G, directed or undirected, the objective is to summarize G as S with much smaller (sublinear) space, linear con… ▽ More

    Submitted 8 October, 2015; originally announced October 2015.

  47. arXiv:1001.5130  [pdf, ps, other

    q-bio.GN cs.CE q-bio.QM

    BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies

    Authors: Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan, Nelson L. S. Tang, Weichuan Yu

    Abstract: Gene-gene interactions have long been recognized to be fundamentally important to understand genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodologically challenging. In this paper, we introduce a simple but powerful method, named `BOolean Operation based Screening and Testing'(BOOST). To di… ▽ More

    Submitted 28 January, 2010; originally announced January 2010.

    Comments: Submitted

  48. A Tighter Bound for the Determinization of Visibly Pushdown Automata

    Authors: Nguyen Van Tang

    Abstract: Visibly pushdown automata (VPA), introduced by Alur and Madhusuan in 2004, is a subclass of pushdown automata whose stack behavior is completely determined by the input symbol according to a fixed partition of the input alphabet. Since its introduce, VPAs have been shown to be useful in various context, e.g., as specification formalism for verification and as automaton model for processing XML s… ▽ More

    Submitted 17 November, 2009; originally announced November 2009.

    ACM Class: F.4.3; F.4.1

    Journal ref: EPTCS 10, 2009, pp. 62-76