Skip to main content

Showing 1–19 of 19 results for author: Chan, D M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13766  [pdf, other

    cs.CV

    Visual Haystacks: Answering Harder Questions About Sets of Images

    Authors: Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E. Gonzalez, Trevor Darrell, David M. Chan

    Abstract: Recent advancements in Large Multimodal Models (LMMs) have made significant progress in the field of single-image visual question answering. However, these models face substantial challenges when tasked with queries that span extensive collections of images, similar to real-world scenarios like searching through large photo albums, finding specific information across the internet, or monitoring en… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Project page: https://visual-haystacks.github.io

  2. arXiv:2407.06576  [pdf, other

    cs.CL cs.AI

    Virtual Personas for Language Models via an Anthology of Backstories

    Authors: Suhong Moon, Marwa Abdulhai, Minwoo Kang, Joseph Suh, Widyadewi Soedarmadji, Eran Kohen Behar, David M. Chan

    Abstract: Large language models (LLMs) are trained from vast repositories of text authored by millions of distinct authors, reflecting an enormous diversity of human traits. While these models bear the potential to be used as approximations of human subjects in behavioral studies, prior efforts have been limited in steering model responses to match individual human users. In this work, we introduce "Antholo… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  3. arXiv:2404.02904  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    ALOHa: A New Measure for Hallucination in Captioning Models

    Authors: Suzanne Petryk, David M. Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E. Gonzalez, Trevor Darrell

    Abstract: Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverage… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: To appear at NAACL 2024

  4. arXiv:2401.05314  [pdf, other

    eess.AS cs.CL cs.CV cs.SD

    ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of Video

    Authors: Kevin Cai, Chonghua Liu, David M. Chan

    Abstract: The Internet's wealth of content, with up to 60% published in English, starkly contrasts the global population, where only 18.8% are English speakers, and just 5.1% consider it their native language, leading to disparities in online information access. Unfortunately, automated processes for dubbing of video - replacing the audio track of a video with a translated alternative - remains a complex an… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

    Comments: To appear in ICASSP 2024

  5. arXiv:2401.02417  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition

    Authors: David M. Chan, Shalini Ghosh, Hitesh Tulsiani, Ariya Rastrow, Björn Hoffmeister

    Abstract: While word error rates of automatic speech recognition (ASR) systems have consistently fallen, natural language understanding (NLU) applications built on top of ASR systems still attribute significant numbers of failures to low-quality speech recognition results. Existing assistant systems collect large numbers of these unsuccessful interactions, but these systems usually fail to learn from these… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

    Comments: To appear in ICASSP 2024

  6. arXiv:2312.14378  [pdf, other

    cs.LG cs.SD eess.AS

    Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification

    Authors: Anirudh S. Sundar, Chao-Han Huck Yang, David M. Chan, Shalini Ghosh, Venkatesh Ravichandran, Phani Sankar Nidadavolu

    Abstract: Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowle… ▽ More

    Submitted 9 February, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: 5 pages, 1 figure, ICASSP 2024 Workshop on Self-supervision in Audio, Speech and Beyond

  7. arXiv:2302.01328  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    IC3: Image Captioning by Committee Consensus

    Authors: David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, John Canny

    Abstract: If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to generate a single "best" (most like a reference) image caption. Unfortunately, doing so encourages captions that are "informationally impoverished," and focus on only a subset of the possible details, while ignoring other potentially useful information in th… ▽ More

    Submitted 19 October, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

    Comments: To Appear at EMNLP 2023

  8. arXiv:2301.02736  [pdf, other

    eess.AS cs.LG cs.SD

    Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition

    Authors: David M. Chan, Shalini Ghosh, Ariya Rastrow, Björn Hoffmeister

    Abstract: Despite improvements to the generalization performance of automated speech recognition (ASR) models, specializing ASR models for downstream tasks remains a challenging task, primarily due to reduced data availability (necessitating increased data collection), and rapidly shifting data distributions (requiring more frequent model fine-tuning). In this work, we investigate the potential of leveragin… ▽ More

    Submitted 6 January, 2023; originally announced January 2023.

  9. arXiv:2209.07518  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Distribution Aware Metrics for Conditional Natural Language Generation

    Authors: David M Chan, Yiming Ni, David A Ross, Sudheendra Vijayanarasimhan, Austin Myers, John Canny

    Abstract: Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the best-matching gold-standard ground truth text. When multiple ground truths are available, scores are aggregated using an average or max operation across references. While this approach works well when diversity in the ground truth data (i.e. dispersi… ▽ More

    Submitted 29 September, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

  10. arXiv:2206.08353  [pdf, other

    cs.LG stat.ML

    Towards Understanding How Machines Can Learn Causal Overhypotheses

    Authors: Eliza Kosoy, David M. Chan, Adrian Liu, Jasmine Collins, Bryanna Kaufmann, Sandy Han Huang, Jessica B. Hamrick, John Canny, Nan Rosemary Ke, Alison Gopnik

    Abstract: Recent work in machine learning and cognitive science has suggested that understanding causal information is essential to the development of intelligence. The extensive literature in cognitive science using the ``blicket detector'' environment shows that children are adept at many kinds of causal inference and learning. We propose to adapt that environment for machine learning agents. One of the k… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

  11. arXiv:2205.09872  [pdf, other

    eess.AS cs.LG cs.SD

    Content-Context Factorized Representations for Automated Speech Recognition

    Authors: David M. Chan, Shalini Ghosh

    Abstract: Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnecessary contexts such as background noise and sounds or speaker identity, accent, or protected a… ▽ More

    Submitted 15 September, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

    Comments: Presented at Interspeech 2022 (On-Site Oral Presentation)

  12. arXiv:2205.06253  [pdf, other

    cs.CV cs.CL

    What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

    Authors: David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, Bryan Seybold, John F. Canny

    Abstract: While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In th… ▽ More

    Submitted 12 January, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: The 1st Workshop on Vision Datasets Understanding, IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022

  13. arXiv:2202.10430  [pdf, other

    cs.LG cs.AI cs.NE

    Learning Causal Overhypotheses through Exploration in Children and Computational Models

    Authors: Eliza Kosoy, Adrian Liu, Jasmine Collins, David M Chan, Jessica B Hamrick, Nan Rosemary Ke, Sandy H Huang, Bryanna Kaufmann, John Canny, Alison Gopnik

    Abstract: Despite recent progress in reinforcement learning (RL), RL algorithms for exploration still remain an active area of research. Existing methods often focus on state-based metrics, which do not consider the underlying causal structures of the environment, and while recent research has begun to explore RL environments for causal learning, these environments primarily leverage causal information thro… ▽ More

    Submitted 21 February, 2022; originally announced February 2022.

  14. arXiv:2110.09890  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-Modal Pre-Training for Automated Speech Recognition

    Authors: David M. Chan, Shalini Ghosh, Debmalya Chakrabarty, Björn Hoffmeister

    Abstract: Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise… ▽ More

    Submitted 15 September, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: Presented at ICASSP 2022

  15. arXiv:2104.01263  [pdf, other

    cs.CV

    A Semantic Segmentation Network for Urban-Scale Building Footprint Extraction Using RGB Satellite Imagery

    Authors: Aatif Jiwani, Shubhrakanti Ganguly, Chao Ding, Nan Zhou, David M. Chan

    Abstract: Urban areas consume over two-thirds of the world's energy and account for more than 70 percent of global CO2 emissions. As stated in IPCC's Global Warming of 1.5C report, achieving carbon neutrality by 2050 requires a clear understanding of urban geometry. High-quality building footprint generation from satellite images can accelerate this predictive process and empower municipal decision-making a… ▽ More

    Submitted 18 November, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

    Comments: 11 pages, 5 figures. Code available at https://github.com/aatifjiwani/rgb-footprint-extract/

  16. arXiv:2007.13913  [pdf, other

    cs.CV cs.CL cs.LG

    Active Learning for Video Description With Cluster-Regularized Ensemble Ranking

    Authors: David M. Chan, Sudheendra Vijayanarasimhan, David A. Ross, John Canny

    Abstract: Automatic video captioning aims to train models to generate text descriptions for all segments in a video, however, the most effective approaches require large amounts of manual annotation which is slow and expensive. Active learning is a promising way to efficiently build a training set for video captioning tasks while reducing the need to manually label uninformative examples. In this work we bo… ▽ More

    Submitted 2 December, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

    Comments: Published at the 15th Asian Conference on Computer Vision (ACCV 2020)

  17. arXiv:2005.02880  [pdf, other

    cs.AI

    Exploring Exploration: Comparing Children with RL Agents in Unified Environments

    Authors: Eliza Kosoy, Jasmine Collins, David M. Chan, Sandy Huang, Deepak Pathak, Pulkit Agrawal, John Canny, Alison Gopnik, Jessica B. Hamrick

    Abstract: Research in developmental psychology consistently shows that children explore the world thoroughly and efficiently and that this exploration allows them to learn. In turn, this early learning supports more robust generalization and intelligent behavior later in life. While much work has gone into developing methods for exploration in machine learning, artificial agents have not yet reached the hig… ▽ More

    Submitted 1 July, 2020; v1 submitted 6 May, 2020; originally announced May 2020.

    Comments: Published as a workshop paper at "Bridging AI and Cognitive Science" (ICLR 2020)

  18. arXiv:1812.04604  [pdf, other

    cs.CV cs.AI

    Diagnostic Visualization for Deep Neural Networks Using Stochastic Gradient Langevin Dynamics

    Authors: Biye Jiang, David M. Chan, Tianhao Zhang, John F. Canny

    Abstract: The internal states of most deep neural networks are difficult to interpret, which makes diagnosis and debugging during training challenging. Activation maximization methods are widely used, but lead to multiple optima and are hard to interpret (appear noise-like) for complex neurons. Image-based methods use maximally-activating image regions which are easier to interpret, but do not provide pixel… ▽ More

    Submitted 11 December, 2018; originally announced December 2018.

  19. arXiv:1807.11824  [pdf, other

    cs.LG cs.PF stat.ML

    t-SNE-CUDA: GPU-Accelerated t-SNE and its Applications to Modern Data

    Authors: David M. Chan, Roshan Rao, Forrest Huang, John F. Canny

    Abstract: Modern datasets and models are notoriously difficult to explore and analyze due to their inherent high dimensionality and massive numbers of samples. Existing visualization methods which employ dimensionality reduction to two or three dimensions are often inefficient and/or ineffective for these datasets. This paper introduces t-SNE-CUDA, a GPU-accelerated implementation of t-distributed Symmetric… ▽ More

    Submitted 31 July, 2018; originally announced July 2018.

    Comments: To appear in HPML 2018 High Performance Machine Learning Workshop (Accepted, 2018)