Skip to main content

Showing 1–21 of 21 results for author: Koh, J Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01476  [pdf, other

    cs.AI cs.CL cs.LG

    Tree Search for Language Model Agents

    Authors: Jing Yu Koh, Stephen McAleer, Daniel Fried, Ruslan Salakhutdinov

    Abstract: Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 11 pages. Models and code available at https://jykoh.com/search-agents

  2. arXiv:2406.12814  [pdf, other

    cs.LG cs.CL cs.CR cs.CV

    Adversarial Attacks on Multimodal Agents

    Authors: Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan

    Abstract: Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-base… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 19 pages

  3. arXiv:2406.00505  [pdf, other

    cs.CV

    Improving Text Generation on Images with Synthetic Captions

    Authors: Jun Young Koh, Sang Hyun Park, Joy Song

    Abstract: The recent emergence of latent diffusion models such as SDXL and SD 1.5 has shown significant capability in generating highly detailed and realistic images. Despite their remarkable ability to produce images, generating accurate text within images still remains a challenging task. In this paper, we examine the validity of fine-tuning approaches in generating legible text within the image. We propo… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: 9 pages, 12 figures

  4. arXiv:2404.07554  [pdf, other

    cs.CV cs.AI

    CAT: Contrastive Adapter Training for Personalized Image Generation

    Authors: Jae Wan Park, Sang Hyun Park, Jun Young Koh, Junha Lee, Min Song

    Abstract: The emergence of various adapters, including Low-Rank Adaptation (LoRA) applied from the field of natural language processing, has allowed diffusion models to personalize image generation at a low cost. However, due to the various challenges including limited datasets and shortage of regularization and computation resources, adapter training often results in unsatisfactory outcomes, leading to the… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: CVPRW 2024

  5. arXiv:2402.17553  [pdf, other

    cs.AI cs.CL cs.CV cs.HC

    OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

    Authors: Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov

    Abstract: For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They coul… ▽ More

    Submitted 21 July, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

  6. arXiv:2401.13649  [pdf, other

    cs.LG cs.CL cs.CV

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

    Authors: Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried

    Abstract: Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augmen… ▽ More

    Submitted 5 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted to ACL 2024. 24 pages. Project page: https://jykoh.com/vwa

  7. arXiv:2310.07478  [pdf, other

    cs.AI

    Multimodal Graph Learning for Generative Tasks

    Authors: Minji Yoon, Jing Yu Koh, Bryan Hooi, Ruslan Salakhutdinov

    Abstract: Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize: for example, from plain text to image-caption pairs. Most multimodal learning algorithms focus on modeling simple one-to-one pairs of data from two modalities, such as image-caption pairs, or audio-text pairs. However, in most real-world settings, entities of different modalit… ▽ More

    Submitted 12 October, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

  8. arXiv:2305.17216  [pdf, other

    cs.CL cs.CV cs.LG

    Generating Images with Multimodal Language Models

    Authors: Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov

    Abstract: We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to… ▽ More

    Submitted 13 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023. Project page: http://jykoh.com/gill

  9. arXiv:2302.06833  [pdf, other

    cs.CV

    VQ3D: Learning a 3D-Aware Generative Model on ImageNet

    Authors: Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang, Charles Herrmann, Pratul Srinivasan, Jiajun Wu, Deqing Sun

    Abstract: Recent work has shown the possibility of training generative models of 3D content from 2D image collections on small datasets corresponding to a single object class, such as human faces, animal faces, or cars. However, these models struggle on larger, more complex datasets. To model diverse and unconstrained image collections such as ImageNet, we present VQ3D, which introduces a NeRF-based decoder… ▽ More

    Submitted 14 February, 2023; originally announced February 2023.

    Comments: 15 pages. For visual results, please visit the project webpage at http://kylesargent.github.io/vq3d

  10. arXiv:2301.13823  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Grounding Language Models to Images for Multimodal Inputs and Outputs

    Authors: Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried

    Abstract: We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the langu… ▽ More

    Submitted 13 June, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

    Comments: Published in ICML 2023. Project page: https://jykoh.com/fromage

  11. arXiv:2210.03112  [pdf, other

    cs.LG cs.CL cs.CV cs.RO

    A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

    Authors: Aishwarya Kamath, Peter Anderson, Su Wang, Jing Yu Koh, Alexander Ku, Austin Waters, Yinfei Yang, Jason Baldridge, Zarana Parekh

    Abstract: Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial langua… ▽ More

    Submitted 17 April, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: CVPR 2023

  12. arXiv:2206.10789  [pdf, other

    cs.CV cs.LG

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Authors: Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu

    Abstract: We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in a… ▽ More

    Submitted 21 June, 2022; originally announced June 2022.

    Comments: Preprint

  13. arXiv:2204.02960  [pdf, other

    cs.CV cs.AI cs.LG

    Simple and Effective Synthesis of Indoor 3D Scenes

    Authors: Jing Yu Koh, Harsh Agrawal, Dhruv Batra, Richard Tucker, Austin Waters, Honglak Lee, Yinfei Yang, Jason Baldridge, Peter Anderson

    Abstract: We study the problem of synthesizing immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints, including viewpoints that extrapolate far beyond the input images while maintaining 3D consistency. Existing approaches are highly complex, with many separately trained stages and components. We propose a simple alternative: an ima… ▽ More

    Submitted 1 December, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: AAAI 2023

  14. arXiv:2110.04627  [pdf, other

    cs.CV cs.LG

    Vector-quantized Image Modeling with Improved VQGAN

    Authors: Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu

    Abstract: Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregres… ▽ More

    Submitted 4 June, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

    Comments: Accepted in ICLR 2022

  15. arXiv:2105.08756  [pdf, other

    cs.CV cs.LG

    Pathdreamer: A World Model for Indoor Navigation

    Authors: Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, Peter Anderson

    Abstract: People navigating in unfamiliar buildings take advantage of myriad visual, spatial and semantic cues to efficiently achieve their navigation goals. Towards equipping computational agents with similar capabilities, we introduce Pathdreamer, a visual world model for agents navigating in novel indoor environments. Given one or more previous visual observations, Pathdreamer generates plausible high-re… ▽ More

    Submitted 16 August, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

    Comments: In ICCV 2021

  16. arXiv:2104.06697  [pdf, other

    cs.CV

    Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

    Authors: Wonkwang Lee, Whie Jung, Han Zhang, Ting Chen, Jing Yu Koh, Thomas Huang, Hyungsuk Yoon, Honglak Lee, Seunghoon Hong

    Abstract: Learning to predict the long-term future of video frames is notoriously challenging due to inherent ambiguities in the distant future and dramatic amplifications of prediction error through time. Despite the recent advances in the literature, existing approaches are limited to moderately short-term prediction (less than a few seconds), while extrapolating it to a longer future quickly leads to des… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

    Comments: Accepted as a conference paper at ICLR 2021

  17. arXiv:2101.04702  [pdf, other

    cs.CV

    Cross-Modal Contrastive Learning for Text-to-Image Generation

    Authors: Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang

    Abstract: The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses this challenge by maximizing the mutual information between image and text. It does this via multiple contrastive losses which capture inter-modality and int… ▽ More

    Submitted 14 April, 2022; v1 submitted 12 January, 2021; originally announced January 2021.

    Comments: CVPR 2021

  18. arXiv:2011.03775  [pdf, other

    cs.CV cs.AI

    Text-to-Image Generation Grounded by Fine-Grained User Attention

    Authors: Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang

    Abstract: Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces that provide a sparse, fine-grained visual grounding for phrases. We propose TReCS, a sequential model that exploits this grounding to generate images. TReCS uses descriptions to retrieve segmentation masks and predict object labels aligned with mouse traces. These alignments are used t… ▽ More

    Submitted 30 March, 2021; v1 submitted 7 November, 2020; originally announced November 2020.

    Comments: To appear in WACV 2021

  19. arXiv:2002.02634  [pdf, other

    cs.CV

    SideInfNet: A Deep Neural Network for Semi-Automatic Semantic Segmentation with Side Information

    Authors: Jing Yu Koh, Duc Thanh Nguyen, Quang-Trung Truong, Sai-Kit Yeung, Alexander Binder

    Abstract: Fully-automatic execution is the ultimate goal for many Computer Vision applications. However, this objective is not always realistic in tasks associated with high failure costs, such as medical applications. For these tasks, semi-automatic methods allowing minimal effort from users to guide computer algorithms are often preferred due to desirable accuracy and performance. Inspired by the practica… ▽ More

    Submitted 17 July, 2020; v1 submitted 7 February, 2020; originally announced February 2020.

    Comments: ECCV 2020

  20. arXiv:1606.09187  [pdf, other

    cs.CV cs.NE stat.ML

    Object Boundary Detection and Classification with Image-level Labels

    Authors: Jing Yu Koh, Wojciech Samek, Klaus-Robert Müller, Alexander Binder

    Abstract: Semantic boundary and edge detection aims at simultaneously detecting object edge pixels in images and assigning class labels to them. Systematic training of predictors for this task requires the labeling of edges in images which is a particularly tedious task. We propose a novel strategy for solving this task, when pixel-level annotations are not available, performing it in an almost zero-shot ma… ▽ More

    Submitted 25 June, 2017; v1 submitted 29 June, 2016; originally announced June 2016.

    Comments: 12 pages, 2 figures, accepted for GCPR 2017 - 39th German Conference on Pattern Recognition

  21. Geo-spatial Location Spoofing Detection for Internet of Things

    Authors: Jing Yang Koh, Ido Nevat, Derek Leong, Wai-Choong Wong

    Abstract: We develop a new location spoofing detection algorithm for geo-spatial tagging and location-based services in the Internet of Things (IoT), called Enhanced Location Spoofing Detection using Audibility (ELSA) which can be implemented at the backend server without modifying existing legacy IoT systems. ELSA is based on a statistical decision theory framework and uses two-way time-of-arrival (TW-TOA)… ▽ More

    Submitted 28 March, 2017; v1 submitted 17 February, 2016; originally announced February 2016.

    Comments: A shorten version of this work has been accepted to the IEEE IoT Journal (IoT-J) on 08-Feb-2016

    Journal ref: IEEE Internet of Things Journal, vol. 3, no. 6, pp. 971-978, Dec. 2016