Localized symbolic knowledge distillation for visual commonsense models

JS Park, J Hessel, K Chandu… - Advances in …, 2024 - proceedings.neurips.cc
Advances in Neural Information Processing Systems, 2024proceedings.neurips.cc
Instruction following vision-language (VL) models offer a flexibleinterface that supports a
broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on
full images do not directly enable the user to “point to" and access specific regions within
images. This capability is importantnot only to support reference-grounded VL benchmarks,
but also, for practicalapplications that require precise within-image reasoning. We build
LocalizedVisual Commonsense model which allows users to specify (multiple) regions-as …
Abstract
Instruction following vision-language (VL) models offer a flexibleinterface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to “point to" and access specific regions within images. This capability is importantnot only to support reference-grounded VL benchmarks, but also, for practicalapplications that require precise within-image reasoning. We build LocalizedVisual Commonsense model which allows users to specify (multiple) regions-as-input. We train our model by sampling localized commonsense knowledgefrom a large language model (LLM): specifically, we prompt a LLM to collectcommonsense knowledge given a global literal image description and a localliteral region description automatically generated by a set of VL models. Thispipeline is scalable and fully automatic, as no aligned or human-authored imageand text pairs are required. With a separately trained critic model that selectshigh quality examples, we find that training on the localized commonsense corpusexpanded solely from images can successfully distill existing VL models to supporta reference-as-input interface. Empirical results and human evaluations in zero-shotsettings demonstrate that our distillation method results in more precise VL modelsof reasoning compared to a baseline of passing a generated referring expression.
proceedings.neurips.cc
Showing the best result for this search. See all results