Logo VISCO

Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

1University of California Los Angeles,   2Stanford
Paper arXiv Code

๐Ÿ“

Data

Introduction

The ability of large vision-language models (LVLMs) to critique and correct their reasoning is an essential building block towards their self-improvement, but a systematic analysis of such capabilities is still lacking.

To bridge this gap, we propose VISCO, the first benchmark to extensively analyze the critique and correction capabilities of LVLMs. VISCO features dense and fine-grained critique, where LVLMs are required to evaluate the correctness of each step in the chain-of-thought, and then provide natural language explanations to support their judgments.

We have conducted an extensive evaluation of 24 LVLMs. Our results show that human-written critiques significantly helps correction, while the model-generated critiques are less helpful and sometimes detrimental to the performance. This showcases the potential of the self-improvement and also reveals critique is the crucial bottleneck. We identified three failure patterns in critique: failure to critique visual perception, reluctance to "say no", and exaggerated assumption of error propagation. We further explore a LookBack strategy that improves critique performance by revisiting the image and verifying each piece of information.

Leaderboard

Click metrics to sort the leaderboard.

# Model Source Critique Correction Gain
VISCore๐Ÿ… Ans. F1 Step F1 Ex. F1 Human Critique Model Critique
- Human* 86.47 100.0 90.6 71.4
1 GPT-4o-2024-08-06 ๐Ÿฅ‡ Link 52.36 63.0 57.2 39.8 76.2 28.8
2 Claude-3.5-Sonnet-20240620 ๐Ÿฅˆ Link 51.28 61.8 58.1 37.6 73.7 25.6
3 Gemini-1.5-Pro ๐Ÿฅ‰ Link 45.01 55.6 51.2 32.0 78.0 24.9
4 LLaVA-Critic-72B Link 42.60 53.9 50.9 28.2 58.9 15.4
5 Qwen2-VL-72B Link 37.44 49.2 41.9 25.5 31.5 -2.1
6 Llama-3.2-90B Link 36.40 46.8 42.5 24.3 66.4 4.4
7 Molmo-72B Link 35.59 49.4 39.8 22.9 53.1 1.4
8 LLaVA-OV-72B Link 35.27 47.1 42.0 22.2 33.4 -10.2
9 NVLM-72B Link 33.07 44.0 38.6 21.3 42.2 1.7
10 InternVL2-40B Link 28.48 41.6 31.4 17.7 47.6 9.9
11 InternVL2-76B Link 26.38 37.7 28.6 17.0 72.7 11.7
12 InternVL2-26B Link 25.20 39.4 30.2 13.4 59.3 6.0
13 InternVL2-8B Link 23.33 37.1 31.1 11.0 52.7 5.4
14 LLaVA-v1.6-7B Link 21.80 44.6 33.6 6.9 40.3 -8.7
15 Qwen2-VL-7B Link 21.71 43.0 30.6 7.8 50.8 5.5
16 LLaVA-v1.6-13B Link 21.02 40.2 32.8 7.1 40.2 -7.2
17 LLaVA-Critic-7B Link 20.02 32.0 28.7 8.8 19.3 -11.4
18 Prometheus-Vision-13B Link 19.32 38.0 37.8 5.0 -โ€  -โ€ 
19 Prometheus-Vision-7B Link 17.67 37.6 35.8 4.1 -โ€  -โ€ 
20 Molmo-7B Link 13.43 35.5 22.0 3.1 49.1 1.8
21 Llama-3.2-11B Link 11.44 29.4 21.1 2.4 34.8 -11.7
22 LLaVA-v1.6-34B Link 11.05 23.6 14.3 4.0 39.2 -1.7
23 LLaVA-OV-7B Link 7.53 14.5 14.9 2.0 22.4 -9.1
24 DeepSeek-VL-7B Link 7.53 21.8 15.7 1.2 2.3 -17.9
- Random - 37.9 32.0 -
Human Performance* : Performed by expert human annotators. Measured on 265 data points rather than full test set due to annotation cost.
Negative correction gain : Correction brings more harm than benefits to task performance.
No correction results โ€  : After specialized training, Prometheus-Vision models lack general question answering capability and cannot perform correction.

๐Ÿšจ To submit your results to the leaderboard, please send to this email with your result json files.

VISCO Dataset

VISCO is the first benchmark to evaluate the critique and correction capabilities of LVLMs. The dataset spans 18 datasets and 8 tasks across two main categories: (1) reasoning tasks, such as math and science reasoning, and (2) perception tasks, such as text recognition, spatial relationship understanding, and prevention of hallucination.

We sample model responses with chain-of-thoughts from 7 LVLMs, and ask expert human annotators to produce ground-truth critiques. We collect dense and fine-grained critique, with a binary correctness label for each step followed by a natural language explanation if the step is incorrect. In total, we collect 1645 pairs of questions and LVLM-generated answers, including 5604 step-wise annotations.

You can download the dataset on Huggingface Dataset.

Categories, tasks and datasets distribution of VISCO.

Statistics of VISCO.

Examples

Experiment Results

Key takeaways:

  1. Critique capability typically emerges in โˆผ70B LVLMs. LVLMs with <70B parameters often perform even worse than random guessing.
  2. Specialized training as in Prometheus-Vision and LLaVA-Critic improves critique performance.
  3. As shown below, when human critiques are available, top LVLMs can correct over 70% errors, and fine-grained critique consistently brings improvements.
  4. In contrast, model-generated critiques are less effective and can sometimes even hurt the performance.

Correction performance given human-generated or model-generated critiques with different granularity.

Critique Failure Patterns

LookBack: Improved Baseline

Method Overview

For each step in the chain-of-thought, LookBack (1) extracts a set of atomic information that needs verification, (2) answers each question individually based on the image, and (3) accordingly produces the final critique.


Results

LookBack significantly enhances the critique performance on four leading LVLMs. When applied to the correction task, critique generated by LookBack brings notable improvements.


Qualitative Example

BibTeX

@misc{wu2024viscobenchmarkingfinegrainedcritique,
      title={VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning}, 
      author={Xueqing Wu and Yuheng Ding and Bingxuan Li and Pan Lu and Da Yin and Kai-Wei Chang and Nanyun Peng},
      year={2024},
      eprint={2412.02172},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.02172}, 
}