VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

Overview of VISCO task, data and evaluation. Based on an image, a question and an initial model response, VISCO evaluates two tasks, critique and correction. Correction is performed on top of critique.

Left: Evaluation settings of VISCO. We evaluate two settings for the correction task. Right: Scaling curve for critique performance.

Introduction

The ability of large vision-language models (LVLMs) to critique and correct their reasoning is an essential building block towards their self-improvement, but a systematic analysis of such capabilities is still lacking.

To bridge this gap, we propose VISCO, the first benchmark to extensively analyze the critique and correction capabilities of LVLMs. VISCO features dense and fine-grained critique, where LVLMs are required to evaluate the correctness of each step in the chain-of-thought, and then provide natural language explanations to support their judgments.

We have conducted an extensive evaluation of 24 LVLMs. Our results show that human-written critiques significantly helps correction, while the model-generated critiques are less helpful and sometimes detrimental to the performance. This showcases the potential of the self-improvement and also reveals critique is the crucial bottleneck. We identified three failure patterns in critique: failure to critique visual perception, reluctance to "say no", and exaggerated assumption of error propagation. We further explore a LookBack strategy that improves critique performance by revisiting the image and verifying each piece of information.

Leaderboard

Click metrics to sort the leaderboard.

#	Model	Source	Critique				Correction Gain
#	Model	Source	VISCore🏅	Ans. F1	Step F1	Ex. F1	Human Critique	Model Critique
-	Human*		86.47	100.0	90.6	71.4
1	Qwen2.5-VL-72B 🥇	Link	55.59	64.3	61.1	43.7	73.1	31.6
2	GPT-4o-2024-08-06 🥈	Link	52.36	63.0	57.2	39.8	76.2	28.8
3	Claude-3.5-Sonnet-20240620 🥉	Link	51.28	61.8	58.1	37.6	73.7	25.6
4	Gemini-1.5-Pro	Link	45.01	55.6	51.2	32.0	78.0	24.9
5	LLaVA-Critic-72B	Link	42.60	53.9	50.9	28.2	58.9	15.4
6	Qwen2-VL-72B	Link	37.44	49.2	41.9	25.5	31.5	-2.1
7	Llama-3.2-90B	Link	36.40	46.8	42.5	24.3	66.4	4.4
8	Molmo-72B	Link	35.59	49.4	39.8	22.9	53.1	1.4
9	LLaVA-OV-72B	Link	35.27	47.1	42.0	22.2	33.4	-10.2
10	Qwen2.5-VL-7B	Link	34.73	53.9	46.7	16.7	51.6	14.7
11	NVLM-72B	Link	33.07	44.0	38.6	21.3	42.2	1.7
12	InternVL2-40B	Link	28.48	41.6	31.4	17.7	47.6	9.9
13	InternVL2-76B	Link	26.38	37.7	28.6	17.0	72.7	11.7
14	InternVL2-26B	Link	25.20	39.4	30.2	13.4	59.3	6.0
15	InternVL2-8B	Link	23.33	37.1	31.1	11.0	52.7	5.4
16	LLaVA-v1.6-7B	Link	21.80	44.6	33.6	6.9	40.3	-8.7
17	Qwen2-VL-7B	Link	21.71	43.0	30.6	7.8	50.8	5.5
18	LLaVA-v1.6-13B	Link	21.02	40.2	32.8	7.1	40.2	-7.2
19	LLaVA-Critic-7B	Link	20.02	32.0	28.7	8.8	19.3	-11.4
20	Prometheus-Vision-13B	Link	19.32	38.0	37.8	5.0	-^†	-^†
21	Prometheus-Vision-7B	Link	17.67	37.6	35.8	4.1	-^†	-^†
22	Molmo-7B	Link	13.43	35.5	22.0	3.1	49.1	1.8
23	MiniCPM-V-2.6	Link	13.07	27.9	18.2	4.4	53.0	7.8
24	Llama-3.2-11B	Link	11.44	29.4	21.1	2.4	34.8	-11.7
25	LLaVA-v1.6-34B	Link	11.05	23.6	14.3	4.0	39.2	-1.7
26	LLaVA-OV-7B	Link	7.53	14.5	14.9	2.0	22.4	-9.1
27	DeepSeek-VL-7B	Link	7.53	21.8	15.7	1.2	2.3	-17.9
-	Random		-	37.9	32.0	-

Human Performance* : Performed by expert human annotators. Measured on 265 data points rather than full test set due to annotation cost.
Negative correction gain : Correction brings more harm than benefits to task performance.
No correction results † : After specialized training, Prometheus-Vision models lack general question answering capability and cannot perform correction.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

VISCO is the first benchmark to evaluate the critique and correction capabilities of LVLMs. The dataset spans 18 datasets and 8 tasks across two main categories: (1) reasoning tasks, such as math and science reasoning, and (2) perception tasks, such as text recognition, spatial relationship understanding, and prevention of hallucination.

We sample model responses with chain-of-thoughts from 7 LVLMs, and ask expert human annotators to produce ground-truth critiques. We collect dense and fine-grained critique, with a binary correctness label for each step followed by a natural language explanation if the step is incorrect. In total, we collect 1645 pairs of questions and LVLM-generated answers, including 5604 step-wise annotations.

You can download the dataset on Huggingface Dataset.

Categories, tasks and datasets distribution of VISCO.

Statistics of VISCO.

Examples

Category: Reasoning. Task: Math. Dataset: MathVista.

Category: Reasoning. Task: Science. Dataset: SceMQA.

Category: Reasoning. Task: Humanities. Dataset: MMMU.

Category: Perception. Task: Hallucination. Dataset: POPE.

Category: Perception. Task: Spatial relationship. Dataset: VSR.

Category: Perception. Task: OCR. Dataset: TextVQA.

Key takeaways:

Critique capability typically emerges in ∼70B LVLMs. LVLMs with <70B parameters often perform even worse than random guessing.
Specialized training as in Prometheus-Vision and LLaVA-Critic improves critique performance.
As shown below, when human critiques are available, top LVLMs can correct over 70% errors, and fine-grained critique consistently brings improvements.
In contrast, model-generated critiques are less effective and can sometimes even hurt the performance.

Correction performance given human-generated or model-generated critiques with different granularity.

Critique Failure Patterns

Failure to critique visual perception. LVLMs' critique performance on perception tasks is consistently lower than on reasoning tasks. Even in reasoning tasks, a significant portion of errors are also caused by either perception (38%) or a combination of perception and reasoning (22%).

An error case is shown below:

Reluctance to "say no". LVLMs they are more likely to judge an answer or step as correct rather than incorrect.

As shown below, all models predict "Incorrect" less frequently than the ground truth annotations. Some models like LLaVA-1.6-34B identify an extremely low percentage of inputs as incorrect, resulting in notably poor performance.

As pointed out by previous work, a potential reason for this bias is imbalanced instruction tuning data, as instruction tuning typically encourages models to agree with users.

Exaggerated assumption of error propagation. In the sequential chain-of-thought, errors may naturally propagate from earlier steps to later ones. However, we observe that LVLMs exhibit a much stronger bias in expecting errors to propagate.

An error case is shown below. The model critique mistakenly believes the error in step 4 is propagated to step 5, while in fact step 5 is based on step 3 and thus independent from the error in step 4.

Method Overview

For each step in the chain-of-thought, LookBack (1) extracts a set of atomic information that needs verification, (2) answers each question individually based on the image, and (3) accordingly produces the final critique.

Results

LookBack significantly enhances the critique performance on four leading LVLMs. When applied to the correction task, critique generated by LookBack brings notable improvements.

Qualitative Example

BibTeX

@misc{wu2024viscobenchmarkingfinegrainedcritique,
      title={VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning}, 
      author={Xueqing Wu and Yuheng Ding and Bingxuan Li and Pan Lu and Da Yin and Kai-Wei Chang and Nanyun Peng},
      year={2024},
      eprint={2412.02172},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.02172}, 
}

VISCO

Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning