V2P-Bench

Abstract

Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark (V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs’ video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instancelevel fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2PBench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation.

Perception: OA(Object Attribute), HA(Human Attribute), OD(Object Direction), FM(Feature Mapping)
Temporal: FT(Forward Temporal), RT(Reverse Temporal), AS(Action Sequence)
Reasoning: CR(Causal Relationship), PU(Plot Understanding), CI(Counterfactual Inference)
Spatial: SR(Spatial Relationship) Counting: GC(General Counting)
The best results are highlighted in bold and the second-best are underlined.

#	Model	Perception				Reasoning			Temporal			Spatial	Counting	Avg
#	Model	OA	HA	OD	FM	CR	PU	CI	FT	RT	AS	SR	GC	Avg
1	Gemini-1.5-Pro 🥇	70.9	74.8	34.8	58.8	80.7	76.8	48.6	70.4	70.5	46.4	70.0	56.1	67.9
2	GPT-4o 🥈	76.6	68.9	41.3	60.8	67.0	73.3	67.6	68.1	70.5	50.0	54.0	48.4	65.4
3	Gemini-1.5-Flash 🥉	61.2	64.4	28.3	52.6	72.3	64.2	37.8	58.0	54.5	35.1	56.0	52.1	57.3
4	GPT-4o-Mini	68.8	61.0	30.4	49.0	65.1	63.6	32.4	48.3	56.8	41.1	62.0	45.3	56.3
5	mPLUG-Owl3-7B	57.4	59.7	39.1	43.9	60.7	58.4	27.0	61.3	75.0	38.6	50.0	37.9	54.3
6	LLaVA-Video-7B	64.3	54.9	32.6	56.1	50.0	59.5	48.6	47.9	54.5	49.1	52.0	36.8	52.6
7	Qwen2-VL-7B	49.6	54.9	32.6	47.4	58.0	57.2	70.3	54.6	52.3	28.1	48.0	32.6	50.7
8	LLaVA-OV-7B	59.7	54.5	32.6	36.8	46.4	59.0	35.1	53.8	59.1	36.8	50.0	32.6	49.9
9	MiniCPM-V 2.6-8B	50.4	51.8	17.4	49.1	53.6	61.8	37.8	49.6	50.0	31.6	48.0	27.4	48.0
10	LLaVA-NeXT-INST-IT-7B	57.4	58.4	26.1	42.4	43.0	49.2	31.6	49.2	42.2	26.3	42.0	27.4	46.3
11	LLaVA-NeXT-7B	56.6	55.6	34.8	52.5	43.0	48.6	31.6	42.6	42.2	28.1	42.0	30.5	46.0
12	VideoLLaMA2-7B	47.3	45.8	26.1	45.6	41.1	52.0	35.1	44.5	50.0	29.8	44.0	32.6	43.4
13	InternVL2.5-8B	50.4	48.2	26.1	57.9	37.5	47.4	51.4	40.3	38.6	36.8	30.0	31.6	43.2
14	PLLaVA-7B	45.7	48.2	21.7	45.6	39.3	54.9	24.3	47.1	45.5	28.1	40.0	28.4	43.0
15	InternVL2-8B	48.1	47.8	23.9	35.1	42.9	51.4	59.5	42.0	36.4	28.1	46.0	24.2	42.7
16	ShareGPT4Video-8B	40.3	43.1	21.7	45.6	40.2	45.7	51.4	43.7	40.9	24.6	46.0	30.5	40.6

Overview

Video source and category distribution in V2P-Bench

Performance of different models on V2P-Bench by dimension

Distribution of QA dimensions and video durations.

Various Visual Prompt Types

While annotating the QA pairs, annotators are also required to perform visual prompt annotations.
To align with real-world distributions, we adopt a fully manual approach for annotating the video frames.
We predefined various types of visual prompts as follows: rectangle, mask contour, ellipse, triangle, scribble, point, arrow and SoM.

More Results

Results on two data formats. Retrieval refers to the sequential input of the original video, questions, and visual prompt frames; Needle refers to embedding the visual prompt frames into the video.

Evaluation results on V2P-Bench across durations. The best results are bold and the secondbest are underlined.

Visualization Examples

📃 BibTeX


            @article{zhao2025v2p,
              title={V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction},
              author={Zhao, Yiming and Zeng, Yu and Qi, Yukun and Liu, YaoYang and Chen, Lin and Chen, Zehui and Bao, Xikun and Zhao, Jie and Zhao, Feng},
              journal={arXiv preprint arXiv:2503.17736},
              year={2025}
            }

V2P-Bench: Evaluating Video-Language Understanding with
Visual Prompts for Better Human-Model Interaction

Abstract

V2P-Bench Leaderboard

V2P-Bench Dataset

Overview

Various Visual Prompt Types

Experiment Results

More Results

Visualization Examples

📃 BibTeX