V2P-Bench: Evaluating Video-Language Understanding with
Visual Prompts for Better Human-Model Interaction


1 University of Science and Technology of China

πŸ”₯What's New

  • [2025.03.20] 🌟 We have released V2P-Bench, a comprehensive benchmark specifically designed to evaluate the video understanding capabilities of LVLMs in human-model interaction scenarios.

Abstract

Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark (V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs’ video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instancelevel fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2PBench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation.

grade-lv

V2P-Bench Leaderboard

Perception: OA(Object Attribute), HA(Human Attribute), OD(Object Direction), FM(Feature Mapping)
Temporal: FT(Forward Temporal), RT(Reverse Temporal), AS(Action Sequence)
Reasoning: CR(Causal Relationship), PU(Plot Understanding), CI(Counterfactual Inference)
Spatial: SR(Spatial Relationship) Counting: GC(General Counting)
The best results are highlighted in bold and the second-best are underlined.
# Model Perception Reasoning Temporal Spatial Counting Avg
OA HA OD FM CR PU CI FT RT AS SR GC
1 Gemini-1.5-Pro πŸ₯‡ 70.9 74.8 34.8 58.8 80.7 76.8 48.6 70.4 70.5 46.4 70.0 56.1 67.9
2 GPT-4o πŸ₯ˆ 76.6 68.9 41.3 60.8 67.0 73.3 67.6 68.1 70.5 50.0 54.0 48.4 65.4
3 Gemini-1.5-Flash πŸ₯‰ 61.2 64.4 28.3 52.6 72.3 64.2 37.8 58.0 54.5 35.1 56.0 52.1 57.3
4 GPT-4o-Mini 68.8 61.0 30.4 49.0 65.1 63.6 32.4 48.3 56.8 41.1 62.0 45.3 56.3
5 mPLUG-Owl3-7B 57.4 59.7 39.1 43.9 60.7 58.4 27.0 61.3 75.0 38.6 50.0 37.9 54.3
6 LLaVA-Video-7B 64.3 54.9 32.6 56.1 50.0 59.5 48.6 47.9 54.5 49.1 52.0 36.8 52.6
7 Qwen2-VL-7B 49.6 54.9 32.6 47.4 58.0 57.2 70.3 54.6 52.3 28.1 48.0 32.6 50.7
8 LLaVA-OV-7B 59.7 54.5 32.6 36.8 46.4 59.0 35.1 53.8 59.1 36.8 50.0 32.6 49.9
9 MiniCPM-V 2.6-8B 50.4 51.8 17.4 49.1 53.6 61.8 37.8 49.6 50.0 31.6 48.0 27.4 48.0
10 LLaVA-NeXT-INST-IT-7B 57.4 58.4 26.1 42.4 43.0 49.2 31.6 49.2 42.2 26.3 42.0 27.4 46.3
11 LLaVA-NeXT-7B 56.6 55.6 34.8 52.5 43.0 48.6 31.6 42.6 42.2 28.1 42.0 30.5 46.0
12 VideoLLaMA2-7B 47.3 45.8 26.1 45.6 41.1 52.0 35.1 44.5 50.0 29.8 44.0 32.6 43.4
13 InternVL2.5-8B 50.4 48.2 26.1 57.9 37.5 47.4 51.4 40.3 38.6 36.8 30.0 31.6 43.2
14 PLLaVA-7B 45.7 48.2 21.7 45.6 39.3 54.9 24.3 47.1 45.5 28.1 40.0 28.4 43.0
15 InternVL2-8B 48.1 47.8 23.9 35.1 42.9 51.4 59.5 42.0 36.4 28.1 46.0 24.2 42.7
16 ShareGPT4Video-8B 40.3 43.1 21.7 45.6 40.2 45.7 51.4 43.7 40.9 24.6 46.0 30.5 40.6

V2P-Bench Dataset

Overview

data-overview

Video source and category distribution in V2P-Bench

data-overview

Performance of different models on V2P-Bench by dimension

data-overview

Distribution of QA dimensions and video durations.

Various Visual Prompt Types

While annotating the QA pairs, annotators are also required to perform visual prompt annotations.
To align with real-world distributions, we adopt a fully manual approach for annotating the video frames.
We predefined various types of visual prompts as follows: rectangle, mask contour, ellipse, triangle, scribble, point, arrow and SoM.

Experiment Results

More Results

data-overview

Results on two data formats. Retrieval refers to the sequential input of the original video, questions, and visual prompt frames; Needle refers to embedding the visual prompt frames into the video.

data-overview

Evaluation results on V2P-Bench across durations. The best results are bold and the secondbest are underlined.

Visualization Examples

πŸ“ƒ BibTeX


            @article{zhao2025v2p,
              title={V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction},
              author={Zhao, Yiming and Zeng, Yu and Qi, Yukun and Liu, YaoYang and Chen, Lin and Chen, Zehui and Bao, Xikun and Zhao, Jie and Zhao, Feng},
              journal={arXiv preprint arXiv:2503.17736},
              year={2025}
            }