VUDG: Video Understanding Dataset Generation

Ziyi Wang1, Zhi Gao1, Boxuan Yu1, Zirui Dai1, Yuxiang Song1, Qingyuan Lu1, Jin Chen1, Xinxiao Wu1,2

1Beijing Key Laboratory of Intelligent Information Technology,
School of Computer Science & Technology, Beijing Institute of Technology

2Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University

Abstract

Video understanding has made remarkable progress in recent years, largely driven by advances in deep models and the availability of large-scale annotated datasets. However, existing works typically ignore the inherent domain shifts encountered in real-world video applications, leaving domain generalization (DG) in video understanding underexplored. Hence, we propose Video Understanding Domain Generalization (VUDG), a novel dataset designed specifically for evaluating the DG performance in video understanding. VUDG contains videos from 11 distinct domains that cover three types of domain shifts, and maintains semantic similarity across different domains to ensure fair and meaningful evaluation. We propose a multi-expert progressive annotation framework to annotate each video with both multiple-choice and open-ended question-answer pairs. Extensive experiments on 9 representative large video-language models (LVLMs) and several traditional video question answering methods show that most models (including state-of-the-art LVLMs) suffer performance degradation under domain shifts. These results highlight the challenges posed by VUDG and the difference in the robustness of current models to data distribution shifts. We believe VUDG provides a valuable resource for prompting future research in domain generalization video understanding.

Dataset Introduction

The training set comprises 6,337 video clips and 31,685 QA pairs. The testing set contains 1,532 video clips and 4,703 QA pairs.

Image 2

Dataset Examples

Select a category from the dropdown menu and click "Preview" to view the corresponding videos and QA pairs.

Leaderboard

Image 3
Image 4
Image 5

Citation

@article{wang2025vudg,
  title={VUDG: A Dataset for Video Understanding Domain Generalization},
  author={Wang, Ziyi and Gao, Zhi and Yu, Boxuan and Dai, Zirui and Song, Yuxiang and Lu, Qingyuan and Chen, Jin and Wu, Xinxiao},
  journal={arXiv preprint arXiv:2505.24346},
  year={2025}
}