Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong1,2,*, Yurong Mou1,*, Hangcheng Li1,2,*, Mingzhe Li2,*, Yongzhuo Yang1,*, Ming Zhang1, Qiguang Chen3, Tianyi Liang1,2, Xiaomeng Hu4, Yining Zheng1, Xinchi Chen1, Jun Zhao1,†, Xuanjing Huang1, Xipeng Qiu1,2,†
1Fudan University    2Shanghai Innovation Institute    3Central South University    4The Chinese University of Hong Kong
* Core contribution   † Corresponding authors
OpenMOSS Logo
Paper PDF GitHub 🤗Benchmark 🤗Daily Paper
Hugging Face Paper

TL;DR

We introduce 'Thinking with Video', a new paradigm leveraging video generation for multimodal reasoning. Our VideoThinkBench shows that Sora-2 surpasses GPT5 by 10% on eyeballing puzzles and reaches 69% accuracy on MMMU.

GSM8K
Visual Puzzle
Ray Intersection
ARC-AGI-2
Maze
Thinking with Video - Research Overview

Abstract

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU).

Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2’s performance.

In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a unified multimodal reasoning paradigm.

Eyeballing Puzzles

Arc Connect

Circle Center

Circle Tangent Line

Circle Tangent Point

Isosceles Trapezoid

Midpoint

Orthocenter

Parallel

Parallelogram

Perpendicular Bisector

Ray Intersection

Ray Reflection

Visual Puzzles

Hexagon Color Pattern Match

Grid Color Pattern Match

Shape Color Pattern Match

Rectangle Height Color Match

Color Mixing Perception & Application

Color Gradient Perception & Application

Grid Size Pattern Match

Cycle Size Pattern Match

Grid Shape & Size Pattern Match

Reflection Recognition & Application

ARC-AGI-2

ARC-AGI-2

ARC-AGI-2

ARC-AGI-2

ARC-AGI-2

Maze

Square Maze

GSM8K

GSM8K

GSM8K

Multimodal Reasoning

MathVista

MathVista

MMMU

MMBench

Failure cases (click to expand)

BibTeX

@misc{tong2025thinkingvideovideogeneration,
      title={Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm}, 
      author={Jingqi Tong and Yurong Mou and Hangcheng Li and Mingzhe Li and Yongzhuo Yang and Ming Zhang and Qiguang Chen and Tianyi Liang and Xiaomeng Hu and Yining Zheng and Xinchi Chen and Jun Zhao and Xuanjing Huang and Xipeng Qiu},
      year={2025},
      eprint={2511.04570},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.04570}, 
}