More Research: [ICLR 2026] 🔥Game-RL 🔥AI Can Learn Scientific Taste

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

CVPR 2026
Jingqi Tong2,4,5,*, Yurong Mou2,4,5,*, Hangcheng Li2,4,5,*, Mingzhe Li2,4,5,*, Yongzhuo Yang4,5,*, Ming Zhang4, Qiguang Chen7, Tianyi Liang2,5, Xiaomeng Hu6, Yining Zheng1,3,5, Xinchi Chen1,3,4,5,†, Jun Zhao4,†, Xuanjing Huang1,3,4, Xipeng Qiu1,2,3,4,5,†
1Institute of Trustworthy Embodied AI, Fudan University 2Shanghai Innovation Institute 3Shanghai Key Laboratory of Multimodal Embodied AI
4College of Computer Science and Artificial Intelligence, Fudan University 5OpenMOSS Team 6The Chinese University of Hong Kong 7Central South University
* Core contribution   † Corresponding authors
OpenMOSS Logo
Paper PDF GitHub 🤗Benchmark 🤗Daily Paper Leaderboard Twitter
Hugging Face Paper

TL;DR

We introduce "Thinking with Video", a new paradigm leveraging video generation for multimodal reasoning. Our VideoThinkBench shows that Sora 2 surpasses GPT-5 by 10% on eyeballing puzzles and reaches 69% accuracy on MMMU.

GSM8K
Visual Puzzle
Ray Reflection
ARC-AGI-2
Maze
Thinking with Video - Research Overview

Abstract

The "Thinking with Text" and "Thinking with Images" paradigms significantly improve the reasoning abilities of large language models (LLMs) and Vision-Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) the separation of text and vision as distinct modalities, which hinders unified multimodal understanding and generation. Therefore, we propose "Thinking with Video", a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench), which covers both vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., GSM8K and MMMU).

Our evaluation on VideoThinkBench establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is comparable to state-of-the-art (SOTA) VLMs, and even surpasses GPT-5 by 10% on eyeballing puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyze the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance.

In summary, our findings show that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a potential unified multimodal reasoning paradigm.

Leaderboard on VideoThinkBench (minitest)

Vision-Centric Tasks

Video Generation Models 🎞️

Image Generation Models 🖼️

Vision-Language Models 📃

Note:
"Eyeballing Point/Line/Shape" refer to Point Tasks, Line Tasks and Shape Tasks in Eyeballing Puzzles. The results of video generation models are Major Frame evaluation results.
"Visual Symmetry/Gradient/Compositionality" refer to the Symmetry Tasks, Gradient Tasks and Compositionality Tasks in Visual Puzzles.
"Maze Square/Hexagon/Circle" refer to Square Maze, Hexagon Maze and Circle Maze in Maze Tasks.

Text-Centric Tasks

Video Generation Models 🎞️

Image Generation Models 🖼️

Vision-Language Models 📃

Leaderboard on VideoThinkBench (test)

Vision-Centric Tasks

Text-Centric Tasks

Eyeballing Puzzles

Arc Connect

Circle Center

Circle Tangent Line

Circle Tangent Point

Isosceles Trapezoid

Midpoint

Orthocenter

Parallel

Parallelogram

Perpendicular Bisector

Ray Intersection

Ray Reflection

Visual Puzzles

Hexagon Color Pattern Match

Grid Color Pattern Match

Grid Size Pattern Match

Reflection Recognition & Application

Color Gradient Perception & Application

Cycle Size Pattern Match

Shape Color Pattern Match

Rectangle Height Color Match

Color Mixing Perception & Application

Grid Shape & Size Pattern Match

ARC-AGI-2

ARC-AGI-2

ARC-AGI-2

ARC-AGI-2

ARC-AGI-2

Maze

Square Maze

GSM8K

GSM8K

GSM8K

Multimodal Reasoning

MathVista

MathVista

MMMU

MMBench

Failure cases (click to expand)

Acknowledgement

Core Contributor of this Website: Jiahui Lin, Hongji Chen, Junpeng Zhang

BibTeX

@article{tong2025thinking,
  title={Thinking with video: Video generation as a promising multimodal reasoning paradigm},
  author={Tong, Jingqi and Mou, Yurong and Li, Hangcheng and Li, Mingzhe and Yang, Yongzhuo and Zhang, Ming and Chen, Qiguang and Liang, Tianyi and Hu, Xiaomeng and Zheng, Yining and others},
  journal={arXiv preprint arXiv:2511.04570},
  year={2025}
}