Jiatong Guo Jiatong Guo

From End-to-End 3D Reconstruction to a Queryable 4D World

For two consecutive years, the CVPR Best Paper has been answering the same question: Can a unified feed-forward vision model directly build a geometric world representation of a scene from images or videos, and solve, in one shot, the originally fragmented 3D/4D tasks such as camera estimation, depth prediction, point cloud reconstruction, and point tracking?

Last year’s Best Paper, VGGT, gave the answer for 3D: it can be done end-to-end. VGGT puts camera intrinsics and extrinsics, depth, point clouds, and point tracks into the same Transformer. With a single forward pass, it can process up to hundreds of images and produce the results directly, largely bypassing the carefully engineered geometric computation modules in traditional SfM/MVS pipelines.

CVPR 2025 best paper - VGGT: Visual Geometry Grounded Transformer

This year’s Best Paper, D4RT, goes one step further. It is no longer limited to static multi-view images, but takes dynamic videos as input. It reformulates the problem of building a geometric world representation into the following question: given any pixel in any frame of a video, how can we determine the 3D position of that point at any moment in time and under any reference camera coordinate system?

CVPR 2026 best paper - D4RT: Dynamic 4D Reconstruction and Tracking

What makes D4RT important is not only that it achieves SOTA performance, but that it shifts the paradigm from the somewhat harder-to-accept idea of “end-to-end everything” to “complete representation + lightweight query.” First, an Encoder compresses the entire video into a Global Scene Representation. Then, a Decoder asks queries such as: Where is this pixel in frame 10? If we use the camera of frame 5 as the reference, how should the point cloud of the whole video be aligned?

More specifically, D4RT unifies the task into a single Query token, which I think is its most elegant design. This Query token contains five types of information: the normalized 2D coordinate of the pixel in the source frame, the source-frame timestep, the target timestep, the reference camera, and the embedding of the 9×9 RGB patch around that pixel. The Decoder then uses this Query token to cross-attend to the global scene representation and finally outputs a 3D point.

This has a flavor similar to the unified interface of LLMs: first compress the world into a holistic representation, then query it as needed. Perhaps the truly valuable 3D/4D models of the future will not merely output a beautiful point cloud, but answer questions about space, time, and motion: Where is the object? Where did it come from? Where will it go? Maybe this is the LLM moment for 3D vision.