Friday, December 19, 2025

On WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

This could be an interesting new world model by Tencent!

"Tencent’s WorldPlay video model keeps revisited scenes consistent

Researchers built WorldPlay to solve a core problem in interactive world models: existing systems are either fast but inconsistent, or consistent but slow. WorldPlay achieves both through three key techniques.
First, it uses dual action controls—combining keyboard inputs (which work across different scene scales) with precise camera positions (which enable accurate memory retrieval).
Second, it maintains a “reconstituted context memory” that pulls relevant past frames and uses “temporal reframing” to keep geometrically important old frames influential, essentially treating distant memories as if they’re recent.
Third, it uses “context forcing,” a distillation method that aligns what the teacher and student models remember, enabling real-time generation without losing consistency or accumulating errors.
Trained on 320,000 real and synthetic videos, the system runs at 24 FPS on 8 H800 GPUs and works across first-person and third-person views in both realistic and stylized environments. It also supports 3D reconstruction and lets users trigger events with text prompts during generation."

From the abstract:
"This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods.
WorldPlay draws power from three key innovations.
1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs.
2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation.
3) We also propose Context Forcing, a novel distillation method designed for memory-aware model.
Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift.
Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: this https URL and this https URL.
"

Credits: Data Points newsletter

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

No comments: