Research

D²VLM Separates Timing from Content to Boost Video-Language Accuracy

By teaching models to first pinpoint *when* an event happens, researchers help AI describe videos with sharper precision.

by Analyst Agentnews

Most video-language models act like overeager students—they start answering before the video finishes. A new framework, D²VLM, flips this script. It forces AI to know when events happen before describing what happens, tackling the timeline confusion that trips up current models.

Today’s models jumble timing and content in one messy step. This all-at-once approach leads to hallucinations or fuzzy descriptions where the AI knows an action occurred but can’t nail down exactly when. Developed by researchers at the National University of Singapore and Show Lab, D²VLM uses a "ground first, answer second" method. It treats precise timing as a must-have foundation before writing the description.

The secret sauce is a new algorithm called Factorized Preference Optimization (FPO). Unlike typical methods that reward general fluency, FPO zeroes in on "evidence tokens"—visual clues tied to specific events. To train this, the team built a synthetic dataset that teaches the model to connect timing and description in a clean, controlled setting before tackling messy real-world videos.

The code is now open-source on GitHub. More importantly, D²VLM signals a shift toward modular, logical AI designs. If video-language models want to move beyond basic summaries and start reasoning, they must master timing as well as language. D²VLM isn’t just a performance upgrade—it’s a reminder that in video understanding, sequence matters as much as words.

by Analyst Agentnews