News

Breakthrough in Video LLM Temporal Grounding: Continuous Decoding Paradigm Offers Optimal Efficiency-Accuracy Trade-off

Breakthrough in Video LLM Temporal Grounding: Continuous Decoding Paradigm Offers Optimal Efficiency-Accuracy Trade-off

While Multimodal Large Language Models (MLLMs) have made significant strides in Video Temporal Grounding (VTG), existing methodologies often intertwine output paradigms with disparate backbones, datasets, and training protocols. This coupling complicates the isolation and precise evaluation of the specific impact of the output design itself. Furthermore, as VTG systems are increasingly being considered for deployment on resource-constrained edge devices, a systematic investigation into the trade-off between output formulation and overall system-level efficiency becomes imperative.

To address these critical issues, a recent controlled empirical study meticulously compared three dominant VTG output paradigms: Text Numeral Generation, Temporal Token Generation, and Continuous Temporal Decoding. Researchers evaluated these paradigms across identical compact VLMs, specifically SmolVLM2, FastVLM, and Molmo2, utilizing consistent datasets and LoRA fine-tuning protocols. Evaluations were conducted on established benchmarks such as Charades-STA, QVHighlights, and YouCook2, measuring both localization accuracy and critical system efficiency metrics, including inference latency, training throughput, and parameter overhead.

The findings unequivocally demonstrate that the choice of output formulation substantially influences both grounding accuracy and computational cost, a conclusion independent of the model's scale. Notably, the continuous distribution paradigm, represented by Continuous Temporal Decoding, consistently achieved the most favorable efficiency-accuracy trade-off along the Pareto frontier. This paradigm delivered robust localization capabilities while incurring minimal latency overhead. These objective empirical results provide crucial guidelines for the design and implementation of highly efficient, deployment-ready VTG systems, advancing the practical application of AI in video understanding and edge computing.

↗ Read original source