StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Xueyi Chen, Keda Tao, Kele Shao, Huan Wang
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens, ensuring predictable latency. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves 15.7 kv-cache compression ratio; compared to prior SOTA (LiveVLM), it delivers 1.2 lower peak memory and 2 faster TTFT. StreamingTOM achieves state-of-the-art accuracy among training-free methods with an average of 63.8\% on offline benchmarks and 55.8\% accuracy and 3.7 score on RVS. These results demonstrate that real-time streaming video understanding with bounded active memory is achievable without model retraining.