HORUS: Multimodal Large Language Models Framework for Video Retrieval at VBS 2025
Tai Nguyen, Vo Ngoc Minh Anh, Duc Dat Pham, Tran Quang Vinh, Nhu Duong Thi Quynh, Le Anh Tien, Tan Duy Le, Binh T. Nguyen
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
In the dynamic field of video retrieval, precise and effective search methods are crucial for managing complex datasets. We present HORUS, a novel approach based on multimodal Large Language Models (mLLMs) that advances video retrieval capabilities through two key innovations: (1) advanced multi-modal feature aggregation, integrating text-to-image search with CLIP, free-text search from captions generated by Video-LLaMA2, and visual features from Video-LLaMA to capture temporal dynamics; and (2) GPT-based query expansion, combined with an advanced filter, addresses issues with low-quality open-ended text queries and refines item searches based on type and location within a scene. This work provides cutting-edge solutions for the VBS 2025 challenge and offers valuable insights into enhancing video search techniques.