HORUS: Multimodal Large Language Models Framework for Video Retrieval at VBS 2025

2025-01-01MultiMedia Modeling 2025Unverified0· sign in to hype

Tai Nguyen, Vo Ngoc Minh Anh, Duc Dat Pham, Tran Quang Vinh, Nhu Duong Thi Quynh, Le Anh Tien, Tan Duy Le, Binh T. Nguyen

arXiv PDF

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

In the dynamic field of video retrieval, precise and effective search methods are crucial for managing complex datasets. We present HORUS, a novel approach based on multimodal Large Language Models (mLLMs) that advances video retrieval capabilities through two key innovations: (1) advanced multi-modal feature aggregation, integrating text-to-image search with CLIP, free-text search from captions generated by Video-LLaMA2, and visual features from Video-LLaMA to capture temporal dynamics; and (2) GPT-based query expansion, combined with an advanced filter, addresses issues with low-quality open-ended text queries and refines item searches based on type and location within a scene. This work provides cutting-edge solutions for the VBS 2025 challenge and offers valuable insights into enhancing video search techniques.

Tasks

Image Retrieval Retrieval Video Retrieval

HORUS: Multimodal Large Language Models Framework for Video Retrieval at VBS 2025

Abstract

Tasks

Reproductions