VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Apr 17, 2023 Audio captioning Audio-Video Question Answering (AVQA)
Code Code Available 2Robust Cross-Modal Knowledge Distillation for Unconstrained Videos Apr 16, 2023 Action Recognition Audio Tagging
Code Code Available 1LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision Apr 15, 2023 Language Modeling Language Modelling
— Unverified 0Self-Supervised Video Similarity Learning Apr 6, 2023 ISVR Retrieval
Code Code Available 1Perfect Match in Video Retrieval Mar 29, 2023 Retrieval Video Retrieval
— Unverified 0Free-Form Multi-Modal Multimedia Retrieval (4MR) Mar 29, 2023 Form Management
— Unverified 0Hierarchical Video-Moment Retrieval and Step-Captioning Mar 29, 2023 Information Retrieval Moment Retrieval
Code Code Available 1Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding Mar 28, 2023 Action Localization Action Recognition
— Unverified 0Unmasked Teacher: Towards Training-Efficient Video Foundation Models Mar 28, 2023 Action Classification Action Recognition
Code Code Available 0Colo-SCRL: Self-Supervised Contrastive Representation Learning for Colonoscopic Video Retrieval Mar 28, 2023 Action Recognition Contrastive Learning
— Unverified 0Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning Mar 25, 2023 Contrastive Learning Question Answering
Code Code Available 1Aligning Step-by-Step Instructional Diagrams to Video Demonstrations Mar 24, 2023 Contrastive Learning Image Retrieval
Code Code Available 0Dialogue-to-Video Retrieval Mar 23, 2023 Recommendation Systems Retrieval
Code Code Available 0MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Mar 23, 2023 Auxiliary Learning Multimodal Sentiment Analysis
Code Code Available 1DiffusionRet: Generative Text-Video Retrieval with Diffusion Model Mar 17, 2023 Retrieval Video Retrieval
Code Code Available 1VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression Mar 15, 2023 Retrieval Video Retrieval
Code Code Available 1Accommodating Audio Modality in CLIP for Multimodal Processing Mar 12, 2023 AudioCaps Contrastive Learning
Code Code Available 0MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling Mar 10, 2023 Multi-Label Classification MUlTI-LABEL-ClASSIFICATION
— Unverified 0Improving Video Retrieval by Adaptive Margin Mar 9, 2023 Retrieval Video Retrieval
— Unverified 0STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training Feb 20, 2023 Language Modelling Object
— Unverified 0Video-Text Retrieval by Supervised Sparse Multi-Grained Learning Feb 19, 2023 Representation Learning Retrieval
Code Code Available 0Is Multimodal Vision Supervision Beneficial to Language? Feb 10, 2023 Image Retrieval Natural Language Understanding
Code Code Available 0Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer Feb 4, 2023 Computational Efficiency Question Answering
Code Code Available 0mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Feb 1, 2023 Action Classification Image Classification
Code Code Available 4Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring Jan 26, 2023 Representation Learning Retrieval
Code Code Available 1Zorro: the masked multimodal transformer Jan 23, 2023 Audio Tagging Multimodal Deep Learning
Code Code Available 0Temporal Perceiving Video-Language Pre-training Jan 18, 2023 Action Localization Contrastive Learning
— Unverified 0UATVR: Uncertainty-Adaptive Text-Video Retrieval Jan 16, 2023 Retrieval Semantic correspondence
Code Code Available 1Learning Trajectory-Word Alignments for Video-Language Tasks Jan 5, 2023 Question Answering Retrieval
— Unverified 0PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval Jan 1, 2023 Representation Learning Retrieval
— Unverified 0HiVLP: Hierarchical Interactive Video-Language Pre-Training Jan 1, 2023 Retrieval Self-Supervised Learning
— Unverified 0Exploring Temporal Concurrency for Video-Language Representation Learning Jan 1, 2023 Dynamic Time Warping Metric Learning
Code Code Available 0Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval Jan 1, 2023 Knowledge Distillation Language Modelling
Code Code Available 1Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval Jan 1, 2023 Diversity Object
Code Code Available 1Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? Dec 31, 2022 Data Augmentation Retrieval
Code Code Available 2HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training Dec 30, 2022 cross-modal alignment TGIF-Action
— Unverified 0TempCLR: Temporal Alignment Representation with Contrastive Learning Dec 28, 2022 Action Recognition Contrastive Learning
Code Code Available 1You were saying? - Spoken Language in the V3C Dataset Dec 15, 2022 Retrieval Video Retrieval
Code Code Available 0Contextual Explainable Video Representation: Human Perception-based Understanding Dec 12, 2022 Action Detection Action Recognition
Code Code Available 0VindLU: A Recipe for Effective Video-and-Language Pretraining Dec 9, 2022 Question Answering Retrieval
Code Code Available 1VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners Dec 9, 2022 Question Answering Retrieval
— Unverified 0InternVideo: General Video Foundation Models via Generative and Discriminative Learning Dec 6, 2022 Action Classification Action Recognition
Code Code Available 4Masked Contrastive Pre-Training for Efficient Video-Text Retrieval Dec 2, 2022 Image-text Retrieval Retrieval
— Unverified 0Normalized Contrastive Learning for Text-Video Retrieval Nov 30, 2022 Contrastive Learning Cross-Modal Retrieval
Code Code Available 1Renmin University of China at TRECVID 2022: Improving Video Search by Feature Fusion and Negation Understanding Nov 28, 2022 Ad-hoc video search Negation
— Unverified 0VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval Nov 23, 2022 Cross-Modal Retrieval Retrieval
Code Code Available 1TransVCL: Attention-enhanced Video Copy Localization Network with Flexible Supervision Nov 23, 2022 Retrieval Video Retrieval
Code Code Available 1X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Nov 22, 2022 All Cross-Modal Retrieval
Code Code Available 2Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval Nov 21, 2022 All Retrieval
Code Code Available 0Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Nov 21, 2022 Contrastive Learning Representation Learning
Code Code Available 1