InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Mar 22, 2024 Action Classification Action Recognition
Code Code Available 75 ImageBind: One Embedding Space To Bind Them All May 9, 2023 All Cross-Modal Retrieval
Code Code Available 55 InternVideo: General Video Foundation Models via Generative and Discriminative Learning Dec 6, 2022 Action Classification Action Recognition
Code Code Available 45 LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment Oct 3, 2023 Audio Classification Contrastive Learning
Code Code Available 45 mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Feb 1, 2023 Action Classification Image Classification
Code Code Available 45 Gramian Multimodal Representation Learning and Alignment Dec 16, 2024 Contrastive Learning Representation Learning
Code Code Available 25 VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset May 29, 2023 Audio captioning Audio-Visual Captioning
Code Code Available 25 vid-TLDR: Training Free Token merging for Light-weight Video Transformer Mar 20, 2024 Action Recognition Computational Efficiency
Code Code Available 25 Revealing Single Frame Bias for Video-and-Language Learning Jun 7, 2022 Action Recognition Fine-grained Action Recognition
Code Code Available 25 Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Apr 1, 2021 Retrieval Text Retrieval
Code Code Available 15 HowToCaption: Prompting LLMs to Transform Video Annotations at Scale Oct 7, 2023 Automatic Speech Recognition Video Captioning
Code Code Available 15 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Dec 21, 2023 Image Retrieval Image-to-Text Retrieval
Code Code Available 15 Make Your Training Flexible: Towards Deployment-Efficient Video Models Mar 18, 2025 Action Classification Zero-Shot Video Retrieval
Code Code Available 15 MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval Apr 26, 2022 Action Recognition Retrieval
Code Code Available 15 Multi-modal Transformer for Video Retrieval Jul 21, 2020 Natural Language Queries Retrieval
Code Code Available 15 VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling Nov 24, 2021 Question Answering Retrieval
Code Code Available 15 Align and Prompt: Video-and-Language Pre-training with Entity Prompts Dec 17, 2021 cross-modal alignment Entity Alignment
Code Code Available 15 Bridging Video-text Retrieval with Multiple Choice Questions Jan 13, 2022 Action Recognition Linear evaluation
Code Code Available 15 CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval Apr 18, 2021 Retrieval Text Retrieval
Code Code Available 15 Clover: Towards A Unified Video-Language Alignment and Fusion Model Jul 16, 2022 Language Modeling Language Modelling
Code Code Available 15 End-to-End Learning of Visual Representations from Uncurated Instructional Videos Dec 13, 2019 Action Localization Action Recognition
Code Code Available 15 Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval Dec 8, 2021 Action Localization Retrieval
Code Code Available 15 Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval Jan 1, 2022 Action Localization Retrieval
Code Code Available 15 Florence: A New Foundation Model for Computer Vision Nov 22, 2021 Action Classification Action Recognition
Code Code Available 15 Object-aware Video-language Pre-training for Retrieval Dec 1, 2021 Object Retrieval
Code Code Available 15 BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Sep 27, 2023 GPU Video-based Generative Performance Benchmarking
Code Code Available 15 Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning Nov 24, 2022 cross-modal alignment Image-text Retrieval
Code Code Available 15 VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Apr 22, 2021 Action Classification Action Recognition
Code Code Available 15 Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions Nov 19, 2021 Retrieval Super-Resolution
Code Code Available 15 Unmasked Teacher: Towards Training-Efficient Video Foundation Models Mar 28, 2023 Action Classification Action Recognition
Code Code Available 05 Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language Apr 1, 2022 Diversity Image Captioning
Code Code Available 05 Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning Mar 6, 2020 Density Estimation Noise Estimation
Code Code Available 05 VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Sep 28, 2021 Action Localization Action Segmentation
Code Code Available 05 HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training Dec 30, 2022 cross-modal alignment TGIF-Action
— Unverified 00 OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning Jan 1, 2024 3D Point Cloud Classification Action Classification
— Unverified 00 OmniVL:One Foundation Model for Image-Language and Video-Language Tasks Sep 15, 2022 Action Classification Action Recognition
— Unverified 00 LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval Jul 11, 2022 Representation Learning Retrieval
— Unverified 00 VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners Dec 9, 2022 Question Answering Retrieval
— Unverified 00 TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment Aug 23, 2021 Action Segmentation Contrastive Learning
— Unverified 00 Learning Audio-Video Modalities from Image Captions Apr 1, 2022 Image Captioning Retrieval
— Unverified 00