| Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | Aug 18, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | Jul 9, 2023 | Question AnsweringTGIF-Frame | CodeCode Available | 1 |
| Lightweight Recurrent Cross-modal Encoder for Video Question Answering | Jun 30, 2023 | Action RecognitionQuestion Answering | CodeCode Available | 0 |
| COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | Jun 15, 2023 | Formmodel | CodeCode Available | 1 |
| VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | May 29, 2023 | Audio captioningAudio-Visual Captioning | CodeCode Available | 2 |
| VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | May 22, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | Apr 17, 2023 | Audio captioningAudio-Video Question Answering (AVQA) | CodeCode Available | 2 |
| MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | Mar 23, 2023 | Auxiliary LearningMultimodal Sentiment Analysis | CodeCode Available | 1 |
| MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling | Mar 10, 2023 | Multi-Label ClassificationMUlTI-LABEL-ClASSIFICATION | —Unverified | 0 |
| mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | Feb 1, 2023 | Action ClassificationImage Classification | CodeCode Available | 4 |
| HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | Dec 30, 2022 | cross-modal alignmentTGIF-Action | —Unverified | 0 |
| An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | Sep 4, 2022 | Fill MaskOptical Flow Estimation | CodeCode Available | 1 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | Jun 16, 2022 | Fill MaskLanguage Modeling | CodeCode Available | 1 |
| All in One: Exploring Unified Video-Language Pre-training | Mar 14, 2022 | AllLanguage Modelling | CodeCode Available | 2 |