| Visual Causal Scene Refinement for Video Question Answering | May 7, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 3 |
| VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation | May 4, 2023 | DecoderQuestion Answering | —Unverified | 0 |
| ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos | May 4, 2023 | Question AnsweringSpatio-temporal Scene Graphs | CodeCode Available | 0 |
| LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | Apr 28, 2023 | Instruction Followingmodel | CodeCode Available | 5 |
| A Review of Deep Learning for Video Captioning | Apr 22, 2023 | Deep LearningDense Video Captioning | —Unverified | 0 |
| MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Apr 20, 2023 | Image DescriptionLanguage Modelling | CodeCode Available | 7 |
| SViTT: Temporal Learning of Sparse Video-Text Transformers | Apr 18, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Learning Situation Hyper-Graphs for Video Question Answering | Apr 18, 2023 | DecoderQuestion Answering | CodeCode Available | 1 |
| Visual Instruction Tuning | Apr 17, 2023 | 1 Image, 2*2 Stitching3D Question Answering (3D-QA) | CodeCode Available | 6 |
| VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | Apr 17, 2023 | Audio captioningAudio-Video Question Answering (AVQA) | CodeCode Available | 2 |
| Verbs in Action: Improving verb understanding in video-language models | Apr 13, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 0 |
| Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering | Apr 7, 2023 | Question AnsweringQuestion Generation | —Unverified | 0 |
| MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | Mar 29, 2023 | Cross-Modal RetrievalDecoder | CodeCode Available | 0 |
| Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding | Mar 28, 2023 | Action LocalizationAction Recognition | —Unverified | 0 |
| Unmasked Teacher: Towards Training-Efficient Video Foundation Models | Mar 28, 2023 | Action ClassificationAction Recognition | CodeCode Available | 0 |
| Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning | Mar 25, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 1 |
| MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | Mar 23, 2023 | Auxiliary LearningMultimodal Sentiment Analysis | CodeCode Available | 1 |
| ViperGPT: Visual Inference via Python Execution for Reasoning | Mar 14, 2023 | Code GenerationVideo Question Answering | CodeCode Available | 3 |
| MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling | Mar 10, 2023 | Multi-Label ClassificationMUlTI-LABEL-ClASSIFICATION | —Unverified | 0 |
| Video Question Answering Using CLIP-Guided Visual-Text Attention | Mar 6, 2023 | General KnowledgeQuestion Answering | —Unverified | 0 |
| Contrastive Video Question Answering via Video Graph Transformer | Feb 27, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 1 |
| Connecting Vision and Language with Video Localized Narratives | Feb 22, 2023 | Question AnsweringVideo Narrative Grounding | CodeCode Available | 1 |
| STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training | Feb 20, 2023 | Language ModellingObject | —Unverified | 0 |
| Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer | Feb 4, 2023 | Computational EfficiencyQuestion Answering | CodeCode Available | 0 |
| mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | Feb 1, 2023 | Action ClassificationImage Classification | CodeCode Available | 4 |