| VideoCon: Robust Video-Language Alignment via Contrast Captions | Nov 15, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | Nov 14, 2023 | Image-based Generative Performance BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities | Nov 9, 2023 | Action ClassificationAudio Classification | —Unverified | 0 |
| Modular Blended Attention Network for Video Question Answering | Nov 2, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos | Nov 2, 2023 | counterfactualCounterfactual Reasoning | CodeCode Available | 0 |
| Long Story Short: a Summarize-then-Search Method for Long Video Question Answering | Nov 2, 2023 | DiversityQuestion Answering | CodeCode Available | 0 |
| Harvest Video Foundation Models via Efficient Post-Pretraining | Oct 30, 2023 | Question AnsweringText Retrieval | —Unverified | 0 |
| TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding | Oct 29, 2023 | FormLanguage Modelling | CodeCode Available | 1 |
| Large Language Models are Temporal and Causal Reasoners for Video Question Answering | Oct 24, 2023 | Natural Language UnderstandingQuestion Answering | CodeCode Available | 1 |
| Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models | Oct 9, 2023 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks | Oct 7, 2023 | Action RecognitionMultiple-choice | —Unverified | 0 |
| MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering | Oct 6, 2023 | counterfactualQuestion Answering | —Unverified | 0 |
| Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts | Sep 27, 2023 | Few-shot Video Question AnsweringPrompt Learning | CodeCode Available | 1 |
| AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model | Sep 27, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | Sep 27, 2023 | GPUVideo-based Generative Performance Benchmarking | CodeCode Available | 1 |
| ATM: Action Temporality Modeling for Video Question Answering | Sep 5, 2023 | Contrastive LearningOptical Flow Estimation | —Unverified | 0 |
| Understanding Video Scenes through Text: Insights from Text-based Video Question Answering | Sep 4, 2023 | Domain AdaptationQuestion Answering | —Unverified | 0 |
| Can I Trust Your Answer? Visually Grounded Video Question Answering | Sep 4, 2023 | Grounded Video Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Distraction-free Embeddings for Robust VQA | Aug 31, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control | Aug 18, 2023 | Image CaptioningText Generation | CodeCode Available | 1 |
| Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | Aug 18, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |
| Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer | Aug 16, 2023 | DecoderQuestion Answering | CodeCode Available | 1 |
| Redundancy-aware Transformer for Video Question Answering | Aug 7, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | Jul 31, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |