| Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering | Jul 25, 2023 | graph constructionQuestion Answering | —Unverified | 0 |
| Discovering Spatio-Temporal Rationales for Video Question Answering | Jul 22, 2023 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Traffic-Domain Video Question Answering with Automatic Captioning | Jul 18, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Emu: Generative Pretraining in Multimodality | Jul 11, 2023 | Image CaptioningImage Generation | CodeCode Available | 3 |
| Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | Jul 9, 2023 | Question AnsweringTGIF-Frame | CodeCode Available | 1 |
| Reading Between the Lanes: Text VideoQA on the Road | Jul 8, 2023 | Question AnsweringScene Text Recognition | CodeCode Available | 0 |
| Read, Look or Listen? What's Needed for Solving a Multimodal Dataset | Jul 6, 2023 | Question AnsweringSpeaker Identification | —Unverified | 0 |
| Lightweight Recurrent Cross-modal Encoder for Video Question Answering | Jun 30, 2023 | Action RecognitionQuestion Answering | CodeCode Available | 0 |
| FunQA: Towards Surprising Video Comprehension | Jun 26, 2023 | Question AnsweringText Generation | CodeCode Available | 1 |
| Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models | Jun 15, 2023 | cross-modal alignmentDomain Generalization | —Unverified | 0 |
| COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | Jun 15, 2023 | Formmodel | CodeCode Available | 1 |
| Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Jun 8, 2023 | Question AnsweringVCGBench-Diverse | CodeCode Available | 3 |
| Diversifying Joint Vision-Language Tokenization Learning | Jun 6, 2023 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | Jun 5, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | May 29, 2023 | Audio captioningAudio-Visual Captioning | CodeCode Available | 2 |
| PaLI-X: On Scaling up a Multilingual Vision and Language Model | May 29, 2023 | Chart Question Answeringdocument understanding | CodeCode Available | 1 |
| Perception Test: A Diagnostic Benchmark for Multimodal Video Models | May 23, 2023 | DiagnosticGrounded Video Question Answering | CodeCode Available | 2 |
| VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | May 22, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| Paxion: Patching Action Knowledge in Video-Language Foundation Models | May 18, 2023 | Action UnderstandingDiagnostic | CodeCode Available | 1 |
| TG-VQA: Ternary Game of Video Question Answering | May 17, 2023 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| Is a Video worth n n Images? A Highly Efficient Approach to Transformer-based Video Question Answering | May 16, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering | May 14, 2023 | Question AnsweringSemantic Role Labeling | —Unverified | 0 |
| Self-Chained Image-Language Model for Video Localization and Question Answering | May 11, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | May 11, 2023 | 1 Image, 2*2 StitchingDiversity | CodeCode Available | 2 |
| VideoChat: Chat-Centric Video Understanding | May 10, 2023 | Question AnsweringVideo-based Generative Performance Benchmarking | CodeCode Available | 4 |