| BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind | Feb 12, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| YTCommentQA: Video Question Answerability in Instructional Videos | Jan 30, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 0 |
| STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering | Jan 8, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 0 |
| Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering | Jan 3, 2024 | Question AnsweringScheduling | —Unverified | 0 |
| Language-aware Visual Semantic Distillation for Video Question Answering | Jan 1, 2024 | Answer GenerationQuestion Answering | —Unverified | 0 |
| VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens | Jan 1, 2024 | HallucinationPosition | —Unverified | 0 |
| On Scaling Up a Multilingual Vision and Language Model | Jan 1, 2024 | document understandingIn-Context Learning | —Unverified | 0 |
| Cross-Modal Reasoning with Event Correlation for Video Question Answering | Dec 20, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Perception Test 2023: A Summary of the First Challenge And Outcome | Dec 20, 2023 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 |
| Text-Conditioned Resampler For Long Form Video Understanding | Dec 19, 2023 | EgoSchemaForm | —Unverified | 0 |
| Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens | Dec 12, 2023 | HallucinationPosition | —Unverified | 0 |
| MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding | Dec 8, 2023 | FormQuestion Answering | —Unverified | 0 |
| Retrieval-based Video Language Model for Efficient Long Video Question Answering | Dec 8, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding | Dec 4, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Zero-Shot Video Question Answering with Procedural Programs | Dec 1, 2023 | Code GenerationLanguage Modeling | —Unverified | 0 |
| E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer | Nov 28, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Characterizing Video Question Answering with Sparsified Inputs | Nov 27, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | Nov 25, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| Vamos: Versatile Action Models for Video Understanding | Nov 22, 2023 | EgoSchemaHard Attention | CodeCode Available | 0 |
| Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities | Nov 9, 2023 | Action ClassificationAudio Classification | —Unverified | 0 |
| Long Story Short: a Summarize-then-Search Method for Long Video Question Answering | Nov 2, 2023 | DiversityQuestion Answering | CodeCode Available | 0 |
| ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos | Nov 2, 2023 | counterfactualCounterfactual Reasoning | CodeCode Available | 0 |
| Modular Blended Attention Network for Video Question Answering | Nov 2, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Harvest Video Foundation Models via Efficient Post-Pretraining | Oct 30, 2023 | Question AnsweringText Retrieval | —Unverified | 0 |
| Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks | Oct 7, 2023 | Action RecognitionMultiple-choice | —Unverified | 0 |