| Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos | Aug 26, 2024 | FormLanguage Modelling | CodeCode Available | 1 |
| Learning Video Context as Interleaved Multimodal Sequences | Jul 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Video-Language Alignment via Spatio-Temporal Graph Transformer | Jul 16, 2024 | Contrastive LearningQuestion Answering | CodeCode Available | 1 |
| Referring Atomic Video Action Recognition | Jul 2, 2024 | Action LocalizationAction Recognition | CodeCode Available | 1 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 |
| AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding | Jun 19, 2024 | Question AnsweringSpatial Reasoning | CodeCode Available | 1 |
| Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA | Jun 13, 2024 | AllEgoSchema | CodeCode Available | 1 |
| Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering | Jun 2, 2024 | counterfactualCounterfactual Reasoning | CodeCode Available | 1 |
| Encoding and Controlling Global Semantics for Long-form Video Question Answering | May 30, 2024 | FormQuestion Answering | CodeCode Available | 1 |
| CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes | Apr 1, 2024 | Causal DiscoveryCausal Discovery in Video Reasoning | CodeCode Available | 1 |
| TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering | Apr 1, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| HawkEye: Training Video-Text LLMs for Grounding Text in Videos | Mar 15, 2024 | Video GroundingVideo Question Answering | CodeCode Available | 1 |
| DAM: Dynamic Adapter Merging for Continual Video QA Learning | Mar 13, 2024 | Continual Learningimage-classification | CodeCode Available | 1 |
| Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge | Feb 25, 2024 | Computational EfficiencyLanguage Modelling | CodeCode Available | 1 |
| Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering | Jan 19, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports | Jan 3, 2024 | Action Understandingcounterfactual | CodeCode Available | 1 |
| Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | Jan 3, 2024 | Action DetectionHuman-Object Interaction Detection | CodeCode Available | 1 |
| A Simple LLM Framework for Long-Range Video Question-Answering | Dec 28, 2023 | EgoSchemaLanguage Modelling | CodeCode Available | 1 |
| Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos | Dec 16, 2023 | Video Captioningvideo narration captioning | CodeCode Available | 1 |
| ViLA: Efficient Video-Language Alignment for Video Question Answering | Dec 13, 2023 | cross-modal alignmentLanguage Modeling | CodeCode Available | 1 |
| Grounded Question-Answering in Long Egocentric Videos | Dec 11, 2023 | Video GroundingVideo Question Answering | CodeCode Available | 1 |
| RTQ: Rethinking Video-language Understanding Based on Image-text Model | Dec 1, 2023 | Video CaptioningVideo Question Answering | CodeCode Available | 1 |
| AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering | Nov 25, 2023 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| VideoCon: Robust Video-Language Alignment via Contrast Captions | Nov 15, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding | Oct 29, 2023 | FormLanguage Modelling | CodeCode Available | 1 |
| Large Language Models are Temporal and Causal Reasoners for Video Question Answering | Oct 24, 2023 | Natural Language UnderstandingQuestion Answering | CodeCode Available | 1 |
| Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models | Oct 9, 2023 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | Sep 27, 2023 | GPUVideo-based Generative Performance Benchmarking | CodeCode Available | 1 |
| AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model | Sep 27, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts | Sep 27, 2023 | Few-shot Video Question AnsweringPrompt Learning | CodeCode Available | 1 |
| Can I Trust Your Answer? Visually Grounded Video Question Answering | Sep 4, 2023 | Grounded Video Question AnsweringQuestion Answering | CodeCode Available | 1 |
| VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control | Aug 18, 2023 | Image CaptioningText Generation | CodeCode Available | 1 |
| Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | Aug 18, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |
| Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer | Aug 16, 2023 | DecoderQuestion Answering | CodeCode Available | 1 |
| Discovering Spatio-Temporal Rationales for Video Question Answering | Jul 22, 2023 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | Jul 9, 2023 | Question AnsweringTGIF-Frame | CodeCode Available | 1 |
| FunQA: Towards Surprising Video Comprehension | Jun 26, 2023 | Question AnsweringText Generation | CodeCode Available | 1 |
| COSA: Concatenated Sample Pretrained Vision-Language Foundation Model | Jun 15, 2023 | Formmodel | CodeCode Available | 1 |
| PaLI-X: On Scaling up a Multilingual Vision and Language Model | May 29, 2023 | Chart Question Answeringdocument understanding | CodeCode Available | 1 |
| Paxion: Patching Action Knowledge in Video-Language Foundation Models | May 18, 2023 | Action UnderstandingDiagnostic | CodeCode Available | 1 |
| Self-Chained Image-Language Model for Video Localization and Question Answering | May 11, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Learning Situation Hyper-Graphs for Video Question Answering | Apr 18, 2023 | DecoderQuestion Answering | CodeCode Available | 1 |
| SViTT: Temporal Learning of Sparse Video-Text Transformers | Apr 18, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning | Mar 25, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 1 |
| MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | Mar 23, 2023 | Auxiliary LearningMultimodal Sentiment Analysis | CodeCode Available | 1 |
| Contrastive Video Question Answering via Video Graph Transformer | Feb 27, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 1 |
| Connecting Vision and Language with Video Localized Narratives | Feb 22, 2023 | Question AnsweringVideo Narrative Grounding | CodeCode Available | 1 |
| IntentQA: Context-aware Video Intent Reasoning | Jan 1, 2023 | Contrastive LearningVideo Question Answering | CodeCode Available | 1 |
| MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | Dec 19, 2022 | FormQuestion Answering | CodeCode Available | 1 |