| Semi-Parametric Video-Grounded Text Generation | Jan 27, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Temporal Perceiving Video-Language Pre-training | Jan 18, 2023 | Action LocalizationContrastive Learning | —Unverified | 0 |
| Learning Trajectory-Word Alignments for Video-Language Tasks | Jan 5, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| Exploring Temporal Concurrency for Video-Language Representation Learning | Jan 1, 2023 | Dynamic Time WarpingMetric Learning | CodeCode Available | 0 |
| IntentQA: Context-aware Video Intent Reasoning | Jan 1, 2023 | Contrastive LearningVideo Question Answering | CodeCode Available | 1 |
| Knowledge Proxy Intervention for Deconfounded Video Question Answering | Jan 1, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering | Jan 1, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | Dec 30, 2022 | cross-modal alignmentTGIF-Action | —Unverified | 0 |
| MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | Dec 19, 2022 | FormQuestion Answering | CodeCode Available | 1 |
| VindLU: A Recipe for Effective Video-and-Language Pretraining | Dec 9, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 |
| VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | Dec 9, 2022 | Question AnsweringRetrieval | —Unverified | 0 |
| InternVideo: General Video Foundation Models via Generative and Discriminative Learning | Dec 6, 2022 | Action ClassificationAction Recognition | CodeCode Available | 4 |
| X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks | Nov 22, 2022 | AllCross-Modal Retrieval | CodeCode Available | 2 |
| Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations | Nov 21, 2022 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training | Nov 21, 2022 | cross-modal alignmentGPU | —Unverified | 0 |
| Visual Commonsense-aware Representation Network for Video Captioning | Nov 17, 2022 | Caption GenerationQuestion Answering | CodeCode Available | 1 |
| Watching the News: Towards VideoQA Models that can Read | Nov 10, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering | Nov 7, 2022 | Add - POAdd - PQ | CodeCode Available | 0 |
| LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling | Oct 21, 2022 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Composing Ensembles of Pre-trained Models via Iterative Consensus | Oct 20, 2022 | Arithmetic ReasoningImage Generation | —Unverified | 0 |
| Perception Test: A Diagnostic Benchmark for Multimodal Models | Oct 19, 2022 | DiagnosticMultiple-choice | CodeCode Available | 2 |
| Dense but Efficient VideoQA for Intricate Compositional Reasoning | Oct 19, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Vision-Language Pre-training: Basics, Recent Advances, and Future Trends | Oct 17, 2022 | Few-Shot LearningImage Captioning | CodeCode Available | 3 |
| Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy | Oct 15, 2022 | Feature CompressionQuestion Answering | CodeCode Available | 2 |
| Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning | Oct 12, 2022 | Contrastive LearningForm | CodeCode Available | 2 |
| Contrastive Video-Language Learning with Fine-grained Frame Sampling | Oct 10, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling | Oct 8, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Locate before Answering: Answer Guided Question Localization for Video Question Answering | Oct 5, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Extending Compositional Attention Networks for Social Reasoning in Videos | Oct 3, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 0 |
| In-the-Wild Video Question Answering | Oct 1, 2022 | Evidence SelectionQuestion Answering | —Unverified | 0 |
| LAVIS: A Library for Language-Vision Intelligence | Sep 15, 2022 | BenchmarkingImage Captioning | —Unverified | 0 |
| OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | Sep 15, 2022 | Action ClassificationAction Recognition | —Unverified | 0 |
| WildQA: In-the-Wild Video Question Answering | Sep 14, 2022 | Evidence SelectionQuestion Answering | —Unverified | 0 |
| Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering | Sep 8, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | Sep 4, 2022 | Fill MaskOptical Flow Estimation | CodeCode Available | 1 |
| Video Question Answering with Iterative Video-Text Co-Tokenization | Aug 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Equivariant and Invariant Grounding for Video Question Answering | Jul 26, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Video Graph Transformer for Video Question Answering | Jul 12, 2022 | Question AnsweringRelation | CodeCode Available | 1 |
| Video Dialog as Conversation about Objects Living in Space-Time | Jul 8, 2022 | ObjectRelational Reasoning | CodeCode Available | 1 |
| Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering | Jul 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| 0/1 Deep Neural Networks via Block Coordinate Descent | Jun 19, 2022 | 10-shot image generation | —Unverified | 0 |
| Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | Jun 16, 2022 | Fill MaskLanguage Modeling | CodeCode Available | 1 |
| LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Jun 14, 2022 | DecoderLanguage Modeling | CodeCode Available | 1 |
| Revealing Single Frame Bias for Video-and-Language Learning | Jun 7, 2022 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 2 |
| Invariant Grounding for Video Question Answering | Jun 6, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval | Jun 5, 2022 | RetrievalSentence | CodeCode Available | 0 |
| Revisiting the "Video" in Video-Language Understanding | Jun 3, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| Structured Two-stream Attention Network for Video Question Answering | Jun 2, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Modality Alignment between Deep Representations for Effective Video-and-Language Learning | Jun 1, 2022 | Question AnsweringVideo Captioning | —Unverified | 0 |