| A Unified Framework for Human-centric Point Cloud Video Understanding | Mar 29, 2024 | 3D Pose EstimationAction Recognition | —Unverified | 0 |
| Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality | Mar 28, 2024 | Data AugmentationDiversity | CodeCode Available | 0 |
| OmniVid: A Generative Framework for Universal Video Understanding | Mar 26, 2024 | Action RecognitionDecoder | CodeCode Available | 2 |
| Understanding Long Videos with Multimodal Language Models | Mar 25, 2024 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 2 |
| Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding | Mar 24, 2024 | Dense Video CaptioningTemporal Localization | —Unverified | 0 |
| InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | Mar 22, 2024 | Action ClassificationAction Recognition | CodeCode Available | 7 |
| VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding | Mar 21, 2024 | Pose EstimationVideo Understanding | CodeCode Available | 0 |
| Language Repository for Long Video Understanding | Mar 21, 2024 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation | Mar 18, 2024 | Referring Video Object SegmentationSemantic Segmentation | CodeCode Available | 1 |
| VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding | Mar 18, 2024 | EgoSchemaVideo Understanding | —Unverified | 0 |
| Towards Neuro-Symbolic Video Understanding | Mar 16, 2024 | Video Understanding | CodeCode Available | 1 |
| VideoAgent: Long-form Video Understanding with Large Language Model as Agent | Mar 15, 2024 | EgoSchemaForm | CodeCode Available | 2 |
| Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding | Mar 14, 2024 | MambaMoment Retrieval | CodeCode Available | 3 |
| Don't Judge by the Look: Towards Motion Coherent Video Representation | Mar 14, 2024 | Data AugmentationObject Recognition | CodeCode Available | 0 |
| Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions | Mar 11, 2024 | counterfactualVideo Editing | —Unverified | 0 |
| VideoMamba: State Space Model for Efficient Video Understanding | Mar 11, 2024 | Action ClassificationMamba | CodeCode Available | 5 |
| An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models | Mar 11, 2024 | Computational EfficiencyVideo Understanding | CodeCode Available | 4 |
| Beyond MOT: Semantic Multi-Object Tracking | Mar 8, 2024 | Multi-Object TrackingObject | CodeCode Available | 2 |
| A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives | Mar 5, 2024 | Video Understanding | —Unverified | 0 |
| MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies | Mar 3, 2024 | Text GenerationVideo Understanding | —Unverified | 0 |
| Abductive Ego-View Accident Video Understanding for Safe Driving Perception | Mar 1, 2024 | Objectobject-detection | —Unverified | 0 |
| TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning | Feb 29, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 |
| LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs | Feb 21, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Slot-VLM: SlowFast Slots for Video-Language Modeling | Feb 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Video ReCap: Recursive Captioning of Hour-Long Videos | Feb 20, 2024 | EgoSchemaVideo Captioning | CodeCode Available | 3 |
| VideoPrism: A Foundational Visual Encoder for Video Understanding | Feb 20, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Dynamics Based Neural Encoding with Inter-Intra Region Connectivity | Feb 19, 2024 | Video Understanding | —Unverified | 0 |
| Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly Videos | Feb 16, 2024 | Decision MakingVideo Understanding | CodeCode Available | 0 |
| World Model on Million-Length Video And Language With Blockwise RingAttention | Feb 13, 2024 | 4kVideo Understanding | CodeCode Available | 9 |
| Video Annotator: A framework for efficiently building video classifiers using vision-language models and active learning | Feb 9, 2024 | Active LearningVideo Classification | CodeCode Available | 2 |
| Memory Consolidation Enables Long-Context Video Understanding | Feb 8, 2024 | EgoSchemaVideo Understanding | —Unverified | 0 |
| Spatio-temporal Prompting Network for Robust Video Feature Extraction | Feb 4, 2024 | Instance Segmentationobject-detection | CodeCode Available | 1 |
| BehAVE: Behaviour Alignment of Video Game Encodings | Feb 2, 2024 | DiversityFPS Games | CodeCode Available | 1 |
| A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming | Jan 30, 2024 | Video GenerationVideo Understanding | —Unverified | 0 |
| Multi-granularity Correspondence Learning from Long-term Noisy Videos | Jan 30, 2024 | Action SegmentationLong Video Retrieval (Background Removed) | CodeCode Available | 2 |
| Cutup and Detect: Human Fall Detection on Cutup Untrimmed Videos Using a Large Foundational Video Understanding Model | Jan 29, 2024 | Action DetectionAction Localization | —Unverified | 0 |
| Exploring Missing Modality in Multimodal Egocentric Datasets | Jan 21, 2024 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Learning to Visually Connect Actions and their Effects | Jan 19, 2024 | Object TrackingTask Planning | —Unverified | 0 |
| CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding | Jan 17, 2024 | Contrastive Learningpoint cloud video understanding | —Unverified | 0 |
| Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video Localization | Jan 16, 2024 | DecoderDenoising | —Unverified | 0 |
| Dr^2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning | Jan 8, 2024 | object-detectionObject Detection | CodeCode Available | 0 |
| Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action | Jan 1, 2024 | Image GenerationInstruction Following | —Unverified | 0 |
| Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning | Jan 1, 2024 | Transfer LearningVideo Understanding | —Unverified | 0 |
| Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning | Jan 1, 2024 | object-detectionObject Detection | —Unverified | 0 |
| Compositional Video Understanding with Spatiotemporal Structure-based Transformers | Jan 1, 2024 | Video Understanding | CodeCode Available | 1 |
| VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding | Jan 1, 2024 | Spatio-Temporal Video GroundingVideo Grounding | —Unverified | 0 |
| Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding | Dec 31, 2023 | Spatio-Temporal Video GroundingVideo Grounding | —Unverified | 0 |
| Video Understanding with Large Language Models: A Survey | Dec 29, 2023 | SurveyVideo Understanding | CodeCode Available | 4 |
| A Simple LLM Framework for Long-Range Video Question-Answering | Dec 28, 2023 | EgoSchemaLanguage Modelling | CodeCode Available | 1 |
| Open-Vocabulary Video Relation Extraction | Dec 25, 2023 | Action ClassificationAction Understanding | CodeCode Available | 1 |