| Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention | Apr 10, 2024 | Action AnticipationGraph Neural Network | —Unverified | 0 |
| Koala: Key frame-conditioned long video-LLM | Apr 5, 2024 | Action RecognitionQuestion Answering | —Unverified | 0 |
| BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes | Apr 4, 2024 | ObjectVideo Understanding | —Unverified | 0 |
| OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning | Apr 4, 2024 | DescriptiveDiversity | —Unverified | 0 |
| R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding | Apr 2, 2024 | Highlight DetectionMoment Retrieval | —Unverified | 0 |
| R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding | Mar 31, 2024 | Highlight DetectionMoment Retrieval | —Unverified | 0 |
| Instrument-tissue Interaction Detection Framework for Surgical Video Understanding | Mar 30, 2024 | Video Understanding | —Unverified | 0 |
| A Unified Framework for Human-centric Point Cloud Video Understanding | Mar 29, 2024 | 3D Pose EstimationAction Recognition | —Unverified | 0 |
| Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality | Mar 28, 2024 | Data AugmentationDiversity | CodeCode Available | 0 |
| Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding | Mar 24, 2024 | Dense Video CaptioningTemporal Localization | —Unverified | 0 |
| VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding | Mar 21, 2024 | Pose EstimationVideo Understanding | CodeCode Available | 0 |
| VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding | Mar 18, 2024 | EgoSchemaVideo Understanding | —Unverified | 0 |
| Don't Judge by the Look: Towards Motion Coherent Video Representation | Mar 14, 2024 | Data AugmentationObject Recognition | CodeCode Available | 0 |
| Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions | Mar 11, 2024 | counterfactualVideo Editing | —Unverified | 0 |
| A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives | Mar 5, 2024 | Video Understanding | —Unverified | 0 |
| MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies | Mar 3, 2024 | Text GenerationVideo Understanding | —Unverified | 0 |
| Abductive Ego-View Accident Video Understanding for Safe Driving Perception | Mar 1, 2024 | Objectobject-detection | —Unverified | 0 |
| TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning | Feb 29, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 |
| LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs | Feb 21, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Slot-VLM: SlowFast Slots for Video-Language Modeling | Feb 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VideoPrism: A Foundational Visual Encoder for Video Understanding | Feb 20, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Dynamics Based Neural Encoding with Inter-Intra Region Connectivity | Feb 19, 2024 | Video Understanding | —Unverified | 0 |
| Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly Videos | Feb 16, 2024 | Decision MakingVideo Understanding | CodeCode Available | 0 |
| Memory Consolidation Enables Long-Context Video Understanding | Feb 8, 2024 | EgoSchemaVideo Understanding | —Unverified | 0 |
| A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming | Jan 30, 2024 | Video GenerationVideo Understanding | —Unverified | 0 |
| Cutup and Detect: Human Fall Detection on Cutup Untrimmed Videos Using a Large Foundational Video Understanding Model | Jan 29, 2024 | Action DetectionAction Localization | —Unverified | 0 |
| Exploring Missing Modality in Multimodal Egocentric Datasets | Jan 21, 2024 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Learning to Visually Connect Actions and their Effects | Jan 19, 2024 | Object TrackingTask Planning | —Unverified | 0 |
| CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding | Jan 17, 2024 | Contrastive Learningpoint cloud video understanding | —Unverified | 0 |
| Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video Localization | Jan 16, 2024 | DecoderDenoising | —Unverified | 0 |
| Dr^2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning | Jan 8, 2024 | object-detectionObject Detection | CodeCode Available | 0 |
| VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding | Jan 1, 2024 | Spatio-Temporal Video GroundingVideo Grounding | —Unverified | 0 |
| Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning | Jan 1, 2024 | object-detectionObject Detection | —Unverified | 0 |
| Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action | Jan 1, 2024 | Image GenerationInstruction Following | —Unverified | 0 |
| Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning | Jan 1, 2024 | Transfer LearningVideo Understanding | —Unverified | 0 |
| Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding | Dec 31, 2023 | Spatio-Temporal Video GroundingVideo Grounding | —Unverified | 0 |
| No More Shortcuts: Realizing the Potential of Temporal Self-Supervision | Dec 20, 2023 | Action ClassificationAttribute | —Unverified | 0 |
| Text-Conditioned Resampler For Long Form Video Understanding | Dec 19, 2023 | EgoSchemaForm | —Unverified | 0 |
| Learning Object State Changes in Videos: An Open-World Perspective | Dec 19, 2023 | Video Understanding | —Unverified | 0 |
| Artificial intelligence optical hardware empowers high-resolution hyperspectral video understanding at 1.2 Tb/s | Dec 17, 2023 | Semantic SegmentationVideo Semantic Segmentation | —Unverified | 0 |
| X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-modal Knowledge Transfer | Dec 12, 2023 | Action RecognitionAction Segmentation | CodeCode Available | 0 |
| Audio-Visual LLM for Video Understanding | Dec 11, 2023 | AudioCapsLanguage Modeling | —Unverified | 0 |
| MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding | Dec 8, 2023 | FormQuestion Answering | —Unverified | 0 |
| Retrieval-based Video Language Model for Efficient Long Video Question Answering | Dec 8, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding | Dec 5, 2023 | DiversityGraph Generation | —Unverified | 0 |
| VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding | Dec 4, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Zero-Shot Video Question Answering with Procedural Programs | Dec 1, 2023 | Code GenerationLanguage Modeling | —Unverified | 0 |
| Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding | Nov 30, 2023 | FormVideo Retrieval | —Unverified | 0 |
| Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation | Nov 30, 2023 | Contrastive LearningDomain Adaptation | —Unverified | 0 |
| GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | Nov 25, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |