| Temporal Grounding of Activities using Multimodal Large Language Models | May 30, 2024 | Video Understanding | —Unverified | 0 |
| DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark | May 30, 2024 | DeepFake DetectionMamba | CodeCode Available | 2 |
| EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery Videos | May 30, 2024 | Action RecognitionSurgical phase recognition | CodeCode Available | 1 |
| VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos | May 29, 2024 | EgoSchemaMME | CodeCode Available | 2 |
| MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | May 28, 2024 | Decision MakingVideo Understanding | —Unverified | 0 |
| Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions | May 28, 2024 | Action RecognitionVideo Recognition | —Unverified | 0 |
| Hawk: Learning to Understand Open-World Video Anomalies | May 27, 2024 | Anomaly DetectionQuestion Answering | CodeCode Available | 3 |
| Streaming Long Video Understanding with Large Language Models | May 25, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 |
| MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models | May 23, 2024 | Action RecognitionAction Segmentation | —Unverified | 0 |
| Dense Connector for MLLMs | May 22, 2024 | Video Understanding | CodeCode Available | 2 |
| TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment | May 22, 2024 | EgoSchemaVideo Understanding | CodeCode Available | 1 |
| Anticipating Object State Changes in Long Procedural Videos | May 21, 2024 | ObjectObject State Change Classification | —Unverified | 0 |
| Open-Vocabulary Spatio-Temporal Action Detection | May 17, 2024 | Action DetectionFine-Grained Action Detection | —Unverified | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 |
| Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis | May 14, 2024 | 4kGPU | —Unverified | 0 |
| No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding | May 14, 2024 | Action DetectionGPU | CodeCode Available | 1 |
| Global Motion Understanding in Large-Scale Video Object Segmentation | May 11, 2024 | Instance SegmentationOptical Flow Estimation | —Unverified | 0 |
| RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning | May 11, 2024 | Image-text matchingRetrieval | —Unverified | 0 |
| A Survey on Backbones for Deep Video Action Recognition | May 9, 2024 | Action RecognitionDiversity | —Unverified | 0 |
| Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition | May 7, 2024 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| Vision Mamba: A Comprehensive Survey and Taxonomy | May 7, 2024 | MambaMedical Image Analysis | CodeCode Available | 2 |
| Snippet-Aware Transformer With Multiple Action Elements for Skeleton-Based Action Segmentation | May 6, 2024 | Action SegmentationSkeleton Based Action Segmentation | CodeCode Available | 0 |
| Foundation Models for Video Understanding: A Survey | May 6, 2024 | SurveyVideo Understanding | CodeCode Available | 2 |
| WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning | May 6, 2024 | Multiple-choiceVideo Understanding | —Unverified | 0 |
| How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs | May 6, 2024 | Autonomous VehiclesVideo Understanding | —Unverified | 0 |
| Learning text-to-video retrieval from image captioning | Apr 26, 2024 | Image CaptioningImage Retrieval | —Unverified | 0 |
| Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting | Apr 26, 2024 | Facial Expression RecognitionMulti-Task Learning | —Unverified | 0 |
| MovieChat+: Question-aware Sparse Memory for Long Video Question Answering | Apr 26, 2024 | 2kQuestion Answering | CodeCode Available | 4 |
| PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | Apr 25, 2024 | Dense CaptioningMVBench | CodeCode Available | 4 |
| SFMViT: SlowFast Meet ViT in Chaotic World | Apr 25, 2024 | Action LocalizationVideo Understanding | CodeCode Available | 1 |
| IPAD: Industrial Process Anomaly Detection Dataset | Apr 23, 2024 | Anomaly DetectionVideo Anomaly Detection | —Unverified | 0 |
| From Image to Video, what do we need in multimodal LLMs? | Apr 18, 2024 | Video Understanding | —Unverified | 0 |
| Leveraging Temporal Contextualization for Video Action Recognition | Apr 15, 2024 | Action RecognitionTemporal Action Localization | CodeCode Available | 2 |
| In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition | Apr 14, 2024 | Action RecognitionHand Pose Estimation | CodeCode Available | 0 |
| Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection | Apr 14, 2024 | Highlight DetectionMoment Retrieval | CodeCode Available | 1 |
| Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis | Apr 12, 2024 | Dense Video CaptioningTransfer Learning | CodeCode Available | 1 |
| Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention | Apr 10, 2024 | Action AnticipationGraph Neural Network | —Unverified | 0 |
| A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos | Apr 10, 2024 | Activity RecognitionGaze Prediction | —Unverified | 0 |
| MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | Apr 8, 2024 | GPUMultiple-choice | CodeCode Available | 3 |
| SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos | Apr 6, 2024 | Graph GenerationRelation | CodeCode Available | 1 |
| Koala: Key frame-conditioned long video-LLM | Apr 5, 2024 | Action RecognitionQuestion Answering | —Unverified | 0 |
| BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes | Apr 4, 2024 | ObjectVideo Understanding | —Unverified | 0 |
| OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning | Apr 4, 2024 | DescriptiveDiversity | —Unverified | 0 |
| LongVLM: Efficient Long Video Understanding via Large Language Models | Apr 4, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 2 |
| MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens | Apr 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| SnAG: Scalable and Accurate Video Grounding | Apr 2, 2024 | Video GroundingVideo Understanding | CodeCode Available | 4 |
| R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding | Apr 2, 2024 | Highlight DetectionMoment Retrieval | —Unverified | 0 |
| R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding | Mar 31, 2024 | Highlight DetectionMoment Retrieval | —Unverified | 0 |
| Instrument-tissue Interaction Detection Framework for Surgical Video Understanding | Mar 30, 2024 | Video Understanding | —Unverified | 0 |
| ST-LLM: Large Language Models Are Effective Temporal Learners | Mar 30, 2024 | MVBenchReading Comprehension | CodeCode Available | 2 |