| Contrastive Language Video Time Pre-training | Jun 4, 2024 | Action RecognitionContrastive Learning | —Unverified | 0 |
| 2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation | Jun 1, 2024 | Autonomous DrivingPanoptic Segmentation | —Unverified | 0 |
| HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model | Jun 1, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| Temporal Grounding of Activities using Multimodal Large Language Models | May 30, 2024 | Video Understanding | —Unverified | 0 |
| MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | May 28, 2024 | Decision MakingVideo Understanding | —Unverified | 0 |
| Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions | May 28, 2024 | Action RecognitionVideo Recognition | —Unverified | 0 |
| Streaming Long Video Understanding with Large Language Models | May 25, 2024 | Question AnsweringVideo Understanding | —Unverified | 0 |
| MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models | May 23, 2024 | Action RecognitionAction Segmentation | —Unverified | 0 |
| Anticipating Object State Changes in Long Procedural Videos | May 21, 2024 | ObjectObject State Change Classification | —Unverified | 0 |
| Open-Vocabulary Spatio-Temporal Action Detection | May 17, 2024 | Action DetectionFine-Grained Action Detection | —Unverified | 0 |
| Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis | May 14, 2024 | 4kGPU | —Unverified | 0 |
| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 |
| Global Motion Understanding in Large-Scale Video Object Segmentation | May 11, 2024 | Instance SegmentationOptical Flow Estimation | —Unverified | 0 |
| RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning | May 11, 2024 | Image-text matchingRetrieval | —Unverified | 0 |
| A Survey on Backbones for Deep Video Action Recognition | May 9, 2024 | Action RecognitionDiversity | —Unverified | 0 |
| Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition | May 7, 2024 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| Snippet-Aware Transformer With Multiple Action Elements for Skeleton-Based Action Segmentation | May 6, 2024 | Action SegmentationSkeleton Based Action Segmentation | CodeCode Available | 0 |
| WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning | May 6, 2024 | Multiple-choiceVideo Understanding | —Unverified | 0 |
| How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs | May 6, 2024 | Autonomous VehiclesVideo Understanding | —Unverified | 0 |
| Learning text-to-video retrieval from image captioning | Apr 26, 2024 | Image CaptioningImage Retrieval | —Unverified | 0 |
| Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting | Apr 26, 2024 | Facial Expression RecognitionMulti-Task Learning | —Unverified | 0 |
| IPAD: Industrial Process Anomaly Detection Dataset | Apr 23, 2024 | Anomaly DetectionVideo Anomaly Detection | —Unverified | 0 |
| From Image to Video, what do we need in multimodal LLMs? | Apr 18, 2024 | Video Understanding | —Unverified | 0 |
| In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition | Apr 14, 2024 | Action RecognitionHand Pose Estimation | CodeCode Available | 0 |
| A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos | Apr 10, 2024 | Activity RecognitionGaze Prediction | —Unverified | 0 |