| Action Scene Graphs for Long-Form Understanding of Egocentric Videos | Dec 6, 2023 | Action AnticipationForm | CodeCode Available | 1 | 5 |
| Is Appearance Free Action Recognition Possible? | Jul 13, 2022 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 | 5 |
| PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos | Dec 2, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 1 | 5 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 | 5 |
| -Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation | Jan 31, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 | 5 |
| IntentVizor: Towards Generic Query Guided Interactive Video Summarization | Sep 30, 2021 | Video SummarizationVideo Understanding | CodeCode Available | 1 | 5 |
| Leveraging triplet loss for unsupervised action segmentation | Apr 13, 2023 | Action SegmentationClustering | CodeCode Available | 1 | 5 |
| PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling | May 29, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models | Dec 15, 2023 | Video Understanding | CodeCode Available | 1 | 5 |
| Crossover Learning for Fast Online Video Instance Segmentation | Apr 13, 2021 | Instance SegmentationSemantic Segmentation | CodeCode Available | 1 | 5 |
| DEVIAS: Learning Disentangled Video Representations of Action and Scene | Nov 30, 2023 | Action RecognitionDecoder | CodeCode Available | 1 | 5 |
| Panoptic Video Scene Graph Generation | Nov 28, 2023 | Graph GenerationPanoptic Scene Graph Generation | CodeCode Available | 1 | 5 |
| Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos | Jun 3, 2024 | Mistake DetectionOnline Mistake Detection | CodeCode Available | 1 | 5 |
| How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning? | Mar 27, 2022 | Self-Supervised LearningSensitivity | CodeCode Available | 1 | 5 |
| Learning Salient Boundary Feature for Anchor-free Temporal Action Localization | Mar 24, 2021 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 | 5 |
| AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation | Jan 14, 2025 | MambaVideo Understanding | CodeCode Available | 1 | 5 |
| Disentangle Your Dense Object Detector | Jul 7, 2021 | DisentanglementObject | CodeCode Available | 1 | 5 |
| Learning Self-Similarity in Space and Time as a Generalized Motion for Action Recognition | Jan 1, 2021 | Action RecognitionVideo Understanding | CodeCode Available | 1 | 5 |
| DisTime: Distribution-based Time Representation for Video Large Language Models | May 30, 2025 | Temporal LocalizationVideo Understanding | CodeCode Available | 1 | 5 |
| BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection | May 5, 2022 | Action Detectionobject-detection | CodeCode Available | 1 | 5 |
| Learning Temporally Causal Latent Processes from General Temporal Data | Oct 11, 2021 | Causal DiscoveryRepresentation Learning | CodeCode Available | 1 | 5 |
| How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation | Dec 12, 2023 | Anomaly DetectionAutonomous Driving | CodeCode Available | 1 | 5 |
| Open-Vocabulary Video Relation Extraction | Dec 25, 2023 | Action ClassificationAction Understanding | CodeCode Available | 1 | 5 |
| Panoramic Vision Transformer for Saliency Detection in 360° Videos | Sep 19, 2022 | Saliency DetectionSaliency Prediction | CodeCode Available | 1 | 5 |
| Do Language Models Understand Time? | Dec 18, 2024 | Action RecognitionAnomaly Detection | CodeCode Available | 1 | 5 |
| Large Scale Holistic Video Understanding | Apr 25, 2019 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| Domain Knowledge-Informed Self-Supervised Representations for Workout Form Assessment | Feb 28, 2022 | 3D Action RecognitionAction Analysis | CodeCode Available | 1 | 5 |
| Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models | Mar 20, 2025 | Multiple-choiceVideo Understanding | CodeCode Available | 1 | 5 |
| Localizing Moments in Long Video Via Multimodal Guidance | Feb 26, 2023 | Natural Language Moment RetrievalNatural Language Visual Grounding | CodeCode Available | 1 | 5 |
| Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation | Dec 16, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 | 5 |
| Contrastive Masked Autoencoders for Self-Supervised Video Hashing | Nov 21, 2022 | DecoderRetrieval | CodeCode Available | 1 | 5 |
| PAN: Towards Fast Action Recognition via Learning Persistence of Appearance | Aug 8, 2020 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 | 5 |
| REVECA -- Rich Encoder-decoder framework for Video Event CAptioner | Jun 18, 2022 | DecoderSemantic Segmentation | CodeCode Available | 1 | 5 |
| Helping Hands: An Object-Aware Ego-Centric Video Recognition Model | Aug 15, 2023 | DecoderObject | CodeCode Available | 1 | 5 |
| BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | Sep 27, 2023 | GPUVideo-based Generative Performance Benchmarking | CodeCode Available | 1 | 5 |
| HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization | Aug 12, 2024 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 | 5 |
| Grounded Question-Answering in Long Egocentric Videos | Dec 11, 2023 | Video GroundingVideo Question Answering | CodeCode Available | 1 | 5 |
| A Multi-Person Video Dataset Annotation Method of Spatio-Temporally Actions | Apr 21, 2022 | Action DetectionVideo Understanding | CodeCode Available | 1 | 5 |
| Dual-path Adaptation from Image to Video Transformers | Mar 17, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives | Feb 4, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | May 20, 2025 | Caption GenerationRetrieval | CodeCode Available | 1 | 5 |
| Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models | Jan 1, 2025 | Action RecognitionAction Segmentation | CodeCode Available | 1 | 5 |
| Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding | Jul 11, 2024 | EEGLanguage Modeling | CodeCode Available | 1 | 5 |
| ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning | Jun 27, 2022 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos | Aug 18, 2023 | point cloud video understandingSelf-Supervised Learning | CodeCode Available | 1 | 5 |
| From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities | Jan 10, 2025 | Human-Object Interaction DetectionKnowledge Distillation | CodeCode Available | 1 | 5 |
| Object-Region Video Transformers | Oct 13, 2021 | Action DetectionAction Recognition | CodeCode Available | 1 | 5 |
| Compositional Video Understanding with Spatiotemporal Structure-based Transformers | Jan 1, 2024 | Video Understanding | CodeCode Available | 1 | 5 |
| From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | May 30, 2022 | counterfactualDescriptive | CodeCode Available | 1 | 5 |