| What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation? | Mar 20, 2025 | DecoderGraph Generation | —Unverified | 0 |
| What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets | Jun 1, 2018 | Video Understanding | —Unverified | 0 |
| When Work Matters: Transforming Classical Network Structures to Graph CNN | Jul 7, 2018 | Graph ClassificationVideo Understanding | —Unverified | 0 |
| WildQA: In-the-Wild Video Question Answering | Sep 14, 2022 | Evidence SelectionQuestion Answering | —Unverified | 0 |
| Wolf: Captioning Everything with a World Summarization Framework | Jul 26, 2024 | Autonomous DrivingMixture-of-Experts | —Unverified | 0 |
| WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning | May 6, 2024 | Multiple-choiceVideo Understanding | —Unverified | 0 |
| WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs | Feb 6, 2025 | Video Understanding | —Unverified | 0 |
| X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding | Jan 12, 2025 | Video Understanding | —Unverified | 0 |
| YouMVOS: An Actor-Centric Multi-Shot Video Object Segmentation Dataset | Jan 1, 2022 | ManagementSegmentation | —Unverified | 0 |
| YouTube-8M Video Understanding Challenge Approach and Applications | Jun 26, 2017 | Ensemble LearningVideo Understanding | —Unverified | 0 |
| ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection | Nov 1, 2023 | Action DetectionClassification | —Unverified | 0 |
| Zero-shot Action Localization via the Confidence of Large Vision-Language Models | Oct 18, 2024 | Action LocalizationLanguage Modelling | —Unverified | 0 |
| Zero-Shot Action Recognition in Surveillance Videos | Oct 28, 2024 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Zero-Shot Action Recognition in Videos: A Survey | Sep 13, 2019 | Action RecognitionAction Recognition In Still Images | —Unverified | 0 |
| Zero-Shot Long-Form Video Understanding through Screenplay | Jun 25, 2024 | FormQuestion Answering | —Unverified | 0 |
| Zero-shot Shark Tracking and Biometrics from Aerial Imagery | Jan 10, 2025 | Video Understanding | —Unverified | 0 |
| Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network | Jun 2, 2019 | General ClassificationGraph Neural Network | —Unverified | 0 |
| 4D Generic Video Object Proposals | Jan 26, 2019 | Instance SegmentationObject | CodeCode Available | 0 |
| LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models | Aug 26, 2024 | Large Language ModelVideo Quality Assessment | CodeCode Available | 0 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 |
| A Context-Aware Loss Function for Action Spotting in Soccer Videos | Dec 3, 2019 | Action SpottingVideo Understanding | CodeCode Available | 0 |
| Learnable pooling with Context Gating for video classification | Jun 21, 2017 | ClassificationClustering | CodeCode Available | 0 |
| Learnable Pooling Methods for Video Classification | Oct 1, 2018 | ClassificationGeneral Classification | CodeCode Available | 0 |
| Leaping Into Memories: Space-Time Deep Feature Synthesis | Mar 17, 2023 | DiversityVideo Understanding | CodeCode Available | 0 |
| Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing | Mar 13, 2025 | EgoSchemaForm | CodeCode Available | 0 |
| Judging a video by its bitstream cover | Sep 14, 2023 | Video Understanding | CodeCode Available | 0 |
| CARPe Posterum: A Convolutional Approach for Real-time Pedestrian Path Prediction | May 26, 2020 | Autonomous VehiclesPrediction | CodeCode Available | 0 |
| VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model | Jul 9, 2024 | Video Understanding | CodeCode Available | 0 |
| Joint Event Detection and Description in Continuous Video Streams | Feb 28, 2018 | Dense CaptioningDense Video Captioning | CodeCode Available | 0 |
| Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning | Apr 15, 2018 | Video CaptioningVideo Understanding | CodeCode Available | 0 |
| Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition | Jan 25, 2022 | Action RecognitionOptical Flow Estimation | CodeCode Available | 0 |
| B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens | Dec 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition | Apr 14, 2024 | Action RecognitionHand Pose Estimation | CodeCode Available | 0 |
| ViP: Video Platform for PyTorch | Oct 7, 2019 | BenchmarkingVideo Understanding | CodeCode Available | 0 |
| ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation | May 21, 2025 | Decision MakingLanguage Modeling | CodeCode Available | 0 |
| Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision | Jun 6, 2025 | Video Understanding | CodeCode Available | 0 |
| ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models | Jun 28, 2023 | RetrievalVideo Retrieval | CodeCode Available | 0 |
| https://arxiv.org/abs/2407.00634 | Jul 2, 2024 | Video CaptioningVideo Description | CodeCode Available | 0 |
| How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios | Oct 18, 2022 | Video Understanding | CodeCode Available | 0 |
| HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios | Jun 11, 2025 | Action RecognitionAction Segmentation | CodeCode Available | 0 |
| Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations | Mar 25, 2025 | Representation LearningVideo Understanding | CodeCode Available | 0 |
| HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding | Jan 3, 2025 | Question AnsweringVideo Understanding | CodeCode Available | 0 |
| The Visual Centrifuge: Model-Free Layered Video Representations | Dec 4, 2018 | Color Constancymodel | CodeCode Available | 0 |
| The YouTube-8M Kaggle Competition: Challenges and Methods | Jun 28, 2017 | General ClassificationVideo Classification | CodeCode Available | 0 |
| Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model | Jun 15, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 0 |
| The Monkeytyping Solution to the YouTube-8M Video Understanding Challenge | Jun 16, 2017 | General ClassificationVideo Classification | CodeCode Available | 0 |
| Hierarchical Deep Recurrent Architecture for Video Understanding | Jul 11, 2017 | ClassificationGeneral Classification | CodeCode Available | 0 |
| Temporal Tessellation: A Unified Approach for Video Analysis | Dec 21, 2016 | Action DetectionVideo Captioning | CodeCode Available | 0 |
| Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding | May 19, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding | Jul 14, 2017 | Video RecognitionVideo Understanding | CodeCode Available | 0 |