| Apollo: An Exploration of Video Understanding in Large Multimodal Models | Dec 13, 2024 | MMEVideo MME | —Unverified | 0 |
| APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval | Jun 5, 2025 | Information RetrievalRetrieval | —Unverified | 0 |
| Artificial intelligence optical hardware empowers high-resolution hyperspectral video understanding at 1.2 Tb/s | Dec 17, 2023 | Semantic SegmentationVideo Semantic Segmentation | —Unverified | 0 |
| A SPIKING SEQUENTIAL MODEL: RECURRENT LEAKY INTEGRATE-AND-FIRE | Sep 25, 2019 | Text SummarizationVideo Understanding | —Unverified | 0 |
| A Structured Model For Action Detection | Dec 9, 2018 | Action Detectionmodel | —Unverified | 0 |
| A Study On the Effects of Pre-processing On Spatio-temporal Action Recognition Using Spiking Neural Networks Trained with STDP | May 31, 2021 | Action RecognitionSpatio-temporal Action Recognition | —Unverified | 0 |
| A Survey on Backbones for Deep Video Action Recognition | May 9, 2024 | Action RecognitionDiversity | —Unverified | 0 |
| A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming | Jan 30, 2024 | Video GenerationVideo Understanding | —Unverified | 0 |
| A Survey on Mamba Architecture for Vision Applications | Feb 11, 2025 | Mambaobject-detection | —Unverified | 0 |
| A Survey on Video Analytics in Cloud-Edge-Terminal Collaborative Systems | Feb 10, 2025 | Autonomous DrivingEdge-computing | —Unverified | 0 |
| A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos | Apr 10, 2024 | Activity RecognitionGaze Prediction | —Unverified | 0 |
| Attend and Interact: Higher-Order Object Interactions for Video Understanding | Nov 16, 2017 | Action ClassificationAction Recognition | —Unverified | 0 |
| Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification | Aug 26, 2024 | Video ClassificationVideo Understanding | —Unverified | 0 |
| Attention Is Not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion | Jan 1, 2021 | Time SeriesTime Series Analysis | —Unverified | 0 |
| Audio-Visual Glance Network for Efficient Video Recognition | Aug 18, 2023 | Video RecognitionVideo Understanding | —Unverified | 0 |
| Audio-Visual LLM for Video Understanding | Dec 11, 2023 | AudioCapsLanguage Modeling | —Unverified | 0 |
| Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations | Feb 21, 2022 | Answer GenerationVideo Understanding | —Unverified | 0 |
| Audio-visual training for improved grounding in video-text LLMs | Jul 21, 2024 | Video Understanding | —Unverified | 0 |
| Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation | Mar 30, 2021 | Action DetectionTemporal Action Proposal Generation | —Unverified | 0 |
| A Unified Framework for Human-centric Point Cloud Video Understanding | Mar 29, 2024 | 3D Pose EstimationAction Recognition | —Unverified | 0 |
| A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset | Nov 19, 2022 | Common Sense ReasoningGraph Embedding | —Unverified | 0 |
| AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark | Oct 4, 2024 | Image CaptioningVideo Understanding | —Unverified | 0 |
| Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training | Jul 5, 2020 | DecoderQuestion Answering | —Unverified | 0 |
| Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search | Dec 9, 2021 | Neural Architecture SearchVideo Recognition | —Unverified | 0 |
| AVD2: Accident Video Diffusion for Accident Video Description | Feb 20, 2025 | Autonomous DrivingScene Understanding | —Unverified | 0 |
| Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding | Mar 24, 2024 | Dense Video CaptioningTemporal Localization | —Unverified | 0 |
| AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs | Jun 5, 2025 | BenchmarkingVideo Understanding | —Unverified | 0 |
| AVT: Audio-Video Transformer for Multimodal Action Recognition | Sep 22, 2022 | Action RecognitionAudio Classification | —Unverified | 0 |
| BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation | Aug 1, 2022 | ObjectOptical Flow Estimation | —Unverified | 0 |
| BEARCUBS: A benchmark for computer-using web agents | Mar 10, 2025 | Video Understanding | —Unverified | 0 |
| BERT for Large-scale Video Segment Classification with Test-time Augmentation | Dec 2, 2019 | General ClassificationVideo Understanding | —Unverified | 0 |
| Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation | Jul 8, 2025 | Depth EstimationDepth Prediction | —Unverified | 0 |
| Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection | Dec 6, 2024 | GPUMulti-Object Tracking | —Unverified | 0 |
| Beyond still images: Temporal features and input variance resilience | Nov 1, 2023 | Video Understanding | —Unverified | 0 |
| Beyond the Camera: Neural Networks in World Coordinates | Mar 12, 2020 | Action RecognitionVideo Stabilization | —Unverified | 0 |
| Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding | Nov 21, 2024 | Computational EfficiencyVideo Understanding | —Unverified | 0 |
| BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes | Apr 4, 2024 | ObjectVideo Understanding | —Unverified | 0 |
| Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding? | May 20, 2025 | Video Understanding | —Unverified | 0 |
| Breaking the Encoder Barrier for Seamless Video-Language Understanding | Mar 24, 2025 | DecoderLanguage Modeling | —Unverified | 0 |
| Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models | Jun 6, 2025 | SegmentationVideo Understanding | —Unverified | 0 |
| Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens | Jun 13, 2022 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs | Jan 8, 2025 | EgoSchemaObject Tracking | —Unverified | 0 |
| Building Scalable Video Understanding Benchmarks through Sports | Jan 17, 2023 | Video Understanding | —Unverified | 0 |
| C^3: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues | Jun 16, 2021 | Contrastive Learningcounterfactual | —Unverified | 0 |
| CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition | Mar 30, 2025 | Action ClassificationAction Recognition | —Unverified | 0 |
| CAG-QIL: Context-Aware Actionness Grouping via Q Imitation Learning for Online Temporal Action Localization | Jan 1, 2021 | Action LocalizationImitation Learning | —Unverified | 0 |
| Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting | Apr 19, 2021 | Action SpottingCamera Calibration | —Unverified | 0 |
| Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP | Sep 23, 2024 | Image GenerationQuestion Answering | —Unverified | 0 |
| FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning | Oct 20, 2024 | DiagnosticVideo Captioning | —Unverified | 0 |
| Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks? | Nov 13, 2024 | Action LocalizationTemporal Action Localization | —Unverified | 0 |