| Extending Video Masked Autoencoders to 128 frames | Nov 20, 2024 | DecoderVideo Understanding | —Unverified | 0 |
| Extensible Hierarchical Method of Detecting Interactive Actions for Video Understanding | Aug 11, 2017 | Action DetectionAction Recognition | —Unverified | 0 |
| Real-Time Segmentation Networks should be Latency Aware | Apr 6, 2020 | Autonomous VehiclesScene Segmentation | —Unverified | 0 |
| Fast Retinomorphic Event Stream for Video Recognition and Reinforcement Learning | May 16, 2018 | Action RecognitionAtari Games | —Unverified | 0 |
| FaVChat: Unlocking Fine-Grained Facail Video Understanding with Multimodal Large Language Models | Mar 12, 2025 | Mixture-of-ExpertsQuestion Answering | —Unverified | 0 |
| FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding | Mar 19, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models | Jun 12, 2024 | Video Understanding | —Unverified | 0 |
| Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework | Nov 16, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Fine-Grain Annotation of Cricket Videos | Nov 24, 2015 | Action RecognitionRetrieval | —Unverified | 0 |
| Fine-Grained Video Captioning through Scene Graph Consolidation | Feb 23, 2025 | Caption GenerationImage Captioning | —Unverified | 0 |
| CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval | Dec 31, 2024 | RetrievalText Retrieval | —Unverified | 0 |
| First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge | Sep 20, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Flatten: Video Action Recognition is an Image Classification task | Aug 17, 2024 | Action Recognitionimage-classification | —Unverified | 0 |
| Flexible Frame Selection for Efficient Video Reasoning | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| FlexSelect: Flexible Token Selection for Efficient Long Video Understanding | Jun 1, 2025 | Video Understanding | —Unverified | 0 |
| FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering | Dec 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions | Sep 7, 2022 | Image GenerationText to Image Generation | —Unverified | 0 |
| Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles | May 22, 2025 | EgoSchemaFew-Shot Learning | —Unverified | 0 |
| Frame-Voyager: Learning to Query Frames for Video Large Language Models | Oct 4, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models | Apr 8, 2025 | In-Context LearningInstruction Following | —Unverified | 0 |
| From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction | Apr 8, 2025 | Game State ReconstructionJersey Number Recognition | —Unverified | 0 |
| From Image to Video, what do we need in multimodal LLMs? | Apr 18, 2024 | Video Understanding | —Unverified | 0 |
| From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations | May 18, 2025 | Video EditingVideo Understanding | —Unverified | 0 |
| From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment | Mar 26, 2025 | Video Understanding | —Unverified | 0 |
| Fully Automated Hand Hygiene Monitoring\ Operating Room using 3D Convolutional Neural Network | Mar 20, 2020 | Optical Flow EstimationTransfer Learning | —Unverified | 0 |
| Future semantic segmentation of time-lapsed videos with large temporal displacement | Dec 27, 2018 | SegmentationSemantic Segmentation | —Unverified | 0 |
| Gameplay Highlights Generation | May 12, 2025 | Event DetectionHighlight Detection | —Unverified | 0 |
| Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention | Apr 10, 2024 | Action AnticipationGraph Neural Network | —Unverified | 0 |
| Generating the Future With Adversarial Transformers | Jul 1, 2017 | Video Understanding | —Unverified | 0 |
| Generating Videos with Scene Dynamics | Sep 8, 2016 | Action ClassificationFuture prediction | —Unverified | 0 |
| Generative Frame Sampler for Long Video Understanding | Mar 12, 2025 | Video Understanding | —Unverified | 0 |
| Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning | Jun 1, 2018 | Action RecognitionRepresentation Learning | —Unverified | 0 |
| GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning | Dec 10, 2024 | cross-modal alignmentVideo Understanding | —Unverified | 0 |
| Global Motion Understanding in Large-Scale Video Object Segmentation | May 11, 2024 | Instance SegmentationOptical Flow Estimation | —Unverified | 0 |
| Global Self-Attention Networks | Jan 1, 2021 | Video Understanding | —Unverified | 0 |
| Global Self-Attention Networks for Image Recognition | Oct 6, 2020 | Video Understanding | —Unverified | 0 |
| GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding | Jun 14, 2024 | Activity RecognitionMMR total | —Unverified | 0 |
| GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | Nov 25, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| Gradient Frequency Modulation for Visually Explaining Video Understanding Models | Nov 1, 2021 | Action RecognitionTemporal Action Localization | —Unverified | 0 |
| GraphVid: It Only Takes a Few Nodes to Understand a Video | Jul 4, 2022 | SuperpixelsVideo Understanding | —Unverified | 0 |
| Grounded Objects and Interactions for Video Captioning | Nov 16, 2017 | ObjectScene Understanding | —Unverified | 0 |
| Grounded Video Situation Recognition | Oct 19, 2022 | DescriptiveStructured Prediction | —Unverified | 0 |
| Grounding Action Descriptions in Videos | Jan 1, 2013 | Semantic Textual SimilarityVideo Understanding | —Unverified | 0 |
| Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection | Apr 20, 2025 | Action DetectionDecoder | —Unverified | 0 |
| GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning | Jun 19, 2025 | Multimodal Reasoningreinforcement-learning | —Unverified | 0 |
| GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement | Jun 19, 2024 | Video Understanding | —Unverified | 0 |
| H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding | Mar 31, 2025 | Video Understanding | —Unverified | 0 |
| HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models | Feb 28, 2025 | Action UnderstandingText-to-Video Generation | —Unverified | 0 |
| Harnessing Object and Scene Semantics for Large-Scale Video Understanding | Jun 1, 2016 | Action RecognitionClustering | —Unverified | 0 |
| HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions | Sep 16, 2024 | Dimensionality ReductionVideo Understanding | —Unverified | 0 |