| Egocentric and Exocentric Methods: A Short Survey | Oct 27, 2024 | Action RecognitionSurvey | —Unverified | 0 | 0 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | Dec 6, 2024 | document understandingHallucination | —Unverified | 0 | 0 |
| Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition | Jun 27, 2018 | Action RecognitionTemporal Action Localization | —Unverified | 0 | 0 |
| Exploring Anchor-based Detection for Ego4D Natural Language Query | Aug 10, 2022 | Video Understanding | —Unverified | 0 | 0 |
| Exploring Missing Modality in Multimodal Egocentric Datasets | Jan 21, 2024 | Action RecognitionVideo Understanding | —Unverified | 0 | 0 |
| Exploring State Change Capture of Heterogeneous Backbones @ Ego4D Hands and Objects Challenge 2022 | Nov 16, 2022 | Human-Object Interaction DetectionObject | —Unverified | 0 | 0 |
| Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding | Jan 28, 2025 | DecoderVideo Understanding | —Unverified | 0 | 0 |
| Extending Video Masked Autoencoders to 128 frames | Nov 20, 2024 | DecoderVideo Understanding | —Unverified | 0 | 0 |
| Extensible Hierarchical Method of Detecting Interactive Actions for Video Understanding | Aug 11, 2017 | Action DetectionAction Recognition | —Unverified | 0 | 0 |
| Real-Time Segmentation Networks should be Latency Aware | Apr 6, 2020 | Autonomous VehiclesScene Segmentation | —Unverified | 0 | 0 |
| Fast Retinomorphic Event Stream for Video Recognition and Reinforcement Learning | May 16, 2018 | Action RecognitionAtari Games | —Unverified | 0 | 0 |
| FaVChat: Unlocking Fine-Grained Facail Video Understanding with Multimodal Large Language Models | Mar 12, 2025 | Mixture-of-ExpertsQuestion Answering | —Unverified | 0 | 0 |
| FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding | Mar 19, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |
| Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models | Jun 12, 2024 | Video Understanding | —Unverified | 0 | 0 |
| Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework | Nov 16, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Fine-Grain Annotation of Cricket Videos | Nov 24, 2015 | Action RecognitionRetrieval | —Unverified | 0 | 0 |
| Fine-Grained Video Captioning through Scene Graph Consolidation | Feb 23, 2025 | Caption GenerationImage Captioning | —Unverified | 0 | 0 |
| CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval | Dec 31, 2024 | RetrievalText Retrieval | —Unverified | 0 | 0 |
| First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge | Sep 20, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Flatten: Video Action Recognition is an Image Classification task | Aug 17, 2024 | Action Recognitionimage-classification | —Unverified | 0 | 0 |
| Flexible Frame Selection for Efficient Video Reasoning | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| FlexSelect: Flexible Token Selection for Efficient Long Video Understanding | Jun 1, 2025 | Video Understanding | —Unverified | 0 | 0 |
| FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering | Dec 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions | Sep 7, 2022 | Image GenerationText to Image Generation | —Unverified | 0 | 0 |
| Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles | May 22, 2025 | EgoSchemaFew-Shot Learning | —Unverified | 0 | 0 |
| Frame-Voyager: Learning to Query Frames for Video Large Language Models | Oct 4, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models | Apr 8, 2025 | In-Context LearningInstruction Following | —Unverified | 0 | 0 |
| From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction | Apr 8, 2025 | Game State ReconstructionJersey Number Recognition | —Unverified | 0 | 0 |
| From Image to Video, what do we need in multimodal LLMs? | Apr 18, 2024 | Video Understanding | —Unverified | 0 | 0 |
| From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations | May 18, 2025 | Video EditingVideo Understanding | —Unverified | 0 | 0 |
| From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment | Mar 26, 2025 | Video Understanding | —Unverified | 0 | 0 |
| Fully Automated Hand Hygiene Monitoring\ Operating Room using 3D Convolutional Neural Network | Mar 20, 2020 | Optical Flow EstimationTransfer Learning | —Unverified | 0 | 0 |
| Future semantic segmentation of time-lapsed videos with large temporal displacement | Dec 27, 2018 | SegmentationSemantic Segmentation | —Unverified | 0 | 0 |
| Gameplay Highlights Generation | May 12, 2025 | Event DetectionHighlight Detection | —Unverified | 0 | 0 |
| Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention | Apr 10, 2024 | Action AnticipationGraph Neural Network | —Unverified | 0 | 0 |
| Generating the Future With Adversarial Transformers | Jul 1, 2017 | Video Understanding | —Unverified | 0 | 0 |
| Generating Videos with Scene Dynamics | Sep 8, 2016 | Action ClassificationFuture prediction | —Unverified | 0 | 0 |
| Generative Frame Sampler for Long Video Understanding | Mar 12, 2025 | Video Understanding | —Unverified | 0 | 0 |
| Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning | Jun 1, 2018 | Action RecognitionRepresentation Learning | —Unverified | 0 | 0 |
| GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning | Dec 10, 2024 | cross-modal alignmentVideo Understanding | —Unverified | 0 | 0 |
| Global Motion Understanding in Large-Scale Video Object Segmentation | May 11, 2024 | Instance SegmentationOptical Flow Estimation | —Unverified | 0 | 0 |
| Global Self-Attention Networks | Jan 1, 2021 | Video Understanding | —Unverified | 0 | 0 |
| Global Self-Attention Networks for Image Recognition | Oct 6, 2020 | Video Understanding | —Unverified | 0 | 0 |
| GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding | Jun 14, 2024 | Activity RecognitionMMR total | —Unverified | 0 | 0 |
| GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | Nov 25, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 | 0 |
| Gradient Frequency Modulation for Visually Explaining Video Understanding Models | Nov 1, 2021 | Action RecognitionTemporal Action Localization | —Unverified | 0 | 0 |
| GraphVid: It Only Takes a Few Nodes to Understand a Video | Jul 4, 2022 | SuperpixelsVideo Understanding | —Unverified | 0 | 0 |
| Grounded Objects and Interactions for Video Captioning | Nov 16, 2017 | ObjectScene Understanding | —Unverified | 0 | 0 |
| Grounded Video Situation Recognition | Oct 19, 2022 | DescriptiveStructured Prediction | —Unverified | 0 | 0 |
| Grounding Action Descriptions in Videos | Jan 1, 2013 | Semantic Textual SimilarityVideo Understanding | —Unverified | 0 | 0 |