| Massively Parallel Video Networks | Jun 11, 2018 | Action RecognitionTemporal Action Localization | —Unverified | 0 |
| Mavors: Multi-granularity Video Representation for Multimodal Large Language Model | Apr 14, 2025 | Computational EfficiencyLanguage Modeling | —Unverified | 0 |
| MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding | Feb 5, 2025 | DiversityEgoSchema | —Unverified | 0 |
| Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization | Mar 12, 2025 | Temporal LocalizationVideo Understanding | —Unverified | 0 |
| Memory Consolidation Enables Long-Context Video Understanding | Feb 8, 2024 | EgoSchemaVideo Understanding | —Unverified | 0 |
| Memory-enhanced Retrieval Augmentation for Long Video Understanding | Mar 12, 2025 | RAGRetrieval | —Unverified | 0 |
| Memory-Guided Semantic Learning Network for Temporal Sentence Grounding | Jan 3, 2022 | SentenceTemporal Sentence Grounding | —Unverified | 0 |
| MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD | Jun 11, 2024 | Video RecognitionVideo Understanding | —Unverified | 0 |
| MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound | Jan 7, 2022 | Action ClassificationNavigate | —Unverified | 0 |
| Mid-level Representation for Visual Recognition | Dec 23, 2015 | object-detectionObject Detection | —Unverified | 0 |
| Mimic The Raw Domain: Accelerating Action Recognition in the Compressed Domain | Nov 19, 2019 | Action RecognitionVideo Recognition | —Unverified | 0 |
| M-LLM Based Video Frame Selection for Efficient Video Understanding | Feb 27, 2025 | EgoSchemaLanguage Modeling | —Unverified | 0 |
| MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding | Jun 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | Sep 30, 2024 | Mixture-of-ExpertsOptical Character Recognition (OCR) | —Unverified | 0 |
| MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding | Jun 20, 2024 | FormVideo Understanding | —Unverified | 0 |
| MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | May 28, 2024 | Decision MakingVideo Understanding | —Unverified | 0 |
| MM-Ego: Towards Building Egocentric Multimodal LLMs | Oct 9, 2024 | Video Understanding | —Unverified | 0 |
| Moment Quantization for Video Temporal Grounding | Apr 3, 2025 | QuantizationVideo Understanding | —Unverified | 0 |
| MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval | Feb 18, 2025 | Action RecognitionMoment Retrieval | —Unverified | 0 |
| Morph: Flexible Acceleration for 3D CNN-based Video Understanding | Oct 16, 2018 | MORPHVideo Recognition | —Unverified | 0 |
| MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models | Jan 6, 2025 | BenchmarkingFeature Compression | —Unverified | 0 |
| Motion-Guided Masking for Spatiotemporal Representation Learning | Aug 24, 2023 | Domain AdaptationRepresentation Learning | —Unverified | 0 |
| Motion Sensitive Contrastive Learning for Self-supervised Video Representation | Aug 12, 2022 | Contrastive LearningRepresentation Learning | —Unverified | 0 |
| MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies | Mar 3, 2024 | Text GenerationVideo Understanding | —Unverified | 0 |
| MovieNet: A Holistic Dataset for Movie Understanding | Jul 21, 2020 | Video Understanding | —Unverified | 0 |
| MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning | Jun 4, 2023 | BenchmarkingContrastive Learning | —Unverified | 0 |
| MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding | Dec 8, 2023 | FormQuestion Answering | —Unverified | 0 |
| MRSN: Multi-Relation Support Network for Video Action Detection | Apr 24, 2023 | Action DetectionRelation | —Unverified | 0 |
| MSR-VTT: A Large Video Description Dataset for Bridging Video and Language | Jun 1, 2016 | Image CaptioningSentence | —Unverified | 0 |
| Multi-kernel learning of deep convolutional features for action recognition | Jul 21, 2017 | Action RecognitionActivity Recognition | —Unverified | 0 |
| Multimodal High-order Relation Transformer for Scene Boundary Detection | Jan 1, 2023 | Boundary DetectionDecoder | —Unverified | 0 |
| Multimodal Intent Discovery from Livestream Videos | Jul 1, 2022 | Intent DiscoveryVideo Summarization | —Unverified | 0 |
| Multi-modal Representation Learning for Video Advertisement Content Structuring | Sep 4, 2021 | Representation LearningRe-Ranking | —Unverified | 0 |
| Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation | Nov 30, 2023 | Contrastive LearningDomain Adaptation | —Unverified | 0 |
| Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding | May 29, 2025 | RAGRetrieval-augmented Generation | —Unverified | 0 |
| Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video Localization | Jan 16, 2024 | DecoderDenoising | —Unverified | 0 |
| Multi-Scale Contrastive Learning for Video Temporal Grounding | Dec 10, 2024 | Contrastive LearningData Augmentation | —Unverified | 0 |
| Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding | Mar 8, 2022 | Contrastive LearningSentence | —Unverified | 0 |
| Multiview Transformers for Video Recognition | Jan 12, 2022 | Action ClassificationAction Recognition | —Unverified | 0 |
| MVTamperBench: Evaluating Robustness of Vision-Language Models | Dec 27, 2024 | Video Understanding | —Unverified | 0 |
| Representation Learning on Visual-Symbolic Graphs for Video Understanding | May 17, 2019 | Action ClassificationAction Detection | —Unverified | 0 |
| No More Shortcuts: Realizing the Potential of Temporal Self-Supervision | Dec 20, 2023 | Action ClassificationAttribute | —Unverified | 0 |
| Non-local NetVLAD Encoding for Video Classification | Sep 29, 2018 | ClassificationGeneral Classification | —Unverified | 0 |
| O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning | Aug 5, 2021 | AttributeCaption Generation | —Unverified | 0 |
| OBJECT DYNAMICS DISTILLATION FOR SCENE DECOMPOSITION AND REPRESENTATION | Sep 29, 2021 | ObjectPredict Future Video Frames | —Unverified | 0 |
| Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge | Nov 15, 2021 | Instance SegmentationObject Recognition | —Unverified | 0 |
| OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding | Jul 6, 2024 | Video Understanding | —Unverified | 0 |
| OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts | Mar 29, 2025 | Streaming video understandingVideo Understanding | —Unverified | 0 |
| Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks | Jan 14, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| OmniTrack: Real-time detection and tracking of objects, text and logos in video | Oct 14, 2019 | GPUobject-detection | —Unverified | 0 |