| Vamos: Versatile Action Models for Video Understanding | Nov 22, 2023 | EgoSchemaHard Attention | CodeCode Available | 0 |
| SPOT! Revisiting Video-Language Models for Event Understanding | Nov 21, 2023 | AttributeVideo Understanding | —Unverified | 0 |
| ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab | Nov 1, 2023 | Action RecognitionVideo Understanding | —Unverified | 0 |
| ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection | Nov 1, 2023 | Action DetectionClassification | —Unverified | 0 |
| Beyond still images: Temporal features and input variance resilience | Nov 1, 2023 | Video Understanding | —Unverified | 0 |
| Videoprompter: an ensemble of foundational models for zero-shot video understanding | Oct 23, 2023 | Action RecognitionDescriptive | —Unverified | 0 |
| Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding | Oct 19, 2023 | RelationVideo Understanding | —Unverified | 0 |
| Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks | Oct 7, 2023 | Action RecognitionMultiple-choice | —Unverified | 0 |
| DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | Oct 2, 2023 | Autonomous DrivingLanguage Modeling | —Unverified | 0 |
| Telling Stories for Common Sense Zero-Shot Action Recognition | Sep 29, 2023 | Action RecognitionArticles | CodeCode Available | 0 |
| M^33D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding | Sep 26, 2023 | 2D Semantic SegmentationAction Detection | —Unverified | 0 |
| Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges | Sep 25, 2023 | Anomaly DetectionDense Video Captioning | —Unverified | 0 |
| Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding | Sep 20, 2023 | Action LocalizationForm | —Unverified | 0 |
| Learning Dynamic MRI Reconstruction with Convolutional Network Assisted Reconstruction Swin Transformer | Sep 19, 2023 | AnatomyComputational Efficiency | —Unverified | 0 |
| Language as the Medium: Multimodal Video Classification through text only | Sep 19, 2023 | Action RecognitionVideo Classification | —Unverified | 0 |
| Judging a video by its bitstream cover | Sep 14, 2023 | Video Understanding | CodeCode Available | 0 |
| Motion-Guided Masking for Spatiotemporal Representation Learning | Aug 24, 2023 | Domain AdaptationRepresentation Learning | —Unverified | 0 |
| MOFO: MOtion FOcused Self-Supervision for Video Understanding | Aug 23, 2023 | Action ClassificationAction Recognition | CodeCode Available | 0 |
| Are current long-term video understanding datasets long-term? | Aug 22, 2023 | Action RecognitionVideo Understanding | CodeCode Available | 0 |
| Audio-Visual Glance Network for Efficient Video Recognition | Aug 18, 2023 | Video RecognitionVideo Understanding | —Unverified | 0 |
| Temporally-Adaptive Models for Efficient Video Understanding | Aug 10, 2023 | Action ClassificationAction Recognition | —Unverified | 0 |
| M^3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition | Aug 6, 2023 | Action RecognitionDecision Making | —Unverified | 0 |
| DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action Segmentation | Jul 31, 2023 | Action SegmentationHuman-Object Interaction Detection | —Unverified | 0 |
| InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation | Jul 13, 2023 | Action RecognitionContrastive Learning | —Unverified | 0 |
| HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding | Jul 9, 2023 | Action RecognitionAction Segmentation | CodeCode Available | 0 |
| VideoGLUE: Video General Understanding Evaluation of Foundation Models | Jul 6, 2023 | Action RecognitionTemporal Localization | —Unverified | 0 |
| ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models | Jun 28, 2023 | RetrievalVideo Retrieval | CodeCode Available | 0 |
| Temporal Action Proposal Generation With Action Frequency Adaptive Network | Jun 23, 2023 | Knowledge DistillationTemporal Action Proposal Generation | CodeCode Available | 0 |
| Learning Space-Time Semantic Correspondences | Jun 16, 2023 | Imitation LearningSemantic correspondence | —Unverified | 0 |
| Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment | Jun 8, 2023 | Video Understanding | —Unverified | 0 |
| MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning | Jun 4, 2023 | BenchmarkingContrastive Learning | —Unverified | 0 |
| Teacher Agent: A Knowledge Distillation-Free Framework for Rehearsal-based Video Incremental Learning | Jun 1, 2023 | Incremental LearningKnowledge Distillation | CodeCode Available | 0 |
| Action Sensitivity Learning for Temporal Action Localization | May 25, 2023 | Action LocalizationMoment Queries | —Unverified | 0 |
| Learning Higher-order Object Interactions for Keypoint-based Video Understanding | May 16, 2023 | Action LocalizationAction Recognition | —Unverified | 0 |
| A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot | May 16, 2023 | Emotion ClassificationQuestion Answering | CodeCode Available | 0 |
| Vehicle Detection and Classification without Residual Calculation: Accelerating HEVC Image Decoding with Random Perturbation Injection | May 14, 2023 | Image Reconstructionvehicle detection | —Unverified | 0 |
| ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System | Apr 27, 2023 | Video Understanding | —Unverified | 0 |
| MRSN: Multi-Relation Support Network for Video Action Detection | Apr 24, 2023 | Action DetectionRelation | —Unverified | 0 |
| Search-Map-Search: A Frame Selection Paradigm for Action Recognition | Apr 20, 2023 | Action RecognitionHeuristic Search | —Unverified | 0 |
| LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision | Apr 15, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Therbligs in Action: Video Understanding through Motion Primitives | Apr 6, 2023 | Action AnticipationAction Recognition | —Unverified | 0 |
| DOAD: Decoupled One Stage Action Detection Network | Apr 1, 2023 | Action DetectionAction Recognition | —Unverified | 0 |
| SVT: Supertoken Video Transformer for Efficient Video Understanding | Apr 1, 2023 | Video Understanding | —Unverified | 0 |
| System-status-aware Adaptive Network for Online Streaming Video Understanding | Mar 28, 2023 | Streaming video understandingVideo Understanding | —Unverified | 0 |
| Selective Structured State-Spaces for Long-Form Video Understanding | Mar 25, 2023 | Contrastive LearningForm | —Unverified | 0 |
| Leaping Into Memories: Space-Time Deep Feature Synthesis | Mar 17, 2023 | DiversityVideo Understanding | CodeCode Available | 0 |
| Video4MRI: An Empirical Study on Brain Magnetic Resonance Image Analytics with CNN-based Video Classification Frameworks | Feb 24, 2023 | ClassificationData Augmentation | —Unverified | 0 |
| MINOTAUR: Multi-task Video Grounding From Multimodal Queries | Feb 16, 2023 | Action DetectionSentence | CodeCode Available | 0 |
| Semi-Parametric Video-Grounded Text Generation | Jan 27, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Building Scalable Video Understanding Benchmarks through Sports | Jan 17, 2023 | Video Understanding | —Unverified | 0 |