| No More Shortcuts: Realizing the Potential of Temporal Self-Supervision | Dec 20, 2023 | Action ClassificationAttribute | —Unverified | 0 |
| Text-Conditioned Resampler For Long Form Video Understanding | Dec 19, 2023 | EgoSchemaForm | —Unverified | 0 |
| Learning Object State Changes in Videos: An Open-World Perspective | Dec 19, 2023 | Video Understanding | —Unverified | 0 |
| Artificial intelligence optical hardware empowers high-resolution hyperspectral video understanding at 1.2 Tb/s | Dec 17, 2023 | Semantic SegmentationVideo Semantic Segmentation | —Unverified | 0 |
| Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos | Dec 16, 2023 | Video Captioningvideo narration captioning | CodeCode Available | 1 |
| SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models | Dec 15, 2023 | Video Understanding | CodeCode Available | 1 |
| X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-modal Knowledge Transfer | Dec 12, 2023 | Action RecognitionAction Segmentation | CodeCode Available | 0 |
| How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation | Dec 12, 2023 | Anomaly DetectionAutonomous Driving | CodeCode Available | 1 |
| Audio-Visual LLM for Video Understanding | Dec 11, 2023 | AudioCapsLanguage Modeling | —Unverified | 0 |
| Grounded Question-Answering in Long Egocentric Videos | Dec 11, 2023 | Video GroundingVideo Question Answering | CodeCode Available | 1 |
| Retrieval-based Video Language Model for Efficient Long Video Question Answering | Dec 8, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding | Dec 8, 2023 | FormQuestion Answering | —Unverified | 0 |
| Action Scene Graphs for Long-Form Understanding of Egocentric Videos | Dec 6, 2023 | Action AnticipationForm | CodeCode Available | 1 |
| HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding | Dec 5, 2023 | DiversityGraph Generation | —Unverified | 0 |
| VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding | Dec 4, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 |
| Zero-Shot Video Question Answering with Procedural Programs | Dec 1, 2023 | Code GenerationLanguage Modeling | —Unverified | 0 |
| DEVIAS: Learning Disentangled Video Representations of Action and Scene | Nov 30, 2023 | Action RecognitionDecoder | CodeCode Available | 1 |
| Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation | Nov 30, 2023 | Contrastive LearningDomain Adaptation | —Unverified | 0 |
| Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding | Nov 30, 2023 | FormVideo Retrieval | —Unverified | 0 |
| CAST: Cross-Attention in Space and Time for Video Action Recognition | Nov 30, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives | Nov 30, 2023 | Video Understanding | CodeCode Available | 2 |
| Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties | Nov 28, 2023 | In-Context LearningVideo Understanding | CodeCode Available | 1 |
| Panoptic Video Scene Graph Generation | Nov 28, 2023 | Graph GenerationPanoptic Scene Graph Generation | CodeCode Available | 1 |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 |
| Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning | Nov 27, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | Nov 25, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding | Nov 25, 2023 | Video Understanding | CodeCode Available | 1 |
| PG-Video-LLaVA: Pixel Grounding Large Video-Language Models | Nov 22, 2023 | BenchmarkingPhrase Grounding | CodeCode Available | 2 |
| Vamos: Versatile Action Models for Video Understanding | Nov 22, 2023 | EgoSchemaHard Attention | CodeCode Available | 0 |
| SPOT! Revisiting Video-Language Models for Event Understanding | Nov 21, 2023 | AttributeVideo Understanding | —Unverified | 0 |
| Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | Nov 14, 2023 | Image-based Generative Performance BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab | Nov 1, 2023 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Beyond still images: Temporal features and input variance resilience | Nov 1, 2023 | Video Understanding | —Unverified | 0 |
| ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection | Nov 1, 2023 | Action DetectionClassification | —Unverified | 0 |
| MM-VID: Advancing Video Understanding with GPT-4V(ision) | Oct 30, 2023 | Script GenerationVideo Understanding | CodeCode Available | 1 |
| Videoprompter: an ensemble of foundational models for zero-shot video understanding | Oct 23, 2023 | Action RecognitionDescriptive | —Unverified | 0 |
| Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding | Oct 19, 2023 | RelationVideo Understanding | —Unverified | 0 |
| A Survey on Video Diffusion Models | Oct 16, 2023 | Image GenerationSurvey | CodeCode Available | 4 |
| Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks | Oct 7, 2023 | Action RecognitionMultiple-choice | —Unverified | 0 |
| DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | Oct 2, 2023 | Autonomous DrivingLanguage Modeling | —Unverified | 0 |
| Telling Stories for Common Sense Zero-Shot Action Recognition | Sep 29, 2023 | Action RecognitionArticles | CodeCode Available | 0 |
| A Content-Driven Micro-Video Recommendation Dataset at Scale | Sep 27, 2023 | BenchmarkingRecommendation Systems | CodeCode Available | 2 |
| BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | Sep 27, 2023 | GPUVideo-based Generative Performance Benchmarking | CodeCode Available | 1 |
| End-to-End Streaming Video Temporal Action Segmentation with Reinforce Learning | Sep 27, 2023 | Action RecognitionAction Segmentation | CodeCode Available | 1 |
| M^33D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding | Sep 26, 2023 | 2D Semantic SegmentationAction Detection | —Unverified | 0 |
| Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges | Sep 25, 2023 | Anomaly DetectionDense Video Captioning | —Unverified | 0 |
| Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding | Sep 20, 2023 | Action LocalizationForm | —Unverified | 0 |
| Learning Dynamic MRI Reconstruction with Convolutional Network Assisted Reconstruction Swin Transformer | Sep 19, 2023 | AnatomyComputational Efficiency | —Unverified | 0 |
| Language as the Medium: Multimodal Video Classification through text only | Sep 19, 2023 | Action RecognitionVideo Classification | —Unverified | 0 |