| Judging a video by its bitstream cover | Sep 14, 2023 | Video Understanding | CodeCode Available | 0 |
| SoccerNet 2023 Challenges Results | Sep 12, 2023 | Action SpottingCamera Calibration | CodeCode Available | 1 |
| CEFHRI: A Communication Efficient Federated Learning Framework for Recognizing Industrial Human-Robot Interaction | Aug 29, 2023 | Federated Learningimage-classification | CodeCode Available | 1 |
| Spherical Vision Transformer for 360-degree Video Saliency Prediction | Aug 24, 2023 | PredictionSaliency Prediction | CodeCode Available | 1 |
| Motion-Guided Masking for Spatiotemporal Representation Learning | Aug 24, 2023 | Domain AdaptationRepresentation Learning | —Unverified | 0 |
| MOFO: MOtion FOcused Self-Supervision for Video Understanding | Aug 23, 2023 | Action ClassificationAction Recognition | CodeCode Available | 0 |
| Are current long-term video understanding datasets long-term? | Aug 22, 2023 | Action RecognitionVideo Understanding | CodeCode Available | 0 |
| Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos | Aug 18, 2023 | point cloud video understandingSelf-Supervised Learning | CodeCode Available | 1 |
| Audio-Visual Glance Network for Efficient Video Recognition | Aug 18, 2023 | Video RecognitionVideo Understanding | —Unverified | 0 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |
| Helping Hands: An Object-Aware Ego-Centric Video Recognition Model | Aug 15, 2023 | DecoderObject | CodeCode Available | 1 |
| Temporally-Adaptive Models for Efficient Video Understanding | Aug 10, 2023 | Action ClassificationAction Recognition | —Unverified | 0 |
| M^3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition | Aug 6, 2023 | Action RecognitionDecision Making | —Unverified | 0 |
| MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | Jul 31, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action Segmentation | Jul 31, 2023 | Action SegmentationHuman-Object Interaction Detection | —Unverified | 0 |
| A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future | Jul 18, 2023 | Knowledge Distillationobject-detection | CodeCode Available | 2 |
| Multimodal Distillation for Egocentric Action Recognition | Jul 14, 2023 | Action RecognitionKnowledge Distillation | CodeCode Available | 1 |
| InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation | Jul 13, 2023 | Action RecognitionContrastive Learning | —Unverified | 0 |
| HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding | Jul 9, 2023 | Action RecognitionAction Segmentation | CodeCode Available | 0 |
| Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | Jul 9, 2023 | Question AnsweringTGIF-Frame | CodeCode Available | 1 |
| VideoGLUE: Video General Understanding Evaluation of Foundation Models | Jul 6, 2023 | Action RecognitionTemporal Localization | —Unverified | 0 |
| ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models | Jun 28, 2023 | RetrievalVideo Retrieval | CodeCode Available | 0 |
| Temporal Action Proposal Generation With Action Frequency Adaptive Network | Jun 23, 2023 | Knowledge DistillationTemporal Action Proposal Generation | CodeCode Available | 0 |
| An overview on the evaluated video retrieval tasks at TRECVID 2022 | Jun 22, 2023 | Ad-hoc video searchRetrieval | CodeCode Available | 1 |
| Multi-Granularity Hand Action Detection | Jun 19, 2023 | Action DetectionAction Localization | CodeCode Available | 1 |
| Learning Space-Time Semantic Correspondences | Jun 16, 2023 | Imitation LearningSemantic correspondence | —Unverified | 0 |
| EPIC Fields: Marrying 3D Geometry and Video Understanding | Jun 14, 2023 | 3D geometryNeural Rendering | CodeCode Available | 1 |
| Valley: Video Assistant with Large Language model Enhanced abilitY | Jun 12, 2023 | Action RecognitionInstruction Following | CodeCode Available | 2 |
| Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Jun 8, 2023 | Question AnsweringVCGBench-Diverse | CodeCode Available | 3 |
| Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment | Jun 8, 2023 | Video Understanding | —Unverified | 0 |
| Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks | Jun 7, 2023 | Cross-Modal RetrievalLanguage Modelling | CodeCode Available | 2 |
| Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | Jun 5, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning | Jun 4, 2023 | BenchmarkingContrastive Learning | —Unverified | 0 |
| Teacher Agent: A Knowledge Distillation-Free Framework for Rehearsal-based Video Incremental Learning | Jun 1, 2023 | Incremental LearningKnowledge Distillation | CodeCode Available | 0 |
| Action Sensitivity Learning for Temporal Action Localization | May 25, 2023 | Action LocalizationMoment Queries | —Unverified | 0 |
| VideoLLM: Modeling Video Sequence with Large Language Models | May 22, 2023 | DecoderVideo Understanding | CodeCode Available | 1 |
| A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot | May 16, 2023 | Emotion ClassificationQuestion Answering | CodeCode Available | 0 |
| Learning Higher-order Object Interactions for Keypoint-based Video Understanding | May 16, 2023 | Action LocalizationAction Recognition | —Unverified | 0 |
| Vehicle Detection and Classification without Residual Calculation: Accelerating HEVC Image Decoding with Random Perturbation Injection | May 14, 2023 | Image Reconstructionvehicle detection | —Unverified | 0 |
| Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach | May 10, 2023 | Autonomous VehiclesMonocular Visual Odometry | CodeCode Available | 1 |
| VideoChat: Chat-Centric Video Understanding | May 10, 2023 | Question AnsweringVideo-based Generative Performance Benchmarking | CodeCode Available | 4 |
| MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer | Apr 29, 2023 | DecoderHighlight Detection | CodeCode Available | 1 |
| Event-Free Moving Object Segmentation from Moving Ego Vehicle | Apr 28, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System | Apr 27, 2023 | Video Understanding | —Unverified | 0 |
| MRSN: Multi-Relation Support Network for Video Action Detection | Apr 24, 2023 | Action DetectionRelation | —Unverified | 0 |
| Search-Map-Search: A Frame Selection Paradigm for Action Recognition | Apr 20, 2023 | Action RecognitionHeuristic Search | —Unverified | 0 |
| LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision | Apr 15, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Leveraging triplet loss for unsupervised action segmentation | Apr 13, 2023 | Action SegmentationClustering | CodeCode Available | 1 |
| Therbligs in Action: Video Understanding through Motion Primitives | Apr 6, 2023 | Action AnticipationAction Recognition | —Unverified | 0 |
| SVT: Supertoken Video Transformer for Efficient Video Understanding | Apr 1, 2023 | Video Understanding | —Unverified | 0 |