| CAST: Cross-Attention in Space and Time for Video Action Recognition | Nov 30, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties | Nov 28, 2023 | In-Context LearningVideo Understanding | CodeCode Available | 1 |
| Panoptic Video Scene Graph Generation | Nov 28, 2023 | Graph GenerationPanoptic Scene Graph Generation | CodeCode Available | 1 |
| Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning | Nov 27, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding | Nov 25, 2023 | Video Understanding | CodeCode Available | 1 |
| MM-VID: Advancing Video Understanding with GPT-4V(ision) | Oct 30, 2023 | Script GenerationVideo Understanding | CodeCode Available | 1 |
| BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | Sep 27, 2023 | GPUVideo-based Generative Performance Benchmarking | CodeCode Available | 1 |
| End-to-End Streaming Video Temporal Action Segmentation with Reinforce Learning | Sep 27, 2023 | Action RecognitionAction Segmentation | CodeCode Available | 1 |
| SoccerNet 2023 Challenges Results | Sep 12, 2023 | Action SpottingCamera Calibration | CodeCode Available | 1 |
| CEFHRI: A Communication Efficient Federated Learning Framework for Recognizing Industrial Human-Robot Interaction | Aug 29, 2023 | Federated Learningimage-classification | CodeCode Available | 1 |
| Spherical Vision Transformer for 360-degree Video Saliency Prediction | Aug 24, 2023 | PredictionSaliency Prediction | CodeCode Available | 1 |
| Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos | Aug 18, 2023 | point cloud video understandingSelf-Supervised Learning | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |
| Helping Hands: An Object-Aware Ego-Centric Video Recognition Model | Aug 15, 2023 | DecoderObject | CodeCode Available | 1 |
| Multimodal Distillation for Egocentric Action Recognition | Jul 14, 2023 | Action RecognitionKnowledge Distillation | CodeCode Available | 1 |
| Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | Jul 9, 2023 | Question AnsweringTGIF-Frame | CodeCode Available | 1 |
| An overview on the evaluated video retrieval tasks at TRECVID 2022 | Jun 22, 2023 | Ad-hoc video searchRetrieval | CodeCode Available | 1 |
| Multi-Granularity Hand Action Detection | Jun 19, 2023 | Action DetectionAction Localization | CodeCode Available | 1 |
| EPIC Fields: Marrying 3D Geometry and Video Understanding | Jun 14, 2023 | 3D geometryNeural Rendering | CodeCode Available | 1 |
| VideoLLM: Modeling Video Sequence with Large Language Models | May 22, 2023 | DecoderVideo Understanding | CodeCode Available | 1 |
| Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach | May 10, 2023 | Autonomous VehiclesMonocular Visual Odometry | CodeCode Available | 1 |
| MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer | Apr 29, 2023 | DecoderHighlight Detection | CodeCode Available | 1 |
| Event-Free Moving Object Segmentation from Moving Ego Vehicle | Apr 28, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| Leveraging triplet loss for unsupervised action segmentation | Apr 13, 2023 | Action SegmentationClustering | CodeCode Available | 1 |
| Procedure-Aware Pretraining for Instructional Video Understanding | Mar 31, 2023 | Video Understanding | CodeCode Available | 1 |
| Whether and When does Endoscopy Domain Pretraining Make Sense? | Mar 30, 2023 | Action Triplet DetectionSurgical phase recognition | CodeCode Available | 1 |
| Streaming Video Model | Mar 30, 2023 | Action RecognitionDecoder | CodeCode Available | 1 |
| TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition | Mar 28, 2023 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 |
| Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos | Mar 22, 2023 | Representation LearningSentence | CodeCode Available | 1 |
| Dual-path Adaptation from Image to Video Transformers | Mar 17, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| TemporalMaxer: Maximize Temporal Context with only Max Pooling for Temporal Action Localization | Mar 16, 2023 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 |
| Localizing Moments in Long Video Via Multimodal Guidance | Feb 26, 2023 | Natural Language Moment RetrievalNatural Language Visual Grounding | CodeCode Available | 1 |
| Test of Time: Instilling Video-Language Models with a Sense of Time | Jan 5, 2023 | Video-Text RetrievalVideo Understanding | CodeCode Available | 1 |
| Boosting Single Image Super-Resolution via Partial Channel Shifting | Jan 1, 2023 | DiversityImage Super-Resolution | CodeCode Available | 1 |
| Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning | Jan 1, 2023 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| Towards Smooth Video Composition | Dec 14, 2022 | Image Generationsingle-image-generation | CodeCode Available | 1 |
| MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing | Nov 28, 2022 | Activity RecognitionFew Shot Action Recognition | CodeCode Available | 1 |
| Contrastive Masked Autoencoders for Self-Supervised Video Hashing | Nov 21, 2022 | DecoderRetrieval | CodeCode Available | 1 |
| EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens | Nov 19, 2022 | Action RecognitionObject State Change Classification | CodeCode Available | 1 |
| InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges | Nov 17, 2022 | Future Hand PredictionMoment Queries | CodeCode Available | 1 |
| VTC: Improving Video-Text Retrieval with User Comments | Oct 19, 2022 | Representation LearningRetrieval | CodeCode Available | 1 |
| EgoTaskQA: Understanding Human Tasks in Egocentric Videos | Oct 8, 2022 | Action Localizationcounterfactual | CodeCode Available | 1 |
| SoccerNet 2022 Challenges Results | Oct 5, 2022 | Action SpottingCamera Calibration | CodeCode Available | 1 |
| Learning Transferable Spatiotemporal Representations from Natural Script Knowledge | Sep 30, 2022 | DescriptiveRepresentation Learning | CodeCode Available | 1 |
| Streaming Video Temporal Action Segmentation In Real Time | Sep 28, 2022 | Action SegmentationLanguage Modelling | CodeCode Available | 1 |
| Panoramic Vision Transformer for Saliency Detection in 360° Videos | Sep 19, 2022 | Saliency DetectionSaliency Prediction | CodeCode Available | 1 |
| EchoCoTr: Estimation of the Left Ventricular Ejection Fraction from Spatiotemporal Echocardiography | Sep 9, 2022 | Video Understanding | CodeCode Available | 1 |
| DeepSportradar-v1: Computer Vision Dataset for Sports Understanding with High Quality Annotations | Aug 17, 2022 | Camera CalibrationInstance Segmentation | CodeCode Available | 1 |
| Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding | Jul 30, 2022 | point cloud video understandingVideo Understanding | CodeCode Available | 1 |
| Static and Dynamic Concepts for Self-supervised Video Representation Learning | Jul 26, 2022 | DiversityRepresentation Learning | CodeCode Available | 1 |