SOTAVerified

Video Recognition

Video Recognition is a process of obtaining, processing, and analysing data that it receives from a visual source, specifically video.

Papers

Showing 150 of 307 papers

TitleStatusHype
InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingCode7
Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal RepresentationsCode5
Expanding Language-Image Pretrained Models for General Video RecognitionCode3
Uni-AdaFocus: Spatial-temporal Dynamic Computation for Video RecognitionCode2
DeMamba: AI-Generated Video Detection on Million-Scale GenVideo BenchmarkCode2
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT AdaptationCode2
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language ModelsCode2
Revisiting Classifier: Transferring Vision-Language Models for Video RecognitionCode2
AdaptFormer: Adapting Vision Transformers for Scalable Visual RecognitionCode2
TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge DeviceCode2
Video Swin TransformerCode2
Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?Code2
X3D: Expanding Architectures for Efficient Video RecognitionCode2
Omni-sourced Webly-supervised Learning for Video RecognitionCode2
BASKET: A Large-Scale Video Dataset for Fine-Grained Skill EstimationCode1
PAVE: Patching and Adapting Video Large Language ModelsCode1
OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature LearningCode1
VideoMamba: Spatio-Temporal Selective State Space ModelCode1
No Time to Waste: Squeeze Time into Channel for Mobile Video UnderstandingCode1
VG4D: Vision-Language Model Goes 4D Video RecognitionCode1
Video Recognition in Portrait ModeCode1
Adapting Short-Term Transformers for Action Detection in Untrimmed VideosCode1
OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video RecognitionCode1
DEVIAS: Learning Disentangled Video Representations of Action and SceneCode1
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and DataCode1
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to VideoCode1
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer LearningCode1
Eventful Transformers: Leveraging Temporal Redundancy in Vision TransformersCode1
Audio-Visual Class-Incremental LearningCode1
Helping Hands: An Object-Aware Ego-Centric Video Recognition ModelCode1
Prune Spatio-temporal Tokens by Semantic-aware Temporal AccumulationCode1
What Can Simple Arithmetic Operations Do for Temporal Modeling?Code1
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action RecognitionCode1
Implicit Temporal Modeling with Learnable Alignment for Video RecognitionCode1
Frame Flexible NetworkCode1
The effectiveness of MAE pre-pretraining for billion-scale pretrainingCode1
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language KnowledgeCode1
Making Vision Transformers Efficient from A Token Sparsification ViewCode1
Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight OptimizationCode1
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge TransferringCode1
Efficient Movie Scene Detection using State-Space TransformersCode1
VLG: General Video Recognition with Web Textual KnowledgeCode1
SVFormer: Semi-supervised Video Transformer for Action RecognitionCode1
Look More but Care Less in Video RecognitionCode1
Cluster and Aggregate: Face Recognition with Large Probe SetCode1
Towards a Unified View on Visual Parameter-Efficient Transfer LearningCode1
AdaFocusV3: On Unified Spatial-temporal Dynamic Video RecognitionCode1
Rethinking Resolution in the Context of Efficient Video RecognitionCode1
Real-time Online Video Detection with Temporal Smoothing TransformersCode1
Frozen CLIP Models are Efficient Video LearnersCode1
Show:102550
← PrevPage 1 of 7Next →

No leaderboard results yet.