| BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation | Aug 1, 2022 | ObjectOptical Flow Estimation | —Unverified | 0 |
| Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding | Jul 30, 2022 | point cloud video understandingVideo Understanding | CodeCode Available | 1 |
| Static and Dynamic Concepts for Self-supervised Video Representation Learning | Jul 26, 2022 | DiversityRepresentation Learning | CodeCode Available | 1 |
| EgoEnv: Human-centric environment representations from egocentric video | Jul 22, 2022 | Video Understanding | —Unverified | 0 |
| Video Swin Transformers for Egocentric Video Understanding @ Ego4D Challenges 2022 | Jul 22, 2022 | ObjectObject State Change Classification | —Unverified | 0 |
| AE-Net:Adjoint Enhancement Network for Efficient Action Recognition in Video Understanding | Jul 21, 2022 | Action RecognitionVideo Understanding | —Unverified | 0 |
| An Efficient Spatio-Temporal Pyramid Transformer for Action Detection | Jul 21, 2022 | Action DetectionVideo Understanding | —Unverified | 0 |
| Spotting Temporally Precise, Fine-Grained Events in Video | Jul 20, 2022 | Action DetectionAction Spotting | CodeCode Available | 1 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| SVGraph: Learning Semantic Graphs from Instructional Videos | Jul 16, 2022 | Graph LearningVideo Understanding | —Unverified | 0 |
| Is Appearance Free Action Recognition Possible? | Jul 13, 2022 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 |
| Federated Self-supervised Learning for Video Understanding | Jul 5, 2022 | Action RecognitionFederated Learning | CodeCode Available | 1 |
| GraphVid: It Only Takes a Few Nodes to Understand a Video | Jul 4, 2022 | SuperpixelsVideo Understanding | —Unverified | 0 |
| Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering | Jul 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Multimodal Intent Discovery from Livestream Videos | Jul 1, 2022 | Intent DiscoveryVideo Summarization | —Unverified | 0 |
| (Un)likelihood Training for Interpretable Embedding | Jul 1, 2022 | Ad-hoc video searchDecoder | CodeCode Available | 0 |
| Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach | Jun 30, 2022 | Boundary DetectionGeneric Event Boundary Detection | CodeCode Available | 0 |
| Technical Report for CVPR 2022 LOVEU AQTC Challenge | Jun 29, 2022 | Video Understanding | CodeCode Available | 0 |
| ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning | Jun 27, 2022 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| REVECA -- Rich Encoder-decoder framework for Video Event CAptioner | Jun 18, 2022 | DecoderSemantic Segmentation | CodeCode Available | 1 |
| Multimodal Dialogue State Tracking | Jun 16, 2022 | Dialogue State TrackingVideo Understanding | CodeCode Available | 0 |
| Stand-Alone Inter-Frame Attention in Video Models | Jun 14, 2022 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens | Jun 13, 2022 | Action RecognitionVideo Understanding | —Unverified | 0 |
| A Simple and Efficient Pipeline to Build an End-to-End Spatial-Temporal Action Detector | Jun 7, 2022 | Action ClassificationAction Detection | CodeCode Available | 1 |
| Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey | Jun 5, 2022 | 3D Hand Pose EstimationDomain Adaptation | —Unverified | 0 |