| Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation | May 18, 2023 | Image GenerationText to Image Generation | CodeCode Available | 1 | 5 |
| EvalCrafter: Benchmarking and Evaluating Large Video Generation Models | Oct 17, 2023 | BenchmarkingLanguage Modelling | CodeCode Available | 1 | 5 |
| DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval | Jun 10, 2025 | Image CaptioningRetrieval | CodeCode Available | 1 | 5 |
| Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment | Mar 18, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search | Jan 31, 2025 | DenoisingVideo Alignment | CodeCode Available | 1 | 5 |
| Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning | Mar 28, 2022 | Action ClassificationContrastive Learning | CodeCode Available | 1 | 5 |
| Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers | Jun 15, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion | Mar 11, 2025 | Image MattingVideo Alignment | CodeCode Available | 1 | 5 |
| Adversarial Skill Networks: Unsupervised Robot Skill Learning from Video | Oct 21, 2019 | continuous-controlContinuous Control | CodeCode Available | 0 | 5 |
| Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification | Nov 22, 2024 | Autonomous DrivingText-to-Video Generation | CodeCode Available | 0 | 5 |