| Zero-shot Text-to-Image Retrieval | 15 | 0 |
| controllable image captioning generate image captions conditioned on control signals | 14 | 0 |
| Cross-Modal Person Re-Identification | 13 | 0 |
| Image-text Classification | 13 | 0 |
| Video to Text Retrieval | 13 | 0 |
| Sports Understanding | 11 | 0 |
| Conditional Text-to-Image Synthesis Introducing extra conditions based on the text-to-image gene… | 10 | 0 |
| Cross-modal place recognition text-to-point-cloud place recognition | 10 | 0 |
| Text-to-Video Editing | 9 | 0 |
| Vision-Language Segmentation | 9 | 0 |
| Cross-View Image-to-Image Translation | 8 | 0 |
| Text-to-Shape Generation | 8 | 0 |
| Grounded Video Question Answering | 7 | 0 |
| TGIF-Action | 7 | 0 |
| TGIF-Transition | 7 | 0 |
| Video-Guided Machine Translation | 7 | 0 |
| Vietnamese Visual Question Answering | 7 | 0 |
| Open-Domain Subject-to-Video OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Datase… | 6 | 0 |
| Query focused video summarization Model takes a long video and a query in the following forms(… | 6 | 0 |
| Factual Visual Question Answering | 5 | 0 |
| Vietnamese Image Captioning | 5 | 0 |
| Visual Question Answering (VQA) Split A | 5 | 0 |
| Visual Question Answering (VQA) Split B | 5 | 0 |
| Weakly Supervised Referring Expression Segmentation RES with less percentage of ground truth annotations | 5 | 0 |
| Zero-shot Text-to-Video Generation | 5 | 0 |
| Document Image Quality Assessment Image Quality Assessment for document image | 4 | 0 |
| Person-centric Visual Grounding Person-centric visual grounding is the problem of linking be… | 4 | 0 |
| Semantic Image-Text Similarity | 4 | 0 |
| Text-to-video search | 4 | 0 |
| Hindi Image Captioning The main goal of this task is to generate a caption for an i… | 3 | 0 |
| Multilingual Text-to-Image Generation | 3 | 0 |
| Visual Sentiment Prediction | 3 | 0 |
| Zero-Shot Cross-Lingual Image-to-Text Retrieval | 3 | 0 |
| Zero-Shot Cross-Lingual Text-to-Image Retrieval | 3 | 0 |
| Zero-Shot Cross-Lingual Visual Natural Language Inference | 3 | 0 |
| zero-shot long video breakpoint-mode question answering | 3 | 0 |
| zero-shot long video global-model question answering | 3 | 0 |
| zero-shot long video question answering | 3 | 0 |
| Zero-Shot Visual Question Answring | 3 | 0 |
| Aesthetic Image Captioning | 2 | 0 |
| Cross-lingual Text-to-Image Generation | 2 | 0 |
| Live Video Captioning Live video captioning (LVC) involves detecting and describin… | 2 | 0 |
| Multi-lingual Text-to-Image Generation | 2 | 0 |
| Text within image generation | 2 | 0 |
| Visual Commonsense Tests Predict 5 property types (color, shape, material, size, and … | 2 | 0 |
| Zero-Shot Cross-Lingual Visual Question Answering | 2 | 0 |
| Zero-Shot Cross-Lingual Visual Reasoning | 2 | 0 |
| zero-shot long video global-mode question answering | 2 | 0 |
| Zeroshot Video Question Answer | 2 | 0 |
| Crosslingual Text-to-Image Generation | 1 | 0 |