Papers in this area
Showing 1–10 of 10 papers
| Task | Papers | Results |
|---|---|---|
| Sketch-to-Image Translation | 15 | 7 |
| Motion Captioning Generating textual description for human motion. | 11 | 7 |
| Text to 3D Task involves generating 3D objects based on the text prompt… | 314 | 6 |
| Multi-modal Classification | 31 | 6 |
| Audio-visual Question Answering | 27 | 6 |
| 3D Object Captioning 3D object captioning involves generating a natural language … | 7 | 6 |
| Long Video Retrieval (Background Removed) Retrieve the long videos given all subtitles. | 6 | 6 |
| VCGBench-Diverse Recognizing the limited diversity in existing video conversa… | 5 | 6 |
| Visual Speech Recognition | 182 | 5 |
| Generalized Referring Expression Comprehension Generalized Referring Expression Comprehension (GREC) allows… | 7 | 5 |
| Explanatory Visual Question Answering Explanatory Visual Question Answering (EVQA) requires answer… | 5 | 5 |
| Text to Video Retrieval She's gone I can't find her anywhere I'm looking everywhere … | 75 | 4 |
| Dense Captioning | 69 | 4 |
| Multimodal Text and Image Classification Classification with both source Image and Text | 7 | 4 |
| Lip Reading Lip Reading is a task to infer the speech content in a video… | 153 | 3 |
| Generative Visual Question Answering Generating answers in free form to questions posed about ima… | 9 | 3 |
| Supervised Image Retrieval | 4 | 3 |
| Semi Supervised Learning for Image Captioning | 2 | 3 |
| audio-visual event localization | 26 | 2 |
| Composed Image Retrieval (CoIR) Composed Image Retrieval (CoIR) is the task involves retriev… | 14 | 2 |
| Text-to-3D-Human Generation 3D avatars generation from text prompts | 3 | 2 |
| Document Image Skew Estimation | 1 | 2 |
| Referring Expression Referring expressions places a bounding box around the insta… | 364 | 1 |
| Multimodal Deep Learning Multimodal deep learning is a type of deep learning that com… | 213 | 1 |
| Content-Based Image Retrieval Content-Based Image Retrieval is a well studied problem in c… | 195 | 1 |
| Scene-Aware Dialogue | 8 | 1 |
| Grounded Multimodal Named Entity Recognition | 3 | 1 |
| X-ray Visual Question Answering | 2 | 1 |
| LMM real-life tasks | 1 | 1 |
| Multimodal Text Prediction Multimodal text prediction is a type of natural language pro… | 1 | 1 |
| Segmented Multimodal Named Entity Recognition | 1 | 1 |
| World Knowledge | 818 | 0 |
| cross-modal alignment | 342 | 0 |
| Image-text Retrieval | 248 | 0 |
| Image-text matching Image-Text Matching is a subtask within Cross-Modal Retrieva… | 188 | 0 |
| Referring Expression Comprehension | 167 | 0 |
| Vision-Language-Action | 157 | 0 |
| Video-Text Retrieval Video-Text retrieval requires understanding of both video an… | 111 | 0 |
| Zero-Shot Video Question Answer This task present the results of Zeroshot Question Answer re… | 85 | 0 |
| 3D visual grounding | 82 | 0 |
| Visual Commonsense Reasoning Image source: [Visual Commonsense Reasoning](https://papersw… | 65 | 0 |
| Visual Entailment Visual Entailment (VE) - is a task consisting of image-sente… | 56 | 0 |
| Physical Intuition | 35 | 0 |
| Cross-Modality Person Re-identification | 26 | 0 |
| 3D Question Answering (3D-QA) A 3D-QA task requires models to answer a question when given… | 22 | 0 |
| MM-Vet | 19 | 0 |
| Multimodal Unsupervised Image-To-Image Translation Multimodal unsupervised image-to-image translation is the ta… | 17 | 0 |
| Zero-Shot Text-to-Image Generation Image credit: [GLIDE: Towards Photorealistic Image Generatio… | 16 | 0 |
| Generalized Referring Expression Segmentation Generalized Referring Expression Segmentation (GRES), introd… | 15 | 0 |
| TGIF-Frame | 15 | 0 |