Papers in this area
Showing 1–10 of 10 papers
| Task | Papers | Results |
|---|---|---|
| Visual Question Answering (VQA) Visual Question Answering (VQA) is a task in computer vision… | 2,167 | 727 |
| Image Captioning Image Captioning is the task of describing the content of an… | 1,878 | 422 |
| Image Retrieval Image Retrieval is a fundamental and long-standing computer … | 2,239 | 372 |
| Visual Question Answering MLLM Leaderboard | 2,177 | 334 |
| Referring Expression Segmentation The task aims at labeling the pixels of an image or video th… | 145 | 317 |
| Video Retrieval The objective of video retrieval is as follows: given a text… | 486 | 309 |
| Visual Place Recognition Visual Place Recognition is the task of matching a view of a… | 297 | 265 |
| Video Question Answering | 460 | 254 |
| Image-to-Image Translation Image-to-Image Translation is a task in computer vision and … | 1,184 | 196 |
| Visual Reasoning Ability to understand actions and reasoning associated with … | 698 | 192 |
| Vision and Language Navigation | 223 | 169 |
| Text-to-Image Generation The development of the brain's blood supply in an embryo inv… | 1,085 | 160 |
| Zero-Shot Video Retrieval Zero-shot video retrieval is the task of retrieving relevant… | 40 | 126 |
| Cross-Modal Retrieval Cross-Modal Retrieval (CMR) is a task of retrieving items ac… | 522 | 111 |
| Visual Dialog Visual Dialog requires an AI agent to hold a meaningful dial… | 118 | 106 |
| Sign Language Recognition Sign Language Recognition is a computer vision and natural l… | 297 | 101 |
| Lipreading Lipreading is a process of extracting speech by watching lip… | 103 | 94 |
| Video Captioning Video Captioning is a task of automatic captioning a video b… | 473 | 86 |
| Multi-modal Entity Alignment | 19 | 62 |
| Moment Retrieval Moment retrieval can de defined as the task of "localizing m… | 132 | 57 |
| Document Image Classification Document image classification is the task of classifying doc… | 50 | 49 |
| Text based Person Retrieval | 49 | 42 |
| Visual Prompt Tuning Visual Prompt Tuning(VPT) only introduces a small amount of … | 70 | 40 |
| Molecule Captioning Molecular description generation entails the creation of a d… | 25 | 39 |
| Chart Question Answering Question Answering task on charts images | 50 | 38 |
| Text-to-Video Generation Ma grand-mère m’a raconté que quand elle était étudiante, el… | 201 | 36 |
| Referring expression generation Generate referring expressions | 84 | 34 |
| Visual Storytelling ( Image credit: [No Metrics Are Perfect](https://github.com/… | 115 | 33 |
| Image-to-Text Retrieval Image-text retrieval is the process of retrieving relevant i… | 59 | 33 |
| Phrase Grounding Given an image and a corresponding caption, the Phrase Groun… | 88 | 30 |
| Audio-Visual Speech Recognition Audio-visual speech recognition is the task of transcribing … | 100 | 24 |
| Dense Video Captioning Most natural videos contain numerous events. For example, in… | 76 | 24 |
| Sign Language Translation Given a video containing sign language, the task is to predi… | 153 | 23 |
| Visual Grounding Visual Grounding (VG) aims to locate the most relevant objec… | 571 | 20 |
| Multimodal Emotion Recognition This is a leaderboard for multimodal emotion recognition on … | 180 | 20 |
| Video-Adverb Retrieval The bidirectional video-adverb retrieval task aims at retrie… | 4 | 20 |
| Video Summarization Video Summarization aims to generate a short synopsis that s… | 280 | 18 |
| Natural Language Visual Grounding | 32 | 18 |
| Text-based Person Retrieval with Noisy Correspondence This is a benchmark about text-based person retrieval with n… | 6 | 18 |
| Sketch-Based Image Retrieval | 110 | 17 |
| Multimodal Machine Translation Multimodal machine translation is the task of doing machine … | 108 | 17 |
| Ad-hoc video search The Ad-hoc search task ended a 3 year cycle from 2016-2018 w… | 13 | 15 |
| Image Retrieval with Multi-Modal Query The problem of retrieving images from a database based on a … | 10 | 15 |
| Image/Document Clustering | 8 | 14 |
| Multimodal Reasoning Reasoning over multimodal inputs. | 302 | 13 |
| Spatio-Temporal Video Grounding Spatio-temporal video grounding is a computer vision and nat… | 22 | 10 |
| Image Paragraph Captioning Image paragraph captioning involves generating a detailed, m… | 17 | 10 |
| Video Grounding Video grounding is the task of linking spoken language descr… | 114 | 9 |
| TinyQA Benchmark++ Ultra-lightweight evaluation suite and python package design… | 1 | 8 |
| Unsupervised Image-To-Image Translation Unsupervised image-to-image translation is the task of doing… | 124 | 7 |