SOTAVerified
Home/Multimodal & Vision-Language

Multimodal & Vision-Language

171 tasks · View all areas

Papers in this area

Showing 110 of 10 papers

TitleStatusHype
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent0
Visual Place Recognition for Large-Scale UAV Applications0
Transformer-based Spatial Grounding: A Comprehensive Survey0
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding0
Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark0
LaViPlan : Language-Guided Visual Path Planning with RLVR0
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities0
AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation0
LoViC: Efficient Long Video Generation with Context Compression0
MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval0
Show:102550
TaskPapersResults
Visual Question Answering (VQA)

Visual Question Answering (VQA) is a task in computer vision…

2,167727
Image Captioning

Image Captioning is the task of describing the content of an…

1,878422
Image Retrieval

Image Retrieval is a fundamental and long-standing computer …

2,239372
Visual Question Answering

MLLM Leaderboard

2,177334
Referring Expression Segmentation

The task aims at labeling the pixels of an image or video th…

145317
Video Retrieval

The objective of video retrieval is as follows: given a text…

486309
Visual Place Recognition

Visual Place Recognition is the task of matching a view of a…

297265
Video Question Answering460254
Image-to-Image Translation

Image-to-Image Translation is a task in computer vision and …

1,184196
Visual Reasoning

Ability to understand actions and reasoning associated with …

698192
Vision and Language Navigation223169
Text-to-Image Generation

The development of the brain's blood supply in an embryo inv…

1,085160
Zero-Shot Video Retrieval

Zero-shot video retrieval is the task of retrieving relevant…

40126
Cross-Modal Retrieval

Cross-Modal Retrieval (CMR) is a task of retrieving items ac…

522111
Visual Dialog

Visual Dialog requires an AI agent to hold a meaningful dial…

118106
Sign Language Recognition

Sign Language Recognition is a computer vision and natural l…

297101
Lipreading

Lipreading is a process of extracting speech by watching lip…

10394
Video Captioning

Video Captioning is a task of automatic captioning a video b…

47386
Multi-modal Entity Alignment1962
Moment Retrieval

Moment retrieval can de defined as the task of "localizing m…

13257
Document Image Classification

Document image classification is the task of classifying doc…

5049
Text based Person Retrieval4942
Visual Prompt Tuning

Visual Prompt Tuning(VPT) only introduces a small amount of …

7040
Molecule Captioning

Molecular description generation entails the creation of a d…

2539
Chart Question Answering

Question Answering task on charts images

5038
Text-to-Video Generation

Ma grand-mère m’a raconté que quand elle était étudiante, el…

20136
Referring expression generation

Generate referring expressions

8434
Visual Storytelling

( Image credit: [No Metrics Are Perfect](https://github.com/…

11533
Image-to-Text Retrieval

Image-text retrieval is the process of retrieving relevant i…

5933
Phrase Grounding

Given an image and a corresponding caption, the Phrase Groun…

8830
Audio-Visual Speech Recognition

Audio-visual speech recognition is the task of transcribing …

10024
Dense Video Captioning

Most natural videos contain numerous events. For example, in…

7624
Sign Language Translation

Given a video containing sign language, the task is to predi…

15323
Visual Grounding

Visual Grounding (VG) aims to locate the most relevant objec…

57120
Multimodal Emotion Recognition

This is a leaderboard for multimodal emotion recognition on …

18020
Video-Adverb Retrieval

The bidirectional video-adverb retrieval task aims at retrie…

420
Video Summarization

Video Summarization aims to generate a short synopsis that s…

28018
Natural Language Visual Grounding3218
Text-based Person Retrieval with Noisy Correspondence

This is a benchmark about text-based person retrieval with n…

618
Sketch-Based Image Retrieval11017
Multimodal Machine Translation

Multimodal machine translation is the task of doing machine …

10817
Ad-hoc video search

The Ad-hoc search task ended a 3 year cycle from 2016-2018 w…

1315
Image Retrieval with Multi-Modal Query

The problem of retrieving images from a database based on a …

1015
Image/Document Clustering814
Multimodal Reasoning

Reasoning over multimodal inputs.

30213
Spatio-Temporal Video Grounding

Spatio-temporal video grounding is a computer vision and nat…

2210
Image Paragraph Captioning

Image paragraph captioning involves generating a detailed, m…

1710
Video Grounding

Video grounding is the task of linking spoken language descr…

1149
TinyQA Benchmark++

Ultra-lightweight evaluation suite and python package design…

18
Unsupervised Image-To-Image Translation

Unsupervised image-to-image translation is the task of doing…

1247