Home/Multimodal & Vision-Language

Multimodal & Vision-Language

Papers in this area

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–10 of 10 papers

Title	Date	Tasks	Status
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent	Jul 21, 2025	Multimodal Reasoning	—Unverified
Visual Place Recognition for Large-Scale UAV Applications	Jul 20, 2025	BenchmarkingDiversity	—Unverified
Transformer-based Spatial Grounding: A Comprehensive Survey	Jul 17, 2025	cross-modal alignmentSurvey	—Unverified
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding	Jul 17, 2025	Video GroundingVideo Understanding	—Unverified
Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark	Jul 17, 2025	Multimodal ReasoningPose Estimation	—Unverified
LaViPlan : Language-Guided Visual Path Planning with RLVR	Jul 17, 2025	Autonomous DrivingVision-Language-Action	—Unverified
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities	Jul 17, 2025	Large Language ModelVision and Language Navigation	—Unverified
AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation	Jul 17, 2025	Vision-Language-Action	—Unverified
LoViC: Efficient Long Video Generation with Context Compression	Jul 17, 2025	Text-to-Video GenerationVideo Generation	—Unverified
MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval	Jul 17, 2025	Image RetrievalRe-Ranking	—Unverified

Show:10 25 50

Task	Papers	Results
Visual Question Answering (VQA) Visual Question Answering (VQA) is a task in computer vision…	2,167	727
Image Captioning Image Captioning is the task of describing the content of an…	1,878	422
Image Retrieval Image Retrieval is a fundamental and long-standing computer …	2,239	372
Visual Question Answering MLLM Leaderboard	2,177	334
Referring Expression Segmentation The task aims at labeling the pixels of an image or video th…	145	317
Video Retrieval The objective of video retrieval is as follows: given a text…	486	309
Visual Place Recognition Visual Place Recognition is the task of matching a view of a…	297	265
Video Question Answering	460	254
Image-to-Image Translation Image-to-Image Translation is a task in computer vision and …	1,184	196
Visual Reasoning Ability to understand actions and reasoning associated with …	698	192
Vision and Language Navigation	223	169
Text-to-Image Generation The development of the brain's blood supply in an embryo inv…	1,085	160
Zero-Shot Video Retrieval Zero-shot video retrieval is the task of retrieving relevant…	40	126
Cross-Modal Retrieval Cross-Modal Retrieval (CMR) is a task of retrieving items ac…	522	111
Visual Dialog Visual Dialog requires an AI agent to hold a meaningful dial…	118	106
Sign Language Recognition Sign Language Recognition is a computer vision and natural l…	297	101
Lipreading Lipreading is a process of extracting speech by watching lip…	103	94
Video Captioning Video Captioning is a task of automatic captioning a video b…	473	86
Multi-modal Entity Alignment	19	62
Moment Retrieval Moment retrieval can de defined as the task of "localizing m…	132	57
Document Image Classification Document image classification is the task of classifying doc…	50	49
Text based Person Retrieval	49	42
Visual Prompt Tuning Visual Prompt Tuning(VPT) only introduces a small amount of …	70	40
Molecule Captioning Molecular description generation entails the creation of a d…	25	39
Chart Question Answering Question Answering task on charts images	50	38
Text-to-Video Generation Ma grand-mère m’a raconté que quand elle était étudiante, el…	201	36
Referring expression generation Generate referring expressions	84	34
Visual Storytelling ( Image credit: [No Metrics Are Perfect](https://github.com/…	115	33
Image-to-Text Retrieval Image-text retrieval is the process of retrieving relevant i…	59	33
Phrase Grounding Given an image and a corresponding caption, the Phrase Groun…	88	30
Audio-Visual Speech Recognition Audio-visual speech recognition is the task of transcribing …	100	24
Dense Video Captioning Most natural videos contain numerous events. For example, in…	76	24
Sign Language Translation Given a video containing sign language, the task is to predi…	153	23
Visual Grounding Visual Grounding (VG) aims to locate the most relevant objec…	571	20
Multimodal Emotion Recognition This is a leaderboard for multimodal emotion recognition on …	180	20
Video-Adverb Retrieval The bidirectional video-adverb retrieval task aims at retrie…	4	20
Video Summarization Video Summarization aims to generate a short synopsis that s…	280	18
Natural Language Visual Grounding	32	18
Text-based Person Retrieval with Noisy Correspondence This is a benchmark about text-based person retrieval with n…	6	18
Sketch-Based Image Retrieval	110	17
Multimodal Machine Translation Multimodal machine translation is the task of doing machine …	108	17
Ad-hoc video search The Ad-hoc search task ended a 3 year cycle from 2016-2018 w…	13	15
Image Retrieval with Multi-Modal Query The problem of retrieving images from a database based on a …	10	15
Image/Document Clustering	8	14
Multimodal Reasoning Reasoning over multimodal inputs.	302	13
Spatio-Temporal Video Grounding Spatio-temporal video grounding is a computer vision and nat…	22	10
Image Paragraph Captioning Image paragraph captioning involves generating a detailed, m…	17	10
Video Grounding Video grounding is the task of linking spoken language descr…	114	9
TinyQA Benchmark++ Ultra-lightweight evaluation suite and python package design…	1	8
Unsupervised Image-To-Image Translation Unsupervised image-to-image translation is the task of doing…	124	7