SOTAVerified
Home/Multimodal & Vision-Language

Multimodal & Vision-Language

171 tasks · View all areas

Papers in this area

Showing 110 of 10 papers

TitleStatusHype
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent0
Visual Place Recognition for Large-Scale UAV Applications0
Transformer-based Spatial Grounding: A Comprehensive Survey0
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding0
Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark0
LaViPlan : Language-Guided Visual Path Planning with RLVR0
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities0
AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation0
LoViC: Efficient Long Video Generation with Context Compression0
MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval0
Show:102550
TaskPapersResults
Zero-shot Text-to-Image Retrieval150
controllable image captioning

generate image captions conditioned on control signals

140
Cross-Modal Person Re-Identification130
Image-text Classification130
Video to Text Retrieval130
Sports Understanding110
Conditional Text-to-Image Synthesis

Introducing extra conditions based on the text-to-image gene…

100
Cross-modal place recognition

text-to-point-cloud place recognition

100
Text-to-Video Editing90
Vision-Language Segmentation90
Cross-View Image-to-Image Translation80
Text-to-Shape Generation80
Grounded Video Question Answering70
TGIF-Action70
TGIF-Transition70
Video-Guided Machine Translation70
Vietnamese Visual Question Answering70
Open-Domain Subject-to-Video

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Datase…

60
Query focused video summarization

Model takes a long video and a query in the following forms(…

60
Factual Visual Question Answering50
Vietnamese Image Captioning50
Visual Question Answering (VQA) Split A50
Visual Question Answering (VQA) Split B50
Weakly Supervised Referring Expression Segmentation

RES with less percentage of ground truth annotations

50
Zero-shot Text-to-Video Generation50
Document Image Quality Assessment

Image Quality Assessment for document image

40
Person-centric Visual Grounding

Person-centric visual grounding is the problem of linking be…

40
Semantic Image-Text Similarity40
Text-to-video search40
Hindi Image Captioning

The main goal of this task is to generate a caption for an i…

30
Multilingual Text-to-Image Generation30
Visual Sentiment Prediction30
Zero-Shot Cross-Lingual Image-to-Text Retrieval30
Zero-Shot Cross-Lingual Text-to-Image Retrieval30
Zero-Shot Cross-Lingual Visual Natural Language Inference30
zero-shot long video breakpoint-mode question answering30
zero-shot long video global-model question answering30
zero-shot long video question answering30
Zero-Shot Visual Question Answring30
Aesthetic Image Captioning20
Cross-lingual Text-to-Image Generation20
Live Video Captioning

Live video captioning (LVC) involves detecting and describin…

20
Multi-lingual Text-to-Image Generation20
Text within image generation20
Visual Commonsense Tests

Predict 5 property types (color, shape, material, size, and …

20
Zero-Shot Cross-Lingual Visual Question Answering20
Zero-Shot Cross-Lingual Visual Reasoning20
zero-shot long video global-mode question answering20
Zeroshot Video Question Answer20
Crosslingual Text-to-Image Generation10