SOTAVerified
Home/Multimodal & Vision-Language

Multimodal & Vision-Language

171 tasks · View all areas

Papers in this area

Showing 110 of 10 papers

TitleStatusHype
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent0
Visual Place Recognition for Large-Scale UAV Applications0
Transformer-based Spatial Grounding: A Comprehensive Survey0
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding0
Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark0
LaViPlan : Language-Guided Visual Path Planning with RLVR0
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities0
AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation0
LoViC: Efficient Long Video Generation with Context Compression0
MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval0
Show:102550
TaskPapersResults
Sketch-to-Image Translation157
Motion Captioning

Generating textual description for human motion.

117
Text to 3D

Task involves generating 3D objects based on the text prompt…

3146
Multi-modal Classification316
Audio-visual Question Answering276
3D Object Captioning

3D object captioning involves generating a natural language …

76
Long Video Retrieval (Background Removed)

Retrieve the long videos given all subtitles.

66
VCGBench-Diverse

Recognizing the limited diversity in existing video conversa…

56
Visual Speech Recognition1825
Generalized Referring Expression Comprehension

Generalized Referring Expression Comprehension (GREC) allows…

75
Explanatory Visual Question Answering

Explanatory Visual Question Answering (EVQA) requires answer…

55
Text to Video Retrieval

She's gone I can't find her anywhere I'm looking everywhere …

754
Dense Captioning694
Multimodal Text and Image Classification

Classification with both source Image and Text

74
Lip Reading

Lip Reading is a task to infer the speech content in a video…

1533
Generative Visual Question Answering

Generating answers in free form to questions posed about ima…

93
Supervised Image Retrieval43
Semi Supervised Learning for Image Captioning23
audio-visual event localization262
Composed Image Retrieval (CoIR)

Composed Image Retrieval (CoIR) is the task involves retriev…

142
Text-to-3D-Human Generation

3D avatars generation from text prompts

32
Document Image Skew Estimation12
Referring Expression

Referring expressions places a bounding box around the insta…

3641
Multimodal Deep Learning

Multimodal deep learning is a type of deep learning that com…

2131
Content-Based Image Retrieval

Content-Based Image Retrieval is a well studied problem in c…

1951
Scene-Aware Dialogue81
Grounded Multimodal Named Entity Recognition31
X-ray Visual Question Answering21
LMM real-life tasks11
Multimodal Text Prediction

Multimodal text prediction is a type of natural language pro…

11
Segmented Multimodal Named Entity Recognition11
World Knowledge8180
cross-modal alignment3420
Image-text Retrieval2480
Image-text matching

Image-Text Matching is a subtask within Cross-Modal Retrieva…

1880
Referring Expression Comprehension1670
Vision-Language-Action1570
Video-Text Retrieval

Video-Text retrieval requires understanding of both video an…

1110
Zero-Shot Video Question Answer

This task present the results of Zeroshot Question Answer re…

850
3D visual grounding820
Visual Commonsense Reasoning

Image source: [Visual Commonsense Reasoning](https://papersw…

650
Visual Entailment

Visual Entailment (VE) - is a task consisting of image-sente…

560
Physical Intuition350
Cross-Modality Person Re-identification260
3D Question Answering (3D-QA)

A 3D-QA task requires models to answer a question when given…

220
MM-Vet190
Multimodal Unsupervised Image-To-Image Translation

Multimodal unsupervised image-to-image translation is the ta…

170
Zero-Shot Text-to-Image Generation

Image credit: [GLIDE: Towards Photorealistic Image Generatio…

160
Generalized Referring Expression Segmentation

Generalized Referring Expression Segmentation (GRES), introd…

150
TGIF-Frame150