Home/Multimodal & Vision-Language

Multimodal & Vision-Language

Papers in this area

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–10 of 10 papers

Title	Date	Tasks	Status
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent	Jul 21, 2025	Multimodal Reasoning	—Unverified
Visual Place Recognition for Large-Scale UAV Applications	Jul 20, 2025	BenchmarkingDiversity	—Unverified
Transformer-based Spatial Grounding: A Comprehensive Survey	Jul 17, 2025	cross-modal alignmentSurvey	—Unverified
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding	Jul 17, 2025	Video GroundingVideo Understanding	—Unverified
Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark	Jul 17, 2025	Multimodal ReasoningPose Estimation	—Unverified
LaViPlan : Language-Guided Visual Path Planning with RLVR	Jul 17, 2025	Autonomous DrivingVision-Language-Action	—Unverified
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities	Jul 17, 2025	Large Language ModelVision and Language Navigation	—Unverified
AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation	Jul 17, 2025	Vision-Language-Action	—Unverified
LoViC: Efficient Long Video Generation with Context Compression	Jul 17, 2025	Text-to-Video GenerationVideo Generation	—Unverified
MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval	Jul 17, 2025	Image RetrievalRe-Ranking	—Unverified

Show:10 25 50

Task	Papers	Results
Sketch-to-Image Translation	15	7
Motion Captioning Generating textual description for human motion.	11	7
Text to 3D Task involves generating 3D objects based on the text prompt…	314	6
Multi-modal Classification	31	6
Audio-visual Question Answering	27	6
3D Object Captioning 3D object captioning involves generating a natural language …	7	6
Long Video Retrieval (Background Removed) Retrieve the long videos given all subtitles.	6	6
VCGBench-Diverse Recognizing the limited diversity in existing video conversa…	5	6
Visual Speech Recognition	182	5
Generalized Referring Expression Comprehension Generalized Referring Expression Comprehension (GREC) allows…	7	5
Explanatory Visual Question Answering Explanatory Visual Question Answering (EVQA) requires answer…	5	5
Text to Video Retrieval She's gone I can't find her anywhere I'm looking everywhere …	75	4
Dense Captioning	69	4
Multimodal Text and Image Classification Classification with both source Image and Text	7	4
Lip Reading Lip Reading is a task to infer the speech content in a video…	153	3
Generative Visual Question Answering Generating answers in free form to questions posed about ima…	9	3
Supervised Image Retrieval	4	3
Semi Supervised Learning for Image Captioning	2	3
audio-visual event localization	26	2
Composed Image Retrieval (CoIR) Composed Image Retrieval (CoIR) is the task involves retriev…	14	2
Text-to-3D-Human Generation 3D avatars generation from text prompts	3	2
Document Image Skew Estimation	1	2
Referring Expression Referring expressions places a bounding box around the insta…	364	1
Multimodal Deep Learning Multimodal deep learning is a type of deep learning that com…	213	1
Content-Based Image Retrieval Content-Based Image Retrieval is a well studied problem in c…	195	1
Scene-Aware Dialogue	8	1
Grounded Multimodal Named Entity Recognition	3	1
X-ray Visual Question Answering	2	1
LMM real-life tasks	1	1
Multimodal Text Prediction Multimodal text prediction is a type of natural language pro…	1	1
Segmented Multimodal Named Entity Recognition	1	1
World Knowledge	818	0
cross-modal alignment	342	0
Image-text Retrieval	248	0
Image-text matching Image-Text Matching is a subtask within Cross-Modal Retrieva…	188	0
Referring Expression Comprehension	167	0
Vision-Language-Action	157	0
Video-Text Retrieval Video-Text retrieval requires understanding of both video an…	111	0
Zero-Shot Video Question Answer This task present the results of Zeroshot Question Answer re…	85	0
3D visual grounding	82	0
Visual Commonsense Reasoning Image source: [Visual Commonsense Reasoning](https://papersw…	65	0
Visual Entailment Visual Entailment (VE) - is a task consisting of image-sente…	56	0
Physical Intuition	35	0
Cross-Modality Person Re-identification	26	0
3D Question Answering (3D-QA) A 3D-QA task requires models to answer a question when given…	22	0
MM-Vet	19	0
Multimodal Unsupervised Image-To-Image Translation Multimodal unsupervised image-to-image translation is the ta…	17	0
Zero-Shot Text-to-Image Generation Image credit: [GLIDE: Towards Photorealistic Image Generatio…	16	0
Generalized Referring Expression Segmentation Generalized Referring Expression Segmentation (GRES), introd…	15	0
TGIF-Frame	15	0