SOTAVerified

Multimodal Reasoning

Reasoning over multimodal inputs.

Papers

Showing 251300 of 302 papers

TitleStatusHype
Knowledge-Aware Reasoning over Multimodal Semi-structured Tables0
Towards Holistic Disease Risk Prediction using Small Language Models0
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance0
Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined HighlightsCode0
On scalable oversight with weak LLMs judging strong LLMs0
Improving Multi-Agent Debate with Sparse Communication Topology0
POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models0
Multimodal Reasoning with Multimodal Knowledge Graph0
Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal ModelsCode0
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning0
M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal ModelsCode0
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models0
Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning0
AccidentBlip: Agent of Accident Warning based on MA-former0
Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V0
MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained ClassificationCode0
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval0
VEglue: Testing Visual Entailment Systems via Object-Aligned Joint ErasingCode0
Measuring Vision-Language STEM Skills of Neural ModelsCode0
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis0
Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics0
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models0
Question Aware Vision Transformer for Multimodal Reasoning0
On the generalization capacity of neural networks during generic multimodal reasoningCode0
Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine0
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning0
Towards a Unified Multimodal Reasoning FrameworkCode0
Assessing GPT4-V on Structured Reasoning Tasks0
Apollo: Zero-shot MultiModal Reasoning with Multiple ExpertsCode0
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models0
Personality-aware Human-centric Multimodal Reasoning: A New Task, Dataset and Baselines0
AutoFraudNet: A Multimodal Network to Detect Fraud in the Auto Insurance Industry0
Modal-specific Pseudo Query Generation for Video Corpus Moment RetrievalCode0
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?Code0
Deep Neural Networks for Visual Reasoning0
Reducing the Vision and Language Bias for Temporal Sentence Grounding0
DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation0
Socratic Models: Composing Zero-Shot Multimodal Reasoning with LanguageCode0
Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?Code0
Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering0
Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding0
TxT: Crossmodal End-to-End Learning with Transformers0
C^3: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues0
Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues0
Visual Goal-Step Inference using wikiHowCode0
UniT: Multimodal Multitask Learning with a Unified TransformerCode0
DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents0
Sound2Sight: Generating Visual Dynamics from Sound and Context0
DMRM: A Dual-channel Multi-hop Reasoning Model for Visual DialogCode0
Multimodal Transformer with Multi-View Visual Representation for Image Captioning0
Show:102550
← PrevPage 6 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4VAccuracy24Unverified
2Gemini ProAccuracy13.2Unverified
3LLaVa-1.5-13BAccuracy1.8Unverified
4LLaVa-1.5-7BAccuracy1.5Unverified
5BLIP2-FLAN-T5-XXLAccuracy0.9Unverified
6QWENAccuracy0.9Unverified
7CogVLMAccuracy0.9Unverified
8InstructBLIPAccuracy0.6Unverified
#ModelMetricClaimedVerifiedStatus
1GPT4VAccuracy22.76Unverified
2Gemini ProAccuracy17.66Unverified
3Qwen-VL-MaxAccuracy15.59Unverified
4InternLM-XComposer2-VLAccuracy14.54Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4Acc30.3Unverified