| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 | 5 |
| Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning | Apr 7, 2021 | Representation LearningRetrieval | CodeCode Available | 1 | 5 |
| Measuring Progress in Fine-grained Vision-and-Language Understanding | May 12, 2023 | Visual Reasoning | CodeCode Available | 1 | 5 |
| Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification | Jun 8, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs | Oct 15, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models | Jul 23, 2024 | Multimodal ReasoningPrompt Engineering | CodeCode Available | 1 | 5 |
| Beyond Embeddings: The Promise of Visual Table in Visual Reasoning | Mar 27, 2024 | Representation LearningVisual Question Answering | CodeCode Available | 1 | 5 |
| Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT | May 30, 2025 | Spatial ReasoningVisual Reasoning | CodeCode Available | 1 | 5 |
| FiLM: Visual Reasoning with a General Conditioning Layer | Sep 22, 2017 | Image Retrieval with Multi-Modal QueryVisual Question Answering (VQA) | CodeCode Available | 1 | 5 |
| FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension | Sep 23, 2024 | Image ComprehensionReferring Expression | CodeCode Available | 1 | 5 |
| SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning | May 25, 2025 | BenchmarkingVisual Reasoning | CodeCode Available | 1 | 5 |
| ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension | Jun 17, 2024 | DecoderVisual Reasoning | CodeCode Available | 1 | 5 |
| Forgotten Polygons: Multimodal Large Language Models are Shape-Blind | Feb 21, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 | 5 |
| Forward Prediction for Physical Reasoning | Jun 18, 2020 | PredictionVisual Reasoning | CodeCode Available | 1 | 5 |
| ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models | Jan 24, 2024 | Visual Reasoning | CodeCode Available | 1 | 5 |
| Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training | Jan 1, 2023 | 3D dense captioning3D visual grounding | CodeCode Available | 1 | 5 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 | 5 |
| Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality | Apr 7, 2022 | Visual Reasoning | CodeCode Available | 1 | 5 |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Dec 5, 2023 | BenchmarkingVisual Question Answering | CodeCode Available | 1 | 5 |
| Compositional Chain-of-Thought Prompting for Large Multimodal Models | Nov 27, 2023 | Language ModellingLarge Language Model | CodeCode Available | 1 | 5 |
| From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis | Jun 28, 2024 | Visual Question Answering (VQA)Visual Reasoning | CodeCode Available | 1 | 5 |
| Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models | Aug 9, 2021 | Composed Image Retrieval (CoIR)Image Retrieval | CodeCode Available | 1 | 5 |
| See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning | Jan 12, 2023 | Few-Shot LearningImage Captioning | CodeCode Available | 1 | 5 |
| Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices | Mar 21, 2023 | Visual Reasoning | CodeCode Available | 1 | 5 |
| Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos | Jun 27, 2024 | Temporal Information ExtractionVisual Reasoning | CodeCode Available | 1 | 5 |
| Complete 3D Scene Parsing from an RGBD Image | Oct 25, 2017 | DiversityRetrieval | CodeCode Available | 0 | 5 |
| HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models | Dec 29, 2024 | HallucinationObject | CodeCode Available | 0 | 5 |
| Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning | Mar 1, 2024 | DisentanglementInformativeness | CodeCode Available | 0 | 5 |
| QLEVR: A Diagnostic Dataset for Quantificational Language and Elementary Visual Reasoning | May 6, 2022 | DiagnosticQuestion Answering | CodeCode Available | 0 | 5 |
| Raven's Progressive Matrices Completion with Latent Gaussian Process Priors | Mar 22, 2021 | Answer SelectionGaussian Processes | CodeCode Available | 0 | 5 |
| Prompting Large Vision-Language Models for Compositional Reasoning | Jan 20, 2024 | RetrievalVisual Reasoning | CodeCode Available | 0 | 5 |
| Program synthesis performance constrained by non-linear spatial relations in Synthetic Visual Reasoning Test | Nov 18, 2019 | Few-Shot LearningProgram Synthesis | CodeCode Available | 0 | 5 |
| Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models | Dec 11, 2024 | Question AnsweringVisual Grounding | CodeCode Available | 0 | 5 |
| Grounded Reinforcement Learning for Visual Reasoning | May 29, 2025 | reinforcement-learningReinforcement Learning | CodeCode Available | 0 | 5 |
| A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering | Oct 1, 2022 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 0 | 5 |
| Physical Reasoning Using Dynamics-Aware Models | Feb 20, 2021 | Visual Reasoning | CodeCode Available | 0 | 5 |
| CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | Jan 5, 2024 | Image ComprehensionImage to text | CodeCode Available | 0 | 5 |
| GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives | Dec 7, 2023 | Graph GenerationLanguage Modelling | CodeCode Available | 0 | 5 |
| A Survey on Multimodal Large Language Models | Jun 23, 2023 | HallucinationIn-Context Learning | CodeCode Available | 0 | 5 |
| PaLI: A Jointly-Scaled Multilingual Language-Image Model | Sep 14, 2022 | DecoderFew-Shot Image Classification | CodeCode Available | 0 | 5 |
| Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning | Jul 9, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 0 | 5 |
| Predicting Complete 3D Models of Indoor Scenes | Apr 9, 2015 | DiversityVisual Reasoning | CodeCode Available | 0 | 5 |
| Collecting Visually-Grounded Dialogue with A Game Of Sorts | Sep 10, 2023 | Coreference ResolutionImage Retrieval | CodeCode Available | 0 | 5 |
| CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions | Jan 3, 2019 | DiagnosticImage Segmentation | CodeCode Available | 0 | 5 |
| Odd-One-Out Representation Learning | Dec 14, 2020 | DisentanglementMetric Learning | CodeCode Available | 0 | 5 |
| CLEVR Parser: A Graph Parser Library for Geometric Learning on Language Grounded Image Scenes | Sep 19, 2020 | Graph Neural NetworkVisual Reasoning | CodeCode Available | 0 | 5 |
| CLEVRER: CoLlision Events for Video REpresentation and Reasoning | Oct 3, 2019 | counterfactualDescriptive | CodeCode Available | 0 | 5 |
| Object Level Visual Reasoning in Videos | Jun 16, 2018 | Activity RecognitionHuman Activity Recognition | CodeCode Available | 0 | 5 |
| Abstracting Concept-Changing Rules for Solving Raven's Progressive Matrix Problems | Jul 15, 2023 | Answer GenerationAnswer Selection | CodeCode Available | 0 | 5 |
| Attention over learned object embeddings enables complex visual reasoning | Dec 15, 2020 | ObjectVideo Object Tracking | CodeCode Available | 0 | 5 |