| VCoder: Versatile Vision Encoders for Multimodal Large Language Models | Dec 21, 2023 | Image CaptioningImage Generation | CodeCode Available | 2 |
| LingoQA: Visual Question Answering for Autonomous Driving | Dec 21, 2023 | Autonomous DrivingDecision Making | CodeCode Available | 2 |
| V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs | Dec 21, 2023 | Visual Question AnsweringWorld Knowledge | CodeCode Available | 2 |
| Object Attribute Matters in Visual Question Answering | Dec 20, 2023 | AttributeGraph Neural Network | CodeCode Available | 0 |
| Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering | Dec 20, 2023 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 0 |
| Interactive Visual Task Learning for Robots | Dec 20, 2023 | Continual LearningNovel Concepts | —Unverified | 0 |
| Generative Multimodal Models are In-Context Learners | Dec 20, 2023 | In-Context LearningPersonalized Image Generation | CodeCode Available | 3 |
| Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering | Dec 20, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering | Dec 19, 2023 | Image RetrievalQuestion Answering | CodeCode Available | 0 |
| Gemini: A Family of Highly Capable Multimodal Models | Dec 19, 2023 | 1 Image, 2*2 StitchingArithmetic Reasoning | CodeCode Available | 1 |
| EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering | Dec 19, 2023 | ObjectObject Counting | CodeCode Available | 1 |
| HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles | Dec 18, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| OsmLocator: locating overlapping scatter marks with a non-training generative perspective | Dec 18, 2023 | ClusteringCombinatorial Optimization | CodeCode Available | 0 |
| CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update | Dec 18, 2023 | Continual LearningQuestion Answering | —Unverified | 0 |
| An Evaluation of GPT-4V and Gemini in Online VQA | Dec 17, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Silkie: Preference Distillation for Large Visual Language Models | Dec 17, 2023 | HallucinationMME | —Unverified | 0 |
| p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models | Dec 17, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| Advancing Surgical VQA with Scene Graph Knowledge | Dec 15, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery | Dec 15, 2023 | Contrastive LearningEarth Observation | CodeCode Available | 3 |
| Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models | Dec 15, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 |
| WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data | Dec 15, 2023 | document understandingQuestion Answering | CodeCode Available | 1 |
| Privacy-Aware Document Visual Question Answering | Dec 15, 2023 | document understandingFederated Learning | CodeCode Available | 1 |
| VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation | Dec 14, 2023 | Image CaptioningImage Generation | CodeCode Available | 1 |
| CogAgent: A Visual Language Model for GUI Agents | Dec 14, 2023 | Language Modeling | CodeCode Available | 5 |
| BESTMVQA: A Benchmark Evaluation System for Medical Visual Question Answering | Dec 13, 2023 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |