| Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | Aug 24, 2023 | Chart Question AnsweringFS-MEVQA | CodeCode Available | 5 |
| BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions | Aug 19, 2023 | MMEOptical Character Recognition (OCR) | CodeCode Available | 2 |
| Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models | Aug 18, 2023 | Image-text matchingObject Localization | —Unverified | 0 |
| Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes | Aug 17, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Object Goal Navigation with Recursive Implicit Maps | Aug 10, 2023 | NavigateObject | —Unverified | 0 |
| Spatial Intelligence of a Self-driving Car and Rule-Based Decision Making | Aug 2, 2023 | Autonomous DrivingDecision Making | —Unverified | 0 |
| SpaceNLI: Evaluating the Consistency of Predicting Inferences in Space | Jul 5, 2023 | Natural Language InferenceNegation | CodeCode Available | 0 |
| Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation | Jun 30, 2023 | Action DetectionPose Prediction | CodeCode Available | 2 |
| A Universal Semantic-Geometric Representation for Robotic Manipulation | Jun 18, 2023 | 3D geometryRobot Manipulation | CodeCode Available | 1 |
| Controllable Text-to-Image Generation with GPT-4 | May 29, 2023 | Image GenerationInstruction Following | —Unverified | 0 |