| Flamingo: a Visual Language Model for Few-Shot Learning | Apr 29, 2022 | Few-Shot LearningGenerative Visual Question Answering | CodeCode Available | 4 |
| Emu: Generative Pretraining in Multimodality | Jul 11, 2023 | Image CaptioningImage Generation | CodeCode Available | 3 |
| Gemini: A Family of Highly Capable Multimodal Models | Dec 19, 2023 | 1 Image, 2*2 StitchingArithmetic Reasoning | CodeCode Available | 1 |
| PaLI-3 Vision Language Models: Smaller, Faster, Stronger | Oct 13, 2023 | Chart Question AnsweringCross-Modal Retrieval | CodeCode Available | 1 |
| PaLI-X: On Scaling up a Multilingual Vision and Language Model | May 29, 2023 | Chart Question Answeringdocument understanding | CodeCode Available | 1 |
| Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models | Jun 15, 2023 | cross-modal alignmentDomain Generalization | —Unverified | 0 |