| SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents | Jan 17, 2024 | Natural Language Visual Grounding | CodeCode Available | 3 |
| CogAgent: A Visual Language Model for GUI Agents | Dec 14, 2023 | Language Modeling | CodeCode Available | 5 |
| MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | Oct 14, 2023 | Image ClassificationImage Description | CodeCode Available | 7 |
| Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | Aug 24, 2023 | Chart Question AnsweringFS-MEVQA | CodeCode Available | 5 |
| Localizing Moments in Long Video Via Multimodal Guidance | Feb 26, 2023 | Natural Language Moment RetrievalNatural Language Visual Grounding | CodeCode Available | 1 |
| Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences | Jan 20, 2023 | Coherence EvaluationGrounded language learning | —Unverified | 0 |
| Belief Revision based Caption Re-ranker with Visual Semantic Information | Sep 16, 2022 | Caption GenerationImage Captioning | CodeCode Available | 1 |
| TubeDETR: Spatio-Temporal Video Grounding with Transformers | Mar 30, 2022 | DecoderLanguage-Based Temporal Localization | CodeCode Available | 1 |
| CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks | Dec 6, 2021 | Continuous ControlImitation Learning | CodeCode Available | 1 |
| Panoptic Narrative Grounding | Sep 10, 2021 | Natural Language Visual GroundingPanoptic Segmentation | CodeCode Available | 1 |