| Towards Visual Grounding: A Survey | Dec 28, 2024 | Phrase GroundingReferring Expression | CodeCode Available | 3 |
| MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices | Dec 28, 2023 | AutoMLCPU | CodeCode Available | 3 |
| General Object Foundation Model for Images and Videos at Scale | Dec 14, 2023 | Instance SegmentationLong-tail Video Object Segmentation | CodeCode Available | 3 |
| ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | May 18, 2023 | 1 Image, 2*2 StitchiAction Classification | CodeCode Available | 3 |
| Universal Instance Perception as Object Discovery and Retrieval | Mar 12, 2023 | Described Object DetectionGeneralized Referring Expression Comprehension | CodeCode Available | 3 |
| TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models | May 29, 2025 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 2 |
| Frontiers in Intelligent Colonoscopy | Oct 22, 2024 | Image Captioning | CodeCode Available | 2 |
| SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion | Sep 26, 2024 | DescriptiveGeneralized Referring Expression Comprehension | CodeCode Available | 2 |
| Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models | Jun 24, 2024 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 2 |
| Elysium: Exploring Object-level Perception in Videos via MLLM | Mar 25, 2024 | ObjectObject Tracking | CodeCode Available | 2 |