Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Aug 24, 2023 Chart Question Answering FS-MEVQA
Code Code Available 55 VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning May 17, 2025 2D Object Detection Object Counting
Code Code Available 45 Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement Mar 9, 2025 Domain Generalization Object Detection
Code Code Available 45 GLIPv2: Unifying Localization and Vision-Language Understanding Jun 12, 2022 2D Object Detection Contrastive Learning
Code Code Available 45 Multi-label Cluster Discrimination for Visual Representation Learning Jul 24, 2024 Contrastive Learning Image-text Retrieval
Code Code Available 45 General Object Foundation Model for Images and Videos at Scale Dec 14, 2023 Instance Segmentation Long-tail Video Object Segmentation
Code Code Available 35 PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model Mar 21, 2024 Decoder Generalized Referring Expression Segmentation
Code Code Available 35 Generalized Decoding for Pixel, Image, and Language Dec 21, 2022 Decoder Image Segmentation
Code Code Available 35 Universal Instance Perception as Object Discovery and Retrieval Mar 12, 2023 Described Object Detection Generalized Referring Expression Comprehension
Code Code Available 35 Tracking Anything with Decoupled Video Segmentation Sep 7, 2023 Open-Vocabulary Video Segmentation Open-World Video Segmentation
Code Code Available 35 EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model Jun 28, 2024 Interactive Segmentation Language Modeling
Code Code Available 35 RemoteSAM: Towards Segment Anything for Earth Observation May 23, 2025 Attribute Earth Observation
Code Code Available 35 UniVS: Unified and Universal Video Segmentation with Prompts as Queries Feb 28, 2024 Decoder Referring Expression Segmentation
Code Code Available 35 GLaMM: Pixel Grounding Large Multimodal Model Nov 6, 2023 Conversational Question Answering Image Captioning
Code Code Available 25 GRES: Generalized Referring Expression Segmentation Jun 1, 2023 Generalized Referring Expression Segmentation Referring Expression
Code Code Available 25 Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation Jan 1, 2024 Descriptive Object
Code Code Available 25 Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation Apr 4, 2024 Contrastive Learning Referring Expression
Code Code Available 25 HyperSeg: Towards Universal Visual Segmentation with Large Language Model Nov 26, 2024 Language Modeling Large Language Model
Code Code Available 25 Unleashing Text-to-Image Diffusion Models for Visual Perception Mar 3, 2023 Denoising Depth Estimation
Code Code Available 25 Hierarchical Open-vocabulary Universal Image Segmentation Jul 3, 2023 Image Comprehension Image Segmentation
Code Code Available 25 Universal Segmentation at Arbitrary Granularity with Language Instruction Dec 4, 2023 Referring Expression Segmentation Segmentation
Code Code Available 25 VLT: Vision-Language Transformer and Query Generation for Referring Segmentation Oct 28, 2022 Referring Expression Segmentation Referring Video Object Segmentation
Code Code Available 25 UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces Dec 25, 2023 Image Segmentation Object
Code Code Available 25 GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding Mar 13, 2025 Diversity Language Modeling
Code Code Available 25 Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation Jan 15, 2025 Image Segmentation Referring Expression Segmentation
Code Code Available 25 Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic Jun 27, 2023 Image Captioning Referring Expression Segmentation
Code Code Available 25 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 25 Text4Seg: Reimagining Image Segmentation as Text Generation Oct 13, 2024 Image Segmentation Referring Expression
Code Code Available 25 F-LMM: Grounding Frozen Large Multimodal Models Jun 9, 2024 General Knowledge Instruction Following
Code Code Available 25 Language as Queries for Referring Video Object Segmentation Jan 3, 2022 Object Object Tracking
Code Code Available 25 NExT-Chat: An LMM for Chat, Detection and Segmentation Nov 8, 2023 Referring Expression Referring Expression Segmentation
Code Code Available 25 SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation Sep 1, 2024 Language Modeling Language Modelling
Code Code Available 25 The Devil is in Temporal Token: High Quality Video Reasoning Segmentation Jan 15, 2025 Reasoning Segmentation Referring Expression Segmentation
Code Code Available 25 Multi-Attention Network for Compressed Video Referring Object Segmentation Jul 26, 2022 Object Referring Expression Segmentation
Code Code Available 15 End-to-End Referring Video Object Segmentation with Multimodal Transformers Nov 29, 2021 Inductive Bias Instance Segmentation
Code Code Available 15 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation Aug 31, 2023 Navigate Referring Expression
Code Code Available 15 MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation Jan 23, 2025 Referring Expression Segmentation Referring Video Object Segmentation
Code Code Available 15 Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts Nov 16, 2021 Cross-Modal Retrieval Image Captioning
Code Code Available 15 MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding Apr 26, 2021 Generalized Referring Expression Comprehension Phrase Grounding
Code Code Available 15 PolyFormer: Referring Image Segmentation as Sequential Polygon Generation Feb 14, 2023 Decoder Image Segmentation
Code Code Available 15 DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy Jul 2, 2025 Data Augmentation Generalized Referring Expression Segmentation
Code Code Available 15 MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation Nov 28, 2024 Data Augmentation Image Segmentation
Code Code Available 15 Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation Apr 6, 2022 Optical Flow Estimation Referring Expression Segmentation
Code Code Available 15 Image Segmentation Using Text and Image Prompts Dec 18, 2021 Decoder Image Segmentation
Code Code Available 15 Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus Jul 4, 2022 Referring Expression Segmentation Referring Video Object Segmentation
Code Code Available 15 CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation May 24, 2024 Generalized Referring Expression Segmentation Object
Code Code Available 15 GSVA: Generalized Segmentation via Multimodal Large Language Models Dec 15, 2023 Decoder Generalized Referring Expression Segmentation
Code Code Available 15 Advancing Referring Expression Segmentation Beyond Single Image May 21, 2023 Co-Salient Object Detection Object
Code Code Available 15 Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation Jul 21, 2023 Decoder Image Segmentation
Code Code Available 15 Local-Global Context Aware Transformer for Language-Guided Video Segmentation Mar 18, 2022 Referring Expression Segmentation Referring Video Object Segmentation
Code Code Available 15