SOTAVerified

Referring Expression Segmentation

The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.

Papers

Showing 150 of 145 papers

TitleStatusHype
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and BeyondCode5
VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement LearningCode4
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive ReinforcementCode4
Multi-label Cluster Discrimination for Visual Representation LearningCode4
GLIPv2: Unifying Localization and Vision-Language UnderstandingCode4
RemoteSAM: Towards Segment Anything for Earth ObservationCode3
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything ModelCode3
PSALM: Pixelwise SegmentAtion with Large Multi-Modal ModelCode3
UniVS: Unified and Universal Video Segmentation with Prompts as QueriesCode3
General Object Foundation Model for Images and Videos at ScaleCode3
Tracking Anything with Decoupled Video SegmentationCode3
Universal Instance Perception as Object Discovery and RetrievalCode3
Generalized Decoding for Pixel, Image, and LanguageCode3
GroundingSuite: Measuring Complex Multi-Granular Pixel GroundingCode2
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator TrajectoriesCode2
The Devil is in Temporal Token: High Quality Video Reasoning SegmentationCode2
Densely Connected Parameter-Efficient Tuning for Referring Image SegmentationCode2
HyperSeg: Towards Universal Visual Segmentation with Large Language ModelCode2
Text4Seg: Reimagining Image Segmentation as Text GenerationCode2
SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression SegmentationCode2
F-LMM: Grounding Frozen Large Multimodal ModelsCode2
Decoupling Static and Hierarchical Motion Perception for Referring Video SegmentationCode2
Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression SegmentationCode2
UniRef++: Segment Every Reference Object in Spatial and Temporal SpacesCode2
Universal Segmentation at Arbitrary Granularity with Language InstructionCode2
NExT-Chat: An LMM for Chat, Detection and SegmentationCode2
GLaMM: Pixel Grounding Large Multimodal ModelCode2
Hierarchical Open-vocabulary Universal Image SegmentationCode2
Shikra: Unleashing Multimodal LLM's Referential Dialogue MagicCode2
GRES: Generalized Referring Expression SegmentationCode2
Unleashing Text-to-Image Diffusion Models for Visual PerceptionCode2
VLT: Vision-Language Transformer and Query Generation for Referring SegmentationCode2
Language as Queries for Referring Video Object SegmentationCode2
DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback SynergyCode1
PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?Code1
MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object SegmentationCode1
Multi-task Visual Grounding with Coarse-to-Fine Consistency ConstraintsCode1
IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression SegmentationCode1
RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression SegmentationCode1
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image SegmentationCode1
3D-GRES: Generalized 3D Referring Expression SegmentationCode1
ViLLa: Video Reasoning Segmentation with Large Language ModelCode1
SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression SegmentationCode1
CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression SegmentationCode1
Temporally Consistent Referring Video Object Segmentation with Hybrid MemoryCode1
Mask Grounding for Referring Image SegmentationCode1
GSVA: Generalized Segmentation via Multimodal Large Language ModelsCode1
Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression SegmentationCode1
EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text AlignmentCode1
3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression SegmentationCode1
Show:102550
← PrevPage 1 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1DeRIS-LOverall IoU85.41Unverified
2HyperSegOverall IoU84.8Unverified
3PSALMOverall IoU83.6Unverified
4MLCD-Seg-7BOverall IoU83.6Unverified
5HIPIEOverall IoU82.8Unverified
6EVF-SAMOverall IoU82.4Unverified
7UNINEXT-HOverall IoU82.19Unverified
8UniLSeg-100Overall IoU81.74Unverified
9DETRISOverall IoU81Unverified
10C3VGOverall IoU80.89Unverified
#ModelMetricClaimedVerifiedStatus
1DeRIS-LOverall IoU86.49Unverified
2HyperSegOverall IoU85.7Unverified
3MLCD-Seg-7BOverall IoU85.3Unverified
4EVF-SAMOverall IoU84.2Unverified
5HyperSegOverall IoU83.5Unverified
6C3VGOverall IoU83.18Unverified
7MLCD-Seg-7BOverall IoU82.9Unverified
8DeRIS-LOverall IoU82.34Unverified
9DETRISOverall IoU81.9Unverified
10MaskRIS (Swin-B, combined DB)Overall IoU80.64Unverified
#ModelMetricClaimedVerifiedStatus
1MPG-SAM 2J&F73.9Unverified
2VRS-HQ (Chat-UniVi-13B)J&F71Unverified
3GLEE-ProJ&F70.6Unverified
4UNINEXT-HJ&F70.1Unverified
5ReferDINO (Swin-B)J&F69.3Unverified
6MUTRJ&F68.4Unverified
7VLP (VLMo-L)J&F67.6Unverified
8UniRef-L (Swin-L)J&F67.4Unverified
9HTR (Pre-training)J&F67.1Unverified
10DsHmp (Video-Swin-Base)J&F67.1Unverified
#ModelMetricClaimedVerifiedStatus
1DeRIS-LMean IoU78.59Unverified
2MLCD-Seg-7BOverall IoU75.6Unverified
3HyperSegOverall IoU75.2Unverified
4EVF-SAMOverall IoU71.9Unverified
5DETRISOverall IoU70.2Unverified
6C3VGOverall IoU68.95Unverified
7UniLSeg-100Overall IoU68.15Unverified
8UniLSeg-20Overall IoU66.99Unverified
9UNINEXT-HOverall IoU66.22Unverified
10GROUNDHOGOverall IoU64.9Unverified
#ModelMetricClaimedVerifiedStatus
1HINetIoU overall0.68Unverified
2RefVOSIoU overall0.67Unverified
3ClawCraneNetIoU overall0.64Unverified
4CMSA+CFSAIoU overall0.62Unverified
5RefVOSIoU overall0.6Unverified
6SgMg (Video-Swin-B)AP0.59Unverified
7SOC (Video-Swin-B)AP0.57Unverified
8ReferFormer (Video-Swin-B)AP0.55Unverified
9SOC (Video-Swin-T)AP0.5Unverified
10MANETAP0.47Unverified