SOTAVerified

Referring Expression Segmentation

The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.

Papers

Showing 150 of 145 papers

TitleStatusHype
DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback SynergyCode1
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval0
Refer to Anything with Vision-Language Prompts0
RemoteSAM: Towards Segment Anything for Earth ObservationCode3
VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement LearningCode4
RESAnything: Attribute Prompting for Arbitrary Referring Segmentation0
3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation0
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target GranularitiesCode0
GroundingSuite: Measuring Complex Multi-Granular Pixel GroundingCode2
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator TrajectoriesCode2
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive ReinforcementCode4
PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?Code1
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations0
MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object SegmentationCode1
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context ModelingCode0
Densely Connected Parameter-Efficient Tuning for Referring Image SegmentationCode2
The Devil is in Temporal Token: High Quality Video Reasoning SegmentationCode2
Multi-task Visual Grounding with Coarse-to-Fine Consistency ConstraintsCode1
IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression SegmentationCode1
Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension0
DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension0
Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding0
RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression SegmentationCode1
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image SegmentationCode1
HyperSeg: Towards Universal Visual Segmentation with Large Language ModelCode2
Instance-Aware Generalized Referring Expression Segmentation0
SegLLM: Multi-round Reasoning Segmentation0
Text4Seg: Reimagining Image Segmentation as Text GenerationCode2
SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression SegmentationCode2
3D-GRES: Generalized 3D Referring Expression SegmentationCode1
Multi-label Cluster Discrimination for Visual Representation LearningCode4
ViLLa: Video Reasoning Segmentation with Large Language ModelCode1
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation0
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything ModelCode3
GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation0
F-LMM: Grounding Frozen Large Multimodal ModelsCode2
SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression SegmentationCode1
GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane0
Bring Adaptive Binding Prototypes to Generalized Referring Expression SegmentationCode0
CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression SegmentationCode1
Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation0
Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context UnderstandingCode0
Decoupling Static and Hierarchical Motion Perception for Referring Video SegmentationCode2
Temporally Consistent Referring Video Object Segmentation with Hybrid MemoryCode1
PSALM: Pixelwise SegmentAtion with Large Multi-Modal ModelCode3
UniVS: Unified and Universal Video Segmentation with Prompts as QueriesCode3
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation0
RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner0
Generalizable Entity Grounding via Assistance of Large Language Model0
Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression SegmentationCode2
Show:102550
← PrevPage 1 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1DeRIS-LOverall IoU85.41Unverified
2HyperSegOverall IoU84.8Unverified
3PSALMOverall IoU83.6Unverified
4MLCD-Seg-7BOverall IoU83.6Unverified
5HIPIEOverall IoU82.8Unverified
6EVF-SAMOverall IoU82.4Unverified
7UNINEXT-HOverall IoU82.19Unverified
8UniLSeg-100Overall IoU81.74Unverified
9DETRISOverall IoU81Unverified
10C3VGOverall IoU80.89Unverified
#ModelMetricClaimedVerifiedStatus
1DeRIS-LOverall IoU86.49Unverified
2HyperSegOverall IoU85.7Unverified
3MLCD-Seg-7BOverall IoU85.3Unverified
4EVF-SAMOverall IoU84.2Unverified
5HyperSegOverall IoU83.5Unverified
6C3VGOverall IoU83.18Unverified
7MLCD-Seg-7BOverall IoU82.9Unverified
8DeRIS-LOverall IoU82.34Unverified
9DETRISOverall IoU81.9Unverified
10MaskRIS (Swin-B, combined DB)Overall IoU80.64Unverified
#ModelMetricClaimedVerifiedStatus
1MPG-SAM 2J&F73.9Unverified
2VRS-HQ (Chat-UniVi-13B)J&F71Unverified
3GLEE-ProJ&F70.6Unverified
4UNINEXT-HJ&F70.1Unverified
5ReferDINO (Swin-B)J&F69.3Unverified
6MUTRJ&F68.4Unverified
7VLP (VLMo-L)J&F67.6Unverified
8UniRef-L (Swin-L)J&F67.4Unverified
9HTR (Pre-training)J&F67.1Unverified
10DsHmp (Video-Swin-Base)J&F67.1Unverified
#ModelMetricClaimedVerifiedStatus
1DeRIS-LMean IoU78.59Unverified
2MLCD-Seg-7BOverall IoU75.6Unverified
3HyperSegOverall IoU75.2Unverified
4EVF-SAMOverall IoU71.9Unverified
5DETRISOverall IoU70.2Unverified
6C3VGOverall IoU68.95Unverified
7UniLSeg-100Overall IoU68.15Unverified
8UniLSeg-20Overall IoU66.99Unverified
9UNINEXT-HOverall IoU66.22Unverified
10GROUNDHOGOverall IoU64.9Unverified
#ModelMetricClaimedVerifiedStatus
1HINetIoU overall0.68Unverified
2RefVOSIoU overall0.67Unverified
3ClawCraneNetIoU overall0.64Unverified
4CMSA+CFSAIoU overall0.62Unverified
5RefVOSIoU overall0.6Unverified
6SgMg (Video-Swin-B)AP0.59Unverified
7SOC (Video-Swin-B)AP0.57Unverified
8ReferFormer (Video-Swin-B)AP0.55Unverified
9SOC (Video-Swin-T)AP0.5Unverified
10MANETAP0.47Unverified