Referring Expression Segmentation

The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–100 of 145 papers

Title	Date	Tasks	Status	Hype
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces	Dec 25, 2023	Image SegmentationObject	CodeCode Available	2
Mask Grounding for Referring Image Segmentation	Dec 19, 2023	cross-modal alignmentImage Segmentation	CodeCode Available	1
GSVA: Generalized Segmentation via Multimodal Large Language Models	Dec 15, 2023	DecoderGeneralized Referring Expression Segmentation	CodeCode Available	1
General Object Foundation Model for Images and Videos at Scale	Dec 14, 2023	Instance SegmentationLong-tail Video Object Segmentation	CodeCode Available	3
EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment	Dec 13, 2023	DecoderDepth Estimation	CodeCode Available	1
Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation	Dec 13, 2023	DescriptiveObject	CodeCode Available	1
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects	Dec 8, 2023	Image Captioningobject-detection	—Unverified	0
Universal Segmentation at Arbitrary Granularity with Language Instruction	Dec 4, 2023	Referring Expression SegmentationSegmentation	CodeCode Available	2
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation	Nov 30, 2023	Image CaptioningReferring Expression	CodeCode Available	0
NExT-Chat: An LMM for Chat, Detection and Segmentation	Nov 8, 2023	Referring ExpressionReferring Expression Segmentation	CodeCode Available	2
GLaMM: Pixel Grounding Large Multimodal Model	Nov 6, 2023	Conversational Question AnsweringImage Captioning	CodeCode Available	2
Towards Omni-supervised Referring Expression Segmentation	Nov 1, 2023	Referring ExpressionReferring Expression Segmentation	CodeCode Available	0
CLIPUNetr: Assisting Human-robot Interface for Uncalibrated Visual Servoing Control with CLIP-driven Referring Expression Segmentation	Sep 17, 2023	DecoderReferring Expression	—Unverified	0
Tracking Anything with Decoupled Video Segmentation	Sep 7, 2023	Open-Vocabulary Video SegmentationOpen-World Video Segmentation	CodeCode Available	3
3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation	Aug 31, 2023	NavigateReferring Expression	CodeCode Available	1
Referring Image Segmentation Using Text Supervision	Aug 28, 2023	Image SegmentationObject Localization	CodeCode Available	1
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	Aug 24, 2023	Chart Question AnsweringFS-MEVQA	CodeCode Available	5
EAVL: Explicitly Align Vision and Language for Referring Image Segmentation	Aug 18, 2023	Image SegmentationReferring Expression Segmentation	—Unverified	0
Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation	Aug 8, 2023	Contrastive LearningObject	CodeCode Available	0
Spectrum-guided Multi-granularity Referring Video Object Segmentation	Jul 25, 2023	ObjectReferring Expression Segmentation	CodeCode Available	1
Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation	Jul 21, 2023	DecoderImage Segmentation	CodeCode Available	1
OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation	Jul 18, 2023	Referring Expression SegmentationReferring Video Object Segmentation	CodeCode Available	1
Hierarchical Open-vocabulary Universal Image Segmentation	Jul 3, 2023	Image ComprehensionImage Segmentation	CodeCode Available	2
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic	Jun 27, 2023	Image CaptioningReferring Expression Segmentation	CodeCode Available	2
WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation	Jun 19, 2023	cross-modal alignmentImage Segmentation	—Unverified	0
Extending CLIP's Image-Text Alignment to Referring Image Segmentation	Jun 14, 2023	Image SegmentationReferring Expression Segmentation	—Unverified	0
LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation	Jun 14, 2023	Referring Expression SegmentationReferring Video Object Segmentation	CodeCode Available	1
GRES: Generalized Referring Expression Segmentation	Jun 1, 2023	Generalized Referring Expression SegmentationReferring Expression	CodeCode Available	2
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation	May 26, 2023	cross-modal alignmentObject	CodeCode Available	1
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation	May 25, 2023	ObjectReferring Expression Segmentation	CodeCode Available	1
Advancing Referring Expression Segmentation Beyond Single Image	May 21, 2023	Co-Salient Object DetectionObject	CodeCode Available	1
Meta Compositional Referring Expression Segmentation	Apr 10, 2023	Meta-LearningReferring Expression	—Unverified	0
Zero-shot Referring Image Segmentation with Global-Local Context Features	Mar 31, 2023	Image SegmentationReferring Expression	CodeCode Available	1
Universal Instance Perception as Object Discovery and Retrieval	Mar 12, 2023	Described Object DetectionGeneralized Referring Expression Comprehension	CodeCode Available	3
Unleashing Text-to-Image Diffusion Models for Visual Perception	Mar 3, 2023	DenoisingDepth Estimation	CodeCode Available	2
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation	Feb 14, 2023	DecoderImage Segmentation	CodeCode Available	1
Segment Every Reference Object in Spatial and Temporal Spaces	Jan 1, 2023	Image SegmentationObject	—Unverified	0
Learning To Segment Every Referring Object Point by Point	Jan 1, 2023	ObjectReferring Expression	CodeCode Available	0
Generalized Decoding for Pixel, Image, and Language	Dec 21, 2022	DecoderImage Segmentation	CodeCode Available	3
Fully and Weakly Supervised Referring Expression Segmentation with End-to-End Learning	Dec 17, 2022	PositionReferring Expression	—Unverified	0
A Unified Mutual Supervision Framework for Referring Expression Segmentation and Generation	Nov 15, 2022	Reference Expression GenerationReferring Expression	—Unverified	0
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation	Oct 28, 2022	Referring Expression SegmentationReferring Video Object Segmentation	CodeCode Available	2
Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos	Sep 21, 2022	Action DetectionAction Recognition	CodeCode Available	0
Multi-Attention Network for Compressed Video Referring Object Segmentation	Jul 26, 2022	ObjectReferring Expression Segmentation	CodeCode Available	1
Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus	Jul 4, 2022	Referring Expression SegmentationReferring Video Object Segmentation	CodeCode Available	1
GLIPv2: Unifying Localization and Vision-Language Understanding	Jun 12, 2022	2D Object DetectionContrastive Learning	CodeCode Available	4
Weakly-supervised segmentation of referring expressions	May 10, 2022	Image SegmentationReferring Expression	—Unverified	0
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation	Apr 6, 2022	Optical Flow EstimationReferring Expression Segmentation	CodeCode Available	1
ReSTR: Convolution-free Referring Image Segmentation Using Transformers	Mar 31, 2022	Image SegmentationReferring Expression Segmentation	—Unverified	0
SeqTR: A Simple yet Universal Network for Visual Grounding	Mar 30, 2022	DecoderReferring Expression	CodeCode Available	1

Show:10 25 50

← PrevPage 2 of 3Next →

All datasets RefCoCo val RefCOCO testA Refer-YouTube-VOS (2021 public validation)RefCOCO+ test B A2D Sentences RefCOCOg-val J-HMDB DAVIS 2017 (val)RefCOCOg-test RefCOCO testB PhraseCut RefCOCO

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	DeRIS-L	Overall IoU	85.41	—	Unverified
2	HyperSeg	Overall IoU	84.8	—	Unverified
3	PSALM	Overall IoU	83.6	—	Unverified
4	MLCD-Seg-7B	Overall IoU	83.6	—	Unverified
5	HIPIE	Overall IoU	82.8	—	Unverified
6	EVF-SAM	Overall IoU	82.4	—	Unverified
7	UNINEXT-H	Overall IoU	82.19	—	Unverified
8	UniLSeg-100	Overall IoU	81.74	—	Unverified
9	DETRIS	Overall IoU	81	—	Unverified
10	C3VG	Overall IoU	80.89	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	DeRIS-L	Overall IoU	86.49	—	Unverified
2	HyperSeg	Overall IoU	85.7	—	Unverified
3	MLCD-Seg-7B	Overall IoU	85.3	—	Unverified
4	EVF-SAM	Overall IoU	84.2	—	Unverified
5	HyperSeg	Overall IoU	83.5	—	Unverified
6	C3VG	Overall IoU	83.18	—	Unverified
7	MLCD-Seg-7B	Overall IoU	82.9	—	Unverified
8	DeRIS-L	Overall IoU	82.34	—	Unverified
9	DETRIS	Overall IoU	81.9	—	Unverified
10	MaskRIS (Swin-B, combined DB)	Overall IoU	80.64	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	MPG-SAM 2	J&F	73.9	—	Unverified
2	VRS-HQ (Chat-UniVi-13B)	J&F	71	—	Unverified
3	GLEE-Pro	J&F	70.6	—	Unverified
4	UNINEXT-H	J&F	70.1	—	Unverified
5	ReferDINO (Swin-B)	J&F	69.3	—	Unverified
6	MUTR	J&F	68.4	—	Unverified
7	VLP (VLMo-L)	J&F	67.6	—	Unverified
8	UniRef-L (Swin-L)	J&F	67.4	—	Unverified
9	HTR (Pre-training)	J&F	67.1	—	Unverified
10	DsHmp (Video-Swin-Base)	J&F	67.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	DeRIS-L	Mean IoU	78.59	—	Unverified
2	MLCD-Seg-7B	Overall IoU	75.6	—	Unverified
3	HyperSeg	Overall IoU	75.2	—	Unverified
4	EVF-SAM	Overall IoU	71.9	—	Unverified
5	DETRIS	Overall IoU	70.2	—	Unverified
6	C3VG	Overall IoU	68.95	—	Unverified
7	UniLSeg-100	Overall IoU	68.15	—	Unverified
8	UniLSeg-20	Overall IoU	66.99	—	Unverified
9	UNINEXT-H	Overall IoU	66.22	—	Unverified
10	GROUNDHOG	Overall IoU	64.9	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	HINet	IoU overall	0.68	—	Unverified
2	RefVOS	IoU overall	0.67	—	Unverified
3	ClawCraneNet	IoU overall	0.64	—	Unverified
4	CMSA+CFSA	IoU overall	0.62	—	Unverified
5	RefVOS	IoU overall	0.6	—	Unverified
6	SgMg (Video-Swin-B)	AP	0.59	—	Unverified
7	SOC (Video-Swin-B)	AP	0.57	—	Unverified
8	ReferFormer (Video-Swin-B)	AP	0.55	—	Unverified
9	SOC (Video-Swin-T)	AP	0.5	—	Unverified
10	MANET	AP	0.47	—	Unverified