SOTAVerified

Referring Expression Comprehension

Papers

Showing 150 of 167 papers

TitleStatusHype
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language ModelCode9
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal UnderstandingCode9
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learningCode7
Mini-Gemini: Mining the Potential of Multi-modality Vision Language ModelsCode7
Visual Instruction TuningCode6
Improved Baselines with Visual Instruction TuningCode6
Efficient Multimodal Learning from Data-centric PerspectiveCode5
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object DetectionCode5
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4VCode4
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One DayCode4
Towards Visual Grounding: A SurveyCode3
Universal Instance Perception as Object Discovery and RetrievalCode3
ONE-PEACE: Exploring One General Representation Model Toward Unlimited ModalitiesCode3
General Object Foundation Model for Images and Videos at ScaleCode3
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile DevicesCode3
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal ModelsCode2
MDETR - Modulated Detection for End-to-End Multi-Modal UnderstandingCode2
Frontiers in Intelligent ColonoscopyCode2
Elysium: Exploring Object-level Perception in Videos via MLLMCode2
GREC: Generalized Referring Expression ComprehensionCode2
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal FusionCode2
TextRegion: Text-Aligned Region Tokens from Frozen Image-Text ModelsCode2
Referring Transformer: A One-step Approach to Multi-task Visual GroundingCode1
Described Object Detection: Liberating Object Detection with Flexible ExpressionsCode1
RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone ScenesCode1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
PolyFormer: Referring Image Segmentation as Sequential Polygon GenerationCode1
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLMCode1
RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4DCode1
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and GroundingCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
Kosmos-2: Grounding Multimodal Large Language Models to the WorldCode1
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsCode1
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression ComprehensionCode1
SeqTR: A Simple yet Universal Network for Visual GroundingCode1
Multi-task Visual Grounding with Coarse-to-Fine Consistency ConstraintsCode1
Multi-task Collaborative Network for Joint Referring Expression Comprehension and SegmentationCode1
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM CollaborationCode1
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionCode1
A Fast and Accurate One-Stage Approach to Visual GroundingCode1
Multi-branch Collaborative Learning Network for 3D Visual GroundingCode1
NS3D: Neuro-Symbolic Grounding of 3D Objects and RelationsCode1
A Unified Framework for 3D Point Cloud Visual GroundingCode1
Explainable Neural Computation via Stack Neural Module NetworksCode1
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEsCode1
Correspondence Matters for Video Referring Expression ComprehensionCode1
Improving Visual Grounding by Encouraging Consistent Gradient-based ExplanationsCode1
InstructDET: Diversifying Referring Object Detection with Generalized InstructionsCode1
Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point CloudsCode1
Compositional Attention Networks for Machine ReasoningCode1
Show:102550
← PrevPage 1 of 4Next →

No leaderboard results yet.