SOTAVerified

Referring Expression Comprehension

Papers

Showing 51100 of 167 papers

TitleStatusHype
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsCode1
A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression ComprehensionCode1
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression ComprehensionCode1
SeqTR: A Simple yet Universal Network for Visual GroundingCode1
Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point CloudsCode1
Referring Transformer: A One-step Approach to Multi-task Visual GroundingCode1
MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingCode1
TransVG: End-to-End Visual Grounding with TransformersCode1
Unifying Vision-and-Language Tasks via Text GenerationCode1
TRAR: Routing the Attention Spans in Transformer for Visual Question AnsweringCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
Multi-task Collaborative Network for Joint Referring Expression Comprehension and SegmentationCode1
UNITER: UNiversal Image-TExt Representation LearningCode1
Talk2Car: Taking Control of Your Self-Driving CarCode1
VL-BERT: Pre-training of Generic Visual-Linguistic RepresentationsCode1
A Fast and Accurate One-Stage Approach to Visual GroundingCode1
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language TasksCode1
Explainable Neural Computation via Stack Neural Module NetworksCode1
Compositional Attention Networks for Machine ReasoningCode1
Referring Expression Instance Retrieval and A Strong End-to-End Baseline0
Synthetic Visual Genome0
WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and SegmentationCode0
Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding0
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing0
Exploring Spatial Language Grounding Through Referring Expressions0
FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis0
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks0
Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension0
DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension0
Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding0
Harlequin: Color-driven Generation of Synthetic Data for Referring Expression Comprehension0
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal ModelsCode0
Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression0
A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection TrainingCode0
Revisiting Multi-Modal LLM Evaluation0
MaskInversion: Localized Embeddings via Optimization of Explainability Maps0
Learning Visual Grounding from Generative Vision and Language Model0
The Solution for the 5th GCAIAC Zero-shot Referring Expression Comprehension Challenge0
M^2IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension0
Segment Anything Model for automated image data annotation: empirical studies using text prompts from Grounding DINO0
ScanFormer: Referring Expression Comprehension by Iteratively Scanning0
Adversarial Robustness for Visual Grounding of Multimodal Large Language ModelsCode0
Text-driven Affordance Learning from Egocentric Vision0
PropTest: Automatic Property Testing for Improved Visual Programming0
WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar0
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training0
Revisiting Counterfactual Problems in Referring Expression ComprehensionCode0
Compositional Zero-Shot Learning for Attribute-Based Object Reference in Human-Robot Interaction0
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects0
Show:102550
← PrevPage 2 of 4Next →

No leaderboard results yet.