| Dynamic Inference With Grounding Based Vision and Language Models | Jan 1, 2023 | Language ModellingReferring Expression | —Unverified | 0 |
| DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding | Nov 28, 2022 | object-detectionObject Detection | CodeCode Available | 1 |
| Scene-Text Oriented Reffering Expression Comprehension | Nov 4, 2022 | Object LocalizationReferring Expression | CodeCode Available | 0 |
| TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation | Oct 19, 2022 | Instance SegmentationReferring Expression | CodeCode Available | 1 |
| VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment | Oct 9, 2022 | object-detectionObject Detection | CodeCode Available | 1 |
| Video Referring Expression Comprehension via Transformer with Content-aware Query | Oct 6, 2022 | cross-modal alignmentReferring Expression | —Unverified | 0 |
| Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos | Sep 21, 2022 | Action DetectionAction Recognition | CodeCode Available | 0 |
| Learning to Evaluate Performance of Multi-modal Semantic Localization | Sep 14, 2022 | Cross-Modal RetrievalReferring Expression | CodeCode Available | 1 |
| One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning | Jul 31, 2022 | AllReferring Expression | —Unverified | 0 |
| Correspondence Matters for Video Referring Expression Comprehension | Jul 21, 2022 | Contrastive LearningReferring Expression | CodeCode Available | 1 |