| Transformers in Vision: A Survey | Jan 4, 2021 | Action RecognitionActivity Recognition | —Unverified | 0 | 0 |
| Transform-Retrieve-Generate: Natural Language-Centric Outside-Knowledge Visual Question Answering | Jan 1, 2022 | Generative Question AnsweringImage to text | —Unverified | 0 | 0 |
| Translation Deserves Better: Analyzing Translation Artifacts in Cross-lingual Visual Question Answering | Jun 4, 2024 | Data AugmentationMachine Translation | —Unverified | 0 | 0 |
| TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba | Feb 21, 2025 | image-classificationImage Classification | —Unverified | 0 | 0 |
| WangLab at MEDIQA-M3G 2024: Multimodal Medical Answer Generation using Large Language Models | Apr 22, 2024 | Answer Generationimage-classification | —Unverified | 0 | 0 |
| Asking More Informative Questions for Grounded Retrieval | Nov 14, 2023 | Question AnsweringQuestion Selection | —Unverified | 0 | 0 |
| Yin and Yang: Balancing and Answering Binary Visual Questions | Nov 16, 2015 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance | Apr 23, 2025 | Question AnsweringScene Understanding | —Unverified | 0 | 0 |
| A Concept-Centric Approach to Multi-Modality Learning | Dec 18, 2024 | Image-text matchingQuestion Answering | —Unverified | 0 | 0 |
| Tree Memory Networks for Modelling Long-term Temporal Dependencies | Mar 12, 2017 | Machine TranslationPart-Of-Speech Tagging | —Unverified | 0 | 0 |
| Triplet-Aware Scene Graph Embeddings | Sep 19, 2019 | Data AugmentationGraph Embedding | —Unverified | 0 | 0 |
| Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis | Jun 21, 2024 | AttributeMedical Visual Question Answering | —Unverified | 0 | 0 |
| TrojVLM: Backdoor Attack Against Vision Language Models | Sep 28, 2024 | Backdoor AttackImage Captioning | —Unverified | 0 | 0 |
| As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks? | Mar 19, 2024 | Adversarial AttackImage Captioning | —Unverified | 0 | 0 |
| TRRNet: Tiered Relation Reasoning for Compositional Visual Question Answering | Aug 1, 2020 | ObjectQuestion Answering | —Unverified | 0 | 0 |
| TruthLens:A Training-Free Paradigm for DeepFake Detection | Mar 19, 2025 | Binary ClassificationDeepFake Detection | —Unverified | 0 | 0 |
| Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering | Jul 31, 2024 | DiagnosticHallucination | —Unverified | 0 | 0 |
| A scoping review on multimodal deep learning in biomedical images and texts | Jul 14, 2023 | Cross-Modal RetrievalDecision Making | —Unverified | 0 | 0 |
| Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering | Mar 29, 2018 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| TxT: Crossmodal End-to-End Learning with Transformers | Sep 9, 2021 | Multimodal ReasoningQuestion Answering | —Unverified | 0 | 0 |
| UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training | Apr 1, 2021 | Image-text matchingImage-text Retrieval | —Unverified | 0 | 0 |
| U-CAM: Visual Explanation using Uncertainty based Class Activation Maps | Aug 17, 2019 | Deep LearningProbabilistic Deep Learning | —Unverified | 0 | 0 |
| SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge | May 23, 2024 | Question AnsweringRAG | —Unverified | 0 | 0 |
| UFO: A UniFied TransfOrmer for Vision-Language Representation Learning | Nov 19, 2021 | Image CaptioningImage-text matching | —Unverified | 0 | 0 |
| UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering | Jul 6, 2023 | DiagnosticImage Enhancement | —Unverified | 0 | 0 |