SOTAVerified

Image to text

Papers

Showing 51100 of 246 papers

TitleStatusHype
TrojVLM: Backdoor Attack Against Vision Language Models0
Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization0
Evaluating authenticity and quality of image captions via sentiment and semantic analyses0
See or Guess: Counterfactually Regularized Image CaptioningCode1
UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and GenerationCode1
Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models0
Instruction Tuning-free Visual Token Complement for Multimodal LLMs0
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic SegmentationCode2
GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language ModelsCode0
Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local SimilaritiesCode2
Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic0
GPC: Generative and General Pathology Image Classifier0
LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image RetrievalCode2
15M Multimodal Facial Image-Text Dataset0
Towards a text-based quantitative and explainable histopathology image analysisCode0
HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels0
Vision-Braille: An End-to-End Tool for Chinese Braille Image-to-Text Translation0
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything0
A Data-Driven Guided Decoding Mechanism for Diagnostic CaptioningCode0
Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags0
BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image RetrievalCode0
CMC-Bench: Towards a New Paradigm of Visual Signal CompressionCode1
Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval0
Benchmarking Vision-Language Contrastive Methods for Medical Representation LearningCode0
AICoderEval: Improving AI Domain Code Generation of Large Language Models0
Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and DesignCode1
Faithful Chart Summarization with ChaTS-Pi0
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning0
Multi-modality Regional Alignment Network for Covid X-Ray Survival Prediction and Report GenerationCode0
Libra: Building Decoupled Vision System on Large Language ModelsCode2
Language-Oriented Semantic Latent Representation for Image TransmissionCode1
DOCCI: Descriptions of Connected and Contrasting Images0
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation0
Leveraging AI to Generate Audio for User-generated Content in Video Games0
VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical AlterationsCode0
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?Code1
Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection0
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept MatchingCode2
OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation0
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language ModelsCode2
Evaluating Text-to-Visual Generation with Image-to-Text GenerationCode3
BIMCV-R: A Landmark Dataset for 3D CT Text-Image Retrieval0
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation0
ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional ChangesCode1
MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant0
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?0
Enhancing Vision-Language Pre-training with Rich Supervisions0
Attention Guidance Mechanism for Handwritten Mathematical Expression Recognition0
Probing Multimodal Large Language Models for Global and Local Semantic RepresentationsCode0
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models0
Show:102550
← PrevPage 2 of 5Next →

No leaderboard results yet.