SOTAVerified

TextVQA

Papers

Showing 147 of 47 papers

TitleStatusHype
CogVLM2: Visual Language Models for Image and Video UnderstandingCode9
CogVLM: Visual Expert for Pretrained Language ModelsCode5
TextMonkey: An OCR-Free Large Multimodal Model for Understanding DocumentCode5
Towards VQA Models That Can ReadCode3
Lyra: An Efficient and Speech-Centric Framework for Omni-CognitionCode3
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language ModelsCode3
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution ImagesCode3
What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of GraphCode2
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal UnderstandingCode2
Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language ModelsCode2
RUArt: A Novel Text-Centered Solution for Text-Based Visual Question AnsweringCode1
Mitigating Object Hallucinations via Sentence-Level Early InterventionCode1
TAP: Text-Aware Pre-training for Text-VQA and Text-CaptionCode1
LaTr: Layout-Aware Transformer for Scene-Text VQACode1
A First Look: Towards Explainable TextVQA Models via Visual and Textual ExplanationsCode1
TAG: Boosting Text-VQA via Text-aware Visual Question-answer GenerationCode1
Structured Multimodal Attentions for TextVQACode1
Spatially Aware Multimodal Transformers for TextVQACode1
Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model0
Analysing the Robustness of Vision-Language-Models to Common Corruptions0
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs0
EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model0
Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy0
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models0
Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA0
FlexAttention for Efficient High-Resolution Vision-Language Models0
Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture0
HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models0
Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA0
Making the V in Text-VQA Matter0
Multiple-Question Multiple-Answer Text-VQA0
SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering0
Sentence Attention Blocks for Answer Grounding0
TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text0
TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance0
Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering0
Towards Escaping from Language Bias and OCR Error: Semantics-Centered Text Visual Question Answering0
Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal CluesCode0
VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction OptimizationCode0
Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language ModelsCode0
Towards a Unified Multimodal Reasoning FrameworkCode0
Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question AnsweringCode0
InstructOCR: Instruction Boosting Scene Text SpottingCode0
Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCapsCode0
Separate and Locate: Rethink the Text in Text-based Visual Question AnsweringCode0
OmniFusion Technical ReportCode0
Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQACode0
Show:102550

No leaderboard results yet.