SOTAVerified

Question Answering

Question answering can be segmented into domain-specific tasks like community question answering and knowledge-base question answering. Popular benchmark datasets for evaluation question answering systems include SQuAD, HotPotQA, bAbI, TriviaQA, WikiQA, and many others. Models for question answering are typically evaluated on metrics like EM and F1. Some recent top performing models are T5 and XLNet.

( Image credit: SQuAD )

Papers

Showing 501550 of 10817 papers

TitleStatusHype
Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language ModelsCode2
DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical DialogueCode2
Generate-on-Graph: Treat LLM as both Agent and KG in Incomplete Knowledge Graph Question AnsweringCode2
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMsCode2
Frozen Transformers in Language Models Are Effective Visual Encoder LayersCode2
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning AbilitiesCode2
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language ModelsCode2
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMsCode2
From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning TasksCode2
GeoChat: Grounded Large Vision-Language Model for Remote SensingCode2
CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video ModelsCode2
FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model EvaluationCode2
F-LMM: Grounding Frozen Large Multimodal ModelsCode2
Fine-Grained Human Feedback Gives Better Rewards for Language Model TrainingCode2
Atlas: Few-shot Learning with Retrieval Augmented Language ModelsCode2
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question AnsweringCode2
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal ReasoningCode2
EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysisCode2
Evaluating LLM Reasoning in the Operations Research Domain with ORQACode2
FakeBench: Probing Explainable Fake Image Detection via Large Multimodal ModelsCode2
E.T. Bench: Towards Open-Ended Event-Level Video-Language UnderstandingCode2
ERA-CoT: Improving Chain-of-Thought through Entity Relationship AnalysisCode2
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-ImprovementCode2
FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language ModelsCode2
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language ModelsCode2
EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health RecordsCode2
Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory PredictionCode2
EmbodiedEval: Evaluate Multimodal LLMs as Embodied AgentsCode2
Egocentric Video-Language PretrainingCode2
Retrieval with Learned SimilaritiesCode2
Empowering Large Language Models to Set up a Knowledge Retrieval Indexer via Self-LearningCode2
Evaluating RAG-Fusion with RAGElo: an Automated Elo-based FrameworkCode2
Explore the Limits of Omni-modal Pretraining at ScaleCode2
Exploring the Limits of Transfer Learning with a Unified Text-to-Text TransformerCode2
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual ScenariosCode2
An Embodied Generalist Agent in 3D WorldCode2
Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language ModelsCode2
FinBERT-QA: Financial Question Answering with pre-trained BERT Language ModelsCode2
Efficient One-Pass End-to-End Entity Linking for QuestionsCode2
FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character DesignCode2
EduChat: A Large-Scale Language Model-based Chatbot System for Intelligent EducationCode2
EfficientRAG: Efficient Retriever for Multi-Hop Question AnsweringCode2
End-To-End Memory NetworksCode2
ANAH: Analytical Annotation of Hallucinations in Large Language ModelsCode2
ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language ModelsCode2
Dual Diffusion for Unified Image Generation and UnderstandingCode2
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario UnderstandingCode2
Chain-of-Table: Evolving Tables in the Reasoning Chain for Table UnderstandingCode2
Can AI Assistants Know What They Don't Know?Code2
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLMCode2
Show:102550
← PrevPage 11 of 217Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1IE-Net (ensemble)EM90.94Unverified
2FPNet (ensemble)EM90.87Unverified
3IE-NetV2 (ensemble)EM90.86Unverified
4SA-Net on Albert (ensemble)EM90.72Unverified
5SA-Net-V2 (ensemble)EM90.68Unverified
6FPNet (ensemble)EM90.6Unverified
7Retro-Reader (ensemble)EM90.58Unverified
8EntitySpanFocusV2 (ensemble)EM90.52Unverified
9TransNets + SFVerifier + SFEnsembler (ensemble)EM90.49Unverified
10EntitySpanFocus+AT (ensemble)EM90.45Unverified