SOTAVerified

Question Answering

Question answering can be segmented into domain-specific tasks like community question answering and knowledge-base question answering. Popular benchmark datasets for evaluation question answering systems include SQuAD, HotPotQA, bAbI, TriviaQA, WikiQA, and many others. Models for question answering are typically evaluated on metrics like EM and F1. Some recent top performing models are T5 and XLNet.

( Image credit: SQuAD )

Papers

Showing 23512400 of 10817 papers

TitleStatusHype
FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering0
Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target TokensCode0
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented GenerationCode1
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models0
Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation0
Factual Confidence of LLMs: on Reliability and Robustness of Current EstimatorsCode1
Nash CoT: Multi-Path Inference with Preference EquilibriumCode0
Diversify, Rationalize, and Combine: Ensembling Multiple QA Strategies for Zero-shot Knowledge-based VQACode0
LightPAL: Lightweight Passage Retrieval for Open Domain Multi-Document Summarization0
Towards Understanding Domain Adapted Sentence Embeddings for Document Retrieval0
Intermediate Distillation: Data-Efficient Distillation from Black-Box LLMs for Information Retrieval0
VoCo-LLaMA: Towards Vision Compression with Large Language ModelsCode3
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive PrinciplesCode1
GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace TheoryCode0
Problem-Solving in Language Model NetworksCode0
Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for EnsemblingCode2
From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries0
VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image UnderstandingCode2
Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis0
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems0
InternalInspector I^2: Robust Confidence Estimation in LLMs through Internal States0
Learn Beyond The Answer: Training Language Models with Reflection for Mathematical ReasoningCode2
Mitigating Large Language Model Hallucination with Faithful Finetuning0
Extrinsic Evaluation of Cultural Competence in Large Language ModelsCode0
MedCalc-Bench: Evaluating Large Language Models for Medical CalculationsCode2
Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs0
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and ActivationsCode1
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language ModelCode1
Soft Prompting for Unlearning in Large Language ModelsCode1
SeRTS: Self-Rewarding Tree Search for Biomedical Retrieval-Augmented Generation0
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference ContentCode0
ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPOCode2
TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented GenerationCode1
Context Graph0
Task Me AnythingCode2
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning AbilitiesCode2
Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models?Code0
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning0
Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy0
Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment0
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive ReasoningCode3
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language ModelsCode1
Hallucination Mitigation Prompts Long-term Video UnderstandingCode0
Refiner: Restructure Retrieval Content Efficiently to Advance Question-Answering CapabilitiesCode0
Mixture-of-Subspaces in Low-Rank AdaptationCode0
Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers0
Towards Lifelong Dialogue Agents via Timeline-based Memory Management0
Identifying Query-Relevant Neurons in Large Language Models for Long-Form TextsCode0
SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response RankingCode1
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food CultureCode1
Show:102550
← PrevPage 48 of 217Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1IE-Net (ensemble)EM90.94Unverified
2FPNet (ensemble)EM90.87Unverified
3IE-NetV2 (ensemble)EM90.86Unverified
4SA-Net on Albert (ensemble)EM90.72Unverified
5SA-Net-V2 (ensemble)EM90.68Unverified
6FPNet (ensemble)EM90.6Unverified
7Retro-Reader (ensemble)EM90.58Unverified
8EntitySpanFocusV2 (ensemble)EM90.52Unverified
9TransNets + SFVerifier + SFEnsembler (ensemble)EM90.49Unverified
10EntitySpanFocus+AT (ensemble)EM90.45Unverified