SOTAVerified

Multiple-choice

Papers

Showing 126150 of 1107 papers

TitleStatusHype
ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access NetworksCode1
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual ContextsCode1
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient EvaluationCode1
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
HCQA @ Ego4D EgoSchema Challenge 2024Code1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food CultureCode1
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-trainingCode1
IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerceCode1
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and LanguagesCode1
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in InsuranceCode1
MuirBench: A Comprehensive Benchmark for Robust Multi-image UnderstandingCode1
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein UnderstandingCode1
TopViewRS: Vision-Language Models as Top-View Spatial ReasonersCode1
Embedding Trajectory for Out-of-Distribution Detection in Mathematical ReasoningCode1
Multiple-Choice Questions are Efficient and Robust LLM EvaluatorsCode1
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure InterpretationCode1
THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language ModelsCode1
Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcomCode1
Latxa: An Open Language Model and Evaluation Suite for BasqueCode1
Non-Linear Inference Time Intervention: Improving LLM TruthfulnessCode1
IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language ModelsCode1
Complex Reasoning over Logical Queries on Commonsense Knowledge GraphsCode1
Unfamiliar Finetuning Examples Control How Language Models HallucinateCode1
To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question AnsweringCode1
Show:102550
← PrevPage 6 of 45Next →

No leaderboard results yet.