SOTAVerified

Multiple-choice

Papers

Showing 101150 of 1107 papers

TitleStatusHype
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric InformationCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 LanguagesCode1
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?Code1
MEG: Medical Knowledge-Augmented Large Language Models for Question AnsweringCode1
MILU: A Multi-task Indic Language Understanding BenchmarkCode1
Delving into the Reversal Curse: How Far Can Large Language Models Generalize?Code1
TimeSeriesExam: A time series understanding examCode1
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluationCode1
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language ModelsCode1
Taming Overconfidence in LLMs: Reward Calibration in RLHFCode1
SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language ModelsCode1
MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE FrameworkCode1
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense ReasoningCode1
Boosting Healthcare LLMs Through Retrieved ContextCode1
Annealed Winner-Takes-All for Motion ForecastingCode1
Training on the Benchmark Is Not All You NeedCode1
TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein EngineeringCode1
Enhancing Knowledge Tracing with Concept Map and Response DisentanglementCode1
LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMsCode1
Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealingCode1
Evaluating language models as risk scoresCode1
TurkishMMLU: Measuring Massive Multitask Language Understanding in TurkishCode1
Fine-tuning Multimodal Large Language Models for Product BundlingCode1
Uncertainty is Fragile: Manipulating Uncertainty in Large Language ModelsCode1
ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access NetworksCode1
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual ContextsCode1
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient EvaluationCode1
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
HCQA @ Ego4D EgoSchema Challenge 2024Code1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food CultureCode1
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-trainingCode1
IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerceCode1
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and LanguagesCode1
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in InsuranceCode1
MuirBench: A Comprehensive Benchmark for Robust Multi-image UnderstandingCode1
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein UnderstandingCode1
TopViewRS: Vision-Language Models as Top-View Spatial ReasonersCode1
Embedding Trajectory for Out-of-Distribution Detection in Mathematical ReasoningCode1
Multiple-Choice Questions are Efficient and Robust LLM EvaluatorsCode1
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure InterpretationCode1
THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language ModelsCode1
Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcomCode1
Latxa: An Open Language Model and Evaluation Suite for BasqueCode1
Non-Linear Inference Time Intervention: Improving LLM TruthfulnessCode1
IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language ModelsCode1
Complex Reasoning over Logical Queries on Commonsense Knowledge GraphsCode1
Unfamiliar Finetuning Examples Control How Language Models HallucinateCode1
To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question AnsweringCode1
Show:102550
← PrevPage 3 of 23Next →

No leaderboard results yet.