SOTAVerified

Multiple-choice

Papers

Showing 551600 of 1107 papers

TitleStatusHype
Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation0
Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions0
QMOS: Enhancing LLMs for Telecommunication with Question Masked loss and Option ShufflingCode0
First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge0
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination0
Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights0
Edu-Values: Towards Evaluating the Chinese Education Values of Large Language ModelsCode0
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do0
Cracking the Code: Multi-domain LLM Evaluation on Real-World Professional Exams in Indonesia0
Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement0
Towards Democratizing Multilingual Large Language Models For Medicine Through A Two-Stage Instruction Fine-tuning ApproachCode0
COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSesCode0
MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models0
The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?0
Novel-WD: Exploring acquisition of Novel World Knowledge in LLMs Using Prefix-Tuning0
Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice OptionsCode0
Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized ModelsCode0
Large Language Models Are Self-Taught Reasoners: Enhancing LLM Applications via Tailored Problem-Solving Demonstrations0
Differentiating Choices via Commonality for Multiple-Choice Question AnsweringCode0
How Susceptible are LLMs to Influence in Prompts?0
Measuring Agreeableness Bias in Multimodal ModelsCode0
Chain-of-Exemplar: Enhancing Distractor Generation for Multimodal Educational Question GenerationCode0
Examining the Behavior of LLM Architectures Within the Framework of Standardized National Exams in Brazil0
LLaVA-OneVision: Easy Visual Task TransferCode0
Winning Amazon KDD Cup'240
Recent Advances in Multi-Choice Machine Reading Comprehension: A Survey on Methods and Datasets0
Improved Few-Shot Image Classification Through Multiple-Choice Questions0
Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models0
MIBench: Evaluating Multimodal Large Language Models over Multiple Images0
Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions0
Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual AlignmentCode0
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data0
Adversarial Databases Improve Success in Retrieval-based Large Language Models0
MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models0
NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models0
AstroMLab 1: Who Wins Astronomy Jeopardy!?0
LAB-Bench: Measuring Capabilities of Language Models for Biology Research0
Leveraging large language models for nano synthesis mechanism explanation: solid foundations or mere conjectures?Code0
Evaluating Nuanced Bias in Large Language Model Free Response Answers0
Self-Recognition in Language ModelsCode0
Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?Code0
Are Large Language Models Consistent over Value-laden Questions?Code0
Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?Code0
CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models0
DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice QuestionsCode0
Changing Answer Order Can Decrease MMLU Accuracy0
Length Optimization in Conformal PredictionCode0
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration0
SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages0
Show:102550
← PrevPage 12 of 23Next →

No leaderboard results yet.