SOTAVerified

Multiple-choice

Papers

Showing 651700 of 1107 papers

TitleStatusHype
DeSIQ: Towards an Unbiased, Challenging Benchmark for Social Intelligence Understanding0
POE: Process of Elimination for Multiple Choice ReasoningCode0
Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond0
StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical UnderstandingCode0
Field-testing items using artificial intelligence: Natural language processing with transformers0
Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting0
Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education0
JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuningCode1
KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language ModelsCode0
Mitigating Bias for Question Answering Models by Tracking Bias Influence0
OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language ModelsCode1
BRAINTEASER: Lateral Thinking Puzzles for Large Language ModelsCode1
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks0
LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language ModelsCode1
On the Performance of Multimodal Language Models0
AutoCast++: Enhancing World Event Prediction with Zero-shot Ranking-based Context RetrievalCode0
Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute MisconceptionsCode0
Language Models as Knowledge Bases for Visual Word Sense DisambiguationCode0
Fusing Models with Complementary ExpertiseCode0
Fool Your (Vision and) Language Model With Embarrassingly Simple PermutationsCode1
Automating question generation from educational text0
HANS, are you clever? Clever Hans Effect Analysis of Neural Systems0
Exploring Iterative Enhancement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language ModelsCode0
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
Benchmarks for Pirá 2.0, a Reading Comprehension Dataset about the Ocean, the Brazilian Coast, and Climate Change0
Language models are susceptible to incorrect patient self-diagnosis in medical applications0
Self-Assessment Tests are Unreliable Measures of LLM Personality0
SafetyBench: Evaluating the Safety of Large Language ModelsCode2
Performance of ChatGPT-3.5 and GPT-4 on the United States Medical Licensing Examination With and Without Distractions0
Use neural networks to recognize students' handwritten letters and incorrect symbols0
Large Language Models Are Not Robust Multiple Choice SelectorsCode1
An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models0
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language ModelsCode1
INCEPTNET: Precise And Early Disease Detection Application For Medical Images AnalysesCode0
Generalised Winograd Schema and its Contextuality0
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language VariantsCode2
Spoken Language Intelligence of Large Language Models for Language LearningCode0
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions0
LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language ModelsCode1
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language ModelsCode2
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering ModelsCode1
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology0
Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context LearningCode0
ChatGPT for GTFS: Benchmarking LLMs on GTFS Understanding and RetrievalCode0
ReCoMIF: Reading comprehension based multi-source information fusion network for Chinese spoken language understandingCode0
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingCode2
Distractor generation for multiple-choice questions with predictive prompting and large language modelsCode0
SEED-Bench: Benchmarking Multimodal LLMs with Generative ComprehensionCode2
A large language model-assisted education tool to provide feedback on open-ended responsesCode0
Show:102550
← PrevPage 14 of 23Next →

No leaderboard results yet.