Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 501–525 of 1107 papers

Title	Date	Tasks	Status
IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark for LLMs	Nov 12, 2024	coreference-resolutionCoreference Resolution	CodeCode Available
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents	Nov 12, 2024	General KnowledgeHallucination	—Unverified
Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability	Nov 10, 2024	Multiple-choiceText Generation	—Unverified
Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators	Nov 8, 2024	Decision MakingMultiple-choice	—Unverified
Quantitative Assessment of Intersectional Empathetic Bias and Understanding	Nov 8, 2024	Multiple-choice	CodeCode Available
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding	Nov 7, 2024	BenchmarkingMultiple-choice	—Unverified
FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees	Nov 4, 2024	Multiple-choiceQuestion Answering	—Unverified
Enhancing LLM Evaluations: The Garbling Trick	Nov 3, 2024	Multiple-choice	—Unverified
Benchmarking Bias in Large Language Models during Role-Playing	Nov 1, 2024	BenchmarkingFairness	—Unverified
R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest	Oct 27, 2024	Medical Visual Question AnsweringMultiple-choice	—Unverified
GPT-4o System Card	Oct 25, 2024	Multiple-choiceSpatial Reasoning	—Unverified
Beyond Multiple-Choice Accuracy: Real-World Challenges of Implementing Large Language Models in Healthcare	Oct 24, 2024	Multiple-choice	—Unverified
Large Language Models Still Exhibit Bias in Long Text	Oct 23, 2024	FairnessMultiple-choice	—Unverified
GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks	Oct 22, 2024	Code GenerationCode Summarization	—Unverified
How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making?	Oct 21, 2024	counterfactualDecision Making	CodeCode Available
Susu Box or Piggy Bank: Assessing Cultural Commonsense Knowledge between Ghana and the U.S	Oct 21, 2024	Multiple-choice	—Unverified
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models	Oct 18, 2024	FairnessMultiple-choice	—Unverified
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs	Oct 18, 2024	BenchmarkingFairness	—Unverified
MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback	Oct 17, 2024	Fact VerificationHallucination	CodeCode Available
CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy	Oct 17, 2024	Multiple-choiceResponse Generation	—Unverified
LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights	Oct 17, 2024	Legal ReasoningMultiple-choice	—Unverified
Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks	Oct 16, 2024	Instruction FollowingMultiple-choice	CodeCode Available
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers	Oct 15, 2024	Multiple-choice	CodeCode Available
Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs	Oct 15, 2024	Image DescriptionMultiple-choice	CodeCode Available
Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing	Oct 14, 2024	AllBinary Classification	—Unverified

Show:10 25 50

← PrevPage 21 of 45Next →

No leaderboard results yet.