Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 351–375 of 1107 papers

Title	Date	Tasks	Status
CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models	Mar 20, 2025	Code GenerationMultiple-choice	—Unverified
AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models	Mar 20, 2025	Autonomous DrivingMultiple-choice	—Unverified
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation	Mar 20, 2025	Multiple-choiceText Generation	CodeCode Available
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models	Mar 19, 2025	Multiple-choice	—Unverified
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding	Mar 19, 2025	BenchmarkingMultiple-choice	—Unverified
How much do LLMs learn from negative examples?	Mar 18, 2025	Multiple-choiceQuestion Answering	CodeCode Available
LEAVS: An LLM-based Labeler for Abdominal CT Supervision	Mar 17, 2025	AnatomyLarge Language Model	CodeCode Available
Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data	Mar 13, 2025	Large Language ModelMath	—Unverified
The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory	Mar 13, 2025	MathMultiple-choice	—Unverified
It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education	Mar 13, 2025	Multiple-choice	—Unverified
SeqSAM: Autoregressive Multiple Hypothesis Prediction for Medical Image Segmentation using SAM	Mar 12, 2025	Image SegmentationMedical Image Segmentation	CodeCode Available
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations	Mar 10, 2025	FormMultiple-choice	—Unverified
VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models	Mar 10, 2025	Image DescriptionMultiple-choice	CodeCode Available
Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words	Mar 10, 2025	Multiple-choice	—Unverified
SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios	Mar 8, 2025	BenchmarkingDiagnostic	CodeCode Available
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces	Mar 8, 2025	Benchmarkingcounterfactual	—Unverified
Towards Conversational AI for Disease Management	Mar 8, 2025	Clinical KnowledgeDiagnostic	—Unverified
This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs	Mar 7, 2025	Large Language ModelMultiple-choice	CodeCode Available
Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework	Mar 7, 2025	Conformal PredictionMedical Question Answering	—Unverified
Analogical Reasoning Inside Large Language Models: Concept Vectors and the Limits of Abstraction	Mar 5, 2025	In-Context LearningMultiple-choice	CodeCode Available
Structured Outputs Enable General-Purpose LLMs to be Medical Experts	Mar 5, 2025	Clinical KnowledgeMedical Question Answering	—Unverified
The impact of AI and peer feedback on research writing skills: a study using the CGScholar platform among Kazakhstani scholars	Mar 5, 2025	Multiple-choiceSurvey	—Unverified
None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering	Mar 3, 2025	Business EthicsEthics	—Unverified
When an LLM is apprehensive about its answers -- and when its uncertainty is justified	Mar 3, 2025	MathMMLU	CodeCode Available
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts	Feb 28, 2025	MathMathematical Reasoning	—Unverified

Show:10 25 50

← PrevPage 15 of 45Next →

No leaderboard results yet.