Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–150 of 1107 papers

Title	Date	Tasks	Status	Hype
InstructionBench: An Instructional Video Understanding Benchmark	Apr 7, 2025	Common Sense ReasoningMultiple-choice	—Unverified	0
Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams	Apr 4, 2025	BenchmarkingManagement	—Unverified	0
From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models	Apr 4, 2025	Multiple-choice	—Unverified	0
VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence	Apr 3, 2025	Multiple-choice	CodeCode Available	0
ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning	Mar 31, 2025	Multiple-choice	—Unverified	0
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1	Mar 31, 2025	Logical ReasoningMultiple-choice	CodeCode Available	2
Order Independence With Finetuning	Mar 30, 2025	ARCLanguage Modeling	—Unverified	0
Question-Aware Knowledge Graph Prompting for Enhancing Large Language Models	Mar 30, 2025	Knowledge GraphsMultiple-choice	CodeCode Available	0
Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark	Mar 26, 2025	MMLUMultiple-choice	CodeCode Available	1
Language Model Uncertainty Quantification with Attention Chain	Mar 24, 2025	Computational EfficiencyLanguage Modeling	CodeCode Available	1
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering	Mar 23, 2025	BenchmarkingChart Question Answering	—Unverified	0
Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark	Mar 22, 2025	Multiple-choice	—Unverified	0
SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia	Mar 21, 2025	Multiple-choice	—Unverified	0
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation	Mar 20, 2025	Multiple-choiceText Generation	CodeCode Available	0
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models	Mar 20, 2025	Multiple-choiceVideo Understanding	CodeCode Available	1
CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models	Mar 20, 2025	Code GenerationMultiple-choice	—Unverified	0
AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models	Mar 20, 2025	Autonomous DrivingMultiple-choice	—Unverified	0
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models	Mar 19, 2025	Multiple-choice	—Unverified	0
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding	Mar 19, 2025	BenchmarkingMultiple-choice	—Unverified	0
How much do LLMs learn from negative examples?	Mar 18, 2025	Multiple-choiceQuestion Answering	CodeCode Available	0
LEAVS: An LLM-based Labeler for Abdominal CT Supervision	Mar 17, 2025	AnatomyLarge Language Model	CodeCode Available	0
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research	Mar 17, 2025	ArticlesBenchmarking	CodeCode Available	1
Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data	Mar 13, 2025	Large Language ModelMath	—Unverified	0
It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education	Mar 13, 2025	Multiple-choice	—Unverified	0
The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory	Mar 13, 2025	MathMultiple-choice	—Unverified	0
SeqSAM: Autoregressive Multiple Hypothesis Prediction for Medical Image Segmentation using SAM	Mar 12, 2025	Image SegmentationMedical Image Segmentation	CodeCode Available	0
Mellow: a small audio language model for reasoning	Mar 11, 2025	Audio captioningLanguage Modeling	CodeCode Available	2
Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words	Mar 10, 2025	Multiple-choice	—Unverified	0
VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models	Mar 10, 2025	Image DescriptionMultiple-choice	CodeCode Available	0
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations	Mar 10, 2025	FormMultiple-choice	—Unverified	0
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces	Mar 8, 2025	Benchmarkingcounterfactual	—Unverified	0
SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios	Mar 8, 2025	BenchmarkingDiagnostic	CodeCode Available	0
Towards Conversational AI for Disease Management	Mar 8, 2025	Clinical KnowledgeDiagnostic	—Unverified	0
CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset	Mar 8, 2025	Multiple-choice	CodeCode Available	1
Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework	Mar 7, 2025	Conformal PredictionMedical Question Answering	—Unverified	0
This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs	Mar 7, 2025	Large Language ModelMultiple-choice	CodeCode Available	0
The impact of AI and peer feedback on research writing skills: a study using the CGScholar platform among Kazakhstani scholars	Mar 5, 2025	Multiple-choiceSurvey	—Unverified	0
Analogical Reasoning Inside Large Language Models: Concept Vectors and the Limits of Abstraction	Mar 5, 2025	In-Context LearningMultiple-choice	CodeCode Available	0
Structured Outputs Enable General-Purpose LLMs to be Medical Experts	Mar 5, 2025	Clinical KnowledgeMedical Question Answering	—Unverified	0
When an LLM is apprehensive about its answers -- and when its uncertainty is justified	Mar 3, 2025	MathMMLU	CodeCode Available	0
None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering	Mar 3, 2025	Business EthicsEthics	—Unverified	0
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts	Feb 28, 2025	MathMathematical Reasoning	—Unverified	0
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology	Feb 28, 2025	Multiple-choicescientific discovery	CodeCode Available	2
EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants	Feb 27, 2025	Multiple-choice	CodeCode Available	0
Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning	Feb 27, 2025	MathMedical Question Answering	—Unverified	0
ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions	Feb 26, 2025	Language ModelingLanguage Modelling	—Unverified	0
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging	Feb 25, 2025	MMLUMultiple-choice	CodeCode Available	0
SECURA: Sigmoid-Enhanced CUR Decomposition with Uninterrupted Retention and Low-Rank Adaptation in Large Language Models	Feb 25, 2025	Continual LearningGSM8K	—Unverified	0
DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning	Feb 25, 2025	ManagementMultiple-choice	—Unverified	0
Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions	Feb 25, 2025	Inductive BiasLogical Reasoning	—Unverified	0

Show:10 25 50

← PrevPage 3 of 23Next →

No leaderboard results yet.