| CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models | Mar 20, 2025 | Code GenerationMultiple-choice | —Unverified | 0 |
| Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation | Mar 20, 2025 | Multiple-choiceText Generation | CodeCode Available | 0 |
| AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models | Mar 20, 2025 | Autonomous DrivingMultiple-choice | —Unverified | 0 |
| FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding | Mar 19, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| VisNumBench: Evaluating Number Sense of Multimodal Large Language Models | Mar 19, 2025 | Multiple-choice | —Unverified | 0 |
| How much do LLMs learn from negative examples? | Mar 18, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| LEAVS: An LLM-based Labeler for Abdominal CT Supervision | Mar 17, 2025 | AnatomyLarge Language Model | CodeCode Available | 0 |
| Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data | Mar 13, 2025 | Large Language ModelMath | —Unverified | 0 |
| The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory | Mar 13, 2025 | MathMultiple-choice | —Unverified | 0 |
| It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education | Mar 13, 2025 | Multiple-choice | —Unverified | 0 |
| SeqSAM: Autoregressive Multiple Hypothesis Prediction for Medical Image Segmentation using SAM | Mar 12, 2025 | Image SegmentationMedical Image Segmentation | CodeCode Available | 0 |
| Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words | Mar 10, 2025 | Multiple-choice | —Unverified | 0 |
| VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models | Mar 10, 2025 | Image DescriptionMultiple-choice | CodeCode Available | 0 |
| Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations | Mar 10, 2025 | FormMultiple-choice | —Unverified | 0 |
| UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces | Mar 8, 2025 | Benchmarkingcounterfactual | —Unverified | 0 |
| Towards Conversational AI for Disease Management | Mar 8, 2025 | Clinical KnowledgeDiagnostic | —Unverified | 0 |
| SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios | Mar 8, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs | Mar 7, 2025 | Large Language ModelMultiple-choice | CodeCode Available | 0 |
| Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework | Mar 7, 2025 | Conformal PredictionMedical Question Answering | —Unverified | 0 |
| Analogical Reasoning Inside Large Language Models: Concept Vectors and the Limits of Abstraction | Mar 5, 2025 | In-Context LearningMultiple-choice | CodeCode Available | 0 |
| The impact of AI and peer feedback on research writing skills: a study using the CGScholar platform among Kazakhstani scholars | Mar 5, 2025 | Multiple-choiceSurvey | —Unverified | 0 |
| Structured Outputs Enable General-Purpose LLMs to be Medical Experts | Mar 5, 2025 | Clinical KnowledgeMedical Question Answering | —Unverified | 0 |
| When an LLM is apprehensive about its answers -- and when its uncertainty is justified | Mar 3, 2025 | MathMMLU | CodeCode Available | 0 |
| None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering | Mar 3, 2025 | Business EthicsEthics | —Unverified | 0 |
| MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts | Feb 28, 2025 | MathMathematical Reasoning | —Unverified | 0 |
| Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning | Feb 27, 2025 | MathMedical Question Answering | —Unverified | 0 |
| EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants | Feb 27, 2025 | Multiple-choice | CodeCode Available | 0 |
| ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions | Feb 26, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SECURA: Sigmoid-Enhanced CUR Decomposition with Uninterrupted Retention and Low-Rank Adaptation in Large Language Models | Feb 25, 2025 | Continual LearningGSM8K | —Unverified | 0 |
| Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions | Feb 25, 2025 | Inductive BiasLogical Reasoning | —Unverified | 0 |
| WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging | Feb 25, 2025 | MMLUMultiple-choice | CodeCode Available | 0 |
| DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning | Feb 25, 2025 | ManagementMultiple-choice | —Unverified | 0 |
| The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own | Feb 23, 2025 | Multiple-choice | —Unverified | 0 |
| Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores | Feb 22, 2025 | Distractor GenerationInformation Retrieval | CodeCode Available | 0 |
| Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare | Feb 22, 2025 | Decision MakingMultiple-choice | CodeCode Available | 0 |
| LegalBench.PT: A Benchmark for Portuguese Law | Feb 22, 2025 | Multiple-choice | —Unverified | 0 |
| MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models | Feb 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns | Feb 21, 2025 | Distractor GenerationMultiple-choice | —Unverified | 0 |
| Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension | Feb 20, 2025 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| Fundamental Limitations in Defending LLM Finetuning APIs | Feb 20, 2025 | Multiple-choice | —Unverified | 0 |
| MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels | Feb 20, 2025 | Multiple-choiceText Generation | —Unverified | 0 |
| VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare | Feb 19, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above | Feb 19, 2025 | AllMultiple-choice | —Unverified | 0 |
| Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh | Feb 19, 2025 | Instruction FollowingMultiple-choice | —Unverified | 0 |
| Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora | Feb 19, 2025 | ArticlesMultiple-choice | —Unverified | 0 |
| Towards Geo-Culturally Grounded LLM Generations | Feb 19, 2025 | Multiple-choiceRetrieval-augmented Generation | —Unverified | 0 |
| OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities | Feb 18, 2025 | Large Language ModelMultiple-choice | —Unverified | 0 |
| None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks | Feb 18, 2025 | MathMemorization | —Unverified | 0 |
| Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs | Feb 18, 2025 | Generative Question AnsweringMultiple-choice | —Unverified | 0 |
| Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering | Feb 17, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |