| Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models | Feb 24, 2025 | GSM8KMath | CodeCode Available | 2 |
| AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models | Feb 24, 2025 | Logical ReasoningMultiple-choice | CodeCode Available | 1 |
| The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own | Feb 23, 2025 | Multiple-choice | —Unverified | 0 |
| LegalBench.PT: A Benchmark for Portuguese Law | Feb 22, 2025 | Multiple-choice | —Unverified | 0 |
| Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare | Feb 22, 2025 | Decision MakingMultiple-choice | CodeCode Available | 0 |
| Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores | Feb 22, 2025 | Distractor GenerationInformation Retrieval | CodeCode Available | 0 |
| MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models | Feb 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns | Feb 21, 2025 | Distractor GenerationMultiple-choice | —Unverified | 0 |
| Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension | Feb 20, 2025 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| Fundamental Limitations in Defending LLM Finetuning APIs | Feb 20, 2025 | Multiple-choice | —Unverified | 0 |
| MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels | Feb 20, 2025 | Multiple-choiceText Generation | —Unverified | 0 |
| Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora | Feb 19, 2025 | ArticlesMultiple-choice | —Unverified | 0 |
| Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh | Feb 19, 2025 | Instruction FollowingMultiple-choice | —Unverified | 0 |
| Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above | Feb 19, 2025 | AllMultiple-choice | —Unverified | 0 |
| Towards Geo-Culturally Grounded LLM Generations | Feb 19, 2025 | Multiple-choiceRetrieval-augmented Generation | —Unverified | 0 |
| VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare | Feb 19, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities | Feb 18, 2025 | Large Language ModelMultiple-choice | —Unverified | 0 |
| None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks | Feb 18, 2025 | MathMemorization | —Unverified | 0 |
| Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs | Feb 18, 2025 | Generative Question AnsweringMultiple-choice | —Unverified | 0 |
| Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering | Feb 17, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning | Feb 16, 2025 | Analogical questionsIn-Context Learning | —Unverified | 0 |
| Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models | Feb 16, 2025 | Multiple-choice | CodeCode Available | 1 |
| VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models | Feb 14, 2025 | Image CaptioningLarge Language Model | —Unverified | 0 |
| Objective quantification of mood states using large language models | Feb 13, 2025 | Multiple-choice | —Unverified | 0 |
| Truth Knows No Language: Evaluating Truthfulness Beyond English | Feb 13, 2025 | InformativenessMachine Translation | CodeCode Available | 0 |
| SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models | Feb 12, 2025 | FairnessMultiple-choice | —Unverified | 0 |
| A Semantic Parsing Algorithm to Solve Linear Ordering Problems | Feb 12, 2025 | Multiple-choiceSemantic Parsing | —Unverified | 0 |
| Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs | Feb 12, 2025 | Multiple-choiceSurvey | —Unverified | 0 |
| PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian | Feb 11, 2025 | Multiple-choice | —Unverified | 0 |
| Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark | Feb 10, 2025 | MMLUMorphological Analysis | —Unverified | 0 |
| HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models | Feb 9, 2025 | Answer GenerationLanguage Modeling | CodeCode Available | 0 |
| Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning | Feb 8, 2025 | Legal ReasoningMultiple-choice | CodeCode Available | 0 |
| ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning | Feb 7, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs | Feb 6, 2025 | Multiple-choiceSensitivity | —Unverified | 0 |
| LLMs to Support a Domain Specific Knowledge Assistant | Feb 6, 2025 | ChatbotMultiple-choice | —Unverified | 0 |
| TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes | Feb 4, 2025 | Autonomous DrivingMultiple-choice | CodeCode Available | 1 |
| Evalita-LLM: Benchmarking Large Language Models on Italian | Feb 4, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| The Use of Artificial Intelligence Tools in Assessing Content Validity: A Comparative Study with Human Experts | Feb 3, 2025 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| CoddLLM: Empowering Large Language Models for Data Analytics | Feb 1, 2025 | Multiple-choiceSynthetic Data Generation | —Unverified | 0 |
| InnerThoughts: Disentangling Representations and Predictions in Large Language Models | Jan 29, 2025 | Multiple-choicePosition | —Unverified | 0 |
| Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction | Jan 28, 2025 | Logical ReasoningMultiple-choice | —Unverified | 0 |
| Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection | Jan 28, 2025 | Multiple-choice | —Unverified | 0 |
| Attribution analysis of legal language as used by LLM | Jan 28, 2025 | Binary ClassificationMultiple-choice | —Unverified | 0 |
| Options-Aware Dense Retrieval for Multiple-Choice query Answering | Jan 27, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI | Jan 26, 2025 | MMLUMultiple-choice | —Unverified | 0 |
| LLM Evaluation Based on Aerospace Manufacturing Expertise: Automated Generation and Multi-Model Question Answering | Jan 25, 2025 | Information RetrievalMultiple-choice | —Unverified | 0 |
| LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion | Jan 25, 2025 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| Option-ID Based Elimination For Multiple Choice Questions | Jan 25, 2025 | Multiple-choice | CodeCode Available | 0 |
| Humanity's Last Exam | Jan 24, 2025 | Humanity's Last ExamLanguage Modeling | —Unverified | 0 |
| Auto-Evaluation: A Critical Measure in Driving Improvements in Quality and Safety of AI-Generated Lesson Resources | Jan 23, 2025 | Multiple-choice | —Unverified | 0 |