| Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction | Jan 28, 2025 | Logical ReasoningMultiple-choice | —Unverified | 0 |
| Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection | Jan 28, 2025 | Multiple-choice | —Unverified | 0 |
| Attribution analysis of legal language as used by LLM | Jan 28, 2025 | Binary ClassificationMultiple-choice | —Unverified | 0 |
| Options-Aware Dense Retrieval for Multiple-Choice query Answering | Jan 27, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI | Jan 26, 2025 | MMLUMultiple-choice | —Unverified | 0 |
| LLM Evaluation Based on Aerospace Manufacturing Expertise: Automated Generation and Multi-Model Question Answering | Jan 25, 2025 | Information RetrievalMultiple-choice | —Unverified | 0 |
| LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion | Jan 25, 2025 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| Option-ID Based Elimination For Multiple Choice Questions | Jan 25, 2025 | Multiple-choice | CodeCode Available | 0 |
| Humanity's Last Exam | Jan 24, 2025 | Humanity's Last ExamLanguage Modeling | —Unverified | 0 |
| Auto-Evaluation: A Critical Measure in Driving Improvements in Quality and Safety of AI-Generated Lesson Resources | Jan 23, 2025 | Multiple-choice | —Unverified | 0 |