| MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark | Dec 19, 2024 | MMLUMultiple-choice | CodeCode Available | 2 |
| LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks | Dec 19, 2024 | 8kIn-Context Learning | CodeCode Available | 5 |
| Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation | Dec 16, 2024 | Multiple-choice | —Unverified | 0 |
| CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding | Dec 16, 2024 | HallucinationMultiple-choice | —Unverified | 0 |
| Auto-bidding in real-time auctions via Oracle Imitation Learning (OIL) | Dec 16, 2024 | Imitation LearningMultiple-choice | —Unverified | 0 |
| Seeing the Forest and the Trees: Solving Visual Graph and Tree Based Data Structure Problems using Large Multimodal Models | Dec 15, 2024 | Multiple-choice | —Unverified | 0 |
| MedG-KRP: Medical Graph Knowledge Representation Probing | Dec 14, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 |
| Do LLMs Act as Repositories of Causal Knowledge? | Dec 14, 2024 | Causal InferenceMultiple-choice | —Unverified | 0 |
| A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options | Dec 14, 2024 | Multiple-choice | —Unverified | 0 |
| Superhuman performance of a large language model on the reasoning tasks of a physician | Dec 14, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 |