| None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks | Feb 18, 2025 | MathMemorization | —Unverified | 0 |
| Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance | Feb 17, 2025 | BenchmarkingDependency Parsing | —Unverified | 0 |
| Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception | Feb 17, 2025 | MMLUNatural Questions | —Unverified | 0 |
| TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages | Feb 16, 2025 | Machine TranslationMMLU | CodeCode Available | 1 |
| Leveraging Uncertainty Estimation for Efficient LLM Routing | Feb 16, 2025 | GSM8KMMLU | —Unverified | 0 |
| OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning | Feb 16, 2025 | MedQAMMLU | —Unverified | 0 |
| ORI: O Routing Intelligence | Feb 14, 2025 | ARCMMLU | —Unverified | 0 |
| Cost-Saving LLM Cascades with Early Abstention | Feb 13, 2025 | GSM8KMMLU | —Unverified | 0 |
| Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models | Feb 12, 2025 | Mathematical ReasoningMMLU | —Unverified | 0 |
| Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon | Feb 11, 2025 | MMLU | CodeCode Available | 0 |