| TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish | Jul 17, 2024 | MathMultiple-choice | CodeCode Available | 1 |
| A LLM Benchmark based on the Minecraft Builder Dialog Agent Task | Jul 17, 2024 | MathMinecraft | —Unverified | 0 |
| Reasoning with Large Language Models, a Survey | Jul 16, 2024 | Few-Shot LearningIn-Context Learning | —Unverified | 0 |
| CCoE: A Compact LLM with Collaboration of Experts | Jul 16, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling | Jul 13, 2024 | BenchmarkingMath | CodeCode Available | 1 |
| Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models | Jul 12, 2024 | GSM8KMath | —Unverified | 0 |
| Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors | Jul 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| TelecomGPT: A Framework to Build Telecom-Specfic Large Language Models | Jul 12, 2024 | Code GenerationMath | —Unverified | 0 |
| MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine | Jul 11, 2024 | Contrastive LearningLanguage Modelling | CodeCode Available | 4 |
| Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist | Jul 11, 2024 | GSM8KMath | —Unverified | 0 |