| Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models | Feb 16, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Forgotten Polygons: Multimodal Large Language Models are Shape-Blind | Feb 21, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 |
| Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities | Feb 17, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers | Apr 1, 2023 | Inductive BiasMathematical Problem-Solving | CodeCode Available | 1 |
| MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula | Jul 1, 2024 | Mathematical Problem-Solving | CodeCode Available | 1 |
| A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets | May 29, 2023 | Bias DetectionCode Generation | CodeCode Available | 1 |
| MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions | May 29, 2024 | BenchmarkingDialogue Understanding | CodeCode Available | 1 |
| MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion | Mar 20, 2025 | Data AugmentationMathematical Problem-Solving | CodeCode Available | 1 |
| Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks | Apr 23, 2024 | Mathematical Problem-SolvingQuestion Answering | CodeCode Available | 1 |
| Non-myopic Generation of Language Models for Reasoning and Planning | Oct 22, 2024 | Computational EfficiencyLanguage Modelling | CodeCode Available | 1 |