| TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish | Jul 17, 2024 | MathMultiple-choice | CodeCode Available | 1 |
| A LLM Benchmark based on the Minecraft Builder Dialog Agent Task | Jul 17, 2024 | MathMinecraft | —Unverified | 0 |
| CCoE: A Compact LLM with Collaboration of Experts | Jul 16, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| Reasoning with Large Language Models, a Survey | Jul 16, 2024 | Few-Shot LearningIn-Context Learning | —Unverified | 0 |
| OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling | Jul 13, 2024 | BenchmarkingMath | CodeCode Available | 1 |
| Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models | Jul 12, 2024 | GSM8KMath | —Unverified | 0 |
| TelecomGPT: A Framework to Build Telecom-Specfic Large Language Models | Jul 12, 2024 | Code GenerationMath | —Unverified | 0 |
| Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors | Jul 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models | Jul 11, 2024 | Language ModellingMath | CodeCode Available | 1 |
| Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist | Jul 11, 2024 | GSM8KMath | —Unverified | 0 |
| Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On | Jul 11, 2024 | GSM8KMath | —Unverified | 0 |
| MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine | Jul 11, 2024 | Contrastive LearningLanguage Modelling | CodeCode Available | 4 |
| ConvNLP: Image-based AI Text Detection | Jul 9, 2024 | Domain GeneralizationMath | —Unverified | 0 |
| Who is better at math, Jenny or Jingzhen? Uncovering Stereotypes in Large Language Models | Jul 9, 2024 | Math | CodeCode Available | 0 |
| Solving for X and Beyond: Can Large Language Models Solve Complex Math Problems with More-Than-Two Unknowns? | Jul 6, 2024 | Math | CodeCode Available | 0 |
| Smart Vision-Language Reasoners | Jul 5, 2024 | MathMathematical Reasoning | CodeCode Available | 0 |
| DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning | Jul 4, 2024 | AvgGSM8K | CodeCode Available | 1 |
| Helpful assistant or fruitful facilitator? Investigating how personas affect language model behavior | Jul 2, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? | Jul 1, 2024 | MathMathematical Reasoning | CodeCode Available | 2 |
| Eliminating Position Bias of Language Models: A Mechanistic Approach | Jul 1, 2024 | Mathobject-detection | CodeCode Available | 1 |
| Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning | Jun 30, 2024 | GSM8KMath | CodeCode Available | 1 |
| Advancing Process Verification for Large Language Models via Tree-Based Preference Learning | Jun 29, 2024 | Binary ClassificationGSM8K | —Unverified | 0 |
| CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models | Jun 28, 2024 | DiversityMath | —Unverified | 0 |
| ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting | Jun 28, 2024 | Bilevel OptimizationInstruction Following | —Unverified | 0 |
| LiveBench: A Challenging, Contamination-Limited LLM Benchmark | Jun 27, 2024 | ArticlesInstruction Following | CodeCode Available | 5 |
| DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions | Jun 27, 2024 | Distractor GenerationMath | CodeCode Available | 0 |
| Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs | Jun 26, 2024 | Arithmetic ReasoningGSM8K | CodeCode Available | 3 |
| MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data | Jun 26, 2024 | BenchmarkingMath | CodeCode Available | 2 |
| Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models | Jun 25, 2024 | DiversityMath | CodeCode Available | 2 |
| Task Oriented In-Domain Data Augmentation | Jun 24, 2024 | Data AugmentationMath | —Unverified | 0 |
| Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs | Jun 24, 2024 | Instruction FollowingMath | CodeCode Available | 1 |
| Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions | Jun 20, 2024 | Active LearningMath | —Unverified | 0 |
| RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold | Jun 20, 2024 | MathReinforcement Learning (RL) | CodeCode Available | 1 |
| LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback | Jun 20, 2024 | Binary ClassificationGSM8K | CodeCode Available | 1 |
| Towards Infinite-Long Prefix in Transformer | Jun 20, 2024 | Mathparameter-efficient fine-tuning | CodeCode Available | 0 |
| CityGPT: Empowering Urban Spatial Cognition of Large Language Models | Jun 20, 2024 | Code GenerationMath | CodeCode Available | 1 |
| Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning | Jun 20, 2024 | GSM8KHeuristic Search | —Unverified | 0 |
| Adaptable Logical Control for Large Language Models | Jun 19, 2024 | MathText Generation | CodeCode Available | 2 |
| Knowledge Tagging System on Math Questions via LLMs with Flexible Demonstration Retriever | Jun 19, 2024 | MathSemantic Similarity | —Unverified | 0 |
| Can LLMs Reason in the Wild with Programs? | Jun 19, 2024 | GSM8KMath | CodeCode Available | 0 |
| DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving | Jun 18, 2024 | Arithmetic ReasoningMath | CodeCode Available | 2 |
| ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools | Jun 18, 2024 | AllGSM8K | CodeCode Available | 14 |
| Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems | Jun 18, 2024 | In-Context LearningMath | —Unverified | 0 |
| Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles | Jun 18, 2024 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts | Jun 17, 2024 | Math | —Unverified | 0 |
| DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling | Jun 17, 2024 | GSM8KMath | CodeCode Available | 1 |
| DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence | Jun 17, 2024 | 16kLanguage Modeling | CodeCode Available | 9 |
| GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation | Jun 17, 2024 | Image GenerationMath | CodeCode Available | 0 |
| Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment | Jun 17, 2024 | Logical ReasoningMath | —Unverified | 0 |
| Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning | Jun 16, 2024 | BenchmarkingMath | —Unverified | 0 |