| H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables | Jun 29, 2024 | Fact VerificationMathematical Reasoning | CodeCode Available | 1 |
| LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement | Jun 29, 2024 | Contrastive LearningMathematical Reasoning | —Unverified | 0 |
| LiteSearch: Efficacious Tree Search for LLM | Jun 29, 2024 | GSM8KMathematical Reasoning | —Unverified | 0 |
| The Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic | Jun 28, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Applying RLAIF for Code Generation with API-usage in Lightweight LLMs | Jun 28, 2024 | Code GenerationHallucination | —Unverified | 0 |
| Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs | Jun 26, 2024 | Arithmetic ReasoningGSM8K | CodeCode Available | 3 |
| MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data | Jun 26, 2024 | BenchmarkingMath | CodeCode Available | 2 |
| Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models | Jun 25, 2024 | DiversityMath | CodeCode Available | 2 |
| Anomaly Detection of Tabular Data Using LLMs | Jun 24, 2024 | Anomaly DetectionLong-Context Understanding | —Unverified | 0 |
| Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts | Jun 24, 2024 | Mathematical ReasoningVisual Question Answering (VQA) | —Unverified | 0 |
| Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads | Jun 22, 2024 | Mathematical Reasoning | —Unverified | 0 |
| LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback | Jun 20, 2024 | Binary ClassificationGSM8K | CodeCode Available | 1 |
| Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models | Jun 18, 2024 | Mathematical Reasoning | CodeCode Available | 0 |
| Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning | Jun 17, 2024 | Data AugmentationMathematical Reasoning | CodeCode Available | 2 |
| DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence | Jun 17, 2024 | 16kLanguage Modeling | CodeCode Available | 9 |
| CodeGemma: Open Code Models Based on Gemma | Jun 17, 2024 | Code CompletionMathematical Reasoning | —Unverified | 0 |
| Step-level Value Preference Optimization for Mathematical Reasoning | Jun 16, 2024 | Learning-To-RankMath | CodeCode Available | 3 |
| Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning | Jun 16, 2024 | BenchmarkingMath | —Unverified | 0 |
| MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models | Jun 15, 2024 | Mathematical ReasoningMMLU | —Unverified | 0 |
| Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? | Jun 13, 2024 | Mathematical ReasoningQuestion Answering | CodeCode Available | 1 |
| ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models | Jun 13, 2024 | Code Generationdomain classification | —Unverified | 0 |
| Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B | Jun 11, 2024 | Decision MakingGSM8K | CodeCode Available | 5 |
| Flow of Reasoning:Training LLMs for Divergent Problem Solving with Minimal Examples | Jun 9, 2024 | ARCDiversity | CodeCode Available | 2 |
| LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs | Jun 7, 2024 | Mathematical ReasoningMultiple-choice | CodeCode Available | 0 |
| Robustness Assessment of Mathematical Reasoning in the Presence of Missing and Contradictory Conditions | Jun 7, 2024 | HallucinationMathematical Reasoning | —Unverified | 0 |