| Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency | Apr 24, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| Fine-Tuning Large Language Models on Quantum Optimization Problems for Circuit Generation | Apr 15, 2025 | MathQuantum Machine Learning | CodeCode Available | 1 |
| The Jailbreak Tax: How Useful are Your Jailbreak Outputs? | Apr 14, 2025 | Math | CodeCode Available | 1 |
| Efficient Process Reward Model Training via Active Learning | Apr 14, 2025 | Active LearningMath | CodeCode Available | 1 |
| M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models | Apr 14, 2025 | MambaMath | CodeCode Available | 1 |
| Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression | Apr 10, 2025 | MathMMLU | CodeCode Available | 1 |
| MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models | Apr 8, 2025 | MathMultimodal Reasoning | CodeCode Available | 1 |
| Large (Vision) Language Models are Unsupervised In-Context Learners | Apr 3, 2025 | GSM8KIn-Context Learning | CodeCode Available | 1 |
| BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing | Apr 2, 2025 | 3D ReconstructionBenchmarking | CodeCode Available | 1 |
| Entropy-Based Adaptive Weighting for Self-Training | Mar 31, 2025 | GSM8KMath | CodeCode Available | 1 |
| QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks? | Mar 28, 2025 | Logical ReasoningMath | CodeCode Available | 1 |
| ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models | Mar 27, 2025 | Math | CodeCode Available | 1 |
| LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation | Mar 25, 2025 | Code CompletionLanguage Modeling | CodeCode Available | 1 |
| EXAONE Deep: Reasoning Enhanced Language Models | Mar 16, 2025 | Math | CodeCode Available | 1 |
| VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search | Mar 13, 2025 | Image RetrievalMath | CodeCode Available | 1 |
| EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees | Mar 11, 2025 | ChatbotLanguage Modeling | CodeCode Available | 1 |
| PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models | Mar 4, 2025 | GSM8KMath | CodeCode Available | 1 |
| FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving | Feb 27, 2025 | GSM8KMath | CodeCode Available | 1 |
| Self-Training Elicits Concise Reasoning in Large Language Models | Feb 27, 2025 | GSM8KIn-Context Learning | CodeCode Available | 1 |
| Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? | Feb 26, 2025 | Math | CodeCode Available | 1 |
| Forgotten Polygons: Multimodal Large Language Models are Shape-Blind | Feb 21, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 |
| How to Get Your LLM to Generate Challenging Problems for Evaluation | Feb 20, 2025 | Code CompletionMath | CodeCode Available | 1 |
| Reasoning with Reinforced Functional Token Tuning | Feb 19, 2025 | Math | CodeCode Available | 1 |
| Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities | Feb 17, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Thinking Preference Optimization | Feb 17, 2025 | Math | CodeCode Available | 1 |