SOTAVerified

Math

Papers

Showing 151200 of 1596 papers

TitleStatusHype
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language ModelsCode2
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to ImitateCode2
Advancing Language Model Reasoning through Reinforcement Learning and Inference ScalingCode2
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal MathematicsCode2
Offline Reinforcement Learning for LLM Multi-Step ReasoningCode2
ProcessBench: Identifying Process Errors in Mathematical ReasoningCode2
Preference Optimization for Reasoning with Pseudo FeedbackCode2
LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-TrainingCode2
Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic CorpusCode2
Flaming-hot Initiation with Regular Execution Sampling for Large Language ModelsCode2
Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from ScratchCode2
A Comparative Study on Reasoning Patterns of OpenAI's o1 ModelCode2
JudgeBench: A Benchmark for Evaluating LLM-based JudgesCode2
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function OptimizationCode2
VibeCheck: Discover and Quantify Qualitative Differences in Large Language ModelsCode2
MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical CodeCode2
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language ModelsCode2
Steering Large Language Models between Code Execution and Textual ReasoningCode2
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit AssignmentCode2
Archon: An Architecture Search Framework for Inference-Time TechniquesCode2
Balancing LoRA Performance and Efficiency with Simple Shard SharingCode2
Training Language Models to Self-Correct via Reinforcement LearningCode2
VAE Explainer: Supplement Learning Variational Autoencoders with Interactive VisualizationCode2
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal ModelsCode2
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language ModelsCode2
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning ProcessCode2
Weak-to-Strong ReasoningCode2
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?Code2
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math DataCode2
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language ModelsCode2
Adaptable Logical Control for Large Language ModelsCode2
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-SolvingCode2
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language ModelsCode2
CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuningCode2
Yuan 2.0-M32: Mixture of Experts with Attention RouterCode2
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of ParametersCode2
Autoformalizing Euclidean GeometryCode2
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision ModelsCode2
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics BenchmarkCode2
Exploring the Compositional Deficiency of Large Language Models in Mathematical ReasoningCode2
Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained RewardsCode2
Evaluating Mathematical Reasoning Beyond AccuracyCode2
MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical ProblemsCode2
ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique PipelineCode2
Easy-to-Hard Generalization: Scalable Alignment Beyond Human SupervisionCode2
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning GapCode2
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem SolversCode2
Measuring Multimodal Mathematical Reasoning with MATH-Vision DatasetCode2
Reformatted AlignmentCode2
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task ArithmeticCode2
Show:102550
← PrevPage 4 of 32Next →

No leaderboard results yet.