SOTAVerified

Math

Papers

Showing 351400 of 1596 papers

TitleStatusHype
MathViz-E: A Case-study in Domain-Specialized Tool-Using AgentsCode1
Nerva: a Truly Sparse Implementation of Neural NetworksCode1
Toward Adaptive Reasoning in Large Language Models with Thought RollbackCode1
Learning Goal-Conditioned Representations for Language Reward ModelsCode1
TurkishMMLU: Measuring Massive Multitask Language Understanding in TurkishCode1
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization ModelingCode1
AutoBencher: Creating Salient, Novel, Difficult Datasets for Language ModelsCode1
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical ReasoningCode1
Eliminating Position Bias of Language Models: A Mechanistic ApproachCode1
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical ReasoningCode1
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMsCode1
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-FoldCode1
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language FeedbackCode1
CityGPT: Empowering Urban Spatial Cognition of Large Language ModelsCode1
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive PrinciplesCode1
DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based SamplingCode1
Collective Constitutional AI: Aligning a Language Model with Public InputCode1
DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math ReasoningCode1
TAIA: Large Language Models are Out-of-Distribution Data LearnersCode1
MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn InteractionsCode1
JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis ModelsCode1
Multiple-Choice Questions are Efficient and Robust LLM EvaluatorsCode1
TANQ: An open domain dataset of table answered questionsCode1
VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual ContextCode1
GOLD: Geometry Problem Solver with Natural Language DescriptionCode1
PECC: Problem Extraction and Coding ChallengesCode1
AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code GenerationCode1
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word ProblemsCode1
Toward Self-Improvement of LLMs via Imagination, Searching, and CriticizingCode1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
What is in Your Safe Data? Identifying Benign Data that Breaks SafetyCode1
Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with AutoformalizationCode1
Memory-Efficient and Secure DNN Inference on TrustZone-enabled Consumer IoT DevicesCode1
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?Code1
Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language ModelsCode1
Brilla AI: AI Contestant for the National Science and Maths QuizCode1
Improving the Validity of Automatically Generated Feedback via Reinforcement LearningCode1
Case-Based or Rule-Based: How Do Transformers Do the Math?Code1
Stepwise Self-Consistent Mathematical Reasoning with Large Language ModelsCode1
MATHWELL: Generating Educational Math Word Problems Using Teacher AnnotationsCode1
ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language ModelsCode1
Language Models as Science TutorsCode1
GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-SolvingCode1
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof DataCode1
Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths AggregationCode1
MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language ModelsCode1
ReGAL: Refactoring Programs to Discover Generalizable AbstractionsCode1
TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic TasksCode1
Over-Reasoning and Redundant Calculation of Large Language ModelsCode1
Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step ReasoningCode1
Show:102550
← PrevPage 8 of 32Next →

No leaderboard results yet.