SOTAVerified

Math

Papers

Showing 251300 of 1596 papers

TitleStatusHype
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic TasksCode1
Brilla AI: AI Contestant for the National Science and Maths QuizCode1
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive PrinciplesCode1
Graph-to-Tree Neural Networks for Learning Structured Input-Output Translation with Applications to Semantic Parsing and Math Word ProblemCode1
Ape210K: A Large-Scale and Template-Rich Dataset of Math Word ProblemsCode1
GOLD: Geometry Problem Solver with Natural Language DescriptionCode1
Bridging and Modeling Correlations in Pairwise Data for Direct Preference OptimizationCode1
MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reportsCode1
Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and ObservationsCode1
GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-SolvingCode1
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical ReasoningCode1
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language ModelsCode1
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoningCode1
Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image ModelsCode1
Boosting Large Language Models with Socratic Method for Conversational Mathematics TeachingCode1
Get an A in Math: Progressive Rectification PromptingCode1
Measuring Conversational Uptake: A Case Study on Student-Teacher InteractionsCode1
Multiple-Choice Questions are Efficient and Robust LLM EvaluatorsCode1
From Zero to Hero: Convincing with Extremely Complicated MathCode1
From GAN to WGANCode1
MathViz-E: A Case-study in Domain-Specialized Tool-Using AgentsCode1
BlenderGym: Benchmarking Foundational Model Systems for Graphics EditingCode1
An In-depth Look at Gemini's Language AbilitiesCode1
Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMsCode1
MATHWELL: Generating Educational Math Word Problems Using Teacher AnnotationsCode1
MathPrompter: Mathematical Reasoning using Large Language ModelsCode1
Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and LayersCode1
Forgotten Polygons: Multimodal Large Language Models are Shape-BlindCode1
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical ReasoningCode1
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle SolvingCode1
Math Neurosurgery: Isolating Language Models' Math Reasoning Abilities Using Only Forward PassesCode1
FormulaNet: A Benchmark Dataset for Mathematical Formula DetectionCode1
Fine-Tuning Large Language Models on Quantum Optimization Problems for Circuit GenerationCode1
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human AnnotationsCode1
Math Word Problem Solving with Explicit Numerical ValuesCode1
A Neural Network Solves, Explains, and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human LevelCode1
MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning ProblemsCode1
Advancing Multimodal Reasoning via Reinforcement Learning with Cold StartCode1
Expression Syntax Information Bottleneck for Math Word ProblemsCode1
Mathematical Capabilities of ChatGPTCode1
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization ModelingCode1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
MathChat: Converse to Tackle Challenging Math Problems with LLM AgentsCode1
MathGloss: Building mathematical glossaries from textCode1
BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree SearchCode1
FELM: Benchmarking Factuality Evaluation of Large Language ModelsCode1
EXAONE Deep: Reasoning Enhanced Language ModelsCode1
Explaining Datasets in Words: Statistical Models with Natural Language ParametersCode1
An Early Evaluation of GPT-4V(ision)Code1
Evolving Prompts In-Context: An Open-ended, Self-replicating PerspectiveCode1
Show:102550
← PrevPage 6 of 32Next →

No leaderboard results yet.