SOTAVerified

Math

Papers

Showing 751800 of 1596 papers

TitleStatusHype
TurkishMMLU: Measuring Massive Multitask Language Understanding in TurkishCode1
A LLM Benchmark based on the Minecraft Builder Dialog Agent Task0
CCoE: A Compact LLM with Collaboration of Experts0
Reasoning with Large Language Models, a Survey0
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization ModelingCode1
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models0
TelecomGPT: A Framework to Build Telecom-Specfic Large Language Models0
Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model TutorsCode0
AutoBencher: Creating Salient, Novel, Difficult Datasets for Language ModelsCode1
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist0
Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On0
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data EngineCode4
ConvNLP: Image-based AI Text Detection0
Who is better at math, Jenny or Jingzhen? Uncovering Stereotypes in Large Language ModelsCode0
Solving for X and Beyond: Can Large Language Models Solve Complex Math Problems with More-Than-Two Unknowns?Code0
Smart Vision-Language ReasonersCode0
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical ReasoningCode1
Helpful assistant or fruitful facilitator? Investigating how personas affect language model behaviorCode0
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?Code2
Eliminating Position Bias of Language Models: A Mechanistic ApproachCode1
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical ReasoningCode1
Advancing Process Verification for Large Language Models via Tree-Based Preference Learning0
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models0
ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting0
LiveBench: A Challenging, Contamination-Limited LLM BenchmarkCode5
DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice QuestionsCode0
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMsCode3
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math DataCode2
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language ModelsCode2
Task Oriented In-Domain Data Augmentation0
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMsCode1
Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions0
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-FoldCode1
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language FeedbackCode1
Towards Infinite-Long Prefix in TransformerCode0
CityGPT: Empowering Urban Spatial Cognition of Large Language ModelsCode1
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning0
Adaptable Logical Control for Large Language ModelsCode2
Knowledge Tagging System on Math Questions via LLMs with Flexible Demonstration Retriever0
Can LLMs Reason in the Wild with Programs?Code0
DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-SolvingCode2
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All ToolsCode14
Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems0
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive PrinciplesCode1
Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts0
DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based SamplingCode1
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code IntelligenceCode9
GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image GenerationCode0
Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment0
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning0
Show:102550
← PrevPage 16 of 32Next →

No leaderboard results yet.