SOTAVerified

Logical Reasoning

Papers

Showing 150 of 747 papers

TitleStatusHype
Scaling Synthetic Data Creation with 1,000,000,000 PersonasCode11
NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?Code9
LLaVA-CoT: Let Vision Language Models Reason Step-by-StepCode7
PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented GenerationCode7
Training Compute-Optimal Large Language ModelsCode6
SGLang: Efficient Execution of Structured Language Model ProgramsCode6
SoundMind: RL-Incentivized Logic Reasoning for Audio-Language ModelsCode5
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGICode5
Training Large Language Models to Reason in a Continuous Latent SpaceCode5
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by TencentCode5
From System 1 to System 2: A Survey of Reasoning Large Language ModelsCode5
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and ReasoningCode4
R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep ReasoningCode4
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RLCode4
Knowledge Fusion of Large Language ModelsCode4
Reasoning with Language Model Prompting: A SurveyCode3
Chameleon: Plug-and-Play Compositional Reasoning with Large Language ModelsCode3
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic EvaluationsCode3
Advancing LLM Reasoning Generalists with Preference TreesCode3
Faithful Logical Reasoning via Symbolic Chain-of-ThoughtCode3
Measuring AI Ability to Complete Long TasksCode3
LLM4Drive: A Survey of Large Language Models for Autonomous DrivingCode3
A Survey on Large Language Model Acceleration based on KV Cache ManagementCode3
Scaling Language Models: Methods, Analysis & Insights from Training GopherCode2
SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and BeyondCode2
Cumulative Reasoning with Large Language ModelsCode2
Ontology Embedding: A Survey of Methods, Applications and ResourcesCode2
Easy Problems That LLMs Get WrongCode2
MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic WorkflowCode2
Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks AutomationCode2
PaLM: Scaling Language Modeling with PathwaysCode2
Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical ReasoningCode2
LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process ThinkingCode2
LTNtorch: PyTorch Implementation of Logic Tensor NetworksCode2
LangBridge: Multilingual Reasoning Without Multilingual SupervisionCode2
Chain-of-Thought for Autonomous Driving: A Comprehensive Survey and Future ProspectsCode2
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical ReasoningCode2
Large Language Models are Zero-Shot ReasonersCode2
MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical ProblemsCode2
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMsCode2
Flow of Reasoning:Training LLMs for Divergent Problem Solving with Minimal ExamplesCode2
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1Code2
FlashRNN: Optimizing Traditional RNNs on Modern HardwareCode2
Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor GenerationCode2
Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic CorpusCode2
Learning from Committee: Reasoning Distillation from a Mixture of Teachers with Peer-ReviewCode2
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous DrivingCode2
Evaluating the World Model Implicit in a Generative ModelCode2
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative ReasonersCode2
Show:102550
← PrevPage 1 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified