SOTAVerified

Logical Reasoning

Papers

Showing 251300 of 747 papers

TitleStatusHype
Towards Superior Quantization Accuracy: A Layer-sensitive Approach0
SCoRE: Benchmarking Long-Chain Reasoning in Commonsense ScenariosCode0
The Society of HiveMind: Multi-Agent Optimization of Foundation Model Swarms to Unlock the Potential of Collective Intelligence0
DB-Explore: Automated Database Exploration and Instruction Synthesis for Text-to-SQL0
HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling in Open-Ended General-Domain Tasks0
Psy-Insight: Explainable Multi-turn Bilingual Dataset for Mental Health Counseling0
Three tiers of computation in transformers and in brain architecturesCode0
DeLTa: A Decoding Strategy based on Logit Trajectory Prediction Improves Factuality and Reasoning AbilityCode0
KGCompiler: Deep Learning Compilation Optimization for Knowledge Graph Complex Logical Query Answering0
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs0
Order Doesn't Matter, But Reasoning Does: Training LLMs with Order-Centric Augmentation0
Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions0
TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model ReasoningCode0
Logic Haystacks: Probing LLMs Long-Context Logical Reasoning (Without Easily Identifiable Unrelated Padding)0
Intermediate Languages Matter: Formal Choice Drives Neurosymbolic LLM Reasoning0
Autoregressive Image Generation Guided by Chains of Thought0
Quantifying Logical Consistency in Transformers via Query-Key Alignment0
Empowering LLMs with Logical Reasoning: A Comprehensive Survey0
Identifying Features that Shape Perceived Consciousness in Large Language Model-based AI: A Quantitative Study of Human Responses0
Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests0
On the logical skills of large language models: evaluations using arbitrarily complex first-order logic problemsCode0
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos0
SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin0
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights0
HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation0
Integrating Expert Knowledge into Logical Programs via LLMsCode0
Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs0
Dialogue-based Explanations for Logical Reasoning using Structured Argumentation0
Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis0
The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models0
Logical Reasoning in Large Language Models: A Survey0
Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York0
Logical forms complement probability in understanding language model (and human) performance0
DMWM: Dual-Mind World Model with Long-Term Imagination0
Structural Reformation of Large Language Model Neuron Encapsulation for Divergent Information Aggregation0
S^2-MAD: Breaking the Token Barrier to Enhance Multi-Agent Debate Efficiency0
SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs0
Standard Neural Computation Alone Is Insufficient for Logical Intelligence0
Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs0
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning0
Enhancing Large Language Model Efficiencyvia Symbolic Compression: A Formal Approach Towards Interpretability0
Instantiation-based Formalization of Logical Reasoning Tasks using Language Models and Logical Solvers0
Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction0
DBRouting: Routing End User Queries to Databases for Answerability0
SedarEval: Automated Evaluation using Self-Adaptive RubricsCode0
A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models0
JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language ModelsCode0
VERUS-LM: a Versatile Framework for Combining LLMs with Symbolic Reasoning0
Assessing the Alignment of FOL Closeness Metrics with Human JudgementCode0
Reasoning with Graphs: Structuring Implicit Knowledge to Enhance LLMs Reasoning0
Show:102550
← PrevPage 6 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified