SOTAVerified|Agents Browse Leaderboard About

Logical Reasoning

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 391–400 of 747 papers

Title	Date	Tasks	Status	Hype	Score
Towards Unifying Logical Entailment and Statistical Estimation	Feb 27, 2022	Formal LogicLogical Reasoning	—Unverified	0	0
Towards Unifying Perceptual Reasoning and Logical Reasoning	Jun 27, 2022	Bayesian InferenceLogical Reasoning	—Unverified	0	0
To What Extent Do Natural Language Understanding Datasets Correlate to Logical Reasoning? A Method for Diagnosing Logical Reasoning.	Oct 1, 2022	DiagnosticLogical Reasoning	—Unverified	0	0
Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction	Jan 28, 2025	Logical ReasoningMultiple-choice	—Unverified	0	0
Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs	Feb 17, 2025	In-Context LearningLogical Reasoning	—Unverified	0	0
Transformer-based Language Models for Reasoning in the Description Logic ALCQ	Oct 12, 2024	Logical Reasoning	—Unverified	0	0
Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests	Feb 20, 2025	Logical ReasoningMMLU	—Unverified	0	0
Truth Table Deep Convolutional Neural Network, A New SAT-Encodable Architecture - Application To Complete Robustness	Sep 29, 2021	Explainable Artificial Intelligence (XAI)Explanation Generation	—Unverified	0	0
A Scalable, Interpretable, Verifiable & Differentiable Logic Gate Convolutional Neural Network Architecture From Truth Tables	Aug 18, 2022	FairnessLogical Reasoning	—Unverified	0	0
TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games	Jun 11, 2025	Logical ReasoningMath	—Unverified	0	0

Show:10 25 50

← PrevPage 40 of 75Next →

All datasets LingOly BIG-bench (Formal Fallacies Syllogisms Negation)BIG-bench (Penguins In A Table)BIG-bench (Reasoning About Colored Objects)BIG-bench (Temporal Sequences)BIG-bench (Logic Grid Puzzle)BIG-bench (StrategyQA)RuWorldTree Winograd Automatic BIG-bench (Logical Fallacy Detection)

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Claude Opus	Delta_NoContext	28.8	—	Unverified
2	GPT-4o	Delta_NoContext	25.1	—	Unverified
3	Gemini 1.5 Pro	Delta_NoContext	23.4	—	Unverified
4	GPT-4	Delta_NoContext	21.5	—	Unverified
5	Command R+	Delta_NoContext	11.6	—	Unverified
6	GPT-3.5	Delta_NoContext	11.2	—	Unverified
7	Mixtral 8x7B	Delta_NoContext	6.4	—	Unverified
8	Llama 3 8B	Delta_NoContext	4.9	—	Unverified
9	Llama 3 70B	Delta_NoContext	2.9	—	Unverified
10	Gemma 7B	Delta_NoContext	2.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLM 2 (few-shot, k=3, Direct)	Accuracy	64.8	—	Unverified
2	PaLM 2 (few-shot, k=3, CoT)	Accuracy	57.2	—	Unverified
3	OPT 66B (few-shot, k=3)	Accuracy	54	—	Unverified
4	PaLM 540B (few-shot, k=3)	Accuracy	53.6	—	Unverified
5	GPT-NeoX 20B (few-shot, k=3)	Accuracy	52.8	—	Unverified
6	BLOOM 176B (few-shot, k=3)	Accuracy	52.8	—	Unverified
7	Chinchilla-70B (few-shot, k=5)	Accuracy	52.1	—	Unverified
8	Bloomberg GPT 50B (few-shot, k=3)	Accuracy	50.8	—	Unverified
9	Gopher-280B (few-shot, k=5)	Accuracy	50.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLM 2 (few-shot, k=3, CoT)	Accuracy	84.9	—	Unverified
2	PaLM 2 (few-shot, k=3, Direct)	Accuracy	65.8	—	Unverified
3	Chinchilla-70B (few-shot, k=5)	Accuracy	48.7	—	Unverified
4	PaLM 540B (few-shot, k=3)	Accuracy	44.5	—	Unverified
5	Gopher-280B (few-shot, k=5)	Accuracy	40.6	—	Unverified
6	BLOOM 176B (few-shot, k=3)	Accuracy	40.41	—	Unverified
7	Bloomberg GPT (few-shot, k=3)	Accuracy	37.67	—	Unverified
8	GPT-NeoX (few-shot, k=3)	Accuracy	33.56	—	Unverified
9	OPT 66B (few-shot, k=3)	Accuracy	28.08	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLM 2 (few-shot, k=3, CoT)	Accuracy	91.2	—	Unverified
2	PaLM 2 (few-shot, k=3, Direct)	Accuracy	61.2	—	Unverified
3	Chinchilla-70B (few-shot, k=5)	Accuracy	59.7	—	Unverified
4	Gopher-280B (few-shot, k=5)	Accuracy	49.2	—	Unverified
5	PaLM 540B (few-shot, k=3)	Accuracy	38	—	Unverified
6	BLOOM 176B (few-shot, k=3)	Accuracy	36.8	—	Unverified
7	Bloomberg GPT (few-shot, k=3)	Accuracy	34.8	—	Unverified
8	OPT 66B (few-shot, k=3)	Accuracy	31.2	—	Unverified
9	GPT-NeoX (few-shot, k=3)	Accuracy	26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLM 2 (few-shot, k=3, CoT)	Accuracy	100	—	Unverified
2	PaLM 2 (few-shot, k=3, Direct)	Accuracy	96.4	—	Unverified
3	PaLM 540B (few-shot, k=3)	Accuracy	39.6	—	Unverified
4	BLOOM 176B (few-shot, k=3)	Accuracy	36.8	—	Unverified
5	Chinchilla-70B (few-shot, k=5)	Accuracy	32	—	Unverified
6	Bloomberg GPT (few-shot, k=3)	Accuracy	29.2	—	Unverified
7	OPT 66B (few-shot, k=3)	Accuracy	23.6	—	Unverified
8	GPT-NeoX (few-shot, k=3)	Accuracy	21.2	—	Unverified
9	Gopher-280B (few-shot, k=5)	Accuracy	19	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Chinchilla-70B (few-shot, k=5)	Accuracy	44	—	Unverified
2	PaLM-540B (few-shot, k=5)	Accuracy	42.4	—	Unverified
3	PaLM-62B (few-shot, k=5)	Accuracy	36.5	—	Unverified
4	Gopher-280B (few-shot, k=5)	Accuracy	35.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLM-540B (few-shot, k=5)	Accuracy	73.9	—	Unverified
2	Chinchilla-70B (few-shot, k=5)	Accuracy	68.3	—	Unverified
3	PaLM-62B (few-shot, k=5)	Accuracy	65.4	—	Unverified
4	Gopher-280B (few-shot, k=5)	Accuracy	61	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Human benchmark	Accuracy	83.7	—	Unverified
2	RuGPT-3 Large	Accuracy	40.7	—	Unverified
3	RuGPT-3 Medium	Accuracy	38	—	Unverified
4	RuGPT-3 Small	Accuracy	34	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Human benchmark	Accuracy	87	—	Unverified
2	RuGPT-3 Small	Accuracy	57.9	—	Unverified
3	RuGPT-3 Medium	Accuracy	57.2	—	Unverified
4	RuGPT-3 Large	Accuracy	55.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Chinchilla-70B (few-shot, k=5)	Accuracy	72.1	—	Unverified
2	Gopher-280B (few-shot, k=5)	Accuracy	58.9	—	Unverified