SOTAVerified

Logical Reasoning

Papers

Showing 301350 of 747 papers

TitleStatusHype
Improving Multi-hop Logical Reasoning in Knowledge Graphs with Context-Aware Query Representation LearningCode0
Large Language Models are Limited in Out-of-Context Knowledge ReasoningCode0
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct LanguagesCode0
Flow of Reasoning:Training LLMs for Divergent Problem Solving with Minimal ExamplesCode2
LGR2: Language Guided Reward Relabeling for Accelerating Hierarchical Reinforcement Learning0
Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation0
LogiCode: an LLM-Driven Framework for Logical Anomaly DetectionCode1
On the Hardness of Probabilistic Neurosymbolic LearningCode0
Evaluating the World Model Implicit in a Generative ModelCode2
Bi-Chainer: Automated Large Language Models Reasoning with Bidirectional Chaining0
How Truncating Weights Improves Reasoning in Language Models0
Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks0
Disentangling Logic: The Role of Context in Large Language Model Reasoning CapabilitiesCode0
A Synergistic Approach In Network Intrusion Detection By Neurosymbolic AI0
Logical Reasoning with Relation Network for Inductive Knowledge Graph Completion0
Brainstorming Brings Power to Large Language Models of Knowledge Reasoning0
A Closer Look at Logical Reasoning with LLMs: The Choice of Tool MattersCode0
Easy Problems That LLMs Get WrongCode2
PathReasoner: Modeling Reasoning Path with Equivalent Extension for Logical Question Answering0
Faithful Logical Reasoning via Symbolic Chain-of-ThoughtCode3
Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?0
RLSF: Reinforcement Learning via Symbolic Feedback0
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous DrivingCode2
Distilling Instruction-following Abilities of Large Language Models with Task-aware Curriculum Planning0
LLM+Reasoning+Planning for supporting incomplete user queries in presence of APIs0
STAR: A Benchmark for Situated Reasoning in Real-World Videos0
MetaReflection: Learning Instructions for Language Agents using Past Reflections0
MathDivide: Improved mathematical reasoning by large language models0
Logical Negation Augmenting and Debiasing for Prompt-based Methods0
Towards a Theoretical Understanding of the 'Reversal Curse' via Training DynamicsCode0
Improving Complex Reasoning over Knowledge Graph with Logic-Aware Curriculum Tuning0
SuperCLUE-Fin: Graded Fine-Grained Analysis of Chinese LLMs on Diverse Financial Tasks and Applications0
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM0
Aligning Knowledge Graphs Provided by Humans and Generated from Neural Networks in Specific TasksCode0
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language ModelsCode1
Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs0
MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical ProblemsCode2
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues0
Reason from Fallacy: Enhancing Large Language Models' Logical Reasoning through Logical Fallacy Understanding0
I-Design: Personalized LLM Interior Designer0
Language Model Guided Interpretable Video Action ReasoningCode0
Advancing LLM Reasoning Generalists with Preference TreesCode3
Classifying Conspiratorial Narratives At Scale: False Alarms and Erroneous ConnectionsCode0
Sphere Neural-Networks for Rational Reasoning0
LeanReasoner: Boosting Complex Logical Reasoning with LeanCode1
Natural Language as Policies: Reasoning for Coordinate-Level Embodied Control with LLMs0
Reasoning in Transformers - Mitigating Spurious Correlations and Reasoning Shortcuts0
Transforming Competition into Collaboration: The Revolutionary Role of Multi-Agent Systems and Language Models in Modern OrganizationsCode0
Learning Guided Automated Reasoning: A Brief Survey0
Fuzzy Datalog^ over Arbitrary t-Norms0
Show:102550
← PrevPage 7 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified