SOTAVerified

Logical Reasoning

Papers

Showing 401450 of 747 papers

TitleStatusHype
Why should we ever automate moral decision making?0
Analyzing Large language models chatbots: An experimental approach using a probability test0
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games0
Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring0
FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts0
Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogism0
LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic0
Large Language Models Are Cross-Lingual Knowledge-Free ReasonersCode0
Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language ModelsCode0
Imperative Learning: A Self-supervised Neuro-Symbolic Learning Framework for Robot Autonomy0
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference0
Pathformer: Recursive Path Query Encoding for Complex Logical Query Answering0
The neural correlates of logical-mathematical symbol systems processing resemble that of spatial cognition more than natural language processing0
Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language ModelsCode0
Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment0
Scaling Synthetic Logical Reasoning Datasets with Context-Sensitive Declarative GrammarsCode0
City-LEO: Toward Transparent City Management Using LLM with End-to-End Optimization0
Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science ExamCode0
Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ?Code0
Large Language Models are Limited in Out-of-Context Knowledge ReasoningCode0
Improving Multi-hop Logical Reasoning in Knowledge Graphs with Context-Aware Query Representation LearningCode0
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct LanguagesCode0
LGR2: Language Guided Reward Relabeling for Accelerating Hierarchical Reinforcement Learning0
Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation0
On the Hardness of Probabilistic Neurosymbolic LearningCode0
How Truncating Weights Improves Reasoning in Language Models0
Bi-Chainer: Automated Large Language Models Reasoning with Bidirectional Chaining0
Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks0
Disentangling Logic: The Role of Context in Large Language Model Reasoning CapabilitiesCode0
A Synergistic Approach In Network Intrusion Detection By Neurosymbolic AI0
Logical Reasoning with Relation Network for Inductive Knowledge Graph Completion0
Brainstorming Brings Power to Large Language Models of Knowledge Reasoning0
A Closer Look at Logical Reasoning with LLMs: The Choice of Tool MattersCode0
PathReasoner: Modeling Reasoning Path with Equivalent Extension for Logical Question Answering0
Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?0
RLSF: Reinforcement Learning via Symbolic Feedback0
Distilling Instruction-following Abilities of Large Language Models with Task-aware Curriculum Planning0
LLM+Reasoning+Planning for supporting incomplete user queries in presence of APIs0
STAR: A Benchmark for Situated Reasoning in Real-World Videos0
MetaReflection: Learning Instructions for Language Agents using Past Reflections0
MathDivide: Improved mathematical reasoning by large language models0
Logical Negation Augmenting and Debiasing for Prompt-based Methods0
Towards a Theoretical Understanding of the 'Reversal Curse' via Training DynamicsCode0
Improving Complex Reasoning over Knowledge Graph with Logic-Aware Curriculum Tuning0
SuperCLUE-Fin: Graded Fine-Grained Analysis of Chinese LLMs on Diverse Financial Tasks and Applications0
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM0
Aligning Knowledge Graphs Provided by Humans and Generated from Neural Networks in Specific TasksCode0
Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs0
Reason from Fallacy: Enhancing Large Language Models' Logical Reasoning through Logical Fallacy Understanding0
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues0
Show:102550
← PrevPage 9 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified