SOTAVerified

Logical Reasoning

Papers

Showing 701747 of 747 papers

TitleStatusHype
On the Hardness of Probabilistic Neurosymbolic LearningCode0
On the logical skills of large language models: evaluations using arbitrarily complex first-order logic problemsCode0
Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent BalderdashCode0
ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart UnderstandingCode0
Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science ExamCode0
Ontology Reasoning with Deep Neural NetworksCode0
Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test OraclesCode0
Zero-Shot Classification by Logical Reasoning on Natural Language ExplanationsCode0
Chains of Reasoning over Entities, Relations, and Text using Recurrent Neural NetworksCode0
Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic DataCode0
SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language ModelsCode0
Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?Code0
Adaptive Rectification Sampling for Test-Time Compute ScalingCode0
Transforming Competition into Collaboration: The Revolutionary Role of Multi-Agent Systems and Language Models in Modern OrganizationsCode0
Improving Certified Robustness via Statistical Learning with Logical ReasoningCode0
A Closer Look at Logical Reasoning with LLMs: The Choice of Tool MattersCode0
Empower Nested Boolean Logic via Self-Supervised Curriculum LearningCode0
POE: Process of Elimination for Multiple Choice ReasoningCode0
Thought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading ComprehensionCode0
Deep Manifold Learning for Reading Comprehension and Logical Reasoning Tasks with Polytuplet LossCode0
Empowering Few-Shot Recommender Systems with Large Language Models -- Enhanced RepresentationsCode0
Probabilistic Sufficient ExplanationsCode0
Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?Code0
Three tiers of computation in transformers and in brain architecturesCode0
Strategies for Improving NL-to-FOL Translation with LLMs: Data Generation, Incremental Fine-Tuning, and VerificationCode0
What Makes Reading Comprehension Questions Difficult?Code0
V-LoL: A Diagnostic Dataset for Visual Logical LearningCode0
EchoPrompt: Instructing the Model to Rephrase Queries for Improved In-context LearningCode0
Volta at SemEval-2021 Task 9: Statement Verification and Evidence Finding with Tables using TAPAS and Transfer LearningCode0
Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ?Code0
Can recursive neural tensor networks learn logical reasoning?Code0
Sudoku-Bench: Evaluating creative reasoning with Sudoku variantsCode0
Towards a Theoretical Understanding of the 'Reversal Curse' via Training DynamicsCode0
Double Equivariance for Inductive Link Prediction for Both New Nodes and New Relation TypesCode0
Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth AnswersCode0
SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability AnalysisCode0
Document-level Biomedical Relation Extraction Based on Multi-Dimensional Fusion Information and Multi-Granularity Logical ReasoningCode0
Query Structure Modeling for Inductive Logical Reasoning Over Knowledge GraphsCode0
Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision StudyCode0
Disentangling Logic: The Role of Context in Large Language Model Reasoning CapabilitiesCode0
DeLTa: A Decoding Strategy based on Logit Trajectory Prediction Improves Factuality and Reasoning AbilityCode0
DeepLogic: Towards End-to-End Differentiable Logical ReasoningCode0
Matrix Shuffle-Exchange Networks for Hard 2D TasksCode0
A Neural-Symbolic Approach to Natural Language UnderstandingCode0
Semantic RL with Action Grammars: Data-Efficient Learning of Hierarchical Task AbstractionsCode0
Reasoning Capabilities and Invariability of Large Language ModelsCode0
Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event RepresentationCode0
Show:102550
← PrevPage 15 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified