SOTAVerified

Logical Reasoning

Papers

Showing 451500 of 747 papers

TitleStatusHype
On the Potential of CLIP for Compositional Logical Reasoning0
OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving?0
Order Doesn't Matter, But Reasoning Does: Training LLMs with Order-Centric Augmentation0
P3: A Policy-Driven, Pace-Adaptive, and Diversity-Promoted Framework for data pruning in LLM Training0
Pathformer: Recursive Path Query Encoding for Complex Logical Query Answering0
PathReasoner: Modeling Reasoning Path with Equivalent Extension for Logical Question Answering0
P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains0
Physics of Language Models: Part 3.2, Knowledge Manipulation0
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs0
POLYRAG: Integrating Polyviews into Retrieval-Augmented Generation for Medical Applications0
Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment0
ProSLM : A Prolog Synergized Language Model for explainable Domain Specific Knowledge Based Question Answering0
Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent0
Psy-Insight: Explainable Multi-turn Bilingual Dataset for Mental Health Counseling0
PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving0
Puzzle Solving using Reasoning of Large Language Models: A Survey0
Quantifying Adaptability in Pre-trained Language Models with 500 Tasks0
Quantifying Logical Consistency in Transformers via Query-Key Alignment0
Teaching Pretrained Models with Commonsense Reasoning: A Preliminary KB-Based Approach0
Quantum Structure in Cognition and the Foundations of Human Reasoning0
Quantum Structure of Negation and Conjunction in Human Thought0
Reason from Fallacy: Enhancing Large Language Models' Logical Reasoning through Logical Fallacy Understanding0
Reasoning Algorithmically in Graph Neural Networks0
Reasoning-Aware Query-Focused Summarization over Multi-Table Data0
Reasoning in Neurosymbolic AI0
Reasoning in Transformers - Mitigating Spurious Correlations and Reasoning Shortcuts0
Reasoning in Vector Space: An Exploratory Study of Question Answering0
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation0
Reasoning Like Program Executors0
Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification0
Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs0
Reasoning over Logically Interacted Conditions for Question Answering0
Reasoning with Graphs: Structuring Implicit Knowledge to Enhance LLMs Reasoning0
Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog0
Reduced Implication-bias Logic Loss for Neuro-Symbolic Learning0
Retrieval-Augmented Neural Response Generation Using Logical Reasoning and Relevance Scoring0
Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions0
Reverse Thinking Makes LLMs Stronger Reasoners0
RLSF: Reinforcement Learning via Symbolic Feedback0
Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning0
S^2-MAD: Breaking the Token Barrier to Enhance Multi-Agent Debate Efficiency0
SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas0
Scales and Hedges in a Logic with Analogous Semantics0
Scallop: A Language for Neurosymbolic Programming0
Scallop: From Probabilistic Deductive Databases to Scalable Differentiable Reasoning0
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity0
Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning0
Hint of Thought prompting: an explainable and zero-shot approach to reasoning tasks with LLMs0
SentiXRL: An advanced large language Model Framework for Multilingual Fine-Grained Emotion Classification in Complex Text Environment0
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning0
Show:102550
← PrevPage 10 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified