SOTAVerified

Logical Reasoning

Papers

Showing 201250 of 747 papers

TitleStatusHype
LLM-Aided Efficient Hardware Design Automation0
Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks0
Aligning CodeLLMs with Direct Preference Optimization0
MedLogic-AQA: Enhancing Medical Question Answering with Abstractive Models Focusing on Logical StructuresCode0
Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology0
Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event RepresentationCode0
From Babbling to Fluency: Evaluating the Evolution of Language Models in Terms of Human Language AcquisitionCode0
Exploiting LLMs' Reasoning Capability to Infer Implicit Concepts in Legal Information Retrieval0
"Let's Argue Both Sides": Argument Generation Can Force Small Models to Utilize Previously Inaccessible Reasoning Capabilities0
Boosting Deductive Reasoning with Step Signals In RLHF0
Transformer-based Language Models for Reasoning in the Description Logic ALCQ0
A Systematic Assessment of OpenAI o1-Preview for Higher Order Thinking in Education0
P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains0
uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks0
HLM-Cite: Hybrid Language Model Workflow for Text-based Scientific Citation PredictionCode0
KnowGraph: Knowledge-Enabled Anomaly Detection via Logical Reasoning on Graph Data0
Divide and Translate: Compositional First-Order Logic Translation and Verification for Complex Logical ReasoningCode1
Automatic Curriculum Expert Iteration for Reliable LLM ReasoningCode1
Think Beyond Size: Adaptive Prompting for More Effective Reasoning0
Can Transformers Reason Logically? A Study in SAT Solving0
Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?Code0
TurtleBench: Evaluating Top Language Models via Real-World Yes/No PuzzlesCode2
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language ModelsCode1
Latent Feature Mining for Predictive Model Enhancement with Large Language Models0
Interpret Your Decision: Logical Reasoning Regularization for Generalization in Visual ClassificationCode0
Deliberate Reasoning for LLMs as Structure-aware Planning with Accurate World Model0
Learning from Committee: Reasoning Distillation from a Mixture of Teachers with Peer-ReviewCode2
GraphIC: A Graph-Based In-Context Example Retrieval Model for Multi-Step Reasoning0
CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning0
RATIONALYST: Pre-training Process-Supervision for Improving ReasoningCode1
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured DataCode0
Wait, but Tylenol is Acetaminophen... Investigating and Improving Language Models' Ability to Resist Requests for Misinformation0
Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language ModelsCode0
Judgment of Thoughts: Courtroom of the Binary Logical Reasoning in Large Language Models0
Strategies for Improving NL-to-FOL Translation with LLMs: Data Generation, Incremental Fine-Tuning, and VerificationCode0
LTNtorch: PyTorch Implementation of Logic Tensor NetworksCode2
Thought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading ComprehensionCode0
GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion0
Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic DataCode0
LogicPro: Improving Complex Logical Reasoning via Program-Guided LearningCode0
ProSLM : A Prolog Synergized Language Model for explainable Domain Specific Knowledge Based Question Answering0
Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving0
Unleash LLMs Potential for Recommendation by Coordinating Twin-Tower Dynamic Semantic Token Generator0
KARGEN: Knowledge-enhanced Automated Radiology Report Generation Using Large Language Models0
CauseJudger: Identifying the Cause with LLMs for Abductive Logical Reasoning0
Action is the primary key: a categorical framework for episode description and logical reasoning0
VProChart: Answering Chart Question through Visual Perception Alignment Agent and Programmatic Solution ReasoningCode1
Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness0
LLM-Based Multi-Hop Question Answering with Knowledge Graph Integration in Evolving Environments0
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language ModelsCode1
Show:102550
← PrevPage 5 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified