SOTAVerified

Logical Reasoning

Papers

Showing 101150 of 747 papers

TitleStatusHype
BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMsCode1
Logic Tensor Networks: Deep Learning and Logical Reasoning from Data and KnowledgeCode1
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?Code1
Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong GeneralizationCode1
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language ModelsCode1
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video UnderstandingCode1
LogiCode: an LLM-Driven Framework for Logical Anomaly DetectionCode1
HAE-RAE Bench: Evaluation of Korean Knowledge in Language ModelsCode1
Advancing Abductive Reasoning in Knowledge Graphs through Complex Logical Hypothesis GenerationCode1
Neural Collaborative ReasoningCode1
LogiCoT: Logical Chain-of-Thought Instruction-TuningCode1
Improved Logical Reasoning of Language Models via Differentiable Symbolic ProgrammingCode1
Certified Deductive Reasoning with Language ModelsCode1
Chain of Images for Intuitively ReasoningCode1
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual ContextsCode1
AdaLoGN: Adaptive Logic Graph Network for Reasoning-Based Machine Reading ComprehensionCode1
Automatic Curriculum Expert Iteration for Reliable LLM ReasoningCode1
ElecBench: a Power Dispatch Evaluation Benchmark for Large Language ModelsCode1
A Neuro-vector-symbolic Architecture for Solving Raven's Progressive MatricesCode1
Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of TextCode1
OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language ModelsCode1
ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language ModelsCode1
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language ModelsCode1
CHECKWHY: Causal Fact Verification via Argument StructureCode1
Logical Message Passing Networks with One-hop Inference on Atomic FormulasCode1
Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth PaceCode1
Logical Neural NetworksCode1
AbductionRules: Training Transformers to Explain Unexpected InputsCode1
Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical ReasoningCode1
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic ProversCode1
Logic and the 2-Simplicial TransformerCode1
LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban SimulationCode1
Learning Deductive Reasoning from Synthetic Corpus based on Formal LogicCode1
Divide and Translate: Compositional First-Order Logic Translation and Verification for Complex Logical ReasoningCode1
LeanReasoner: Boosting Complex Logical Reasoning with LeanCode1
Learning to Reason via Mixture-of-Thought for Logical ReasoningCode1
Conditional and Modal Reasoning in Large Language ModelsCode1
Large Language Models for Planning: A Comprehensive and Systematic SurveyCode1
Large Language Models Meet Symbolic Provers for Logical Reasoning EvaluationCode1
ConditionalQA: A Complex Reading Comprehension Dataset with Conditional AnswersCode1
Discriminative Reasoning for Document-level Relation ExtractionCode1
Complex Logical Reasoning over Knowledge Graphs using Large Language ModelsCode1
Deductive Verification of Chain-of-Thought ReasoningCode1
LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking PuzzlesCode1
Improving Large Language Models in Event Relation Logical PredictionCode1
Abstract Meaning Representation-Based Logic-Driven Data Augmentation for Logical ReasoningCode1
DAGN: Discourse-Aware Graph Network for Logical ReasoningCode1
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical ReasoningCode1
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
COLLIE: Systematic Construction of Constrained Text Generation TasksCode1
Show:102550
← PrevPage 3 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified