SOTAVerified

Logical Reasoning

Papers

Showing 551600 of 747 papers

TitleStatusHype
Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction0
Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs0
Transformer-based Language Models for Reasoning in the Description Logic ALCQ0
Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests0
Truth Table Deep Convolutional Neural Network, A New SAT-Encodable Architecture - Application To Complete Robustness0
A Scalable, Interpretable, Verifiable & Differentiable Logic Gate Convolutional Neural Network Architecture From Truth Tables0
TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games0
Type-dependent Prompt CycleQAG : Cycle Consistency for Multi-hop Question Generation0
Unifying Neural Learning and Symbolic Reasoning for Spinal Medical Report Generation0
Unifying Structure Reasoning and Language Model Pre-training for Complex Reasoning0
Unleash LLMs Potential for Recommendation by Coordinating Twin-Tower Dynamic Semantic Token Generator0
Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring0
VERUS-LM: a Versatile Framework for Combining LLMs with Symbolic Reasoning0
VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models0
Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving0
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge0
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL0
Wait, but Tylenol is Acetaminophen... Investigating and Improving Language Models' Ability to Resist Requests for Misinformation0
What is the Title of this Paper? Solving logic puzzles using algorithms0
What Makes Machine Reading Comprehension Questions Difficult? Investigating Variation in Passage Sources and Question Types0
What Makes Reading Comprehension Questions Difficult? Investigating Variation in Passage Sources and Question Types0
Why should we ever automate moral decision making?0
XAgents: A Framework for Interpretable Rule-Based Multi-Agents Cooperation0
WatME: Towards Lossless Watermarking Through Lexical Redundancy0
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning0
Logic Pre-Training of Language Models0
LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models0
LogiGAN: Learning Logical Reasoning via Adversarial Pre-training0
LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI0
Lp : A Logic for Statistical Information0
Intermediate Languages Matter: Formal Choice Drives Neurosymbolic LLM Reasoning0
MANGO: Enhancing the Robustness of VQA Models via Adversarial Noise Generation0
Mapping Ontologies Using Ontologies: Cross-lingual Semantic Role Information Transfer0
MARCO: Meta-Reflection with Cross-Referencing for Code Reasoning0
MathDivide: Improved mathematical reasoning by large language models0
Meaningless is better: hashing bias-inducing words in LLM prompts improves performance in logical reasoning and statistical learning0
Medical idioms for clinical Bayesian network development0
MediSee: Reasoning-based Pixel-level Perception in Medical Images0
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization0
MetaReflection: Learning Instructions for Language Agents using Past Reflections0
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning0
Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning0
Mixed Logical and Probabilistic Reasoning for Planning and Explanation Generation in Robotics0
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs0
Modeling Associative Reasoning Processes0
CogReact: A Reinforced Framework to Model Human Cognitive Reaction Modulated by Dynamic Intervention0
Modeling Human Decision-making: An Overview of the Brussels Quantum Approach0
Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation0
MTMT: Consolidating Multiple Thinking Modes to Form a Thought Tree for Strengthening LLM0
MUC-driven Feature Importance Measurement and Adversarial Analysis for Random Forest0
Show:102550
← PrevPage 12 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified