SOTAVerified

Logical Reasoning

Papers

Showing 51100 of 747 papers

TitleStatusHype
HF4Rec: Human-Like Feedback-Driven Optimization Framework for Explainable Recommendation0
Context-Awareness and Interpretability of Rare Occurrences for Discovery and Formalization of Critical Failure Modes0
LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models0
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy0
LAD-Reasoner: Tiny Multimodal Models are Good Reasoners for Logical Anomaly Detection0
Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural IntegrationCode1
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning0
PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving0
MediSee: Reasoning-based Pixel-level Perception in Medical Images0
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge0
Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test OraclesCode0
MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep ThinkingCode0
Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong GeneralizationCode1
Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification0
Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent0
Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition0
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
Adaptive Rectification Sampling for Test-Time Compute ScalingCode0
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1Code2
VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models0
QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?Code1
Negation: A Pink Elephant in the Large Language Models' Room?0
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning0
Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning0
A Study on Neuro-Symbolic Artificial Intelligence: Healthcare Perspectives0
(G)I-DLE: Generative Inference via Distribution-preserving Logit Exclusion with KL Divergence Minimization for Constrained Decoding0
Enhancing Retrieval Systems with Inference-Time Logical Reasoning0
MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic WorkflowCode2
LaMOuR: Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning0
From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Reasoning in Large Language Models0
Bridging Technology and Humanities: Evaluating the Impact of Large Language Models on Social Sciences Research with DeepSeek-R10
Measuring AI Ability to Complete Long TasksCode3
Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack0
3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o0
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation0
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models0
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RLCode4
Towards Superior Quantization Accuracy: A Layer-sensitive Approach0
SCoRE: Benchmarking Long-Chain Reasoning in Commonsense ScenariosCode0
The Society of HiveMind: Multi-Agent Optimization of Foundation Model Swarms to Unlock the Potential of Collective Intelligence0
HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling in Open-Ended General-Domain Tasks0
DB-Explore: Automated Database Exploration and Instruction Synthesis for Text-to-SQL0
Three tiers of computation in transformers and in brain architecturesCode0
Psy-Insight: Explainable Multi-turn Bilingual Dataset for Mental Health Counseling0
DeLTa: A Decoding Strategy based on Logit Trajectory Prediction Improves Factuality and Reasoning AbilityCode0
KGCompiler: Deep Learning Compilation Optimization for Knowledge Graph Complex Logical Query AnsweringCode0
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs0
Order Doesn't Matter, But Reasoning Does: Training LLMs with Order-Centric Augmentation0
Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks AutomationCode2
TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model ReasoningCode0
Show:102550
← PrevPage 2 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified