SOTAVerified

Logical Reasoning

Papers

Showing 151200 of 747 papers

TitleStatusHype
KnowRA: Knowledge Retrieval Augmented Method for Document-level Relation Extraction with Comprehensive Reasoning Abilities0
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and ReasoningCode4
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity0
A Survey on Large Language Model Acceleration based on KV Cache ManagementCode3
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-AugmentationCode1
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs0
Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve FrameworkCode0
Formal Language Knowledge Corpus for Retrieval Augmented Generation0
Logical Consistency of Large Language Models in Fact-checking0
SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language ModelsCode0
WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language ModelCode1
Reasoning-Aware Query-Focused Summarization over Multi-Table Data0
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World ScenariosCode1
Federated In-Context LLM Agent Learning0
Algorithmic Phase Transitions in Language Models: A Mechanistic Case Study of Arithmetic0
FlashRNN: Optimizing Traditional RNNs on Modern HardwareCode2
Training Large Language Models to Reason in a Continuous Latent SpaceCode5
Can OpenAI o1 outperform humans in higher-order cognitive thinking?0
Who Speaks Next? Multi-party AI Discussion Leveraging the Systematics of Turn-taking in Murder Mystery GamesCode0
Guidance is All You Need: Temperature-Guided Reasoning in Large Language Models0
MTMT: Consolidating Multiple Thinking Modes to Form a Thought Tree for Strengthening LLM0
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable CompressionCode1
Reverse Thinking Makes LLMs Stronger Reasoners0
SentiXRL: An advanced large language Model Framework for Multilingual Fine-Grained Emotion Classification in Complex Text Environment0
Learning for Long-Horizon Planning via Neuro-Symbolic Abductive ImitationCode0
Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs0
Meaningless is better: hashing bias-inducing words in LLM prompts improves performance in logical reasoning and statistical learning0
HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator0
Object-centric proto-symbolic behavioural reasoning from pixelsCode0
Interactive Visual Assessment for Text-to-Image Generation Models0
XAgents: A Framework for Interpretable Rule-Based Multi-Agents Cooperation0
Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic CorpusCode2
Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning0
Large Language Models (LLMs) as Traffic Control Systems at Urban Intersections: A New Paradigm0
Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent BalderdashCode0
LLaVA-CoT: Let Vision Language Models Reason Step-by-StepCode7
Building Trustworthy AI: Transparent AI Systems via Large Language Models, Ontologies, and Logical Reasoning (TranspNet)0
Symbolic-AI-Fusion Deep Learning (SAIF-DL): Encoding Knowledge into Training with Answer Set Programming Loss Penalties by a Novel Loss Function Approach0
Knowledge Authoring with Factual English, Rules, and Actions0
OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving?0
How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis0
Formal Logic-guided Robust Federated Learning against Poisoning Attacks0
The LLM Language Network: A Neuroscientific Approach for Identifying Causally Task-Relevant UnitsCode1
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by TencentCode5
LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban SimulationCode1
On Memorization of Large Language Models in Logical Reasoning0
Leveraging LLMs for Hypothetical Deduction in Logical Inference: A Neuro-Symbolic ApproachCode0
Neuro-symbolic Learning Yielding Logical ConstraintsCode1
Combining Domain-Specific Models and LLMs for Automated Disease Phenotyping from Survey Data0
Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs0
Show:102550
← PrevPage 4 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified