SOTAVerified

Logical Reasoning

Papers

Showing 51100 of 747 papers

TitleStatusHype
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical ReasoningCode2
Scaling Language Models: Methods, Analysis & Insights from Training GopherCode2
Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language NavigationCode1
Large Language Models for Planning: A Comprehensive and Systematic SurveyCode1
Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?Code1
NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement LearningCode1
Learning to Reason via Mixture-of-Thought for Logical ReasoningCode1
Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?Code1
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?Code1
BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMsCode1
Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural IntegrationCode1
Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong GeneralizationCode1
QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?Code1
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language ModelsCode1
Large Language Models Meet Symbolic Provers for Logical Reasoning EvaluationCode1
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-AugmentationCode1
WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language ModelCode1
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World ScenariosCode1
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable CompressionCode1
The LLM Language Network: A Neuroscientific Approach for Identifying Causally Task-Relevant UnitsCode1
LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban SimulationCode1
Neuro-symbolic Learning Yielding Logical ConstraintsCode1
Divide and Translate: Compositional First-Order Logic Translation and Verification for Complex Logical ReasoningCode1
Automatic Curriculum Expert Iteration for Reliable LLM ReasoningCode1
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language ModelsCode1
RATIONALYST: Pre-training Process-Supervision for Improving ReasoningCode1
VProChart: Answering Chart Question through Visual Perception Alignment Agent and Programmatic Solution ReasoningCode1
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language ModelsCode1
CHECKWHY: Causal Fact Verification via Argument StructureCode1
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video UnderstandingCode1
R^2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical ReasoningCode1
ElecBench: a Power Dispatch Evaluation Benchmark for Large Language ModelsCode1
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual ContextsCode1
PUZZLES: A Benchmark for Neural Algorithmic ReasoningCode1
VideoVista: A Versatile Benchmark for Video Understanding and ReasoningCode1
A Peek into Token Bias: Large Language Models Are Not Yet Genuine ReasonersCode1
LogiCode: an LLM-Driven Framework for Logical Anomaly DetectionCode1
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language ModelsCode1
LeanReasoner: Boosting Complex Logical Reasoning with LeanCode1
SIMPLOT: Enhancing Chart Question Answering by Distilling EssentialsCode1
OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language ModelsCode1
Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMsCode1
The Quantified Boolean Bayesian Network: Theory and Experiments with a Logical Graphical ModelCode1
Conditional and Modal Reasoning in Large Language ModelsCode1
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided InterventionsCode1
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language ModelsCode1
TEILP: Time Prediction over Knowledge Graphs via Logical ReasoningCode1
Advancing Abductive Reasoning in Knowledge Graphs through Complex Logical Hypothesis GenerationCode1
Modeling Complex Mathematical Reasoning via Large Language Model based MathAgentCode1
Show:102550
← PrevPage 2 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified