SOTAVerified

Logical Reasoning

Papers

Showing 251300 of 747 papers

TitleStatusHype
Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack0
City-LEO: Toward Transparent City Management Using LLM with End-to-End Optimization0
BTPK-based interpretable method for NER tasks based on Talmudic Public Announcement Logic0
FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts0
Are LLMs Rigorous Logical Reasoner? Empowering Natural Language Proof Generation with Contrastive Stepwise Decoding0
Algorithmic Phase Transitions in Language Models: A Mechanistic Case Study of Arithmetic0
Bridging Technology and Humanities: Evaluating the Impact of Large Language Models on Social Sciences Research with DeepSeek-R10
Deceptive AI systems that give explanations are more convincing than honest AI systems and can amplify belief in misinformation0
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games0
KnowGraph: Knowledge-Enabled Anomaly Detection via Logical Reasoning on Graph Data0
Dynamic In-Context Learning from Nearest Neighbors for Bundle Generation0
Brainstorming Brings Power to Large Language Models of Knowledge Reasoning0
Judgment of Thoughts: Courtroom of the Binary Logical Reasoning in Large Language Models0
Do Large Language Models Understand Logic or Just Mimick Context?0
A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models0
KARGEN: Knowledge-enhanced Automated Radiology Report Generation Using Large Language Models0
Knowledge Authoring for Rules and Actions0
Knowledge-based Reasoning and Learning under Partial Observability in Ad Hoc Teamwork0
LaMOuR: Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning0
Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology0
Do Large Language Models Mirror Cognitive Language Processing?0
Boosting Logical Reasoning in Large Language Models through a New Framework: The Graph of Thought0
Does Entity Abstraction Help Generative Transformers Reason?0
Boosting Deductive Reasoning with Step Signals In RLHF0
DMWM: Dual-Mind World Model with Long-Term Imagination0
A Probabilistic Model for Discriminative and Neuro-Symbolic Semi-Supervised Learning0
Distilling Instruction-following Abilities of Large Language Models with Task-aware Curriculum Planning0
Abstract Spatial-Temporal Reasoning via Probabilistic Abduction and Execution0
Is writing style predictive of scientific fraud?0
APOLLO: A Simple Approach for Adaptive Pretraining of Language Models for Logical Reasoning0
Bi-Chainer: Automated Large Language Models Reasoning with Bidirectional Chaining0
A Densely Connected Criss-Cross Attention Network for Document-level Relation Extraction0
Discrete JEPA: Learning Discrete Token Representations without Reconstruction0
Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models0
Is ChatGPT a Good Personality Recognizer? A Preliminary Study0
Discourse-Aware Graph Networks for Textual Logical Reasoning0
Dialogue-based Explanations for Logical Reasoning using Structured Argumentation0
Beyond LLMs: Advancing the Landscape of Complex Reasoning0
Diagnosing the First-Order Logical Reasoning Ability Through LogicNLI0
Detection-based Intermediate Supervision for Visual Question Answering0
Beware of Words: Evaluating the Lexical Diversity of Conversational LLMs using ChatGPT as Case Study0
Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs0
Is writing style predictive of scientific fraud?0
JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning0
DetectGPT-SC: Improving Detection of Text Generated by Large Language Models through Self-Consistency with Masked Predictions0
Bayes Meets Entailment and Prediction: Commonsense Reasoning with Non-monotonicity, Paraconsistency and Predictive Accuracy0
Deliberate Reasoning for LLMs as Structure-aware Planning with Accurate World Model0
De-fine: Decomposing and Refining Visual Programs with Auto-Feedback0
Interleaved Reasoning for Large Language Models via Reinforcement Learning0
Towards Geometry Problem Solving in the Large Model Era: A Survey0
Show:102550
← PrevPage 6 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified