SOTAVerified

Logical Reasoning

Papers

Showing 401450 of 747 papers

TitleStatusHype
Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor GenerationCode2
Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games0
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGICode5
Generation of Explanations for Logic Reasoning0
Enhancing Logical Reasoning in Large Language Models to Facilitate Legal Applications0
De-fine: Decomposing and Refining Visual Programs with Auto-Feedback0
WatME: Towards Lossless Watermarking Through Lexical Redundancy0
FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models0
Neuro-Symbolic Integration Brings Causal and Reliable Reasoning ProofsCode1
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical ReasoningCode0
Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case StudyCode0
From Complex to Simple: Unraveling the Cognitive Tree for Reasoning with Small Language Models0
Are LLMs Rigorous Logical Reasoner? Empowering Natural Language Proof Generation with Contrastive Stepwise Decoding0
Let's Reinforce Step by Step0
Language Models can be Logical Solvers0
Chain of Images for Intuitively ReasoningCode1
COOL: A Constraint Object-Oriented Logic Programming Language and its Neural-Symbolic Compilation System0
Evaluating the Potential of Leading Large Language Models in Reasoning Biology Questions0
Rule Learning as Machine Translation using the Atomic Knowledge BankCode0
LLM4Drive: A Survey of Large Language Models for Autonomous DrivingCode3
Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral AnalysisCode0
Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth PaceCode1
DetermLR: Augmenting LLM-based Logical Reasoning from Indeterminacy to DeterminacyCode1
Generating by Understanding: Neural Visual Generation with Logical Symbol GroundingsCode0
POE: Process of Elimination for Multiple Choice ReasoningCode0
Breaking the Language Barrier: Improving Cross-Lingual Reasoning with Structured Self-AttentionCode0
DetectGPT-SC: Improving Detection of Text Generated by Large Language Models through Self-Consistency with Masked Predictions0
Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study on Syllogism0
Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-ThoughtsCode1
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic ProversCode1
Retrieval-Augmented Neural Response Generation Using Logical Reasoning and Relevance Scoring0
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real WorldCode1
Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical ReasoningCode1
Improving Large Language Models in Event Relation Logical PredictionCode1
GLoRE: Evaluating Logical Reasoning of Large Language ModelsCode1
The potential of large language models for improving probability learning: A study on ChatGPT3.5 and first-year computer engineering students0
Empower Nested Boolean Logic via Self-Supervised Curriculum LearningCode0
DecoderLens: Layerwise Interpretation of Encoder-Decoder TransformersCode0
Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot PerformanceCode0
Learning Reliable Logical Rules with SATNet0
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models0
DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks0
Physics of Language Models: Part 3.2, Knowledge Manipulation0
EchoPrompt: Instructing the Model to Rephrase Queries for Improved In-context LearningCode0
Do PLMs Know and Understand Ontological Knowledge?Code1
HAE-RAE Bench: Evaluation of Korean Knowledge in Language ModelsCode1
On the Potential of CLIP for Compositional Logical Reasoning0
LR-XFL: Logical Reasoning-based Explainable Federated LearningCode0
Human Comprehensible Active Learning of Genome-Scale Metabolic Networks0
LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking PuzzlesCode1
Show:102550
← PrevPage 9 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified