SOTAVerified

Logical Reasoning

Papers

Showing 351400 of 747 papers

TitleStatusHype
SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs0
Symbol Correctness in Deep Neural Networks Containing Symbolic Layers0
Symbolic-AI-Fusion Deep Learning (SAIF-DL): Encoding Knowledge into Training with Answer Set Programming Loss Penalties by a Novel Loss Function Approach0
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models0
System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection0
Table-based Fact Verification with Self-adaptive Mixture of Experts0
Teaching Pretrained Models with Commonsense Reasoning: A Preliminary KB-Based Approach0
TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving0
TensorLog: Deep Learning Meets Probabilistic DBs0
Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness0
Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning0
The Dark Side of Explanations: Poisoning Recommender Systems with Counterfactual Examples0
The General Theory of General Intelligence: A Pragmatic Patternist Perspective0
The Good, The Bad, and Why: Unveiling Emotions in Generative AI0
The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models0
The neural correlates of logical-mathematical symbol systems processing resemble that of spatial cognition more than natural language processing0
The potential of large language models for improving probability learning: A study on ChatGPT3.5 and first-year computer engineering students0
The RatioLog Project: Rational Extensions of Logical Reasoning0
The Society of HiveMind: Multi-Agent Optimization of Foundation Model Swarms to Unlock the Potential of Collective Intelligence0
The theory of quantitative trading0
Think Beyond Size: Adaptive Prompting for More Effective Reasoning0
Thinking Like an Expert:Multimodal Hypergraph-of-Thought (HoT) Reasoning to boost Foundation Modals0
Time-aware Self-Attention Meets Logic Reasoning in Recommender Systems0
TimeLogic: A Temporal Logic Benchmark for Video QA0
Knowledge-based and Data-driven Reasoning and Learning for Ad Hoc Teamwork0
Towards a Theory of Intentions for Human-Robot Collaboration0
Towards Better Response Times and Higher-Quality Queries in Interactive Knowledge Base Debugging0
Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation0
SarcasmBench: Towards Evaluating Large Language Models on Sarcasm Understanding0
Towards Generalist Prompting for Large Language Models by Mental Models0
Towards Human-Compatible XAI: Explaining Data Differentials with Concept Induction over Background Knowledge0
Towards Ideal Semantics for Analyzing Stream Reasoning0
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models0
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models0
Towards Superior Quantization Accuracy: A Layer-sensitive Approach0
Towards Unifying Logical Entailment and Statistical Estimation0
Towards Unifying Perceptual Reasoning and Logical Reasoning0
To What Extent Do Natural Language Understanding Datasets Correlate to Logical Reasoning? A Method for Diagnosing Logical Reasoning.0
Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction0
Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs0
Transformer-based Language Models for Reasoning in the Description Logic ALCQ0
Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests0
Truth Table Deep Convolutional Neural Network, A New SAT-Encodable Architecture - Application To Complete Robustness0
A Scalable, Interpretable, Verifiable & Differentiable Logic Gate Convolutional Neural Network Architecture From Truth Tables0
TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games0
Type-dependent Prompt CycleQAG : Cycle Consistency for Multi-hop Question Generation0
Unifying Neural Learning and Symbolic Reasoning for Spinal Medical Report Generation0
Unifying Structure Reasoning and Language Model Pre-training for Complex Reasoning0
Unleash LLMs Potential for Recommendation by Coordinating Twin-Tower Dynamic Semantic Token Generator0
Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring0
Show:102550
← PrevPage 8 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude OpusDelta_NoContext28.8Unverified
2GPT-4oDelta_NoContext25.1Unverified
3Gemini 1.5 ProDelta_NoContext23.4Unverified
4GPT-4Delta_NoContext21.5Unverified
5Command R+Delta_NoContext11.6Unverified
6GPT-3.5Delta_NoContext11.2Unverified
7Mixtral 8x7BDelta_NoContext6.4Unverified
8Llama 3 8BDelta_NoContext4.9Unverified
9Llama 3 70BDelta_NoContext2.9Unverified
10Gemma 7BDelta_NoContext2.2Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, Direct)Accuracy64.8Unverified
2PaLM 2 (few-shot, k=3, CoT)Accuracy57.2Unverified
3OPT 66B (few-shot, k=3)Accuracy54Unverified
4PaLM 540B (few-shot, k=3)Accuracy53.6Unverified
5GPT-NeoX 20B (few-shot, k=3)Accuracy52.8Unverified
6BLOOM 176B (few-shot, k=3)Accuracy52.8Unverified
7Chinchilla-70B (few-shot, k=5)Accuracy52.1Unverified
8Bloomberg GPT 50B (few-shot, k=3)Accuracy50.8Unverified
9Gopher-280B (few-shot, k=5)Accuracy50.7Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy84.9Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy65.8Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy48.7Unverified
4PaLM 540B (few-shot, k=3)Accuracy44.5Unverified
5Gopher-280B (few-shot, k=5)Accuracy40.6Unverified
6BLOOM 176B (few-shot, k=3)Accuracy40.41Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy37.67Unverified
8GPT-NeoX (few-shot, k=3)Accuracy33.56Unverified
9OPT 66B (few-shot, k=3)Accuracy28.08Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy91.2Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy61.2Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy59.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy49.2Unverified
5PaLM 540B (few-shot, k=3)Accuracy38Unverified
6BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
7Bloomberg GPT (few-shot, k=3)Accuracy34.8Unverified
8OPT 66B (few-shot, k=3)Accuracy31.2Unverified
9GPT-NeoX (few-shot, k=3)Accuracy26Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM 2 (few-shot, k=3, CoT)Accuracy100Unverified
2PaLM 2 (few-shot, k=3, Direct)Accuracy96.4Unverified
3PaLM 540B (few-shot, k=3)Accuracy39.6Unverified
4BLOOM 176B (few-shot, k=3)Accuracy36.8Unverified
5Chinchilla-70B (few-shot, k=5)Accuracy32Unverified
6Bloomberg GPT (few-shot, k=3)Accuracy29.2Unverified
7OPT 66B (few-shot, k=3)Accuracy23.6Unverified
8GPT-NeoX (few-shot, k=3)Accuracy21.2Unverified
9Gopher-280B (few-shot, k=5)Accuracy19Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy44Unverified
2PaLM-540B (few-shot, k=5)Accuracy42.4Unverified
3PaLM-62B (few-shot, k=5)Accuracy36.5Unverified
4Gopher-280B (few-shot, k=5)Accuracy35.1Unverified
#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy73.9Unverified
2Chinchilla-70B (few-shot, k=5)Accuracy68.3Unverified
3PaLM-62B (few-shot, k=5)Accuracy65.4Unverified
4Gopher-280B (few-shot, k=5)Accuracy61Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy 83.7Unverified
2RuGPT-3 LargeAccuracy 40.7Unverified
3RuGPT-3 MediumAccuracy 38Unverified
4RuGPT-3 SmallAccuracy 34Unverified
#ModelMetricClaimedVerifiedStatus
1Human benchmarkAccuracy87Unverified
2RuGPT-3 SmallAccuracy57.9Unverified
3RuGPT-3 MediumAccuracy57.2Unverified
4RuGPT-3 LargeAccuracy55.5Unverified
#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy72.1Unverified
2Gopher-280B (few-shot, k=5)Accuracy58.9Unverified