Logical Reasoning
Papers
Showing 1–10 of 747 papers
All datasetsLingOlyBIG-bench (Formal Fallacies Syllogisms Negation)BIG-bench (Penguins In A Table)BIG-bench (Reasoning About Colored Objects)BIG-bench (Temporal Sequences)BIG-bench (Logic Grid Puzzle)BIG-bench (StrategyQA)RuWorldTreeWinograd AutomaticBIG-bench (Logical Fallacy Detection)
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Claude Opus | Delta_NoContext | 28.8 | — | Unverified |
| 2 | GPT-4o | Delta_NoContext | 25.1 | — | Unverified |
| 3 | Gemini 1.5 Pro | Delta_NoContext | 23.4 | — | Unverified |
| 4 | GPT-4 | Delta_NoContext | 21.5 | — | Unverified |
| 5 | Command R+ | Delta_NoContext | 11.6 | — | Unverified |
| 6 | GPT-3.5 | Delta_NoContext | 11.2 | — | Unverified |
| 7 | Mixtral 8x7B | Delta_NoContext | 6.4 | — | Unverified |
| 8 | Llama 3 8B | Delta_NoContext | 4.9 | — | Unverified |
| 9 | Llama 3 70B | Delta_NoContext | 2.9 | — | Unverified |
| 10 | Gemma 7B | Delta_NoContext | 2.2 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | PaLM 2 (few-shot, k=3, Direct) | Accuracy | 64.8 | — | Unverified |
| 2 | PaLM 2 (few-shot, k=3, CoT) | Accuracy | 57.2 | — | Unverified |
| 3 | OPT 66B (few-shot, k=3) | Accuracy | 54 | — | Unverified |
| 4 | PaLM 540B (few-shot, k=3) | Accuracy | 53.6 | — | Unverified |
| 5 | GPT-NeoX 20B (few-shot, k=3) | Accuracy | 52.8 | — | Unverified |
| 6 | BLOOM 176B (few-shot, k=3) | Accuracy | 52.8 | — | Unverified |
| 7 | Chinchilla-70B (few-shot, k=5) | Accuracy | 52.1 | — | Unverified |
| 8 | Bloomberg GPT 50B (few-shot, k=3) | Accuracy | 50.8 | — | Unverified |
| 9 | Gopher-280B (few-shot, k=5) | Accuracy | 50.7 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | PaLM 2 (few-shot, k=3, CoT) | Accuracy | 84.9 | — | Unverified |
| 2 | PaLM 2 (few-shot, k=3, Direct) | Accuracy | 65.8 | — | Unverified |
| 3 | Chinchilla-70B (few-shot, k=5) | Accuracy | 48.7 | — | Unverified |
| 4 | PaLM 540B (few-shot, k=3) | Accuracy | 44.5 | — | Unverified |
| 5 | Gopher-280B (few-shot, k=5) | Accuracy | 40.6 | — | Unverified |
| 6 | BLOOM 176B (few-shot, k=3) | Accuracy | 40.41 | — | Unverified |
| 7 | Bloomberg GPT (few-shot, k=3) | Accuracy | 37.67 | — | Unverified |
| 8 | GPT-NeoX (few-shot, k=3) | Accuracy | 33.56 | — | Unverified |
| 9 | OPT 66B (few-shot, k=3) | Accuracy | 28.08 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | PaLM 2 (few-shot, k=3, CoT) | Accuracy | 91.2 | — | Unverified |
| 2 | PaLM 2 (few-shot, k=3, Direct) | Accuracy | 61.2 | — | Unverified |
| 3 | Chinchilla-70B (few-shot, k=5) | Accuracy | 59.7 | — | Unverified |
| 4 | Gopher-280B (few-shot, k=5) | Accuracy | 49.2 | — | Unverified |
| 5 | PaLM 540B (few-shot, k=3) | Accuracy | 38 | — | Unverified |
| 6 | BLOOM 176B (few-shot, k=3) | Accuracy | 36.8 | — | Unverified |
| 7 | Bloomberg GPT (few-shot, k=3) | Accuracy | 34.8 | — | Unverified |
| 8 | OPT 66B (few-shot, k=3) | Accuracy | 31.2 | — | Unverified |
| 9 | GPT-NeoX (few-shot, k=3) | Accuracy | 26 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | PaLM 2 (few-shot, k=3, CoT) | Accuracy | 100 | — | Unverified |
| 2 | PaLM 2 (few-shot, k=3, Direct) | Accuracy | 96.4 | — | Unverified |
| 3 | PaLM 540B (few-shot, k=3) | Accuracy | 39.6 | — | Unverified |
| 4 | BLOOM 176B (few-shot, k=3) | Accuracy | 36.8 | — | Unverified |
| 5 | Chinchilla-70B (few-shot, k=5) | Accuracy | 32 | — | Unverified |
| 6 | Bloomberg GPT (few-shot, k=3) | Accuracy | 29.2 | — | Unverified |
| 7 | OPT 66B (few-shot, k=3) | Accuracy | 23.6 | — | Unverified |
| 8 | GPT-NeoX (few-shot, k=3) | Accuracy | 21.2 | — | Unverified |
| 9 | Gopher-280B (few-shot, k=5) | Accuracy | 19 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Chinchilla-70B (few-shot, k=5) | Accuracy | 44 | — | Unverified |
| 2 | PaLM-540B (few-shot, k=5) | Accuracy | 42.4 | — | Unverified |
| 3 | PaLM-62B (few-shot, k=5) | Accuracy | 36.5 | — | Unverified |
| 4 | Gopher-280B (few-shot, k=5) | Accuracy | 35.1 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | PaLM-540B (few-shot, k=5) | Accuracy | 73.9 | — | Unverified |
| 2 | Chinchilla-70B (few-shot, k=5) | Accuracy | 68.3 | — | Unverified |
| 3 | PaLM-62B (few-shot, k=5) | Accuracy | 65.4 | — | Unverified |
| 4 | Gopher-280B (few-shot, k=5) | Accuracy | 61 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Human benchmark | Accuracy | 83.7 | — | Unverified |
| 2 | RuGPT-3 Large | Accuracy | 40.7 | — | Unverified |
| 3 | RuGPT-3 Medium | Accuracy | 38 | — | Unverified |
| 4 | RuGPT-3 Small | Accuracy | 34 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Human benchmark | Accuracy | 87 | — | Unverified |
| 2 | RuGPT-3 Small | Accuracy | 57.9 | — | Unverified |
| 3 | RuGPT-3 Medium | Accuracy | 57.2 | — | Unverified |
| 4 | RuGPT-3 Large | Accuracy | 55.5 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | Chinchilla-70B (few-shot, k=5) | Accuracy | 72.1 | — | Unverified |
| 2 | Gopher-280B (few-shot, k=5) | Accuracy | 58.9 | — | Unverified |