SOTAVerified

StrategyQA

StrategyQA aims to measure the ability of models to answer questions that require multi-step implicit reasoning.

Source: BIG-bench

Papers

Showing 125 of 40 papers

TitleStatusHype
Training Compute-Optimal Large Language ModelsCode6
Mutual Reasoning Makes Smaller LLMs Stronger Problem-SolversCode4
PaLM: Scaling Language Modeling with PathwaysCode2
Scaling Language Models: Methods, Analysis & Insights from Training GopherCode2
Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and DistillationCode1
CR-LT-KGQA: A Knowledge Graph Question Answering Dataset Requiring Commonsense Reasoning and Long-Tail KnowledgeCode1
Self-Consistency Improves Chain of Thought Reasoning in Language ModelsCode1
AutoReason: Automatic Few-Shot Reasoning DecompositionCode1
Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step ReasoningCode1
Improving Planning with Large Language Models: A Modular Agentic ArchitectureCode1
Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive TasksCode1
Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-ContrastCode1
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning StrategiesCode1
Visconde: Multi-document QA with GPT-3 and Neural RerankingCode1
Voting or Consensus? Decision-Making in Multi-Agent DebateCode0
Distilling Reasoning Capabilities into Smaller Language ModelsCode0
Rationale-Aware Answer Verification by Pairwise Self-EvaluationCode0
DeLTa: A Decoding Strategy based on Logit Trajectory Prediction Improves Factuality and Reasoning AbilityCode0
Tailoring Self-Rationalizers with Multi-Reward DistillationCode0
Teaching Smaller Language Models To Generalise To Unseen Compositional QuestionsCode0
Meta-prompting Optimized Retrieval-augmented Generation0
A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions0
Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval0
Better Retrieval May Not Lead to Better Question Answering0
Deduction under Perturbed Evidence: Probing Student Simulation Capabilities of Large Language Models0
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.