SOTAVerified

Decision Making

Papers

Showing 251275 of 12311 papers

TitleStatusHype
How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain SimulationCode1
STeCa: Step-level Trajectory Calibration for LLM Agent LearningCode1
Multi-Objective Causal Bayesian OptimizationCode1
AdaptiveStep: Automatically Dividing Reasoning Step through Model ConfidenceCode1
RobustX: Robust Counterfactual Explanations Made EasyCode1
Benchmarking LLMs for Political Science: A United Nations PerspectiveCode1
Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing InducementsCode1
Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM AgentsCode1
SegX: Improving Interpretability of Clinical Image Diagnosis with Segmentation-based EnhancementCode1
Habitizing Diffusion Planning for Efficient and Effective Decision MakingCode1
RTBAgent: A LLM-based Agent System for Real-Time BiddingCode1
Vintix: Action Model via In-Context Reinforcement LearningCode1
Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge GraphsCode1
MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought ThinkingCode1
A Survey of World Models for Autonomous DrivingCode1
NS-Gym: Open-Source Simulation Environments and Benchmarks for Non-Stationary Markov Decision ProcessesCode1
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical ReasoningCode1
ICFNet: Integrated Cross-modal Fusion Network for Survival PredictionCode1
Co-Activation Graph Analysis of Safety-Verified and Explainable Deep Reinforcement Learning PoliciesCode1
MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive EnvironmentsCode1
Plancraft: an evaluation dataset for planning with LLM agentsCode1
Modality-Projection Universal Model for Comprehensive Full-Body Medical Imaging SegmentationCode1
Constraint-Adaptive Policy Switching for Offline Safe Reinforcement LearningCode1
Multimodal Learning with Uncertainty Quantification based on Discounted Belief FusionCode1
LegalAgentBench: Evaluating LLM Agents in Legal DomainCode1
Show:102550
← PrevPage 11 of 493Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SRLAAverage Remaining Cycles6.4Unverified