SOTAVerified

Decision Making

Papers

Showing 251300 of 12311 papers

TitleStatusHype
Multi-Objective Causal Bayesian OptimizationCode1
STeCa: Step-level Trajectory Calibration for LLM Agent LearningCode1
How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain SimulationCode1
Benchmarking LLMs for Political Science: A United Nations PerspectiveCode1
AdaptiveStep: Automatically Dividing Reasoning Step through Model ConfidenceCode1
RobustX: Robust Counterfactual Explanations Made EasyCode1
Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing InducementsCode1
Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM AgentsCode1
SegX: Improving Interpretability of Clinical Image Diagnosis with Segmentation-based EnhancementCode1
Habitizing Diffusion Planning for Efficient and Effective Decision MakingCode1
RTBAgent: A LLM-based Agent System for Real-Time BiddingCode1
Vintix: Action Model via In-Context Reinforcement LearningCode1
Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge GraphsCode1
A Survey of World Models for Autonomous DrivingCode1
MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought ThinkingCode1
NS-Gym: Open-Source Simulation Environments and Benchmarks for Non-Stationary Markov Decision ProcessesCode1
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical ReasoningCode1
Co-Activation Graph Analysis of Safety-Verified and Explainable Deep Reinforcement Learning PoliciesCode1
ICFNet: Integrated Cross-modal Fusion Network for Survival PredictionCode1
MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive EnvironmentsCode1
Plancraft: an evaluation dataset for planning with LLM agentsCode1
Modality-Projection Universal Model for Comprehensive Full-Body Medical Imaging SegmentationCode1
Constraint-Adaptive Policy Switching for Offline Safe Reinforcement LearningCode1
Multimodal Learning with Uncertainty Quantification based on Discounted Belief FusionCode1
LegalAgentBench: Evaluating LLM Agents in Legal DomainCode1
CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language ModelsCode1
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven OptimizationCode1
A Generative Framework for Probabilistic, Spatiotemporally Coherent Downscaling of Climate SimulationCode1
Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement LearningCode1
Explainable Fuzzy Neural Network with Multi-Fidelity Reinforcement Learning for Micro-Architecture Design Space ExplorationCode1
WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language ModelCode1
Digital Transformation in the Water Distribution System based on the Digital Twins ConceptCode1
SurgBox: Agent-Driven Operating Room Sandbox with Surgery CopilotCode1
AI-Driven Day-to-Day Route ChoiceCode1
BIMCaP: BIM-based AI-supported LiDAR-Camera Pose RefinementCode1
A Survey of Medical Vision-and-Language Applications and Their TechniquesCode1
AssistRAG: Boosting the Potential of Large Language Models with an Intelligent Information AssistantCode1
Large-scale moral machine experiment on large language modelsCode1
BayesianFitForecast: A User-Friendly R Toolbox for Parameter Estimation and Forecasting with Ordinary Differential EquationsCode1
Semantic-Aware Resource Management for C-V2X Platooning via Multi-Agent Reinforcement LearningCode1
Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language ModelsCode1
Online Intrinsic Rewards for Decision Making Agents from Large Language Model FeedbackCode1
DiffLight: A Partial Rewards Conditioned Diffusion Model for Traffic Signal Control with Missing DataCode1
Toward Conditional Distribution Calibration in Survival PredictionCode1
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context PromptingCode1
Reflection-Bench: probing AI intelligence with reflectionCode1
A Comprehensive Evaluation of Cognitive Biases in LLMsCode1
MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task AutomationCode1
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web NavigationCode1
Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement LearningCode1
Show:102550
← PrevPage 6 of 247Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SRLAAverage Remaining Cycles6.4Unverified