SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1480114850 of 474278 papers

TitleStatusHype
Steering Evaluation-Aware Language Models to Act Like They Are Deployed1
ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization1
GISA: A Benchmark for General Information-Seeking Assistant1
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark1
AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios1
Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning1
V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval1
MediX-R1: Open Ended Medical Reinforcement Learning1
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants1
TodoEvolve: Learning to Architect Agent Planning Systems1
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL1
Modular Neural Image Signal Processing1
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer1
TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size1
HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification1
AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model1
COREA: Coupled Relightable 3D Gaussians and SDFs for Efficient Normal Alignment1
Sharing State Between Prompts and Programs1
MIST: Mutual Information Estimation Via Supervised Training1
Learning Personalized Agents from Human Feedback1
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents1
Prism: Spectral-Aware Block-Sparse Attention1
General Agent Evaluation1
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing1
AlphaApollo: A System for Deep Agentic Reasoning1
ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering1
Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation1
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation1
AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence1
Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models1
Reinforced Fast Weights with Next-Sequence Prediction1
RubricBench: Aligning Model-Generated Rubrics with Human Standards1
Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition1
Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding1
MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning1
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs1
World Models That Know When They Don't Know - Controllable Video Generation with Calibrated Uncertainty1
SR-Scientist: Scientific Equation Discovery With Agentic AI1
MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents1
BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?1
VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining1
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks1
The Geometry of Reasoning: Flowing Logics in Representation Space1
Next Visual Granularity Generation1
Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning1
GameDevBench: Evaluating Agentic Capabilities Through Game Development1
Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings1
Learning to Configure Agentic AI Systems1
Panoramic Affordance Prediction1
Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories1
Show:102550
← PrevPage 297 of 9486Next →