SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 84268450 of 474278 papers

TitleStatusHype
UltraGen: High-Resolution Video Generation with Hierarchical Attention0
Robustness Assessment and Enhancement of Text Watermarking for Google's SynthIDCode0
Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations0
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping0
Plural Voices, Single Agent: Towards Inclusive AI in Multi-User Domestic SpacesCode0
Steering Autoregressive Music Generation with Recursive Feature Machines0
When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM EvaluationCode0
DSI-Bench: A Benchmark for Dynamic Spatial Intelligence0
Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without RewardsCode0
RODS: Robust Optimization Inspired Diffusion Sampling for Detecting and Reducing Hallucination in Generative ModelsCode0
Adversarial Graph Fusion for Incomplete Multi-view Semi-supervised Learning with Tensorial ImputationCode0
Glyph: Scaling Context Windows via Visual-Text CompressionCode0
IMB: An Italian Medical Benchmark for Question AnsweringCode0
DART: A Structured Dataset of Regulatory Drug Documents in Italian for Clinical NLPCode0
RAISE: A Unified Framework for Responsible AI Scoring and EvaluationCode0
A Multi-Evidence Framework Rescues Low-Power Prognostic Signals and Rejects Statistical Artifacts in Cancer GenomicsCode0
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web DocumentsCode0
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context TrainingCode0
Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language ModelsCode0
Actor-Free Continuous Control via Structurally Maximizable Q-FunctionsCode0
ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG BenchmarksCode0
Fine-Tuned Thoughts: Leveraging Chain-of-Thought Reasoning for Industrial Asset Health MonitoringCode0
BO4Mob: Bayesian Optimization Benchmarks for High-Dimensional Urban Mobility ProblemCode0
NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM JailbreaksCode0
MATRIX: Multimodal Agent Tuning for Robust Tool-Use ReasoningCode0
Show:102550
← PrevPage 338 of 18972Next →