SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

661,570 papers248,326 code links4,818 tasks

Papers

Showing 24762500 of 661570 papers

TitleStatusHype
Distilling LLM Agent into Small Models with Retrieval and Code ToolsCode3
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to MultimodalityCode3
CLIMB: Class-imbalanced Learning Benchmark on Tabular DataCode3
MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent SystemsCode3
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language ModelsCode3
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement LearningCode3
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPOCode3
LaViDa: A Large Diffusion Language Model for Multimodal UnderstandingCode3
Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQLCode3
Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought ReasoningCode3
IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language ModelsCode3
Training-Free Efficient Video Generation via Dynamic Token CarvingCode3
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept SpaceCode3
Distance Adaptive Beam Search for Provably Accurate Graph-Based Nearest Neighbor SearchCode3
Efficient Agent Training for Computer UseCode3
OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models BenchmarkingCode3
General-Reasoner: Advancing LLM Reasoning Across All DomainsCode3
RLVR-World: Training World Models with Reinforcement LearningCode3
MLZero: A Multi-Agent System for End-to-end Machine Learning AutomationCode3
MM-Agent: LLM as Agents for Real-world Mathematical Modeling ProblemCode3
This Time is Different: An Observability Perspective on Time Series Foundation ModelsCode3
From Automation to Autonomy: A Survey on Large Language Models in Scientific DiscoveryCode3
Thinkless: LLM Learns When to ThinkCode3
ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced Reinforcement LearningCode3
Harnessing the Universal Geometry of EmbeddingsCode3
Show:102550
← PrevPage 100 of 26463Next →