SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

659,983 papers248,104 code links4,818 tasks

Papers

Showing 24512500 of 659983 papers

TitleStatusHype
SealQA: Raising the Bar for Reasoning in Search-Augmented Language ModelsCode3
EXP-Bench: Can AI Conduct AI Research Experiments?Code3
MathArena: Evaluating LLMs on Uncontaminated Math CompetitionsCode3
BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM ModelCode3
MAGREF: Masked Guidance for Any-Reference Video GenerationCode3
EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-JudgeCode3
KVzip: Query-Agnostic KV Cache Compression with Context ReconstructionCode3
Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action ModelsCode3
TiRex: Zero-Shot Forecasting Across Long and Short Horizons with Enhanced In-Context LearningCode3
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement LearningCode3
NeuralOM: Neural Ocean Model for Subseasonal-to-Seasonal SimulationCode3
Iterative Self-Incentivization Empowers Large Language Models as Agentic SearchersCode3
Learning to Reason without External RewardsCode3
PCDCNet: A Surrogate Model for Air Quality Forecasting with Physical-Chemical Dynamics and ConstraintsCode3
syftr: Pareto-Optimal Generative AICode3
VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and ExtrapolationCode3
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D ReconstructionCode3
FruitNeRF++: A Generalized Multi-Fruit Counting Method Utilizing Contrastive Learning and Neural Radiance FieldsCode3
SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative PipelineCode3
InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic ChartsCode3
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization DataCode3
ChartGalaxy: A Dataset for Infographic Chart Understanding and GenerationCode3
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement LearningCode3
RemoteSAM: Towards Segment Anything for Earth ObservationCode3
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to MultimodalityCode3
CLIMB: Class-imbalanced Learning Benchmark on Tabular DataCode3
Distilling LLM Agent into Small Models with Retrieval and Code ToolsCode3
OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in InfographicsCode3
Training-Free Efficient Video Generation via Dynamic Token CarvingCode3
MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent SystemsCode3
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language ModelsCode3
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement LearningCode3
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPOCode3
LaViDa: A Large Diffusion Language Model for Multimodal UnderstandingCode3
Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQLCode3
Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought ReasoningCode3
IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language ModelsCode3
Distance Adaptive Beam Search for Provably Accurate Graph-Based Nearest Neighbor SearchCode3
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept SpaceCode3
MM-Agent: LLM as Agents for Real-world Mathematical Modeling ProblemCode3
Efficient Agent Training for Computer UseCode3
OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models BenchmarkingCode3
General-Reasoner: Advancing LLM Reasoning Across All DomainsCode3
RLVR-World: Training World Models with Reinforcement LearningCode3
MLZero: A Multi-Agent System for End-to-end Machine Learning AutomationCode3
This Time is Different: An Observability Perspective on Time Series Foundation ModelsCode3
From Automation to Autonomy: A Survey on Large Language Models in Scientific DiscoveryCode3
Thinkless: LLM Learns When to ThinkCode3
ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced Reinforcement LearningCode3
Harnessing the Universal Geometry of EmbeddingsCode3
Show:102550
← PrevPage 50 of 13200Next →