SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 79768000 of 177340 papers

TitleStatusHype
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn DialoguesCode2
PPFlow: Target-aware Peptide Design with Torsional Flow MatchingCode2
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation SystemsCode2
Particle Video Revisited: Tracking Through Occlusions Using Point TrajectoriesCode2
FreeTumor: Large-Scale Generative Tumor Synthesis in Computed Tomography Images for Improving Tumor RecognitionCode2
UMBRAE: Unified Multimodal Brain DecodingCode2
HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single DecoderCode2
LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language ModelsCode2
RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!Code2
TensorNet: Cartesian Tensor Representations for Efficient Learning of Molecular PotentialsCode2
Watch Every Step! LLM Agent Learning via Iterative Step-Level Process RefinementCode2
FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning ModelsCode2
Pix2Poly: A Sequence Prediction Method for End-to-end Polygonal Building Footprint Extraction from Remote Sensing ImageryCode2
Multi-View Mesh Reconstruction with Neural Deferred ShadingCode2
Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase PredictionCode2
Room impulse response reconstruction with physics-informed deep learningCode2
Efficient4D: Fast Dynamic 3D Object Generation from a Single-view VideoCode2
MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical CodeCode2
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent DebateCode2
Multi-Programming Language Sandbox for LLMsCode2
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring ModelingCode2
ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target SimulationCode2
How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise GradientsCode2
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI FeedbackCode2
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing DomainCode2
Show:102550
← PrevPage 320 of 7094Next →