SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 62266250 of 474278 papers

TitleStatusHype
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language ModelsCode2
MetaOpenFOAM 2.0: Large Language Model Driven Chain of Thought for Automating CFD Simulation and Post-ProcessingCode2
RaySplats: Ray Tracing based Gaussian SplattingCode2
TRADES: Generating Realistic Market Simulations with Diffusion ModelsCode2
mFollowIR: a Multilingual Benchmark for Instruction Following in RetrievalCode2
Visual Autoregressive Modeling for Image Super-ResolutionCode2
STP: Self-play LLM Theorem Provers with Iterative Conjecturing and ProvingCode2
GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal ModelingCode2
AIN: The Arabic INclusive Large Multimodal ModelCode2
An Adversarial Approach to Register Extreme Resolution Tissue Cleared 3D Brain ImagesCode2
Advancing Dense Endoscopic Reconstruction with Gaussian Splatting-driven Surface Normal-aware Tracking and MappingCode2
Efficient Reasoning with Hidden ThinkingCode2
Diverse Preference OptimizationCode2
Free-T2M: Frequency Enhanced Text-to-Motion Diffusion Model With Consistency LossCode2
Track-On: Transformer-based Online Point Tracking with MemoryCode2
GuardReasoner: Towards Reasoning-based LLM SafeguardsCode2
General Scene Adaptation for Vision-and-Language NavigationCode2
Closing the Gap Between Synthetic and Ground Truth Time Series Distributions via Neural MappingCode2
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMsCode2
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to ImitateCode2
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail ModerationCode2
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse AutoencodersCode2
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language ModelCode2
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse AutoencodersCode2
CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMsCode2
Show:102550
← PrevPage 250 of 18972Next →