SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 90769100 of 177340 papers

TitleStatusHype
MetaOpenFOAM: an LLM-based multi-agent framework for CFDCode2
PyGen: A Collaborative Human-AI Approach to Python Package CreationCode2
Disentangling Memory and Reasoning Ability in Large Language ModelsCode2
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation PerspectiveCode2
vesselFM: A Foundation Model for Universal 3D Blood Vessel SegmentationCode2
TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion ModelsCode2
TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian SplattingCode2
Lost & Found: Tracking Changes from Egocentric Observations in 3D Dynamic Scene GraphsCode2
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation ModelsCode2
CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and RerankingCode2
FLAIR: VLM with Fine-grained Language-informed Image RepresentationsCode2
ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context ScenarioCode2
SoRA: Singular Value Decomposed Low-Rank Adaptation for Domain Generalizable Representation LearningCode2
Divot: Diffusion Powers Video Tokenizer for Comprehension and GenerationCode2
JPC: Flexible Inference for Predictive Coding Networks in JAXCode2
MESA: Effective Matching Redundancy Reduction by Semantic Area SegmentationCode2
DriveMM: All-in-One Large Multimodal Model for Autonomous DrivingCode2
MAC-Ego3D: Multi-Agent Gaussian Consensus for Real-Time Collaborative Ego-Motion and Photorealistic 3D ReconstructionCode2
MMLU-CF: A Contamination-free Multi-task Language Understanding BenchmarkCode2
MR-GDINO: Efficient Open-World Continual Object DetectionCode2
Scenario-Wise Rec: A Multi-Scenario Recommendation BenchmarkCode2
EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model EvaluationCode2
Test-time Computing: from System-1 Thinking to System-2 ThinkingCode2
TakuNet: an Energy-Efficient CNN for Real-Time Inference on Embedded UAV systems in Emergency Response ScenariosCode2
Russian Financial Statements Database: A firm-level collection of the universe of financial statementsCode2
Show:102550
← PrevPage 364 of 7094Next →