SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 61266150 of 474278 papers

TitleStatusHype
Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent SystemsCode2
NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLMCode2
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-TrainingCode2
FinMTEB: Finance Massive Text Embedding BenchmarkCode2
RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM GenerationCode2
Hierarchical Expert Prompt for Large-Language-Model: An Approach Defeat Elite AI in TextStarCraft II for the First TimeCode2
MasRouter: Learning to Route LLMs for Multi-Agent SystemsCode2
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video UnderstandingCode2
D-CIPHER: Dynamic Collaborative Intelligent Multi-Agent System with Planner and Heterogeneous Executors for Offensive SecurityCode2
MonoForce: Learnable Image-conditioned Physics EngineCode2
A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and EvaluationsCode2
Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement LearningCode2
Process Reward Models for LLM Agents: Practical Framework and DirectionsCode2
Compression-Aware One-Step Diffusion Model for JPEG Artifact RemovalCode2
CoSER: Coordinating LLM-Based Persona Simulation of Established RolesCode2
KET-RAG: A Cost-Efficient Multi-Granular Indexing Framework for Graph-RAGCode2
Unlocking the Potential of Classic GNNs for Graph-level Tasks: Simple Architectures Meet ExcellenceCode2
DiffMS: Diffusion Generation of Molecules Conditioned on Mass SpectraCode2
Diffusion Models for Molecules: A Survey of Methods and TasksCode2
TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-InstrumentCode2
A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional HypothesisCode2
Digi-Q: Learning Q-Value Functions for Training Device-Control AgentsCode2
Harnessing Vision Models for Time Series Analysis: A SurveyCode2
DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human ReferencesCode2
CoT-Valve: Length-Compressible Chain-of-Thought TuningCode2
Show:102550
← PrevPage 246 of 18972Next →