SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 90019025 of 474278 papers

TitleStatusHype
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMsCode0
EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning FrameworkCode0
MedCAL-Bench: A Comprehensive Benchmark on Cold-Start Active Learning with Foundation Models for Medical Image AnalysisCode0
Anemoi: A Semi-Centralized Multi-agent System Based on Agent-to-Agent Communication MCP server from Coral ProtocolCode0
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid DynamicsCode0
Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVRCode0
Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data DistillationCode0
ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and InterpretabilityCode0
RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex ProblemsCode0
CLARity: Reasoning Consistency Alone Can Teach Reinforced ExpertsCode0
Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM InferenceCode0
Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth EstimationCode0
On the Representations of Entities in Auto-regressive Large Language ModelsCode0
SilvaScenes: Tree Segmentation and Species Classification from Under-Canopy Images in Natural ForestsCode0
Agentic Property-Based Testing: Finding Bugs Across the Python EcosystemCode0
MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound SeparationCode0
RepDL: Bit-level Reproducible Deep Learning Training and InferenceCode0
Repairing Regex Vulnerabilities via Localization-Guided InstructionsCode0
ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping0
More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse ExplorationCode0
Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models0
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities0
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning0
First Try Matters: Revisiting the Role of Reflection in Reasoning Models0
Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling0
Show:102550
← PrevPage 361 of 18972Next →