The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 9001–9025 of 474278 papers

Title	Date	Status
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs	Oct 10, 2025	CodeCode Available
EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework	Oct 10, 2025	CodeCode Available
MedCAL-Bench: A Comprehensive Benchmark on Cold-Start Active Learning with Foundation Models for Medical Image Analysis	Oct 10, 2025	CodeCode Available
Anemoi: A Semi-Centralized Multi-agent System Based on Agent-to-Agent Communication MCP server from Coral Protocol	Oct 10, 2025	CodeCode Available
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics	Oct 10, 2025	CodeCode Available
Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR	Oct 10, 2025	CodeCode Available
Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation	Oct 10, 2025	CodeCode Available
ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability	Oct 10, 2025	CodeCode Available
RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems	Oct 10, 2025	CodeCode Available
CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts	Oct 10, 2025	CodeCode Available
Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference	Oct 10, 2025	CodeCode Available
Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation	Oct 10, 2025	CodeCode Available
On the Representations of Entities in Auto-regressive Large Language Models	Oct 10, 2025	CodeCode Available
SilvaScenes: Tree Segmentation and Species Classification from Under-Canopy Images in Natural Forests	Oct 10, 2025	CodeCode Available
Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem	Oct 10, 2025	CodeCode Available
MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation	Oct 10, 2025	CodeCode Available
RepDL: Bit-level Reproducible Deep Learning Training and Inference	Oct 10, 2025	CodeCode Available
Repairing Regex Vulnerabilities via Localization-Guided Instructions	Oct 10, 2025	CodeCode Available
ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping	Oct 9, 2025	—Unverified
More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration	Oct 9, 2025	CodeCode Available
Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models	Oct 9, 2025	—Unverified
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities	Oct 9, 2025	—Unverified
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning	Oct 9, 2025	—Unverified
First Try Matters: Revisiting the Role of Reflection in Reasoning Models	Oct 9, 2025	—Unverified
Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling	Oct 9, 2025	—Unverified