The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 14701–14750 of 474278 papers

Title	Date	Status	Hype
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs	Feb 24, 2026	—Unverified	1
Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting	Feb 24, 2026	—Unverified	1
Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data	Feb 24, 2026	—Unverified	1
GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing	Feb 24, 2026	—Unverified	1
Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking	Feb 24, 2026	—Unverified	1
Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding	Feb 24, 2026	—Unverified	1
MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding	Feb 23, 2026	—Unverified	1
MIST: Mutual Information Estimation Via Supervised Training	Feb 23, 2026	—Unverified	1
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation	Feb 23, 2026	—Unverified	1
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations	Feb 22, 2026	—Unverified	1
SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models	Feb 22, 2026	—Unverified	1
WildOS: Open-Vocabulary Object Search in the Wild	Feb 22, 2026	—Unverified	1
RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning	Feb 21, 2026	—Unverified	1
Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum	Feb 20, 2026	—Unverified	1
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs	Feb 19, 2026	—Unverified	1
Learning Personalized Agents from Human Feedback	Feb 18, 2026	—Unverified	1
Reinforced Fast Weights with Next-Sequence Prediction	Feb 18, 2026	—Unverified	1
Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook	Feb 18, 2026	—Unverified	1
Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs	Feb 18, 2026	—Unverified	1
DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning	Feb 18, 2026	—Unverified	1
m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models	Feb 18, 2026	—Unverified	1
ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization	Feb 17, 2026	—Unverified	1
Avey-B	Feb 17, 2026	—Unverified	1
SR-Scientist: Scientific Equation Discovery With Agentic AI	Feb 17, 2026	—Unverified	1
Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs	Feb 17, 2026	—Unverified	1
MARS: Modular Agent with Reflective Search for Automated AI Research	Feb 17, 2026	—Unverified	1
EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing	Feb 16, 2026	—Unverified	1
Revisiting the Platonic Representation Hypothesis: An Aristotelian View	Feb 16, 2026	—Unverified	1
Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models	Feb 16, 2026	—Unverified	1
A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)	Feb 16, 2026	—Unverified	1
Privileged Information Distillation for Language Models	Feb 16, 2026	—Unverified	1
Image Generation with a Sphere Encoder	Feb 16, 2026	—Unverified	1
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem	Feb 16, 2026	—Unverified	1
Efficient Test-Time Scaling for Small Vision-Language Models	Feb 16, 2026	—Unverified	1
Self-Improving World Modelling with Latent Actions	Feb 15, 2026	—Unverified	1
Scaling Behavior of Discrete Diffusion Language Models	Feb 15, 2026	—Unverified	1
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses	Feb 15, 2026	—Unverified	1
AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence	Feb 14, 2026	—Unverified	1
GISA: A Benchmark for General Information-Seeking Assistant	Feb 13, 2026	—Unverified	1
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents	Feb 13, 2026	—Unverified	1
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise	Feb 13, 2026	—Unverified	1
Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision	Feb 13, 2026	—Unverified	1
Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models	Feb 13, 2026	—Unverified	1
Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion	Feb 12, 2026	—Unverified	1
The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context	Feb 12, 2026	—Unverified	1
P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling	Feb 12, 2026	—Unverified	1
Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching	Feb 12, 2026	—Unverified	1
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark	Feb 12, 2026	—Unverified	1
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment	Feb 12, 2026	—Unverified	1
DeepSight: An All-in-One LM Safety Toolkit	Feb 12, 2026	—Unverified	1