The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

659,983 papers248,104 code links4,818 tasks

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2451–2500 of 659983 papers

Title	Date	Tasks	Status	Hype
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models	Jun 1, 2025		CodeCode Available	3
EXP-Bench: Can AI Conduct AI Research Experiments?	May 30, 2025		CodeCode Available	3
MathArena: Evaluating LLMs on Uncontaminated Math Competitions	May 29, 2025	MathMathematical Reasoning	CodeCode Available	3
BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model	May 29, 2025	Large Language Modelscientific discovery	CodeCode Available	3
MAGREF: Masked Guidance for Any-Reference Video Generation	May 29, 2025	Human-Domain Subject-to-VideoOpen-Domain Subject-to-Video	CodeCode Available	3
EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge	May 29, 2025	text-to-speechText to Speech	CodeCode Available	3
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction	May 29, 2025	Question Answering	CodeCode Available	3
Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models	May 29, 2025	Autonomous DrivingDiagnostic	CodeCode Available	3
TiRex: Zero-Shot Forecasting Across Long and Short Horizons with Enhanced In-Context Learning	May 29, 2025	In-Context LearningState Space Models	CodeCode Available	3
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning	May 28, 2025	RAG	CodeCode Available	3
NeuralOM: Neural Ocean Model for Subseasonal-to-Seasonal Simulation	May 27, 2025	Computational EfficiencyGraph Neural Network	CodeCode Available	3
Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers	May 26, 2025	Information Retrieval	CodeCode Available	3
Learning to Reason without External Rewards	May 26, 2025	Code Generationreinforcement-learning	CodeCode Available	3
PCDCNet: A Surrogate Model for Air Quality Forecasting with Physical-Chemical Dynamics and Constraints	May 26, 2025	Deep Learning	CodeCode Available	3
syftr: Pareto-Optimal Generative AI	May 26, 2025	Bayesian OptimizationRAG	CodeCode Available	3
VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation	May 26, 2025	DecoderLanguage Modeling	CodeCode Available	3
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction	May 26, 2025	3D ReconstructionSpatial Reasoning	CodeCode Available	3
FruitNeRF++: A Generalized Multi-Fruit Counting Method Utilizing Contrastive Learning and Neural Radiance Fields	May 26, 2025	Contrastive Learning	CodeCode Available	3
SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline	May 25, 2025	Speech ExtractionSpeech Separation	CodeCode Available	3
InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts	May 25, 2025	Chart UnderstandingQuestion Answering	CodeCode Available	3
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data	May 24, 2025	Image Stylization	CodeCode Available	3
ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation	May 24, 2025	BenchmarkingChart Understanding	CodeCode Available	3
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning	May 24, 2025	GPUReinforcement Learning (RL)	CodeCode Available	3
RemoteSAM: Towards Segment Anything for Earth Observation	May 23, 2025	AttributeEarth Observation	CodeCode Available	3
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality	May 23, 2025	In-Context LearningToken Reduction	CodeCode Available	3
CLIMB: Class-imbalanced Learning Benchmark on Tabular Data	May 23, 2025		CodeCode Available	3
Distilling LLM Agent into Small Models with Retrieval and Code Tools	May 23, 2025	Action GenerationDomain Generalization	CodeCode Available	3
OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics	May 23, 2025	Chart Understandingobject-detection	CodeCode Available	3
Training-Free Efficient Video Generation via Dynamic Token Carving	May 22, 2025	DenoisingVideo Generation	CodeCode Available	3
MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems	May 22, 2025		CodeCode Available	3
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models	May 22, 2025	BenchmarkingFairness	CodeCode Available	3
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning	May 22, 2025	Reinforcement Learning (RL)	CodeCode Available	3
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO	May 22, 2025	Reinforcement Learning (RL)	CodeCode Available	3
LaViDa: A Large Diffusion Language Model for Multimodal Understanding	May 22, 2025	Instruction FollowingLanguage Modeling	CodeCode Available	3
Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL	May 22, 2025	Natural Language UnderstandingReinforcement Learning (RL)	CodeCode Available	3
Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning	May 22, 2025		CodeCode Available	3
IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models	May 22, 2025	BenchmarkingInstruction Following	CodeCode Available	3
Distance Adaptive Beam Search for Provably Accurate Graph-Based Nearest Neighbor Search	May 21, 2025	Information Retrieval	CodeCode Available	3
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space	May 21, 2025		CodeCode Available	3
MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem	May 20, 2025	Mathematical Reasoningscientific discovery	CodeCode Available	3
Efficient Agent Training for Computer Use	May 20, 2025		CodeCode Available	3
OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking	May 20, 2025	Benchmarking	CodeCode Available	3
General-Reasoner: Advancing LLM Reasoning Across All Domains	May 20, 2025	AllMath	CodeCode Available	3
RLVR-World: Training World Models with Reinforcement Learning	May 20, 2025	reinforcement-learningReinforcement Learning	CodeCode Available	3
MLZero: A Multi-Agent System for End-to-end Machine Learning Automation	May 20, 2025	AutoMLCode Generation	CodeCode Available	3
This Time is Different: An Observability Perspective on Time Series Foundation Models	May 20, 2025	DecoderMultivariate Time Series Forecasting	CodeCode Available	3
From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery	May 19, 2025	Navigatescientific discovery	CodeCode Available	3
Thinkless: LLM Learns When to Think	May 19, 2025	GSM8KMath	CodeCode Available	3
ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced Reinforcement Learning	May 19, 2025	Machine Translationreinforcement-learning	CodeCode Available	3
Harnessing the Universal Geometry of Embeddings	May 18, 2025	Attribute	CodeCode Available	3