The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

661,570 papers248,326 code links4,818 tasks

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 3401–3425 of 661570 papers

Title	Date	Tasks	Status	Hype
WebCanvas: Benchmarking Web Agents in Online Environments	Jun 18, 2024	AI AgentBenchmarking	CodeCode Available	3
Refusal in Language Models Is Mediated by a Single Direction	Jun 17, 2024	Instruction Following	CodeCode Available	3
HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model	Jun 17, 2024	Computational EfficiencyEarth Observation	CodeCode Available	3
Unveiling Encoder-Free Vision-Language Models	Jun 17, 2024	DecoderInductive Bias	CodeCode Available	3
GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement	Jun 17, 2024	speech-recognitionSpeech Recognition	CodeCode Available	3
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models	Jun 17, 2024	Document ClassificationVisual Grounding	CodeCode Available	3
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning	Jun 17, 2024	Language ModelingLanguage Modelling	CodeCode Available	3
An Imitative Reinforcement Learning Framework for Autonomous Dogfight	Jun 17, 2024	Imitation Learningreinforcement-learning	CodeCode Available	3
GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding	Jun 16, 2024		CodeCode Available	3
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference	Jun 16, 2024		CodeCode Available	3
Step-level Value Preference Optimization for Mathematical Reasoning	Jun 16, 2024	Learning-To-RankMath	CodeCode Available	3
AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models	Jun 16, 2024	HallucinationHallucination Evaluation	CodeCode Available	3
CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph	Jun 16, 2024	Drug DesignFairness	CodeCode Available	3
AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology	Jun 16, 2024	Code Generation	CodeCode Available	3
IMDL-BenCo: A Comprehensive Benchmark and Codebase for Image Manipulation Detection & Localization	Jun 15, 2024	GPUImage Manipulation	CodeCode Available	3
TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs	Jun 14, 2024	BenchmarkingKnowledge Graphs	CodeCode Available	3
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning	Jun 14, 2024	Offline RL	CodeCode Available	3
CarLLaVA: Vision language models for camera-only closed-loop driving	Jun 14, 2024	Autonomous DrivingBench2Drive	CodeCode Available	3
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation	Jun 14, 2024	Audio-Visual Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	3
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	Jun 13, 2024	Dense Video CaptioningMVBench	CodeCode Available	3
Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation	Jun 13, 2024	Multi-agent Reinforcement Learning	CodeCode Available	3
DrivAerNet++: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks	Jun 13, 2024	Benchmarking	CodeCode Available	3
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models	Jun 13, 2024	Mathobject-detection	CodeCode Available	3
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation	Jun 13, 2024	Video GenerationVideo Prediction	CodeCode Available	3
MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning	Jun 13, 2024	Instruction FollowingMath	CodeCode Available	3