SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

661,570 papers248,326 code links4,818 tasks

Papers

Showing 34013425 of 661570 papers

TitleStatusHype
WebCanvas: Benchmarking Web Agents in Online EnvironmentsCode3
Refusal in Language Models Is Mediated by a Single DirectionCode3
HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation ModelCode3
Unveiling Encoder-Free Vision-Language ModelsCode3
GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and RefinementCode3
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language ModelsCode3
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive ReasoningCode3
An Imitative Reinforcement Learning Framework for Autonomous DogfightCode3
GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented UnderstandingCode3
Quest: Query-Aware Sparsity for Efficient Long-Context LLM InferenceCode3
Step-level Value Preference Optimization for Mathematical ReasoningCode3
AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language ModelsCode3
CBGBench: Fill in the Blank of Protein-Molecule Complex Binding GraphCode3
AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile MethodologyCode3
IMDL-BenCo: A Comprehensive Benchmark and Codebase for Image Manipulation Detection & LocalizationCode3
TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous GraphsCode3
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement LearningCode3
CarLLaVA: Vision language models for camera-only closed-loop drivingCode3
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and TranslationCode3
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video UnderstandingCode3
Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and EvaluationCode3
DrivAerNet++: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning BenchmarksCode3
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language ModelsCode3
OmniTokenizer: A Joint Image-Video Tokenizer for Visual GenerationCode3
MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM FinetuningCode3
Show:102550
← PrevPage 137 of 26463Next →