SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 95519575 of 474278 papers

TitleStatusHype
Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AICode0
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech0
Rethinking Entropy Regularization in Large Reasoning Models0
LayerD: Decomposing Raster Graphic Designs into Layers0
VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial ReasoningCode0
Who's Your Judge? On the Detectability of LLM-Generated Judgments0
GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts0
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time0
Visual Jigsaw Post-Training Improves MLLMs0
A Culturally-diverse Multilingual Multimodal Video Benchmark & Model0
Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution0
Where LLM Agents Fail and How They can Learn From FailuresCode0
Sanitize Your Responses: Mitigating Privacy Leakage in Large Language ModelsCode0
CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity0
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing0
Mitigating Hallucination in Multimodal LLMs with Layer Contrastive DecodingCode0
Efficiently Attacking Memorization ScoresCode0
DiTraj: training-free trajectory control for video diffusion transformer0
Fine-Grained Detection of Context-Grounded Hallucinations Using LLMs0
Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials SynthesisCode0
Interpretable 3D Neural Object Volumes for Robust Conceptual ReasoningCode0
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video GamesCode0
Reward-Agnostic Prompt Optimization for Text-to-Image Diffusion ModelsCode0
OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game PlayingCode0
Streaming Sequence-to-Sequence Learning with Delayed Streams ModelingCode0
Show:102550
← PrevPage 383 of 18972Next →