SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1480114850 of 474278 papers

TitleStatusHype
CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models1
OmniRad: A Radiological Foundation Model for Multi-Task Medical Image Analysis1
SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking1
Same or Not? Enhancing Visual Perception in Vision-Language Models1
When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs1
daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently1
Evaluating and Steering Modality Preferences in Multimodal Large Language Model1
SkeletonGaussian: Editable 4D Generation through Gaussian Skeletonization1
LIVE: Long-horizon Interactive Video World Modeling1
ObjEmbed: Towards Universal Multimodal Object Embeddings1
Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling1
TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents1
Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch1
PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models1
VLS: Steering Pretrained Robot Policies via Vision-Language Models1
DLLM-Searcher: Adapting Diffusion Large Language Model for Search Agents1
Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL1
Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-task Multi-Scale Network1
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts1
Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability1
SWE-Exp: Experience-Driven Software Issue Resolution1
LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation1
Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion1
How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing1
Show, Don't Tell: Morphing Latent Reasoning into Image Generation1
CUA-Skill: Develop Skills for Computer Using Agent1
Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models1
Glance and Focus Reinforcement for Pan-cancer Screening1
WideSeek: Advancing Wide Research via Multi-Agent Scaling1
FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning1
From Directions to Regions: Decomposing Activations in Language Models via Local Geometry1
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding1
LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents1
Rethinking Selective Knowledge Distillation1
HalluHard: A Hard Multi-Turn Hallucination Benchmark1
Language-based Trial and Error Falls Behind in the Era of Experience1
EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control1
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents1
TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation1
Segment Any Events with Language1
Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry1
DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset1
TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers1
Mano: Restriking Manifold Optimization for LLM Training1
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression1
PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature1
DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report1
Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation1
AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts1
ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation1
Show:102550
← PrevPage 297 of 9486Next →