SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1470114750 of 474278 papers

TitleStatusHype
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning1
Vector Quantization using Gaussian Variational Autoencoder1
T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning1
daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently1
SWE-Exp: Experience-Driven Software Issue Resolution1
LIVE: Long-horizon Interactive Video World Modeling1
ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation1
OpenAutoNLU: Open Source AutoML Library for NLU1
Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models1
m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models1
Evaluating and Steering Modality Preferences in Multimodal Large Language Model1
How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing1
V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration1
Matryoshka Gaussian Splatting1
LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation1
Mano: Restriking Manifold Optimization for LLM Training1
CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges1
When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs1
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following1
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty1
Scaling Behavior of Discrete Diffusion Language Models1
MARS: Modular Agent with Reflective Search for Automated AI Research1
RISE-Video: Can Video Generators Decode Implicit World Rules?1
Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling1
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression1
PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature1
ObjEmbed: Towards Universal Multimodal Object Embeddings1
ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models1
DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report1
TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents1
DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset1
Show, Don't Tell: Morphing Latent Reasoning into Image Generation1
Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts1
LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs1
AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models1
FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use1
Learning Self-Correction in Vision-Language Models via Rollout Augmentation1
How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition1
Image Generation with a Sphere Encoder1
Can Vision-Language Models Solve the Shell Game?1
Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening1
LLM Probability Concentration: How Alignment Shrinks the Generative Horizon1
Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs1
Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration1
WildOS: Open-Vocabulary Object Search in the Wild1
Chain of World: World Model Thinking in Latent Motion1
ContextBench: A Benchmark for Context Retrieval in Coding Agents1
Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing1
Mamba-FCS: Joint Spatio- Frequency Feature Fusion, Change-Guided Attention, and SeK Loss for Enhanced Semantic Change Detection in Remote Sensing1
LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning1
Show:102550
← PrevPage 295 of 9486Next →