SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1465114700 of 474278 papers

TitleStatusHype
TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events1
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models1
VLS: Steering Pretrained Robot Policies via Vision-Language Models1
Privileged Information Distillation for Language Models1
Learning Self-Correction in Vision-Language Models via Rollout Augmentation1
Large Multimodal Models as General In-Context Classifiers1
Coarse-Guided Visual Generation via Weighted h-Transform Sampling1
HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions1
DREAM: Where Visual Understanding Meets Text-to-Image Generation1
How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing1
AlphaApollo: A System for Deep Agentic Reasoning1
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents1
Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding1
Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models1
ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL Problems1
Stereo World Model: Camera-Guided Stereo Video Generation1
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants1
MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning1
SK-Adapter: Skeleton-Based Structural Control for Native 3D Generation1
Rethinking Selective Knowledge Distillation1
Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry1
Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning1
Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models1
LatentMem: Customizing Latent Memory for Multi-Agent Systems1
Mano: Restriking Manifold Optimization for LLM Training1
Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing1
Demystifing Video Reasoning1
Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs1
MediX-R1: Open Ended Medical Reinforcement Learning1
Show, Don't Tell: Morphing Latent Reasoning into Image Generation1
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets1
Safety Alignment of LMs via Non-cooperative Games1
Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening1
Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data1
Same or Not? Enhancing Visual Perception in Vision-Language Models1
MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models1
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation1
Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models1
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers1
Reinforced Fast Weights with Next-Sequence Prediction1
VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining1
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem1
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction1
AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models1
General Agent Evaluation1
OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG1
Glance and Focus Reinforcement for Pan-cancer Screening1
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following1
AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts1
Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings1
Show:102550
← PrevPage 294 of 9486Next →