SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

661,570 papers248,326 code links4,818 tasks

Papers

Showing 62016250 of 661570 papers

TitleStatusHype
Sparse Autoencoders for Hypothesis GenerationCode2
The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information SteeringCode2
Seeing World Dynamics in a NutshellCode2
CTR-Driven Advertising Image Generation with Multimodal Large Language ModelsCode2
Honegumi: An Interface for Accelerating the Adoption of Bayesian Optimization in the Experimental SciencesCode2
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise SearchCode2
STAIR: Improving Safety Alignment with Introspective ReasoningCode2
Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUsCode2
CodeSteer: Symbolic-Augmented Language Models via Code/Text GuidanceCode2
On the Guidance of Flow MatchingCode2
Reviving The Classics: Active Reward Modeling in Large Language Model AlignmentCode2
Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose EstimationCode2
Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference OptimizationCode2
Compressed Image Generation with Denoising Diffusion Codebook ModelsCode2
Efficient Diffusion Models: A SurveyCode2
Towards Robust and Generalizable Lensless Imaging with Modular Learned ReconstructionCode2
Massive Values in Self-Attention Modules are the Key to Contextual Knowledge UnderstandingCode2
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal PuzzlesCode2
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion TransformerCode2
Preference Leakage: A Contamination Problem in LLM-as-a-judgeCode2
When Do LLMs Help With Node Classification? A Comprehensive AnalysisCode2
LEAD: Large Foundation Model for EEG-Based Alzheimer's Disease DetectionCode2
FlexCloud: Direct, Modular Georeferencing and Drift-Correction of Point Cloud MapsCode2
Segment Anything for HistopathologyCode2
MetaOpenFOAM 2.0: Large Language Model Driven Chain of Thought for Automating CFD Simulation and Post-ProcessingCode2
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language ModelsCode2
PyMOLfold: Interactive Protein and Ligand Structure Prediction in PyMOLCode2
GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal ModelingCode2
RaySplats: Ray Tracing based Gaussian SplattingCode2
Advancing Dense Endoscopic Reconstruction with Gaussian Splatting-driven Surface Normal-aware Tracking and MappingCode2
Efficient Reasoning with Hidden ThinkingCode2
Visual Autoregressive Modeling for Image Super-ResolutionCode2
mFollowIR: a Multilingual Benchmark for Instruction Following in RetrievalCode2
An Adversarial Approach to Register Extreme Resolution Tissue Cleared 3D Brain ImagesCode2
TRADES: Generating Realistic Market Simulations with Diffusion ModelsCode2
STP: Self-play LLM Theorem Provers with Iterative Conjecturing and ProvingCode2
AIN: The Arabic INclusive Large Multimodal ModelCode2
Diverse Preference OptimizationCode2
Track-On: Transformer-based Online Point Tracking with MemoryCode2
Free-T2M: Frequency Enhanced Text-to-Motion Diffusion Model With Consistency LossCode2
GuardReasoner: Towards Reasoning-based LLM SafeguardsCode2
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail ModerationCode2
General Scene Adaptation for Vision-and-Language NavigationCode2
Closing the Gap Between Synthetic and Ground Truth Time Series Distributions via Neural MappingCode2
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to ImitateCode2
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse AutoencodersCode2
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMsCode2
CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMsCode2
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse AutoencodersCode2
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language ModelCode2
Show:102550
← PrevPage 125 of 13232Next →