SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

661,570 papers248,326 code links4,818 tasks

Papers

Showing 61016150 of 661570 papers

TitleStatusHype
Rethinking Diverse Human Preference Learning through Principal Component AnalysisCode2
S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement LearningCode2
H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash ThinkingCode2
Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding TutorsCode2
Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference OptimizationCode2
WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & DialectsCode2
UXAgent: An LLM Agent-Based Usability Testing Framework for Web DesignCode2
A Survey of Personalized Large Language Models: Progress and Future DirectionsCode2
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMsCode2
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and GenerationCode2
Continuous Diffusion Model for Language ModelingCode2
PUGS: Zero-shot Physical Understanding with Gaussian SplattingCode2
SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQLCode2
BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 LanguagesCode2
JoLT: Joint Probabilistic Predictions on Tabular Data Using LLMsCode2
Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI CollaborationCode2
Without Paired Labeled Data: An End-to-End Self-Supervised Paradigm for UAV-View Geo-LocalizationCode2
Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters MoreCode2
Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and AmendmentCode2
Idiosyncrasies in Large Language ModelsCode2
Diffusion Models without Classifier-free GuidanceCode2
LLM Agents Making Agent ToolsCode2
X-IL: Exploring the Design Space of Imitation Learning PoliciesCode2
Image Inversion: A Survey from GANs to Diffusion and BeyondCode2
Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory SharpeningCode2
Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent SystemsCode2
FinMTEB: Finance Massive Text Embedding BenchmarkCode2
NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLMCode2
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-TrainingCode2
Hierarchical Expert Prompt for Large-Language-Model: An Approach Defeat Elite AI in TextStarCraft II for the First TimeCode2
MasRouter: Learning to Route LLMs for Multi-Agent SystemsCode2
RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM GenerationCode2
D-CIPHER: Dynamic Collaborative Intelligent Multi-Agent System with Planner and Heterogeneous Executors for Offensive SecurityCode2
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video UnderstandingCode2
Process Reward Models for LLM Agents: Practical Framework and DirectionsCode2
A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and EvaluationsCode2
MonoForce: Learnable Image-conditioned Physics EngineCode2
Compression-Aware One-Step Diffusion Model for JPEG Artifact RemovalCode2
Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement LearningCode2
DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human ReferencesCode2
CoSER: Coordinating LLM-Based Persona Simulation of Established RolesCode2
DiffMS: Diffusion Generation of Molecules Conditioned on Mass SpectraCode2
Digi-Q: Learning Q-Value Functions for Training Device-Control AgentsCode2
Diffusion Models for Molecules: A Survey of Methods and TasksCode2
A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional HypothesisCode2
CoT-Valve: Length-Compressible Chain-of-Thought TuningCode2
Harnessing Vision Models for Time Series Analysis: A SurveyCode2
KET-RAG: A Cost-Efficient Multi-Granular Indexing Framework for Graph-RAGCode2
TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-InstrumentCode2
Unlocking the Potential of Classic GNNs for Graph-level Tasks: Simple Architectures Meet ExcellenceCode2
Show:102550
← PrevPage 123 of 13232Next →