SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1610116150 of 474278 papers

TitleStatusHype
Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models0
Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution ShiftsCode1
Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs0
VINCIE: Unlocking In-context Image Editing from Video0
Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning0
Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs0
Can We Infer Confidential Properties of Training Data from LLMs?0
PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation0
Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models0
Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages0
PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier0
Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty0
PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models0
One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers0
Provably Learning from Language Feedback0
PAL: Probing Audio Encoders via LLMs -- A Study of Information Transfer from Audio Encoders to LLMs0
TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving0
Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering0
Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?0
Build the web for agents, not agents for the web0
MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning0
Demystifying Spectral Feature Learning for Instrumental Variable Regression0
Meta-learning Representations for Learning from Multiple Annotators0
The Gittins Index: A Design Principle for Decision-Making Under Uncertainty0
Rethinking Losses for Diffusion Bridge Samplers0
Robustly Improving LLM Fairness in Realistic Settings via InterpretabilityCode0
Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes0
Collaborative Min-Max Regret in Grouped Multi-Armed Bandits0
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix FactorizationCode1
CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-TrainingCode0
Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation LearningCode0
SDialog: A Python Toolkit for Synthetic Dialogue Generation and AnalysisCode2
"Check My Work?": Measuring Sycophancy in a Simulated Educational ContextCode0
Code Execution as Grounded Supervision for LLM ReasoningCode0
NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI TutorsCode0
Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced ClaimsCode0
AutoMind: Adaptive Knowledgeable Agent for Automated Data ScienceCode2
Size-adaptive Hypothesis Testing for FairnessCode0
Detecting Sockpuppetry on Wikipedia Using Meta-LearningCode0
Discrete Audio Tokens: More Than a Survey!0
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers0
Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt AdjustmentCode0
An Analysis of Datasets, Metrics and Models in Keyphrase GenerationCode0
Box-Constrained Softmax Function and Its Application for Post-Hoc CalibrationCode0
VQC-MLPNet: An Unconventional Hybrid Quantum-Classical Architecture for Scalable and Robust Quantum Machine Learning0
Dynamic Epistemic Friction in Dialogue0
ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference OptimizationCode0
Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific PapersCode0
AC/DC: LLM-based Audio Comprehension via Dialogue Continuation0
ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs0
Show:102550
← PrevPage 323 of 9486Next →