SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1775117800 of 474278 papers

TitleStatusHype
ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests0
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training0
MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language ModelsCode0
Kernel k-Medoids as General Vector Quantization0
DeePoly: A High-Order Accuracy Scientific Machine Learning Framework for Function Approximation and Solving PDEsCode1
Learning normalized image densities via dual score matchingCode0
SeedEdit 3.0: Fast and High-Quality Generative Image Editing0
OpenGT: A Comprehensive Benchmark For Graph TransformersCode1
EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition0
Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design0
Mitigating Degree Bias Adaptively with Hard-to-Learn Nodes in Graph Contrastive Learning0
Influence Functions for Edge Edits in Non-Convex Graph Neural Networks0
FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing0
Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?Code0
Reliably detecting model failures in deployment without labelsCode0
Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation0
Information Locality as an Inductive Bias for Neural Language ModelsCode0
Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation0
SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?0
MuSciClaims: Multimodal Scientific Claim Verification0
Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning0
TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering0
Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms0
ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT0
SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View0
Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers0
Does It Make Sense to Speak of Introspection in Large Language Models?0
RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation0
IIITH-BUT system for IWSLT 2025 low-resource Bhojpuri to Hindi speech translation0
Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies0
Improving Low-Resource Morphological Inflection via Self-Supervised Objectives0
Seeing the Invisible: Machine learning-Based QPI Kernel Extraction via Latent Alignment0
TIMING: Temporality-Aware Integrated Gradients for Time Series ExplanationCode1
AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive ModelCode2
Exploring Diffusion Transformer Designs via GraftingCode2
ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow DevelopmentCode7
A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to SearchCode2
Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection LearningCode1
Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long ContextsCode1
Kinetics: Rethinking Test-Time Scaling LawsCode2
LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMsCode0
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and VideosCode2
TreeRPO: Tree Relative Policy OptimizationCode0
iN2V: Bringing Transductive Node Embeddings to Inductive GraphsCode0
Practical Manipulation Model for Robust Deepfake DetectionCode0
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model0
MineInsight: A Multi-sensor Dataset for Humanitarian Demining Robotics in Off-Road EnvironmentsCode1
Survey on the Evaluation of Generative Models in Music0
Learning Beyond Experience: Generalizing to Unseen State Space with Reservoir ComputingCode0
MARS: Radio Map Super-resolution and Reconstruction Method under Sparse Channel Measurements0
Show:102550
← PrevPage 356 of 9486Next →