SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1955119600 of 474278 papers

TitleStatusHype
Continuous Chain of Thought Enables Parallel Exploration and Reasoning0
Understanding Mode Connectivity via Parameter Space Symmetry0
SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA0
MuLoCo: Muon is a practical inner optimizer for DiLoCo0
EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast0
Differential Information: An Information-Theoretic Perspective on Preference Optimization0
Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness0
JAPAN: Joint Adaptive Prediction Areas with Normalising-Flows0
Stable Thompson Sampling: Valid Inference via Variance Inflation0
Enhancing Marker Scoring Accuracy through Ordinal Confidence Modelling in Educational Assessments0
Emergent Risk Awareness in Rational Agents under Resource Constraints0
Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models0
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking0
Are Reasoning Models More Prone to Hallucination?0
From Chat Logs to Collective Insights: Aggregative Question Answering0
Comparative of Genetic Fuzzy regression techniques for aeroacoustic phenomenons0
StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs0
LLMs for Argument Mining: Detection, Extraction, and Relationship Classification of pre-defined Arguments in Online Comments0
EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models0
Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data0
Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios0
PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics0
Enhancing Large Language Models'Machine Translation via Dynamic Focus Anchoring0
Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models0
Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors0
Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt0
Characterizing the Expressivity of Transformer Language Models0
ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs0
Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation0
ATLAS: Learning to Optimally Memorize the Context at Test Time0
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models0
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns0
GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning0
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions0
TRAP: Targeted Redirecting of Agentic Preferences0
Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models0
BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM ModelCode3
ZeroGUI: Automating Online GUI Learning at Zero Human CostCode2
Normalizing Flows are Capable Models for RLCode1
DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement LearningCode1
MAGREF: Masked Guidance for Any-Reference Video GenerationCode3
VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation ModelsCode2
Context Robust Knowledge Editing for Language ModelsCode1
Video Editing for Audio-Visual DubbingCode0
A Divide-and-Conquer Approach for Global Orientation of Non-Watertight Scene-Level Point Clouds Using 0-1 Integer OptimizationCode0
TimePoint: Accelerated Time Series Alignment via Self-Supervised Keypoint and Descriptor LearningCode1
VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality EvaluationCode1
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC VideosCode0
Comparing the Effects of Persistence Barcodes Aggregation and Feature Concatenation on Medical ImagingCode0
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion ModelCode2
Show:102550
← PrevPage 392 of 9486Next →