SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1920119250 of 474278 papers

TitleStatusHype
From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning0
How much do language models memorize?0
KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval0
ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RLCode2
Pangu DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning0
ByzFL: Research Framework for Robust Federated LearningCode1
PDE-Transformer: Efficient and Versatile Transformers for Physics SimulationsCode2
The Hallucination Dilemma: Factuality-Aware Reinforcement Learning for Large Reasoning ModelsCode1
Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM ReasoningCode1
STORK: Improving the Fidelity of Mid-NFE Sampling for Diffusion and Flow Matching ModelsCode1
Causal-aware Large Language Models: Enhancing Decision-Making Through Learning, Adapting and ActingCode1
Learning Safety Constraints for Large Language ModelsCode1
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM AgentsCode2
Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?Code0
WikiGap: Promoting Epistemic Equity by Surfacing Knowledge Gaps Between English Wikipedia and other Language EditionsCode0
MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded KnowledgeCode0
RMoA: Optimizing Mixture-of-Agents through Diversity Maximization and Residual CompensationCode0
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language ReasoningCode7
Invariant Link Selector for Spatial-Temporal Out-of-Distribution ProblemCode0
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline CalibrationCode0
Taming Hyperparameter Sensitivity in Data Attribution: Practical Selection Without Costly RetrainingCode0
Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool DetectorsCode0
VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD SoftwareCode1
Multi-criteria Rank-based Aggregation for Explainable AICode0
Unleashing High-Quality Image Generation in Diffusion Sampling Using Second-Order Levenberg-Marquardt-LangevinCode1
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language ModelsCode0
Knowing Before Saying: LLM Representations Encode Information About Chain-of-Thought Success Before CompletionCode0
Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document EmbeddingsCode1
Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion0
The Gaussian Mixing Mechanism: Renyi Differential Privacy via Gaussian SketchesCode0
RealDrive: Retrieval-Augmented Driving with Diffusion Models0
AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders0
LLM Inference Enhanced by External Knowledge: A SurveyCode0
PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial PatchesCode0
Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and FindingsCode1
TC-GS: A Faster Gaussian Splatting Module Utilizing Tensor CoresCode2
Statistical mechanics of extensive-width Bayesian neural networks near interpolationCode0
EVA-MILP: Towards Standardized Evaluation of MILP Instance GenerationCode0
Don't Reinvent the Wheel: Efficient Instruction-Following Text Embedding based on Guided Space TransformationCode1
Predicting the Past: Estimating Historical Appraisals with OCR and Machine LearningCode0
Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and ReliabilityCode0
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM EvaluationCode0
LGAR: Zero-Shot LLM-Guided Neural Ranking for Abstract Screening in Systematic Literature ReviewsCode0
LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal TextCode0
When Large Multimodal Models Confront Evolving Knowledge:Challenges and PathwaysCode2
Mastering Massive Multi-Task Reinforcement Learning via Mixture-of-Expert Decision TransformerCode1
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable RewardsCode5
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic TasksCode1
ROAD: Responsibility-Oriented Reward Design for Reinforcement Learning in Autonomous Driving0
Federated Foundation Model for GI Endoscopy Images0
Show:102550
← PrevPage 385 of 9486Next →