SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1770117750 of 474278 papers

TitleStatusHype
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout ReplayCode1
FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video GenerationCode1
Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification0
Through-the-Wall Radar Human Activity Recognition WITHOUT Using Neural NetworksCode0
StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher DistillationCode0
Rethinking Contrastive Learning in Session-based RecommendationCode0
Selecting Demonstrations for Many-Shot In-Context Learning via Gradient MatchingCode0
MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and BaselinesCode0
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation ModelsCode5
Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference ModelsCode0
Dissecting Long Reasoning Models: An Empirical StudyCode0
Composing Agents to Minimize Worst-case RiskCode0
Tuning the Right Foundation Models is What you Need for Partial Label LearningCode1
ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation0
Single GPU Task Adaptation of Pathology Foundation Models for Whole Slide Image Analysis0
Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels0
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning0
A MISMATCHED Benchmark for Scientific Natural Language InferenceCode0
Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D GroundingCode0
VideoMolmo: Spatio-Temporal Grounding Meets PointingCode2
Flex-TravelPlanner: A Benchmark for Flexible Planning with Language AgentsCode0
Identifying Reliable Evaluation Metrics for Scientific Text RevisionCode0
Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning ModelsCode1
HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model TrainingCode0
Controlling Summarization Length Through EOS Token Weighting0
TALL -- A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages0
Quantifying Cross-Modality Memorization in Vision-Language Models0
DSG-World: Learning a 3D Gaussian World Model from Dual State Videos0
Stable Vision Concept Transformers for Medical Diagnosis0
MARBLE: Material Recomposition and Blending in CLIP-Space0
ProRefine: Inference-time Prompt Refinement with Textual Feedback0
UNO: Unlearning via Orthogonalization in Generative modelsCode0
Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-ReasoningCode0
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech EvaluationCode0
ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition0
Prompting LLMs: Length Control for Isometric Machine Translation0
OpenAg: Democratizing Agricultural Intelligence0
Search Arena: Analyzing Search-Augmented LLMsCode2
BSBench: will your LLM find the largest prime number?Code0
Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers0
RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion0
SAM-aware Test-time Adaptation for Universal Medical Image SegmentationCode0
A Reasoning-Based Approach to Cryptic Crossword Clue SolvingCode0
FedAPM: Federated Learning via ADMM with Partial Model PersonalizationCode0
Predicting ICU In-Hospital Mortality Using Adaptive Transformer Layer FusionCode0
Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit0
Contrastive Flow MatchingCode2
Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation0
LSM-2: Learning from Incomplete Wearable Sensor Data0
Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback0
Show:102550
← PrevPage 355 of 9486Next →