SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1765117700 of 474278 papers

TitleStatusHype
Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language ModelsCode1
Multi-Objective Causal Bayesian OptimizationCode1
SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography ImagesCode1
Exploiting Deblurring Networks for Radiance FieldsCode1
Aligning LLMs to Ask Good Questions A Case Study in Clinical ReasoningCode1
CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language ModelsCode1
ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action ModelCode1
CLIPPER: Compression enables long-context synthetic data generationCode1
Unstructured Evidence Attribution for Long Context Query Focused SummarizationCode1
SEA-HELM: Southeast Asian Holistic Evaluation of Language ModelsCode1
Pre-training Graph Neural Networks on Molecules by Using Subgraph-Conditioned Graph Information BottleneckCode1
Measuring Faithfulness of Chains of Thought by Unlearning Reasoning StepsCode1
Bridging Text and Vision: A Multi-View Text-Vision Registration Approach for Cross-Modal Place RecognitionCode1
CDGS: Confidence-Aware Depth Regularization for 3D Gaussian SplattingCode1
StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction FollowingCode1
MedFuncta: Modality-Agnostic Representations Based on Efficient Neural FieldsCode1
I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree SearchCode1
NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localizationCode1
Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMsCode1
Noisy Test-Time Adaptation in Vision-Language ModelsCode1
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language ModelsCode1
How to Get Your LLM to Generate Challenging Problems for EvaluationCode1
Pursuing Top Growth with Novel Loss FunctionCode1
H3DE-Net: Efficient and Accurate 3D Landmark Detection in Medical ImagingCode1
Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative AnalysisCode1
FlowAgent: Achieving Compliance and Flexibility for Workflow AgentsCode1
Improving LLM-powered Recommendations with Personalized InformationCode1
Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized DataCode1
PeerQA: A Scientific Question Answering Dataset from Peer ReviewsCode1
Proving Olympiad Inequalities by Synergizing LLMs and Symbolic ReasoningCode1
RobustX: Robust Counterfactual Explanations Made EasyCode1
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought VerificationCode1
Judging the Judges: A Collection of LLM-Generated Relevance JudgementsCode1
Reasoning with Reinforced Functional Token TuningCode1
Deep Learning for VWAP Execution in Crypto Markets: Beyond the Volume CurveCode1
Triad: Vision Foundation Model for 3D Magnetic Resonance ImagingCode1
Spiking Point Transformer for Point Cloud ClassificationCode1
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference OptimizationCode1
SPEX: Scaling Feature Interaction Explanations for LLMsCode1
2.5D U-Net with Depth Reduction for 3D CryoET Object IdentificationCode1
Which Attention Heads Matter for In-Context Learning?Code1
Latent Distribution Decoupling: A Probabilistic Framework for Uncertainty-Aware Multimodal Emotion RecognitionCode1
Lost in Sequence: Do Large Language Models Understand Sequential Recommendation?Code1
Benchmarking LLMs for Political Science: A United Nations PerspectiveCode1
Refining embeddings with fill-tuning: data-efficient generalised performance improvements for materials foundation modelsCode1
AdaptiveStep: Automatically Dividing Reasoning Step through Model ConfidenceCode1
From Tools to Teammates: Evaluating LLMs in Multi-Session Coding InteractionsCode1
Collaborative Retrieval for Large Language Model-based Conversational Recommender SystemsCode1
Learning-Guided Rolling Horizon Optimization for Long-Horizon Flexible Job-Shop SchedulingCode1
A Cognitive Writing Perspective for Constrained Long-Form Text GenerationCode1
Show:102550
← PrevPage 354 of 9486Next →