SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1580115850 of 474278 papers

TitleStatusHype
"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language UnderstandingCode1
Gated Integration of Low-Rank Adaptation for Continual Learning of Language ModelsCode1
CRAKEN: Cybersecurity LLM Agent with Knowledge-Based ExecutionCode1
A Federated Splitting Framework for LLMs: Security, Efficiency, and AdaptabilityCode1
Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global CitiesCode1
MGStream: Motion-aware 3D Gaussian for Streamable Dynamic Scene ReconstructionCode1
Learning Concept-Driven Logical Rules for Interpretable and Generalizable Medical Image ClassificationCode1
DrugPilot: LLM-based Parameterized Reasoning Agent for Drug DiscoveryCode1
Do Language Models Use Their Depth Efficiently?Code1
ABBA: Highly Expressive Hadamard Product Adaptation for Large Language ModelsCode1
Safety Subspaces are Not Distinct: A Fine-Tuning Case StudyCode1
EEG-to-Text Translation: A Model for Deciphering Human Brain ActivityCode1
Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRICode1
Large Language Models for Data SynthesisCode1
UniSim: A Unified Simulator for Time-Coarsened Dynamics of BiomoleculesCode1
Let's Verify Math Questions Step by StepCode1
TxPert: Leveraging Biochemical Relationships for Out-of-Distribution Transcriptomic Perturbation PredictionCode1
Physics-Guided Learning of Meteorological Dynamics for Weather Downscaling and ForecastingCode1
FlashKAT: Understanding and Addressing Performance Bottlenecks in the Kolmogorov-Arnold TransformerCode1
Social Sycophancy: A Broader Understanding of LLM SycophancyCode1
Linear Control of Test Awareness Reveals Differential Compliance in Reasoning ModelsCode1
TinyV: Reducing False Negatives in Verification Improves RL for LLM ReasoningCode1
U-SAM: An audio language Model for Unified Speech, Audio, and Music UnderstandingCode1
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning ModelsCode1
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
Speculative Decoding Reimagined for Multimodal Large Language ModelsCode1
Internal Chain-of-Thought: Empirical Evidence for Layer-wise Subtask Scheduling in LLMsCode1
A Personalized Conversational Benchmark: Towards Simulating Personalized ConversationsCode1
DIFF: Dual Side-Information Filtering and Fusion for Sequential RecommendationCode1
CLEVER: A Curated Benchmark for Formally Verified Code GenerationCode1
Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM WatermarkingCode1
KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language ModelsCode1
Diving into the Fusion of Monocular Priors for Generalized Stereo MatchingCode1
Reasoning Models Better Express Their ConfidenceCode1
Deep Koopman operator framework for causal discovery in nonlinear dynamical systemsCode1
Electrostatics from Laplacian Eigenbasis for Neural Network Interatomic PotentialsCode1
DisastIR: A Comprehensive Information Retrieval Benchmark for Disaster ManagementCode1
Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement LearningCode1
Training-Free Watermarking for Autoregressive Image GenerationCode1
Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability HypothesisCode1
Neural Incompatibility: The Unbridgeable Gap of Cross-Scale Parametric Knowledge Transfer in Large Language ModelsCode1
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language ModelsCode1
RADAR: Enhancing Radiology Report Generation with Supplementary Knowledge InjectionCode1
R2MED: A Benchmark for Reasoning-Driven Medical RetrievalCode1
WebNovelBench: Placing LLM Novelists on the Web Novel DistributionCode1
Decoupling Classifier for Boosting Few-shot Object Detection and Instance SegmentationCode1
Unlocking the Power of SAM 2 for Few-Shot SegmentationCode1
PRL: Prompts from Reinforcement LearningCode1
ConspEmoLLM-v2: A robust and stable model to detect sentiment-transformed conspiracy theoriesCode1
Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM ReasoningCode1
Show:102550
← PrevPage 317 of 9486Next →