SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1535115400 of 474278 papers

TitleStatusHype
Rethinking Machine Unlearning in Image Generation ModelsCode1
ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid MotionsCode1
TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning CompressionCode1
OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning MitigationCode1
EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language modelsCode1
OD3: Optimization-free Dataset Distillation for Object DetectionCode1
Incentivizing Reasoning for Advanced Instruction-Following of Large Language ModelsCode1
WHEN TO ACT, WHEN TO WAIT: Modeling Structural Trajectories for Intent Triggerability in Task-Oriented DialogueCode1
Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head GenerationCode1
Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in KoreanCode1
STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent FrameworkCode1
EfficientFER: EfficientNetv2 Based Deep Learning Approach for Facial Expression RecognitionCode1
Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and AnalysisCode1
scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell OmicsCode1
SEMNAV: A Semantic Segmentation-Driven Approach to Visual Semantic NavigationCode1
SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training CostCode1
GLoSS: Generative Language Models with Semantic Search for Sequential RecommendationCode1
TimeGraph: Synthetic Benchmark Datasets for Robust Time-Series Causal DiscoveryCode1
Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and AccountabilityCode1
IF-GUIDE: Influence Function-Guided Detoxification of LLMsCode1
AIMSCheck: Leveraging LLMs for AI-Assisted Review of Modern Slavery Statements Across JurisdictionsCode1
SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation ModelCode1
Crowdsourcing MUSHRA Tests in the Age of Generative Speech Technologies: A Comparative Analysis of Subjective and Objective Testing MethodsCode1
LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real WorldCode1
Protap: A Benchmark for Protein Modeling on Realistic Downstream ApplicationsCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
PFMBench: Protein Foundation Model BenchmarkCode1
IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response TheoryCode1
Reasoning Like an Economist: Post-Training on Economic Problems Induces Strategic Generalization in LLMsCode1
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual ModalitiesCode1
MIRROR: Cognitive Inner Monologue Between Conversational Turns for Persistent Reflection and Reasoning in Conversational LLMsCode1
Look mom, no experimental data! Learning to score protein-ligand interactions from simulationsCode1
A Brain Graph Foundation Model: Pre-Training and Prompt-Tuning for Any Atlas and DisorderCode1
An LLM Agent for Functional Bug Detection in Network ProtocolsCode1
AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-TimeCode1
PAKTON: A Multi-Agent Framework for Question Answering in Long Legal AgreementsCode1
dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data GenerationCode1
SEED: A Benchmark Dataset for Sequential Facial Attribute Editing with Diffusion ModelsCode1
DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity EnvironmentsCode1
Neuro2Semantic: A Transfer Learning Framework for Semantic Reconstruction of Continuous Language from Human Intracranial EEGCode1
Synergizing LLMs with Global Label Propagation for Multimodal Fake News DetectionCode1
DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?Code1
Bench4KE: Benchmarking Automated Competency Question GenerationCode1
CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental LearningCode1
Timing is Important: Risk-aware Fund Allocation based on Time-Series ForecastingCode1
Can Slow-thinking LLMs Reason Over Time? Empirical Studies in Time Series ForecastingCode1
Chameleon: A MatMul-Free Temporal Convolutional Network Accelerator for End-to-End Few-Shot and Continual Learning from Sequential DataCode1
A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource SettingsCode1
Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic PriorsCode1
VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD SoftwareCode1
Show:102550
← PrevPage 308 of 9486Next →