SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

661,570 papers248,326 code links4,818 tasks

Papers

Showing 98519900 of 661570 papers

TitleStatusHype
Stable Neural Stochastic Differential Equations in Analyzing Irregular Time Series DataCode2
tinyBenchmarks: evaluating LLMs with fewer examplesCode2
HyperFast: Instant Classification for Tabular DataCode2
Less is More: Mitigating Multimodal Hallucination from an EOS Decision PerspectiveCode2
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn DialoguesCode2
PALO: A Polyglot Large Multimodal Model for 5B PeopleCode2
T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory StitchingCode2
ActiveRAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented AgentsCode2
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative DecodingCode2
Geometry-Informed Neural NetworksCode2
Full-Atom Peptide Design with Geometric Latent DiffusionCode2
Self-Distillation Bridges Distribution Gap in Language Model Fine-TuningCode2
A Comprehensive Study of Jailbreak Attack versus Defense for Large Language ModelsCode2
D-Flow: Differentiating through Flows for Controlled GenerationCode2
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific ProblemsCode2
Deep Generative Models for Offline Policy Learning: Tutorial, Survey, and Perspectives on Future DirectionsCode2
Neeko: Leveraging Dynamic LoRA for Efficient Multi-Character Role-Playing AgentCode2
Coercing LLMs to do and reveal (almost) anythingCode2
VOOM: Robust Visual Object Odometry and Mapping using Hierarchical LandmarksCode2
FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language ModelsCode2
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient AnalysisCode2
PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action ChainCode2
A Touch, Vision, and Language Dataset for Multimodal AlignmentCode2
Transformer tricks: Precomputing the first layerCode2
EMO-SUPERB: An In-depth Look at Speech Emotion RecognitionCode2
RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion ModelsCode2
Revitalizing Multivariate Time Series Forecasting: Learnable Decomposition with Inter-Series Dependencies and Intra-Series Variations ModelingCode2
Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken ConversationsCode2
RhythmFormer: Extracting Patterned rPPG Signals based on Periodic Sparse AttentionCode2
Me LLaMA: Foundation Large Language Models for Medical ApplicationsCode2
StyleDubber: Towards Multi-Scale Style Learning for Movie DubbingCode2
Event-Based Motion MagnificationCode2
UnlearnCanvas: Stylized Image Dataset for Enhanced Machine Unlearning Evaluation in Diffusion ModelsCode2
Class-incremental Learning for Time Series: Benchmark and EvaluationCode2
A Critical Evaluation of AI Feedback for Aligning Large Language ModelsCode2
Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language ModelsCode2
EmoBench: Evaluating the Emotional Intelligence of Large Language ModelsCode2
The Revolution of Multimodal Large Language Models: A SurveyCode2
Reformatted AlignmentCode2
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMsCode2
EVOR: Evolving Retrieval for Code GenerationCode2
Generative Semi-supervised Graph Anomaly DetectionCode2
Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMsCode2
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task ArithmeticCode2
Spatio-Temporal Few-Shot Learning via Diffusive Neural Network GenerationCode2
Universal Physics Transformers: A Framework For Efficiently Scaling Neural OperatorsCode2
CausalGym: Benchmarking causal interpretability methods on linguistic tasksCode2
Pan-Mamba: Effective pan-sharpening with State Space ModelCode2
Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set RelationshipsCode2
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language ModelsCode2
Show:102550
← PrevPage 198 of 13232Next →