SOTAVerified

The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Showing 1755117600 of 474278 papers

TitleStatusHype
CaseGen: A Benchmark for Multi-Stage Legal Case Documents GenerationCode1
Unveiling the Key Factors for Distilling Chain-of-Thought ReasoningCode1
Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of ThoughtCode1
Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMsCode1
Multi-Perspective Data Augmentation for Few-shot Object DetectionCode1
LLM Knows Geometry Better than Algebra: Numerical Understanding of LLM-Based Agents in A Trading ArenaCode1
Training Consistency Models with Variational Noise CouplingCode1
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language ModelsCode1
Can Multimodal LLMs Perform Time Series Anomaly Detection?Code1
Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable MetricCode1
ReFocus: Reinforcing Mid-Frequency and Key-Frequency Modeling for Multivariate Time Series ForecastingCode1
Snoopy: Effective and Efficient Semantic Join Discovery via Proxy ColumnsCode1
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from ScratchCode1
CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-ThoughtCode1
LongAttn: Selecting Long-context Training Data via Token-level AttentionCode1
Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public OpinionsCode1
Training a Generally Curious AgentCode1
Function-Space Learning RatesCode1
Hallucination Detection in LLMs Using Spectral Features of Attention MapsCode1
CalibRefine: Deep Learning-Based Online Automatic Targetless LiDAR-Camera Calibration with Iterative and Attention-Driven Post-RefinementCode1
HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference OptimizationCode1
Posterior Inference with Diffusion Models for High-dimensional Black-box OptimizationCode1
COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMsCode1
MAD-AD: Masked Diffusion for Unsupervised Brain Anomaly DetectionCode1
PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal ComplianceCode1
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic ObjectiveCode1
Towards Hierarchical Rectified FlowCode1
SwimVG: Step-wise Multimodal Fusion and Adaption for Visual GroundingCode1
FADE: Why Bad Descriptions Happen to Good FeaturesCode1
Tidiness Score-Guided Monte Carlo Tree Search for Visual Tabletop RearrangementCode1
LongSafety: Evaluating Long-Context Safety of Large Language ModelsCode1
MambaFlow: A Novel and Flow-guided State Space Model for Scene Flow EstimationCode1
Predicting the Energy Landscape of Stochastic Dynamical System via Physics-informed Self-supervised LearningCode1
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context InferenceCode1
LLM-QE: Improving Query Expansion by Aligning Large Language Models with Ranking PreferencesCode1
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit AdamCode1
CipherPrune: Efficient and Scalable Private Transformer InferenceCode1
JUREX-4E: Juridical Expert-Annotated Four-Element Knowledge Base for Legal ReasoningCode1
AeroReformer: Aerial Referring Transformer for UAV-based Referring Image SegmentationCode1
Code Summarization Beyond Function LevelCode1
A Reverse Mamba Attention Network for Pathological Liver SegmentationCode1
OptionZero: Planning with Learned OptionsCode1
CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at ScaleCode1
Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and CompressionCode1
FanChuan: A Multilingual and Graph-Structured Benchmark For Parody Detection and AnalysisCode1
Automatic Input Rewriting Improves Translation with Large Language ModelsCode1
Towards Optimal Adversarial Robust Reinforcement Learning with Infinity Measurement ErrorCode1
BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway ReasoningCode1
Are Sparse Autoencoders Useful? A Case Study in Sparse ProbingCode1
Show:102550
← PrevPage 352 of 9486Next →