The Open Verification Layer for ML Research

Community benchmark tracking and reproducibility verification. Built for researchers and autonomous research agents.

474,278 papers248,326 code links4,818 tasks

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 17551–17600 of 474278 papers

Title	Date	Tasks	Status	Hype
CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation	Feb 25, 2025	Legal Reasoning	CodeCode Available	1
Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning	Feb 25, 2025		CodeCode Available	1
Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought	Feb 25, 2025	Emotion RecognitionLanguage Modeling	CodeCode Available	1
Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs	Feb 25, 2025	BenchmarkingChunking	CodeCode Available	1
Multi-Perspective Data Augmentation for Few-shot Object Detection	Feb 25, 2025	Data AugmentationFew-Shot Object Detection	CodeCode Available	1
LLM Knows Geometry Better than Algebra: Numerical Understanding of LLM-Based Agents in A Trading Arena	Feb 25, 2025		CodeCode Available	1
Training Consistency Models with Variational Noise Coupling	Feb 25, 2025	Image Generation	CodeCode Available	1
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models	Feb 25, 2025	Fact Checking	CodeCode Available	1
Can Multimodal LLMs Perform Time Series Anomaly Detection?	Feb 25, 2025	Anomaly DetectionIrregular Time Series	CodeCode Available	1
Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric	Feb 24, 2025	Diversity	CodeCode Available	1
ReFocus: Reinforcing Mid-Frequency and Key-Frequency Modeling for Multivariate Time Series Forecasting	Feb 24, 2025	Multivariate Time Series ForecastingTime Series	CodeCode Available	1
Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns	Feb 24, 2025	Contrastive LearningGraph Matching	CodeCode Available	1
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models	Feb 24, 2025	Logical ReasoningMultiple-choice	CodeCode Available	1
Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch	Feb 24, 2025		CodeCode Available	1
CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought	Feb 24, 2025	Mathematical ReasoningMisinformation	CodeCode Available	1
LongAttn: Selecting Long-context Training Data via Token-level Attention	Feb 24, 2025	Sentence	CodeCode Available	1
Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions	Feb 24, 2025	Language ModelingLanguage Modelling	CodeCode Available	1
Training a Generally Curious Agent	Feb 24, 2025	Decision MakingEfficient Exploration	CodeCode Available	1
Function-Space Learning Rates	Feb 24, 2025		CodeCode Available	1
Hallucination Detection in LLMs Using Spectral Features of Attention Maps	Feb 24, 2025	Hallucination	CodeCode Available	1
CalibRefine: Deep Learning-Based Online Automatic Targetless LiDAR-Camera Calibration with Iterative and Attention-Driven Post-Refinement	Feb 24, 2025	Autonomous DrivingCamera Calibration	CodeCode Available	1
HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference Optimization	Feb 24, 2025	DiversityFact Verification	CodeCode Available	1
Posterior Inference with Diffusion Models for High-dimensional Black-box Optimization	Feb 24, 2025	Bayesian OptimizationUncertainty Quantification	CodeCode Available	1
COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs	Feb 24, 2025		CodeCode Available	1
MAD-AD: Masked Diffusion for Unsupervised Brain Anomaly Detection	Feb 24, 2025	AnatomyAnomaly Detection	CodeCode Available	1
PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance	Feb 24, 2025		CodeCode Available	1
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective	Feb 24, 2025		CodeCode Available	1
Towards Hierarchical Rectified Flow	Feb 24, 2025		CodeCode Available	1
SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding	Feb 24, 2025	cross-modal alignmentVisual Grounding	CodeCode Available	1
FADE: Why Bad Descriptions Happen to Good Features	Feb 24, 2025		CodeCode Available	1
Tidiness Score-Guided Monte Carlo Tree Search for Visual Tabletop Rearrangement	Feb 24, 2025		CodeCode Available	1
LongSafety: Evaluating Long-Context Safety of Large Language Models	Feb 24, 2025		CodeCode Available	1
MambaFlow: A Novel and Flow-guided State Space Model for Scene Flow Estimation	Feb 24, 2025	Autonomous DrivingDecoder	CodeCode Available	1
Predicting the Energy Landscape of Stochastic Dynamical System via Physics-informed Self-supervised Learning	Feb 24, 2025	Self-Supervised Learning	CodeCode Available	1
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference	Feb 24, 2025		CodeCode Available	1
LLM-QE: Improving Query Expansion by Aligning Large Language Models with Ranking Preferences	Feb 24, 2025	HallucinationInformation Retrieval	CodeCode Available	1
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam	Feb 24, 2025		CodeCode Available	1
CipherPrune: Efficient and Scalable Private Transformer Inference	Feb 24, 2025	Privacy Preserving	CodeCode Available	1
JUREX-4E: Juridical Expert-Annotated Four-Element Knowledge Base for Legal Reasoning	Feb 24, 2025	Legal Reasoning	CodeCode Available	1
AeroReformer: Aerial Referring Transformer for UAV-based Referring Image Segmentation	Feb 23, 2025	Image SegmentationSegmentation	CodeCode Available	1
Code Summarization Beyond Function Level	Feb 23, 2025	Code SummarizationFew-Shot Learning	CodeCode Available	1
A Reverse Mamba Attention Network for Pathological Liver Segmentation	Feb 23, 2025	Computational EfficiencyLiver Segmentation	CodeCode Available	1
OptionZero: Planning with Learned Options	Feb 23, 2025	Atari Games	CodeCode Available	1
CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale	Feb 23, 2025		CodeCode Available	1
Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression	Feb 23, 2025	Efficient Neural NetworkQuantization	CodeCode Available	1
FanChuan: A Multilingual and Graph-Structured Benchmark For Parody Detection and Analysis	Feb 23, 2025	SentenceSentence Embedding	CodeCode Available	1
Automatic Input Rewriting Improves Translation with Large Language Models	Feb 23, 2025	Machine TranslationText Simplification	CodeCode Available	1
Towards Optimal Adversarial Robust Reinforcement Learning with Infinity Measurement Error	Feb 23, 2025	Adversarial RobustnessDeep Reinforcement Learning	CodeCode Available	1
BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning	Feb 23, 2025	Benchmarking	CodeCode Available	1
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing	Feb 23, 2025	Inductive BiasLarge Language Model	CodeCode Available	1