Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–100 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchopt: Reproducible, efficient and collaborative optimization benchmarks	Jun 27, 2022	Benchmarkingimage-classification	CodeCode Available	4
RecBole 2.0: Towards a More Up-to-Date Recommendation Library	Jun 15, 2022	BenchmarkingData Augmentation	CodeCode Available	4
Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets	Mar 9, 2022	BenchmarkingGraph Regression	CodeCode Available	4
TabArena: A Living Benchmark for Machine Learning on Tabular Data	Jun 20, 2025	Benchmarking	CodeCode Available	3
ANIRA: An Architecture for Neural Network Inference in Real-Time Audio Applications	Jun 14, 2025	Benchmarking	CodeCode Available	3
ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation	May 24, 2025	BenchmarkingChart Understanding	CodeCode Available	3
IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models	May 22, 2025	BenchmarkingInstruction Following	CodeCode Available	3
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models	May 22, 2025	BenchmarkingFairness	CodeCode Available	3
OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking	May 20, 2025	Benchmarking	CodeCode Available	3
Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking	May 16, 2025	BenchmarkingManagement	CodeCode Available	3
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization	May 9, 2025	Benchmarking	CodeCode Available	3
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites	Apr 15, 2025	Autonomous Web NavigationBenchmarking	CodeCode Available	3
StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs	Mar 26, 2025	Benchmarking	CodeCode Available	3
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining	Mar 23, 2025	3DGSBenchmarking	CodeCode Available	3
nnInteractive: Redefining 3D Promptable Segmentation	Mar 11, 2025	BenchmarkingInteractive Segmentation	CodeCode Available	3
Robust Latent Matters: Boosting Image Generation with Sampling Error	Mar 11, 2025	BenchmarkingImage Generation	CodeCode Available	3
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection	Feb 27, 2025	Action DetectionBenchmarking	CodeCode Available	3
BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction	Feb 26, 2025	BenchmarkingTime Series	CodeCode Available	3
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks	Feb 7, 2025	Benchmarking	CodeCode Available	3
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents	Jan 24, 2025	Benchmarking	CodeCode Available	3
Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications	Dec 3, 2024	BenchmarkingDisaster Response	CodeCode Available	3
Caravan MultiMet: Extending Caravan with Multiple Weather Nowcasts and Forecasts	Nov 14, 2024	Benchmarking	CodeCode Available	3
General Geospatial Inference with a Population Dynamics Foundation Model	Nov 11, 2024	BenchmarkingGraph Neural Network	CodeCode Available	3
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent	Nov 5, 2024	BenchmarkingHallucination	CodeCode Available	3
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents	Oct 31, 2024	Benchmarking	CodeCode Available	3
XRDSLAM: A Flexible and Modular Framework for Deep Learning based SLAM	Oct 31, 2024	3DGSBenchmarking	CodeCode Available	3
OGBench: Benchmarking Offline Goal-Conditioned RL	Oct 26, 2024	Benchmarkingreinforcement-learning	CodeCode Available	3
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances	Oct 24, 2024	BenchmarkingImage to Video Generation	CodeCode Available	3
VoiceBench: Benchmarking LLM-Based Voice Assistants	Oct 22, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	3
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory	Oct 14, 2024	BenchmarkingLarge Language Model	CodeCode Available	3
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making	Oct 9, 2024	BenchmarkingDecision Making	CodeCode Available	3
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents	Oct 3, 2024	Autonomous DrivingBackdoor Attack	CodeCode Available	3
OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation Models	Oct 2, 2024	Benchmarking	CodeCode Available	3
The Elephant in the Room: Towards A Reliable Time-Series Anomaly Detection Benchmark	Sep 26, 2024	Anomaly DetectionBenchmarking	CodeCode Available	3
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems	Sep 2, 2024	BenchmarkingInstruction Following	CodeCode Available	3
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework	Aug 2, 2024	BenchmarkingDataset Generation	CodeCode Available	3
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents	Jul 26, 2024	BenchmarkingCode Generation	CodeCode Available	3
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation	Jul 24, 2024	BenchmarkingHuman Animation	CodeCode Available	3
AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic Benchmarking	Jul 23, 2024	BenchmarkingTransfer Learning	CodeCode Available	3
Revisiting, Benchmarking and Understanding Unsupervised Graph Domain Adaptation	Jul 9, 2024	BenchmarkingDomain Adaptation	CodeCode Available	3
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation	Jul 1, 2024	BenchmarkingRAG	CodeCode Available	3
HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis	Jun 23, 2024	BenchmarkingRepresentation Learning	CodeCode Available	3
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation	Jun 19, 2024	BenchmarkingImage Generation	CodeCode Available	3
WebCanvas: Benchmarking Web Agents in Online Environments	Jun 18, 2024	AI AgentBenchmarking	CodeCode Available	3
TSI-Bench: Benchmarking Time Series Imputation	Jun 18, 2024	BenchmarkingDeep Learning	CodeCode Available	3
TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs	Jun 14, 2024	BenchmarkingKnowledge Graphs	CodeCode Available	3
DrivAerNet++: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks	Jun 13, 2024	Benchmarking	CodeCode Available	3
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks	Jun 12, 2024	BenchmarkingChatbot	CodeCode Available	3
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents	Jun 10, 2024	Benchmarkingscientific discovery	CodeCode Available	3
TopoBench: A Framework for Benchmarking Topological Deep Learning	Jun 9, 2024	BenchmarkingDeep Learning	CodeCode Available	3

Show:10 25 50

← PrevPage 2 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified