Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–150 of 5548 papers

Title	Date	Tasks	Status	Hype
Multi-Head RAG: Solving Multi-Aspect Problems with LLMs	Jun 7, 2024	BenchmarkingDecoder	CodeCode Available	3
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild	Jun 7, 2024	BenchmarkingChatbot	CodeCode Available	3
MLVU: Benchmarking Multi-task Long Video Understanding	Jun 6, 2024	BenchmarkingVideo Understanding	CodeCode Available	3
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving	May 27, 2024	Autonomous DrivingBenchmarking	CodeCode Available	3
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset	May 17, 2024	16kBenchmarking	CodeCode Available	3
Are EEG-to-Text Models Working?	May 10, 2024	BenchmarkingEEG	CodeCode Available	3
ACEGEN: Reinforcement learning of generative chemical agents for drug discovery	May 7, 2024	BenchmarkingDecision Making	CodeCode Available	3
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension	Apr 25, 2024	BenchmarkingMultiple-choice	CodeCode Available	3
DeepFake-O-Meter v2.0: An Open Platform for DeepFake Detection	Apr 19, 2024	BenchmarkingDeepFake Detection	CodeCode Available	3
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases	Apr 19, 2024	BenchmarkingRetrieval	CodeCode Available	3
Advancing LLM Reasoning Generalists with Preference Trees	Apr 2, 2024	BenchmarkingCode Generation	CodeCode Available	3
Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection	Mar 19, 2024	Anomaly DetectionBenchmarking	CodeCode Available	3
AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework	Mar 19, 2024	BenchmarkingFinancial Analysis	CodeCode Available	3
Recurrent Drafter for Fast Speculative Decoding in Large Language Models	Mar 14, 2024	BenchmarkingKnowledge Distillation	CodeCode Available	3
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries	Jan 27, 2024	BenchmarkingRAG	CodeCode Available	3
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents	Jan 24, 2024	Benchmarking	CodeCode Available	3
Benchmarking LLMs via Uncertainty Quantification	Jan 23, 2024	BenchmarkingUncertainty Quantification	CodeCode Available	3
A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation	Jan 22, 2024	BenchmarkingDiagnostic	CodeCode Available	3
SEED-Bench: Benchmarking Multimodal Large Language Models	Jan 1, 2024	BenchmarkingImage Generation	CodeCode Available	3
AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One	Dec 10, 2023	AllBenchmarking	CodeCode Available	3
LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion	Nov 4, 2023	BenchmarkingImitation Learning	CodeCode Available	3
CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous Driving	Oct 11, 2023	Autonomous DrivingBenchmarking	CodeCode Available	3
Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity Analysis	Oct 9, 2023	BenchmarkingMultivariate Time Series Forecasting	CodeCode Available	3
T^3Bench: Benchmarking Current Progress in Text-to-3D Generation	Oct 4, 2023	3D GenerationBenchmarking	CodeCode Available	3
SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation	Sep 29, 2023	3D Human Pose Estimation3D Human Reconstruction	CodeCode Available	3
Matbench Discovery -- A framework to evaluate machine learning crystal stability predictions	Aug 28, 2023	BenchmarkingFormation Energy	CodeCode Available	3
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning	Jun 5, 2023	Benchmarking	CodeCode Available	3
TorchBench: Benchmarking PyTorch with High API Surface Coverage	Apr 27, 2023	BenchmarkingGPU	CodeCode Available	3
Highly Accurate Quantum Chemical Property Prediction with Uni-Mol+	Mar 16, 2023	BenchmarkingGraph Regression	CodeCode Available	3
Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning	Jan 26, 2023	BenchmarkingDeep Reinforcement Learning	CodeCode Available	3
AER: Auto-Encoder with Regression for Time Series Anomaly Detection	Dec 27, 2022	Anomaly DetectionBenchmarking	CodeCode Available	3
CORL: Research-oriented Deep Offline Reinforcement Learning Library	Oct 13, 2022	BenchmarkingD4RL	CodeCode Available	3
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks	Apr 16, 2022	BenchmarkingInstruction Following	CodeCode Available	3
A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs	Mar 14, 2022	BenchmarkingGraph Embedding	CodeCode Available	3
CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms	Nov 16, 2021	BenchmarkingDeep Reinforcement Learning	CodeCode Available	3
Personalized Benchmarking with the Ludwig Benchmarking Toolkit	Nov 8, 2021	BenchmarkingHyperparameter Optimization	CodeCode Available	3
Benchmarking Multimodal AutoML for Tabular Data with Text Fields	Nov 4, 2021	AutoMLBenchmarking	CodeCode Available	3
A Survey on Performance Metrics for Object-Detection Algorithms	Jul 21, 2020	BenchmarkingObject	CodeCode Available	3
Benchmarking Automatic Machine Learning Frameworks	Aug 17, 2018	Automated Feature EngineeringAutoML	CodeCode Available	3
mlpack 3: a fast, flexible machine learning library	Jun 18, 2018	BenchmarkingBIG-bench Machine Learning	CodeCode Available	3
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering	Jul 15, 2025	BenchmarkingInstruction Following	CodeCode Available	2
GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning	Jul 4, 2025	BenchmarkingGraph Generation	CodeCode Available	2
PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket Conditioning	Jun 24, 2025	BenchmarkingDrug Discovery	CodeCode Available	2
TAB: Unified Benchmarking of Time Series Anomaly Detection Methods	Jun 22, 2025	Anomaly DetectionBenchmarking	CodeCode Available	2
BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models	Jun 17, 2025	BenchmarkingLanguage Modeling	CodeCode Available	2
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks	Jun 13, 2025	BenchmarkingLarge Language Model	CodeCode Available	2
SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis	Jun 12, 2025	BenchmarkingDialogue Generation	CodeCode Available	2
IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments	Jun 11, 2025	Benchmarking	CodeCode Available	2
MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories	Jun 5, 2025	BenchmarkingOptical Character Recognition	CodeCode Available	2
GSCodec Studio: A Modular Framework for Gaussian Splat Compression	Jun 2, 2025	Benchmarking	CodeCode Available	2

Show:10 25 50

← PrevPage 3 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified