Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–125 of 5548 papers

Title	Date	Tasks	Status	Hype
Multi-Head RAG: Solving Multi-Aspect Problems with LLMs	Jun 7, 2024	BenchmarkingDecoder	CodeCode Available	3
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild	Jun 7, 2024	BenchmarkingChatbot	CodeCode Available	3
MLVU: Benchmarking Multi-task Long Video Understanding	Jun 6, 2024	BenchmarkingVideo Understanding	CodeCode Available	3
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving	May 27, 2024	Autonomous DrivingBenchmarking	CodeCode Available	3
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset	May 17, 2024	16kBenchmarking	CodeCode Available	3
Are EEG-to-Text Models Working?	May 10, 2024	BenchmarkingEEG	CodeCode Available	3
ACEGEN: Reinforcement learning of generative chemical agents for drug discovery	May 7, 2024	BenchmarkingDecision Making	CodeCode Available	3
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension	Apr 25, 2024	BenchmarkingMultiple-choice	CodeCode Available	3
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases	Apr 19, 2024	BenchmarkingRetrieval	CodeCode Available	3
DeepFake-O-Meter v2.0: An Open Platform for DeepFake Detection	Apr 19, 2024	BenchmarkingDeepFake Detection	CodeCode Available	3
Advancing LLM Reasoning Generalists with Preference Trees	Apr 2, 2024	BenchmarkingCode Generation	CodeCode Available	3
AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework	Mar 19, 2024	BenchmarkingFinancial Analysis	CodeCode Available	3
Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection	Mar 19, 2024	Anomaly DetectionBenchmarking	CodeCode Available	3
Recurrent Drafter for Fast Speculative Decoding in Large Language Models	Mar 14, 2024	BenchmarkingKnowledge Distillation	CodeCode Available	3
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries	Jan 27, 2024	BenchmarkingRAG	CodeCode Available	3
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents	Jan 24, 2024	Benchmarking	CodeCode Available	3
Benchmarking LLMs via Uncertainty Quantification	Jan 23, 2024	BenchmarkingUncertainty Quantification	CodeCode Available	3
A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation	Jan 22, 2024	BenchmarkingDiagnostic	CodeCode Available	3
SEED-Bench: Benchmarking Multimodal Large Language Models	Jan 1, 2024	BenchmarkingImage Generation	CodeCode Available	3
AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One	Dec 10, 2023	AllBenchmarking	CodeCode Available	3
LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion	Nov 4, 2023	BenchmarkingImitation Learning	CodeCode Available	3
CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous Driving	Oct 11, 2023	Autonomous DrivingBenchmarking	CodeCode Available	3
Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity Analysis	Oct 9, 2023	BenchmarkingMultivariate Time Series Forecasting	CodeCode Available	3
T^3Bench: Benchmarking Current Progress in Text-to-3D Generation	Oct 4, 2023	3D GenerationBenchmarking	CodeCode Available	3
SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation	Sep 29, 2023	3D Human Pose Estimation3D Human Reconstruction	CodeCode Available	3

Show:10 25 50

← PrevPage 5 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified