Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 226–250 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
Learning Transferable Visual Models From Natural Language Supervision	Feb 26, 2021	Action RecognitionBenchmarking	CodeCode Available	2	5
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing	Apr 3, 2025	BenchmarkingLogical Reasoning	CodeCode Available	2	5
EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models	Dec 11, 2023	BenchmarkingEmotional Intelligence	CodeCode Available	2	5
EvalGIM: A Library for Evaluating Generative Image Models	Dec 13, 2024	BenchmarkingDiversity	CodeCode Available	2	5
Advances in APPFL: A Comprehensive and Extensible Federated Learning Framework	Sep 17, 2024	BenchmarkingFederated Learning	CodeCode Available	2	5
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models	Jul 1, 2024	BenchmarkingFairness	CodeCode Available	2	5
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters	May 27, 2024	BenchmarkingGSM8K	CodeCode Available	2	5
A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark	Jan 1, 2024	Age EstimationBenchmarking	CodeCode Available	2	5
LtU-ILI: An All-in-One Framework for Implicit Inference in Astrophysics and Cosmology	Feb 6, 2024	AllBenchmarking	CodeCode Available	2	5
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving	Dec 19, 2024	Autonomous DrivingBenchmarking	CodeCode Available	2	5
AutoPenBench: Benchmarking Generative Agents for Penetration Testing	Oct 4, 2024	Benchmarking	CodeCode Available	2	5
EffiBench: Benchmarking the Efficiency of Automatically Generated Code	Feb 3, 2024	BenchmarkingCode Completion	CodeCode Available	2	5
EasyTPP: Towards Open Benchmarking Temporal Point Processes	Jul 16, 2023	BenchmarkingPoint Processes	CodeCode Available	2	5
LLM-Based Multi-Agent Systems are Scalable Graph Generative Models	Oct 13, 2024	BenchmarkingGraph Generation	CodeCode Available	2	5
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs	Jun 13, 2024	BenchmarkingGPU	CodeCode Available	2	5
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation	Jun 24, 2024	BenchmarkingImage Generation	CodeCode Available	2	5
State-specific protein-ligand complex structure prediction with a multi-scale deep generative model	Sep 30, 2022	BenchmarkingBlind Docking	CodeCode Available	2	5
Fast Vision Transformers with HiLo Attention	May 26, 2022	BenchmarkingEfficient ViTs	CodeCode Available	2	5
Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)	Jan 14, 2023	Benchmarking	CodeCode Available	2	5
A Content-Driven Micro-Video Recommendation Dataset at Scale	Sep 27, 2023	BenchmarkingRecommendation Systems	CodeCode Available	2	5
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations	Jul 1, 2024	Benchmarkingdocument understanding	CodeCode Available	2	5
Deep Visual Geo-localization Benchmark	Apr 7, 2022	BenchmarkingData Augmentation	CodeCode Available	2	5
A large annotated medical image dataset for the development and evaluation of segmentation algorithms	Feb 25, 2019	BenchmarkingSegmentation	CodeCode Available	2	5
Datasets and Benchmarks for Offline Safe Reinforcement Learning	Jun 15, 2023	Autonomous DrivingBenchmarking	CodeCode Available	2	5
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer	Mar 21, 2025	BenchmarkingVideo Generation	CodeCode Available	2	5

Show:10 25 50

← PrevPage 10 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified