SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 251–260 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
A Content-Driven Micro-Video Recommendation Dataset at Scale	Sep 27, 2023	BenchmarkingRecommendation Systems	CodeCode Available	2	5
LLM-Based Multi-Agent Systems are Scalable Graph Generative Models	Oct 13, 2024	BenchmarkingGraph Generation	CodeCode Available	2	5
EffiBench: Benchmarking the Efficiency of Automatically Generated Code	Feb 3, 2024	BenchmarkingCode Completion	CodeCode Available	2	5
nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation Benchmark	Jan 1, 2025	BenchmarkingImage Segmentation	CodeCode Available	2	5
Authorship Obfuscation in Multilingual Machine-Generated Text Detection	Jan 15, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	2	5
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation	Jun 24, 2024	BenchmarkingImage Generation	CodeCode Available	2	5
AutoPenBench: Benchmarking Generative Agents for Penetration Testing	Oct 4, 2024	Benchmarking	CodeCode Available	2	5
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering	Jul 15, 2025	BenchmarkingInstruction Following	CodeCode Available	2	5
Deep Visual Geo-localization Benchmark	Apr 7, 2022	BenchmarkingData Augmentation	CodeCode Available	2	5
PyGraft: Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips	Sep 7, 2023	BenchmarkingKnowledge Graphs	CodeCode Available	2	5

Show:10 25 50

← PrevPage 26 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified