SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 521–530 of 5548 papers

Title	Date	Tasks	Status	Hype
OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification	Apr 29, 2025	BenchmarkingCode Generation	CodeCode Available	1
TrueFake: A Real World Case Dataset of Last Generation Fake Images also Shared on Social Networks	Apr 29, 2025	BenchmarkingMisinformation	CodeCode Available	1
On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks	Apr 29, 2025	Anomaly DetectionBenchmarking	—Unverified	0
LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs	Apr 29, 2025	BenchmarkingFace Generation	—Unverified	0
SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories	Apr 29, 2025	BenchmarkingCode Generation	—Unverified	0
Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model Validation	Apr 29, 2025	BenchmarkingFairness	CodeCode Available	0
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models	Apr 29, 2025	BenchmarkingDataset Generation	CodeCode Available	0
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets	Apr 28, 2025	ArticlesBenchmarking	—Unverified	0
BLADE: Benchmark suite for LLM-driven Automated Design and Evolution of iterative optimisation heuristics	Apr 28, 2025	Benchmarking	—Unverified	0
WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution	Apr 28, 2025	BenchmarkingImage Attribution	—Unverified	0

Show:10 25 50

← PrevPage 53 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified