SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1161–1170 of 5548 papers

Title	Date	Tasks	Status	Hype
Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts	Dec 20, 2024	BenchmarkingOptical Character Recognition	CodeCode Available	0
AI-generated Image Quality Assessment in Visual Communication	Dec 20, 2024	BenchmarkingImage Quality Assessment	CodeCode Available	0
Generative CKM Construction using Partially Observed Data with Diffusion Model	Dec 19, 2024	Benchmarking	CodeCode Available	1
TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation	Dec 19, 2024	BenchmarkingDescription-guided molecule generation	CodeCode Available	1
Pitfalls of topology-aware image segmentation	Dec 19, 2024	BenchmarkingImage Segmentation	—Unverified	0
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving	Dec 19, 2024	Autonomous DrivingBenchmarking	CodeCode Available	2
Autonomous Microscopy Experiments through Large Language Model Agents	Dec 18, 2024	BenchmarkingExperimental Design	CodeCode Available	1
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks	Dec 18, 2024	Benchmarking	CodeCode Available	1
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge	Dec 18, 2024	BenchmarkingWorld Knowledge	CodeCode Available	0
Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning	Dec 18, 2024	BenchmarkingPosition	—Unverified	0

Show:10 25 50

← PrevPage 117 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified