SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2421–2430 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning	Jan 22, 2025	Benchmarking	CodeCode Available	0	5
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion Collider	Apr 26, 2025	BenchmarkingGPU	CodeCode Available	0	5
Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset	Feb 8, 2024	Benchmarking	CodeCode Available	0	5
Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring	Feb 10, 2025	Benchmarking	CodeCode Available	0	5
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data	Jan 31, 2024	BenchmarkingChange Detection	CodeCode Available	0	5
Large-scale Ridesharing DARP Instances Based on Real Travel Demand	May 30, 2023	Benchmarking	CodeCode Available	0	5
HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability prediction	Jul 9, 2024	Benchmarking	CodeCode Available	0	5
Strong and Simple Baselines for Multimodal Utterance Embeddings	May 14, 2019	Benchmarking	CodeCode Available	0	5
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams	Jun 17, 2024	AllBenchmarking	CodeCode Available	0	5
GenderBench: Evaluation Suite for Gender Biases in LLMs	May 17, 2025	Benchmarking	CodeCode Available	0	5

Show:10 25 50

← PrevPage 243 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified