SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2411–2420 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
Do LLM Evaluators Prefer Themselves for a Reason?	Apr 4, 2025	BenchmarkingCode Generation	CodeCode Available	0	5
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning	Jan 22, 2025	Benchmarking	CodeCode Available	0	5
Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset	Feb 8, 2024	Benchmarking	CodeCode Available	0	5
Flexible Generation of Preference Data for Recommendation Analysis	Jul 23, 2024	BenchmarkingRecommendation Systems	CodeCode Available	0	5
HATE-ITA: New Baselines for Hate Speech Detection in Italian	Jul 1, 2022	BenchmarkingHate Speech Detection	CodeCode Available	0	5
Illuminating the Diversity-Fitness Trade-Off in Black-Box Optimization	Aug 29, 2024	BenchmarkingDiversity	CodeCode Available	0	5
Evaluating Shallow and Deep Neural Networks for Network Intrusion Detection Systems in Cyber Security	Oct 8, 2018	BenchmarkingBIG-bench Machine Learning	CodeCode Available	0	5
Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses	May 19, 2023	BenchmarkingForm	CodeCode Available	0	5
Strong and Simple Baselines for Multimodal Utterance Embeddings	May 14, 2019	Benchmarking	CodeCode Available	0	5
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data	Feb 22, 2024	Benchmarking	CodeCode Available	0	5

Show:10 25 50

← PrevPage 242 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified