SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 3041–3050 of 5548 papers

Title	Date	Tasks	Status	Hype
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams	Jun 17, 2024	AllBenchmarking	CodeCode Available	0
A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models	Jun 17, 2024	BenchmarkingSurvey	—Unverified	0
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content	Jun 17, 2024	BenchmarkingGeneral Knowledge	CodeCode Available	0
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning	Jun 16, 2024	BenchmarkingMath	—Unverified	0
Evaluating the Performance of Large Language Models via Debates	Jun 16, 2024	Benchmarking	—Unverified	0
Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex	Jun 16, 2024	BenchmarkingObject Recognition	—Unverified	0
Benchmarking Label Noise in Instance Segmentation: Spatial Noise Matters	Jun 16, 2024	BenchmarkingInstance Segmentation	CodeCode Available	0
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences	Jun 16, 2024	BenchmarkingSpatial Reasoning	—Unverified	0
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models	Jun 16, 2024	Benchmarking	CodeCode Available	0
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment	Jun 16, 2024	Action UnderstandingBenchmarking	—Unverified	0

Show:10 25 50

← PrevPage 305 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified