SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–160 of 5548 papers

Title	Date	Tasks	Status	Hype
DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models	Jun 5, 2025	BenchmarkingDiversity	—Unverified	0
FRED: The Florence RGB-Event Drone Dataset	Jun 5, 2025	BenchmarkingTrajectory Forecasting	—Unverified	0
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation	Jun 5, 2025	Benchmarking	CodeCode Available	0
Refer to Anything with Vision-Language Prompts	Jun 5, 2025	BenchmarkingGeneralized Referring Expression Segmentation	—Unverified	0
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos	Jun 5, 2025	BenchmarkingMathematical Reasoning	—Unverified	0
MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories	Jun 5, 2025	BenchmarkingOptical Character Recognition	CodeCode Available	2
CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx	Jun 5, 2025	2D Pose EstimationBenchmarking	—Unverified	0
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems	Jun 5, 2025	BenchmarkingRAG	—Unverified	0
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model	Jun 5, 2025	BenchmarkingLanguage Modeling	—Unverified	0
A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values	Jun 5, 2025	Benchmarking	—Unverified	0

Show:10 25 50

← PrevPage 16 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified