SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 11–20 of 5548 papers

Title	Date	Tasks	Status	Hype
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks	Jul 14, 2025	BenchmarkingCode Generation	—Unverified	0
Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop	Jul 14, 2025	Benchmarking	—Unverified	0
MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking	Jul 14, 2025	BenchmarkingLanguage Modeling	—Unverified	0
Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models	Jul 13, 2025	AttributeBenchmarking	CodeCode Available	0
Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift	Jul 12, 2025	BenchmarkingTransfer Learning	—Unverified	0
Identifying the Smallest Adversarial Load Perturbations that Render DC-OPF Infeasible	Jul 10, 2025	Adversarial AttackBenchmarking	CodeCode Available	0
Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning	Jul 9, 2025	BenchmarkingImage Retrieval	CodeCode Available	0
Benchmarking Waitlist Mortality Prediction in Heart Transplantation Through Time-to-Event Modeling using New Longitudinal UNOS Dataset	Jul 9, 2025	BenchmarkingDecision Making	—Unverified	0
A Systematic Analysis of Hybrid Linear Attention	Jul 8, 2025	BenchmarkingLanguage Modeling	—Unverified	0
SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations	Jul 8, 2025	6D Pose Estimation6D Pose Estimation using RGB	CodeCode Available	0

Show:10 25 50

← PrevPage 2 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified