SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1751–1760 of 5548 papers

Title	Date	Tasks	Status	Hype
Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance	May 22, 2025	BenchmarkingPrompt Engineering	—Unverified	0
Experimental robustness benchmark of quantum neural network on a superconducting quantum processor	May 22, 2025	Adversarial AttackAdversarial Robustness	—Unverified	0
DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes	May 22, 2025	BenchmarkingRAG	—Unverified	0
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques	May 22, 2025	Benchmarking	—Unverified	0
BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research	May 22, 2025	Benchmarking	—Unverified	0
BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text	May 22, 2025	BenchmarkingRAG	—Unverified	0
MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks	May 22, 2025	BenchmarkingSpatial Reasoning	—Unverified	0
Edge-First Language Model Inference: Models, Metrics, and Tradeoffs	May 22, 2025	BenchmarkingLanguage Modeling	—Unverified	0
CUB: Benchmarking Context Utilisation Techniques for Language Models	May 22, 2025	BenchmarkingFact Checking	—Unverified	0
Learning collective multi-cellular dynamics from temporal scRNA-seq via a transformer-enhanced Neural SDE	May 22, 2025	BenchmarkingTime Series	CodeCode Available	0

Show:10 25 50

← PrevPage 176 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified