SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 331–340 of 5548 papers

Title	Date	Tasks	Status	Hype
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques	May 22, 2025	Benchmarking	—Unverified	0
SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation	May 21, 2025	BenchmarkingCode Generation	—Unverified	0
NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction	May 21, 2025	BenchmarkingHallucination	—Unverified	0
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation	May 21, 2025	BenchmarkingRAG	—Unverified	0
Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets	May 21, 2025	BenchmarkingDiagnostic	—Unverified	0
UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning	May 21, 2025	BenchmarkingImitation Learning	—Unverified	0
VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models	May 21, 2025	Benchmarking	CodeCode Available	0
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering	May 21, 2025	BenchmarkingLanguage Modeling	CodeCode Available	0
AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals	May 21, 2025	BenchmarkingChatbot	—Unverified	0
Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems	May 21, 2025	BenchmarkingMath	—Unverified	0

Show:10 25 50

← PrevPage 34 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified