SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 281–290 of 5548 papers

Title	Date	Tasks	Status	Hype
Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset	May 24, 2025	BenchmarkingRAG	CodeCode Available	0
Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs	May 24, 2025	Benchmarking	—Unverified	0
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions	May 24, 2025	Benchmarking	CodeCode Available	2
Benchmarking and Rethinking Knowledge Editing for Large Language Models	May 24, 2025	Benchmarkingknowledge editing	CodeCode Available	0
From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation	May 24, 2025	ArticlesBenchmarking	—Unverified	0
LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges	May 24, 2025	BenchmarkingMathematical Reasoning	CodeCode Available	0
SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models	May 24, 2025	BenchmarkingVideo Grounding	—Unverified	0
Benchmarking Poisoning Attacks against Retrieval-Augmented Generation	May 24, 2025	BenchmarkingQuestion Answering	—Unverified	0
ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation	May 24, 2025	BenchmarkingChart Understanding	CodeCode Available	3
SPDEBench: An Extensive Benchmark for Learning Regular and Singular Stochastic PDEs	May 24, 2025	Benchmarking	CodeCode Available	0

Show:10 25 50

← PrevPage 29 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified