SOTAVerified

Benchmarking

Papers

Showing 276300 of 5548 papers

TitleStatusHype
Benchmarking Laparoscopic Surgical Image Restoration and BeyondCode2
Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding0
EnvSDD: Benchmarking Environmental Sound Deepfake Detection0
Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking0
Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments0
Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE DatasetCode0
Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs0
From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation0
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and InteractionsCode2
LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning ChallengesCode0
Benchmarking and Rethinking Knowledge Editing for Large Language ModelsCode0
SPDEBench: An Extensive Benchmark for Learning Regular and Singular Stochastic PDEsCode0
SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models0
ChartGalaxy: A Dataset for Infographic Chart Understanding and GenerationCode3
Benchmarking Poisoning Attacks against Retrieval-Augmented Generation0
So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection0
MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation0
Benchmark for Antibody Binding Affinity Maturation and Design0
U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding0
3D Face Reconstruction Error Decomposed: A Modular Benchmark for Fair and Fast Method EvaluationCode0
A Position Paper on the Automatic Generation of Machine Learning LeaderboardsCode0
SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond ClassificationCode0
PawPrint: Whose Footprints Are These? Identifying Animal Individuals by Their Footprints0
PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language0
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering WorkflowCode1
Show:102550
← PrevPage 12 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified