SOTAVerified

Benchmarking

Papers

Showing 151175 of 5548 papers

TitleStatusHype
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM AgentsCode2
VERINA: Benchmarking Verifiable Code GenerationCode2
LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization AlgorithmsCode2
Benchmarking Laparoscopic Surgical Image Restoration and BeyondCode2
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and InteractionsCode2
GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species ClassificationCode2
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and ThoroughlyCode2
Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and EnhancementCode2
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language ModelsCode2
MINERVA: Evaluating Complex Video ReasoningCode2
Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and OutlookCode2
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in ChineseCode2
WASP: Benchmarking Web Agent Security Against Prompt Injection AttacksCode2
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at ScaleCode2
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis GenerationCode2
TorchFX: A modern approach to Audio DSP with PyTorch and GPU accelerationCode2
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation FrameworkCode2
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion TransferCode2
VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-TuningCode2
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical ReasoningCode2
Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and BenchmarkCode2
Medical Hallucinations in Foundation Models and Their Impact on HealthcareCode2
Benchmarking Retrieval-Augmented Generation in Multi-Modal ContextsCode2
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton OperatorsCode2
Show:102550
← PrevPage 7 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified