SOTAVerified

Benchmarking

Papers

Showing 626650 of 5548 papers

TitleStatusHype
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation ModelsCode0
Leveraging State Space Models in Long Range Genomics0
Generative Adversarial Networks with Limited Data: A Survey and Benchmarking0
Riemannian Geometry for the classification of brain states with intracortical brain-computer interfaces0
Cross-functional transferability in universal machine learning interatomic potentials0
A Solid-State Nanopore Signal Generator for Training Machine Learning Models0
Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search0
Subjective Visual Quality Assessment for High-Fidelity Learning-Based Image CompressionCode0
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIsCode0
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
A Survey of Pathology Foundation Model: Progress and Future DirectionsCode1
Do LLM Evaluators Prefer Themselves for a Reason?Code0
Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams0
Point Cloud Objective Quality: Benchmarking Features and Quality Evaluation0
Quantifying Robustness: A Benchmarking Framework for Deep Learning Forecasting in Cyber-Physical SystemsCode0
Towards a Unified Framework for Determining Conformational Ensembles of Disordered Proteins0
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models0
Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency0
Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological UnderpinningsCode0
Evaluating AI Recruitment Sourcing Tools by Human PreferenceCode0
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
Generative Evaluation of Complex Reasoning in Large Language ModelsCode1
Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge0
Global Rice Multi-Class Segmentation Dataset (RiceSEG): A Comprehensive and Diverse High-Resolution RGB-Annotated Images for the Development and Benchmarking of Rice Segmentation Algorithms0
Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers0
Show:102550
← PrevPage 26 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified