SOTAVerified

Benchmarking

Papers

Showing 13511375 of 5548 papers

TitleStatusHype
A Computed Tomography Vertebral Segmentation Dataset with Anatomical Variations and Multi-Vendor Scanner DataCode1
DFGC 2021: A DeepFake Game CompetitionCode1
DFGC 2022: The Second DeepFake Game CompetitionCode1
Benchmarking Test-Time Adaptation against Distribution Shifts in Image ClassificationCode1
A Unified Taxonomy and Multimodal Dataset for Events in Invasion GamesCode1
Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?Code1
DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific InformationCode1
Online Learning with Optimism and DelayCode1
SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)Code1
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action EnvironmentsCode1
Initial recommendations for performing, benchmarking, and reporting single-cell proteomics experimentsCode1
Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge GraphsCode1
Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRACode1
OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental LearningCode1
InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language ModelsCode1
OpenFWI: Large-Scale Multi-Structural Benchmark Datasets for Seismic Full Waveform InversionCode1
Benchmarking Image Retrieval for Visual LocalizationCode1
ArabicaQA: A Comprehensive Dataset for Arabic Question AnsweringCode1
IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language UnderstandingCode1
OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational AgentsCode1
IMUPoser: Full-Body Pose Estimation using IMUs in Phones, Watches, and EarbudsCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language ModelsCode1
DLBacktrace: A Model Agnostic Explainability for any Deep Learning ModelsCode1
RGB-D Indiscernible Object Counting in Underwater ScenesCode1
Show:102550
← PrevPage 55 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified