SOTAVerified

Benchmarking

Papers

Showing 35513575 of 5548 papers

TitleStatusHype
Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests0
A Metadata-Driven Approach to Understand Graph Neural Networks0
Domain Generalization in Computational Pathology: Survey and Guidelines0
LLMs and Finetuning: Benchmarking cross-domain performance for hate speech detection0
Evaluating LLP Methods: Challenges and ApproachesCode0
Benchmark Generation Framework with Customizable Distortions for Image Classifier RobustnessCode0
On General Language Understanding0
OpenDMC: An Open-Source Library and Performance Evaluation for Deep-learning-based Multi-frame CompressionCode0
OrionBench: Benchmarking Time Series Generative Models in the Service of the End-User0
RDBench: ML Benchmark for Relational Databases0
ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair0
XFEVER: Exploring Fact Verification across LanguagesCode0
Quantum Long Short-Term Memory (QLSTM) vs Classical LSTM in Time Series Forecasting: A Comparative Study in Solar Power Forecasting0
BLESS: Benchmarking Large Language Models on Sentence SimplificationCode0
Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic0
XTSC-Bench: Quantitative Benchmarking for Explainers on Time Series ClassificationCode0
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual DesignCode0
A Quantitative Evaluation of Dense 3D Reconstruction of Sinus Anatomy from Monocular Endoscopic Video0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
Benchmarking and Improving Text-to-SQL Generation under AmbiguityCode0
Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language ModelsCode0
Standardised workflow for mass spectrometry-based single-cell proteomics data processing and analysis using the scp package0
Almost Equivariance via Lie Algebra Convolutions0
Benchmarking GPUs on SVBRDF Extractor Model0
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot InteractionsCode0
Show:102550
← PrevPage 143 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified