SOTAVerified

Benchmarking

Papers

Showing 28012850 of 5548 papers

TitleStatusHype
Grounded Intuition of GPT-Vision's Abilities with Scientific ImagesCode0
An Empirical Study of Benchmarking Chinese Aspect Sentiment Quad Prediction0
Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information RetrievalCode0
Decentralized Federated Learning on the Edge over Wireless Mesh Networks0
Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in IndonesiaCode0
Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLOCode1
EMPOT: partial alignment of density maps and rigid body fitting using unbalanced Gromov-Wasserstein divergenceCode1
Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs0
SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization0
UAV Immersive Video Streaming: A Comprehensive Survey, Benchmarking, and Open Challenges0
A Two-Step Framework for Multi-Material Decomposition of Dual Energy Computed Tomography from Projection Domain0
Next-generation MRD assays: do we have the tools to evaluate them properly?0
In Search of Lost Online Test-time Adaptation: A SurveyCode1
What's In My Big Data?Code2
Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests0
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision TasksCode2
Domain Generalization in Computational Pathology: Survey and Guidelines0
A Metadata-Driven Approach to Understand Graph Neural Networks0
Re-evaluating Retrosynthesis Algorithms with SyntheseusCode1
LLMs and Finetuning: Benchmarking cross-domain performance for hate speech detection0
Evaluating LLP Methods: Challenges and ApproachesCode0
Benchmark Generation Framework with Customizable Distortions for Image Classifier RobustnessCode0
OpenDMC: An Open-Source Library and Performance Evaluation for Deep-learning-based Multi-frame CompressionCode0
On General Language Understanding0
OrionBench: Benchmarking Time Series Generative Models in the Service of the End-User0
Quantum Long Short-Term Memory (QLSTM) vs Classical LSTM in Time Series Forecasting: A Comparative Study in Solar Power Forecasting0
RDBench: ML Benchmark for Relational Databases0
ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair0
XFEVER: Exploring Fact Verification across LanguagesCode0
MLFMF: Data Sets for Machine Learning for Mathematical FormalizationCode1
BLESS: Benchmarking Large Language Models on Sentence SimplificationCode0
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic0
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual DesignCode0
XTSC-Bench: Quantitative Benchmarking for Explainers on Time Series ClassificationCode0
A Quantitative Evaluation of Dense 3D Reconstruction of Sinus Anatomy from Monocular Endoscopic Video0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
Fast hyperboloid decision tree algorithmsCode1
Benchmarking and Improving Text-to-SQL Generation under AmbiguityCode0
Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language ModelsCode0
MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection BenchmarkCode1
Standardised workflow for mass spectrometry-based single-cell proteomics data processing and analysis using the scp package0
Benchmarking GPUs on SVBRDF Extractor Model0
Almost Equivariance via Lie Algebra Convolutions0
OODRobustBench: a Benchmark and Large-Scale Analysis of Adversarial Robustness under Distribution ShiftCode1
Formalizing and Benchmarking Prompt Injection Attacks and DefensesCode2
FactCHD: Benchmarking Fact-Conflicting Hallucination DetectionCode1
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot InteractionsCode0
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For NowCode1
Object-aware Inversion and Reassembly for Image EditingCode1
Show:102550
← PrevPage 57 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified