SOTAVerified

Benchmarking

Papers

Showing 28012825 of 5548 papers

TitleStatusHype
Grounded Intuition of GPT-Vision's Abilities with Scientific ImagesCode0
An Empirical Study of Benchmarking Chinese Aspect Sentiment Quad Prediction0
Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information RetrievalCode0
Decentralized Federated Learning on the Edge over Wireless Mesh Networks0
Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in IndonesiaCode0
Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLOCode1
EMPOT: partial alignment of density maps and rigid body fitting using unbalanced Gromov-Wasserstein divergenceCode1
Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs0
SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization0
UAV Immersive Video Streaming: A Comprehensive Survey, Benchmarking, and Open Challenges0
A Two-Step Framework for Multi-Material Decomposition of Dual Energy Computed Tomography from Projection Domain0
Next-generation MRD assays: do we have the tools to evaluate them properly?0
In Search of Lost Online Test-time Adaptation: A SurveyCode1
What's In My Big Data?Code2
Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests0
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision TasksCode2
Domain Generalization in Computational Pathology: Survey and Guidelines0
A Metadata-Driven Approach to Understand Graph Neural Networks0
Re-evaluating Retrosynthesis Algorithms with SyntheseusCode1
LLMs and Finetuning: Benchmarking cross-domain performance for hate speech detection0
Evaluating LLP Methods: Challenges and ApproachesCode0
Benchmark Generation Framework with Customizable Distortions for Image Classifier RobustnessCode0
OpenDMC: An Open-Source Library and Performance Evaluation for Deep-learning-based Multi-frame CompressionCode0
On General Language Understanding0
OrionBench: Benchmarking Time Series Generative Models in the Service of the End-User0
Show:102550
← PrevPage 113 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified