SOTAVerified

Benchmarking

Papers

Showing 35013550 of 5548 papers

TitleStatusHype
Large Language Models as Automated Aligners for benchmarking Vision-Language Models0
An Empirical Investigation into Benchmarking Model Multiplicity for Trustworthy Machine Learning: A Case Study on Image Classification0
Dialogue Quality and Emotion Annotations for Customer Support ConversationsCode0
Learning Dynamic Selection and Pricing of Out-of-Home DeliveriesCode0
Automated 3D Tumor Segmentation using Temporal Cubic PatchGAN (TCuP-GAN)0
Creating and Leveraging a Synthetic Dataset of Cloud Optical Thickness Measures for Cloud Detection in MSICode0
A projected nonlinear state-space model for forecasting time series signalsCode0
Benchmarking Toxic Molecule Classification using Graph Neural Networks and Few Shot Learning0
Benchmarking bias: Expanding clinical AI model card to incorporate bias reporting of social and non-social factors0
Deep State-Space Model for Predicting Cryptocurrency Price0
Segment Together: A Versatile Paradigm for Semi-Supervised Medical Image Segmentation0
Demonstrating Almost Linear Time Complexity of Bus Admittance Matrix-Based Distribution Network Power Flow: An Empirical Approach0
Holistic Inverse Rendering of Complex Facade via Aerial 3D Scanning0
LABCAT: Locally adaptive Bayesian optimization using principal-component-aligned trust regionsCode0
Benchmarking Feature Extractors for Reinforcement Learning-Based Semiconductor Defect Localization0
Benchmarking Machine Learning Models for Quantum Error Correction0
Predicting the Probability of Collision of a Satellite with Space Debris: A Bayesian Machine Learning Approach0
Social Bias Probing: Fairness Benchmarking for Language Models0
Domain Aligned CLIP for Few-shot Classification0
Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two BenchmarksCode0
Model Agnostic Explainable Selective Regression via Uncertainty Estimation0
Benchmarking Individual Tree Mapping with Sub-meter Imagery0
On Using Distribution-Based Compositionality Assessment to Evaluate Compositional Generalisation in Machine TranslationCode0
The Disagreement Problem in Faithfulness Metrics0
Uncertainty estimation of machine learning spatial precipitation predictions from satellite data0
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks0
Connecting the Dots: Graph Neural Network Powered Ensemble and Classification of Medical ImagesCode0
Identification of vortex in unstructured mesh with graph neural networks0
SeaTurtleID2022: A long-span dataset for reliable sea turtle re-identification0
Prompt Sketching for Large Language Models0
An efficiency analysis of Spanish airports0
A Comprehensive Summarization and Evaluation of Feature Refinement Modules for CTR PredictionCode0
DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing UnderstandingCode0
Benchmarking Deep Facial Expression Recognition: An Extensive Protocol with Balanced Dataset in the Wild0
Benchmarking Differential Evolution on a Quantum Simulator0
Exploitation-Guided Exploration for Semantic Embodied Navigation0
Benchmarking a Benchmark: How Reliable is MS-COCO?0
Learning Disentangled Speech Representations0
Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information RetrievalCode0
Grounded Intuition of GPT-Vision's Abilities with Scientific ImagesCode0
An Empirical Study of Benchmarking Chinese Aspect Sentiment Quad Prediction0
Investigating Deep-Learning NLP for Automating the Extraction of Oncology Efficacy Endpoints from Scientific Literature0
Use of Deep Neural Networks for Uncertain Stress Functions with Extensions to Impact Mechanics0
Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in IndonesiaCode0
Decentralized Federated Learning on the Edge over Wireless Mesh Networks0
Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs0
SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization0
A Two-Step Framework for Multi-Material Decomposition of Dual Energy Computed Tomography from Projection Domain0
Next-generation MRD assays: do we have the tools to evaluate them properly?0
UAV Immersive Video Streaming: A Comprehensive Survey, Benchmarking, and Open Challenges0
Show:102550
← PrevPage 71 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified