SOTAVerified

Benchmarking

Papers

Showing 24762500 of 5548 papers

TitleStatusHype
Machine learning classification of non-Markovian noise disturbing quantum dynamicsCode0
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation dataCode0
A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient VoiceCode0
Distributed Non-Convex Optimization with Sublinear Speedup under Intermittent Client AvailabilityCode0
Flexible Generation of Preference Data for Recommendation AnalysisCode0
Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AICode0
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory InstructionsCode0
Benchmarking Large Language Models for Molecule Prediction TasksCode0
DispBench: Benchmarking Disparity Estimation to Synthetic CorruptionsCode0
Are Large Language Models Good at Utility Judgments?Code0
Benchmarking performance of object detection under image distortions in an uncontrolled environmentCode0
DispaRisk: Auditing Fairness Through Usable InformationCode0
A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision MakingCode0
Exploring Context Generalizability in Citywide Crowd Mobility Prediction: An Analytic Framework and BenchmarkCode0
Benchmarking Perturbation-based Saliency Maps for Explaining Atari AgentsCode0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and BenchmarkingCode0
Exploring Model-based Planning with Policy NetworksCode0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
Benchmarking Language-agnostic Intent Classification for Virtual Assistant PlatformsCode0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in ExplanationsCode0
A Recipe for CAC: Mosaic-based Generalized Loss for Improved Class-Agnostic CountingCode0
Benchmarking Label Noise in Instance Segmentation: Spatial Noise MattersCode0
Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal CarcinomaCode0
Show:102550
← PrevPage 100 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified