SOTAVerified

Benchmarking

Papers

Showing 17511775 of 5548 papers

TitleStatusHype
CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis0
CMOS based image cytometry for detection of phytoplankton in ballast water0
Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment0
Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos0
Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios0
CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data0
Benchmarking Causal Study to Interpret Large Language Models for Source Code0
A new dataset of dog breed images and a benchmark for fine-grained classification0
Benchmarking Attention Mechanisms and Consistency Regularization Semi-Supervised Learning for Post-Flood Building Damage Assessment in Satellite Images0
An Empirical Study of Training State-of-the-Art LiDAR Segmentation Models0
Disambiguation in Conversational Question Answering in the Era of LLM: A Survey0
Discriminative Link Prediction using Local Links, Node Features and Community Structure0
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis0
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance0
CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools0
CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations0
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings0
Benchmarking ASR Systems Based on Post-Editing Effort and Error Analysis0
CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices0
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks0
DiPCo -- Dinner Party Corpus0
LAraBench: Benchmarking Arabic AI with Large Language Models0
ChemTime: Rapid and Early Classification for Multivariate Time Series Classification of Chemical Sensors0
CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs0
An Empirical Study of Super-resolution on Low-resolution Micro-expression Recognition0
Show:102550
← PrevPage 71 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified