SOTAVerified

Benchmarking

Papers

Showing 17511800 of 5548 papers

TitleStatusHype
DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs0
CMOS based image cytometry for detection of phytoplankton in ballast water0
Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment0
Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics0
CityLearn v2: Energy-flexible, resilient, occupant-centric, and carbon-aware management of grid-interactive communities0
Addressing the Real-world Class Imbalance Problem in Dermatology0
CISOL: An Open and Extensible Dataset for Table Structure Recognition in the Construction Industry0
A new dataset of dog breed images and a benchmark for fine-grained classification0
Benchmarking Automated Review Response Generation for the Hospitality Domain0
Does AI for science need another ImageNet Or totally different benchmarks? A case study of machine learning force fields0
Benchmarking Automated Machine Learning Methods for Price Forecasting Applications0
CIMLA: Interpretable AI for inference of differential causal networks0
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis0
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance0
CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis0
CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations0
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings0
Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos0
Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios0
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks0
CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data0
Benchmarking Attention Mechanisms and Consistency Regularization Semi-Supervised Learning for Post-Flood Building Damage Assessment in Satellite Images0
An Empirical Study of Training State-of-the-Art LiDAR Segmentation Models0
CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs0
DLUE: Benchmarking Document Language Understanding0
CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools0
Benchmarking ASR Systems Based on Post-Editing Effort and Error Analysis0
CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices0
LAraBench: Benchmarking Arabic AI with Large Language Models0
Cognitive Model Priors for Predicting Human Decisions0
Coherent Feed Forward Quantum Neural Network0
Rethinking Coherence Modeling: Synthetic vs. Downstream Tasks0
ChemTime: Rapid and Early Classification for Multivariate Time Series Classification of Chemical Sensors0
An Empirical Study of Super-resolution on Low-resolution Micro-expression Recognition0
Diverse Community Data for Benchmarking Data Privacy Algorithms0
ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models0
An Empirical Study of Benchmarking Chinese Aspect Sentiment Quad Prediction0
Colonoscopy 3D Video Dataset with Paired Depth from 2D-3D Registration0
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance0
ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task0
Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics0
Distribution-Based Invariant Deep Networks for Learning Meta-Features0
Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories0
Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics0
ChatGPT Alternative Solutions: Large Language Models Survey0
Commute Graph Neural Networks0
An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets0
Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts0
Distributed Training Large-Scale Deep Architectures0
Sensitivity analysis and experimental evaluation of PID-like continuous sliding mode control0
Show:102550
← PrevPage 36 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified