SOTAVerified

Benchmarking

Papers

Showing 11511175 of 5548 papers

TitleStatusHype
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
dEchorate: a Calibrated Room Impulse Response Database for Echo-aware Signal ProcessingCode1
Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMsCode1
Don’t be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue SystemCode1
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with DataCode1
Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional BenchmarkCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
A Closer Look at Mortality Risk Prediction from ElectrocardiogramsCode1
Benchmarking MRI Reconstruction Neural Networks on Large Public DatasetsCode1
A global analysis of metrics used for measuring performance in natural language processingCode1
A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise ModelsCode1
Benchmarking Large Language Models for News SummarizationCode1
A Global Benchmark of Algorithms for Segmenting Late Gadolinium-Enhanced Cardiac Magnetic Resonance ImagingCode1
Benchmarking Multidomain English-Indonesian Machine TranslationCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
Benchmarking Multimodal Knowledge Conflict for Large Multimodal ModelsCode1
A Comparative Visual Analytics Framework for Evaluating Evolutionary Processes in Multi-objective OptimizationCode1
EDFace-Celeb-1M: Benchmarking Face Hallucination with a Million-scale DatasetCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
Benchmarking and scaling of deep learning models for land cover image classificationCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
Show:102550
← PrevPage 47 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified