SOTAVerified

Benchmarking

Papers

Showing 14511475 of 5548 papers

TitleStatusHype
Kvasir-Instrument: Diagnostic and therapeutic tool segmentation dataset in gastrointestinal endoscopyCode1
Just Rank: Rethinking Evaluation with Word and Sentence SimilaritiesCode1
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and ChallengingCode1
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference LettersCode1
Beyond neural scaling laws: beating power law scaling via data pruningCode1
Bag of Tricks for Adversarial TrainingCode1
BEND: Benchmarking DNA Language Models on biologically meaningful tasksCode1
Leveraging Trust for Joint Multi-Objective and Multi-Fidelity OptimizationCode1
Beyond Normal: On the Evaluation of Mutual Information EstimatorsCode1
Experimental Validation of Ultrasound Beamforming with End-to-End Deep Learning for Single Plane Wave ImagingCode1
KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range MultilaterationCode1
RobustBench: a standardized adversarial robustness benchmarkCode1
Benchmarking Graph Neural Networks on Dynamic Link PredictionCode1
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource LanguagesCode1
Benchmarking Graph Neural Networks for FMRI analysisCode1
Exploring Large Language Models for Classical PhilologyCode1
EXPObench: Benchmarking Surrogate-based Optimisation Algorithms on Expensive Black-box FunctionsCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language ModelsCode1
Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture TranscriptsCode1
RobFR: Benchmarking Adversarial Robustness on Face RecognitionCode1
Kimera-Multi: Robust, Distributed, Dense Metric-Semantic SLAM for Multi-Robot SystemsCode1
MatTools: Benchmarking Large Language Models for Materials Science ToolsCode1
FFB: A Fair Fairness Benchmark for In-Processing Group Fairness MethodsCode1
Benchmarking Knowledge-driven Zero-shot LearningCode1
Show:102550
← PrevPage 59 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified