SOTAVerified

Benchmarking

Papers

Showing 51265150 of 5548 papers

TitleStatusHype
A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient VoiceCode0
Arena-Rosnav 2.0: A Development and Benchmarking Platform for Robot Navigation in Highly Dynamic EnvironmentsCode0
On the Fragility of Active Learners for Text ClassificationCode0
Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image SegmentationCode0
Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and DatasetCode0
Benchmarking Large Language Models for Math Reasoning TasksCode0
Benchmarking Large Language Models for Image Classification of Marine MammalsCode0
On the Loss of Context-awareness in General Instruction Fine-tuningCode0
HumaniBench: A Human-Centric Framework for Large Multimodal Models EvaluationCode0
SNaC: Coherence Error Detection for Narrative SummarizationCode0
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking ServicesCode0
Using Motif Transitions for Temporal Graph GenerationCode0
Accurate Peak Detection in Multimodal Optimization via Approximated Landscape LearningCode0
Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious BiasCode0
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician ExamsCode0
Word Embeddings for the Construction DomainCode0
What Actions are Needed for Understanding Human Actions in Videos?Code0
ACCESS DENIED INC: The First Benchmark Environment for Sensitivity AwarenessCode0
On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of ClassifiersCode0
On the Use of ArXiv as a DatasetCode0
On the use of automatically generated synthetic image datasets for benchmarking face recognitionCode0
Benchmarking Large Language Models for Molecule Prediction TasksCode0
Accel-NASBench: Sustainable Benchmarking for Accelerator-Aware NASCode0
SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable ThresholdsCode0
On Training Sample Memorization: Lessons from Benchmarking Generative Modeling with a Large-scale CompetitionCode0
Show:102550
← PrevPage 206 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified