SOTAVerified

Benchmarking

Papers

Showing 9511000 of 5548 papers

TitleStatusHype
Benchmarking Differential Privacy and Federated Learning for BERT ModelsCode1
Accelerated and interpretable oblique random survival forestsCode1
Decoding the Underlying Meaning of Multimodal Hateful MemesCode1
Benchmarking Distribution Shift in Tabular Data with TableShiftCode1
Failure Detection in Medical Image Classification: A Reality Check and Benchmarking TestbedCode1
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMsCode1
Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality MetricsCode1
MMTU: A Massive Multi-Task Table Understanding and Reasoning BenchmarkCode1
Fantastic Questions and Where to Find Them: FairytaleQA -- An Authentic Dataset for Narrative ComprehensionCode1
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language ModelsCode1
Monash University, UEA, UCR Time Series Extrinsic Regression ArchiveCode1
Benchmarking Econometric and Machine Learning Methodologies in NowcastingCode1
Benchmarking Robustness of 3D Object Detection to Common CorruptionsCode1
Exploring QUIC Dynamics: A Large-Scale Dataset for Encrypted Traffic AnalysisCode1
Mukayese: Turkish NLP Strikes BackCode1
Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data PerspectiveCode1
Benchmarking Omni-Vision Representation through the Lens of Visual RealmsCode1
3DYoga90: A Hierarchical Video Dataset for Yoga Pose UnderstandingCode1
Benchmarking: Past, Present and FutureCode1
EXPObench: Benchmarking Surrogate-based Optimisation Algorithms on Expensive Black-box FunctionsCode1
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning AlgorithmsCode1
FedScale: Benchmarking Model and System Performance of Federated Learning at ScaleCode1
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4Code1
Working Memory Capacity of ChatGPT: An Empirical StudyCode1
Benchmarking Object Detectors with COCO: A New Path ForwardCode1
Delving into Out-of-Distribution Detection with Medical Vision-Language ModelsCode1
Demystifying Learning Rate Policies for High Accuracy Training of Deep Neural NetworksCode1
DependEval: Benchmarking LLMs for Repository Dependency UnderstandingCode1
Benchmarking Offline Reinforcement Learning on Real-Robot HardwareCode1
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite ImageryCode1
Descending through a Crowded Valley — Benchmarking Deep Learning OptimizersCode1
Multimodal Fusion via Teacher-Student Network for Indoor Action RecognitionCode1
Experimental Validation of Ultrasound Beamforming with End-to-End Deep Learning for Single Plane Wave ImagingCode1
Detecting beats in the photoplethysmogram: benchmarking open-source algorithmsCode1
MultiRes-NetVLAD: Augmenting Place Recognition Training with Low-Resolution ImageryCode1
Multi-Stream Cellular Test-Time Adaptation of Real-Time Models Evolving in Dynamic EnvironmentsCode1
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMsCode1
Developing a Scalable Benchmark for Assessing Large Language Models in Knowledge Graph EngineeringCode1
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMsCode1
Benchmarking Fish Dataset and Evaluation Metric in Keypoint Detection -- Towards Precise Fish Morphological Assessment in Aquaculture BreedingCode1
Explainable Benchmarking for Iterative Optimization HeuristicsCode1
DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in ConversationsCode1
NAS-Bench-360: Benchmarking Neural Architecture Search on Diverse TasksCode1
NAS-Bench-Graph: Benchmarking Graph Neural Architecture SearchCode1
Benchmarking Neural Network Robustness to Common Corruptions and Surface VariationsCode1
Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERTCode1
DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic DiversityCode1
DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific InformationCode1
Protein Structure Tokenization: Benchmarking and New RecipeCode1
Benchmarking Neural Network Generalization for Grammar InductionCode1
Show:102550
← PrevPage 20 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified