SOTAVerified

Benchmarking

Papers

Showing 9511000 of 5548 papers

TitleStatusHype
Benchmarking Differential Privacy and Federated Learning for BERT ModelsCode1
Accelerated and interpretable oblique random survival forestsCode1
A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive CareCode1
Benchmarking Distribution Shift in Tabular Data with TableShiftCode1
The Effect of Domain and Diacritics in Yorùbá-English Neural Machine TranslationCode1
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMMCode1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsCode1
MetaFormer and CNN Hybrid Model for Polyp Image SegmentationCode1
Benchmarking: Past, Present and FutureCode1
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific ResearchCode1
MIMII DG: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection for Domain Generalization TaskCode1
Benchmarking Econometric and Machine Learning Methodologies in NowcastingCode1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language ModelsCode1
MIRFLEX: Music Information Retrieval Feature Library for ExtractionCode1
Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data PerspectiveCode1
Benchmarking Image Retrieval for Visual LocalizationCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
MLonMCU: TinyML Benchmarking with Fast RetargetingCode1
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual DependencyCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
MMDetection: Open MMLab Detection Toolbox and BenchmarkCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
MNIST-C: A Robustness Benchmark for Computer VisionCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality MetricsCode1
MONICA: Benchmarking on Long-tailed Medical Image ClassificationCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAMCode1
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMsCode1
3DYoga90: A Hierarchical Video Dataset for Yoga Pose UnderstandingCode1
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMsCode1
Benchmarking Fish Dataset and Evaluation Metric in Keypoint Detection -- Towards Precise Fish Morphological Assessment in Aquaculture BreedingCode1
Prompt Tuned Embedding Classification for Multi-Label Industry Sector AllocationCode1
Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic PlanningCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
MultiRes-NetVLAD: Augmenting Place Recognition Training with Low-Resolution ImageryCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERTCode1
MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection BenchmarkCode1
MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph DataCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
Working Memory Capacity of ChatGPT: An Empirical StudyCode1
Show:102550
← PrevPage 20 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified