SOTAVerified

Benchmarking

Papers

Showing 51515200 of 5548 papers

TitleStatusHype
On Using Distribution-Based Compositionality Assessment to Evaluate Compositional Generalisation in Machine TranslationCode0
Are Large Language Models Good at Utility Judgments?Code0
Benchmarking Language-agnostic Intent Classification for Virtual Assistant PlatformsCode0
Distributed Non-Convex Optimization with Sublinear Speedup under Intermittent Client AvailabilityCode0
VitaGraph: Building a Knowledge Graph for Biologically Relevant Learning TasksCode0
Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AICode0
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory InstructionsCode0
DispBench: Benchmarking Disparity Estimation to Synthetic CorruptionsCode0
OpenBioLink: A benchmarking framework for large-scale biomedical link predictionCode0
DispaRisk: Auditing Fairness Through Usable InformationCode0
A Recipe for CAC: Mosaic-based Generalized Loss for Improved Class-Agnostic CountingCode0
Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation ExtractionCode0
Large Scale Clustering with Variational EM for Gaussian Mixture ModelsCode0
AI Sound Recognition on Asthma Medication Adherence: Evaluation With the RDA Benchmark SuiteCode0
Dialogue Quality and Emotion Annotations for Customer Support ConversationsCode0
STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible BenchmarkingCode0
OpenDenoising: an Extensible Benchmark for Building Comparative Studies of Image DenoisersCode0
OpenDMC: An Open-Source Library and Performance Evaluation for Deep-learning-based Multi-frame CompressionCode0
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation FrameworkCode0
Towards Biologically Plausible and Private Gene Expression Data GenerationCode0
DFEE: Interactive DataFlow Execution and Evaluation KitCode0
Towards causal benchmarking of bias in face analysis algorithmsCode0
SORCE: Small Object Retrieval in Complex EnvironmentsCode0
Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological UnderpinningsCode0
Recognizing Object Affordances to Support Scene Reasoning for Manipulation TasksCode0
CleanPatrick: A Benchmark for Image Data CleaningCode0
Detecting critical treatment effect bias in small subgroupsCode0
AI-generated Image Quality Assessment in Visual CommunicationCode0
SOSD: A Benchmark for Learned IndexesCode0
OpenML Benchmarking SuitesCode0
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual DesignCode0
Design and implementation of intelligent packet filtering in IoT microcontroller-based devicesCode0
OpenOOD: Benchmarking Generalized Out-of-Distribution DetectionCode0
Dermatological Diagnosis Explainability Benchmark for Convolutional Neural NetworksCode0
Depth Functions for Partial Orders with a Descriptive Analysis of Machine Learning AlgorithmsCode0
Delving into Instance-Dependent Label Noise in Graph Data: A Comprehensive Study and BenchmarkCode0
Towards Efficient and Scalable Training of Differentially Private Deep LearningCode0
Benchmarking Label Noise in Instance Segmentation: Spatial Noise MattersCode0
Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding ApproachCode0
Delta-Influence: Unlearning Poisons via Influence FunctionsCode0
Benchmarking Keyword Spotting Efficiency on Neuromorphic HardwareCode0
Defense-friendly Images in Adversarial Attacks: Dataset and Metrics for Perturbation DifficultyCode0
DefAn: Definitive Answer Dataset for LLMs Hallucination EvaluationCode0
Deep Reinforcement Learning for General Video Game AICode0
DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing UnderstandingCode0
Operation-Level Performance Benchmarking of Graph Neural Networks for Scientific ApplicationsCode0
DeepOBS: A Deep Learning Optimizer Benchmark SuiteCode0
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
OptIForest: Optimal Isolation Forest for Anomaly DetectionCode0
Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE DatasetCode0
Show:102550
← PrevPage 104 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified