SOTAVerified

Benchmarking

Papers

Showing 13511400 of 5548 papers

TitleStatusHype
A framework for benchmarking class-out-of-distribution detection and its application to ImageNetCode1
LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text InterpretationCode1
DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender SystemsCode1
Benchmarking Test-Time Adaptation against Distribution Shifts in Image ClassificationCode1
A Unified Taxonomy and Multimodal Dataset for Events in Invasion GamesCode1
Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?Code1
Machine Learning Methods for Brain Network Classification: Application to Autism Diagnosis using Cortical Morphological NetworksCode1
Machine Learning with Knowledge Constraints for Process Optimization of Open-Air Perovskite Solar Cell ManufacturingCode1
Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet DatasetsCode1
DependEval: Benchmarking LLMs for Repository Dependency UnderstandingCode1
A User-Centric Multi-Intent Benchmark for Evaluating Large Language ModelsCode1
Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge GraphsCode1
Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRACode1
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and ObjectsCode1
Curious Hierarchical Actor-Critic Reinforcement LearningCode1
MatText: Do Language Models Need More than Text & Scale for Materials Modeling?Code1
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language ModelsCode1
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of CancerCode1
MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named Entity RecognitionCode1
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRTCode1
D2S: Document-to-Slide Generation Via Query-Based Text SummarizationCode1
MedPerf: Open Benchmarking Platform for Medical Artificial Intelligence using Federated EvaluationCode1
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language ModelsCode1
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image SegmentationCode1
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAMCode1
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version)Code1
Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via MetagradientCode1
MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training ConflictsCode1
Benchmarking the Robustness of Deep Neural Networks to Common Corruptions in Digital PathologyCode1
DACBench: A Benchmark Library for Dynamic Algorithm ConfigurationCode1
Benchmarking Image Retrieval for Visual LocalizationCode1
Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object DetectionCode1
ArabicaQA: A Comprehensive Dataset for Arabic Question AnsweringCode1
MIGPerf: A Comprehensive Benchmark for Deep Learning Training and Inference Workloads on Multi-Instance GPUsCode1
COVID-19 event extraction from Twitter via extractive question answering with continuous promptsCode1
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI AgentsCode1
minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language ModelsCode1
Benchmarking the Robustness of Spatial-Temporal Models Against CorruptionsCode1
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures TranslationCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT ScansCode1
MLLM-DataEngine: An Iterative Refinement Approach for MLLMCode1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
CAB: Comprehensive Attention Benchmarking on Long Sequence ModelingCode1
ByzFL: Research Framework for Robust Federated LearningCode1
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement LearningCode1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell DataCode1
Show:102550
← PrevPage 28 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified