SOTAVerified

Benchmarking

Papers

Showing 851900 of 5548 papers

TitleStatusHype
Benchmarking Neural Network Robustness to Common Corruptions and Surface VariationsCode1
EntQA: Entity Linking as Question AnsweringCode1
Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New BenchmarkCode1
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction TasksCode1
Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmarkCode1
CheXphoto: 10,000+ Photos and Transformations of Chest X-rays for Benchmarking Deep Learning RobustnessCode1
An Empirical Study on Google Research Football Multi-agent ScenariosCode1
CIBench: Evaluating Your LLMs with a Code Interpreter PluginCode1
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learningCode1
A Survey on Graph Counterfactual Explanations: Definitions, Methods, Evaluation, and Research ChallengesCode1
AIPerf: Automated machine learning as an AI-HPC benchmarkCode1
Benchmarking Multimodal Knowledge Conflict for Large Multimodal ModelsCode1
4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBsCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
An Evaluation Dataset for Intent Classification and Out-of-Scope PredictionCode1
Benchmarking Batch Deep Reinforcement Learning AlgorithmsCode1
Enhancing Biomedical Relation Extraction with DirectionalityCode1
Enhancing Ligand Pose Sampling for Molecular DockingCode1
AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan DatasetsCode1
A Survey of Pathology Foundation Model: Progress and Future DirectionsCode1
ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate ModelsCode1
Benchmarking Bias Mitigation Algorithms in Representation Learning through Fairness MetricsCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
A Comprehensive Benchmark for RNA 3D Structure-Function ModelingCode1
GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule GenerationCode1
End-to-end Knowledge Retrieval with Multi-modal QueriesCode1
Enhancing spatial and textual analysis with EUPEG: an extensible and unified platform for evaluating geoparsersCode1
Knodle: Modular Weakly Supervised Learning with PyTorchCode1
SHARP: Environment and Person Independent Activity Recognition with Commodity IEEE 802.11 Access PointsCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
Kvasir-Instrument: Diagnostic and therapeutic tool segmentation dataset in gastrointestinal endoscopyCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
A Closer Look at Mortality Risk Prediction from ElectrocardiogramsCode1
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization CorrelationsCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking DatasetCode1
Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical StudyCode1
A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive CareCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMMCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
Benchmarking MRI Reconstruction Neural Networks on Large Public DatasetsCode1
Benchmarking Cognitive Biases in Large Language Models as EvaluatorsCode1
EMPOT: partial alignment of density maps and rigid body fitting using unbalanced Gromov-Wasserstein divergenceCode1
Recent Advances on Neural Network Pruning at InitializationCode1
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language ModelsCode1
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative TasksCode1
LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph GenerationCode1
EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for ElectromyographyCode1
Show:102550
← PrevPage 18 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified