SOTAVerified

Benchmarking

Papers

Showing 31513200 of 5548 papers

TitleStatusHype
My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks0
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language ModelsCode2
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMsCode1
OptIForest: Optimal Isolation Forest for Anomaly DetectionCode0
Benchmarking and Analyzing 3D-aware Image Synthesis with a Modularized CodebaseCode1
GADBench: Revisiting and Benchmarking Supervised Graph Anomaly DetectionCode1
On-orbit model training for satellite imagery with label proportionsCode0
On Evaluation of Document Classification using RVL-CDIP0
VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolutionCode1
Challenges and Opportunities in Improving Worst-Group Generalization in Presence of Spurious FeaturesCode1
Evaluation of Popular XAI Applied to Clinical Prediction Models: Can They be Trusted?0
A Comprehensive Study on the Robustness of Image Classification and Object Detection in Remote Sensing: Surveying and Benchmarking0
IMP-MARL: a Suite of Environments for Large-scale Infrastructure Management Planning via MARLCode1
Diverse Community Data for Benchmarking Data Privacy Algorithms0
Geometric Deep Learning for Structure-Based Drug Design: A SurveyCode1
Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation ExtractionCode0
Beyond Normal: On the Evaluation of Mutual Information EstimatorsCode1
causalAssembly: Generating Realistic Production Data for Benchmarking Causal DiscoveryCode1
OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender SystemsCode2
Benchmarking Robustness of Deep Reinforcement Learning approaches to Online Portfolio Management0
Fairness Index Measures to Evaluate Bias in Biometric Recognition0
Using Motif Transitions for Temporal Graph GenerationCode0
OpenDataVal: a Unified Benchmark for Data ValuationCode1
Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New BenchmarkingCode1
Formal Covariate Benchmarking to Bound Omitted Variable Bias0
MA-BBOB: Many-Affine Combinations of BBOB Functions for Evaluating AutoML Approaches in Noiseless Numerical Black-Box Optimization Contexts0
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
Benchmarking Deep Learning Architectures for Urban Vegetation Point Cloud Semantic Segmentation from MLS0
Framework and Benchmarks for Combinatorial and Mixed-variable Bayesian Optimization0
Convolutional and Deep Learning based techniques for Time Series Ordinal Classification0
LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient LearningCode1
ALP: Action-Aware Embodied Learning for Perception0
Acoustic Identification of Ae. aegypti Mosquitoes using Smartphone Apps and Residual Convolutional Neural NetworksCode0
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and BeyondCode1
AQuA: A Benchmarking Tool for Label Quality AssessmentCode1
Symmetry-Informed Geometric Representation for Molecules, Proteins, and Crystalline MaterialsCode1
DISC: a Dataset for Integrated Sensing and Communication in mmWave Systems0
Large-Scale Quantum Separability Through a Reproducible Machine Learning Lens0
FFB: A Fair Fairness Benchmark for In-Processing Group Fairness MethodsCode1
PaReprop: Fast Parallelized Reversible BackpropagationCode1
DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning0
PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEsCode2
Re-Benchmarking Pool-Based Active Learning for Binary ClassificationCode0
MLonMCU: TinyML Benchmarking with Fast RetargetingCode1
Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?Code1
KoLA: Carefully Benchmarking World Knowledge of Large Language ModelsCode1
One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial SupportCode0
BED: Bi-Encoder-Based Detectors for Out-of-Distribution DetectionCode0
Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion0
Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language ModelsCode1
Show:102550
← PrevPage 64 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified