SOTAVerified

Benchmarking

Papers

Showing 9511000 of 5548 papers

TitleStatusHype
IntelliGraphs: Datasets for Benchmarking Knowledge Graph GenerationCode1
A Comprehensive Overview of Large Language ModelsCode1
AnuraSet: A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoringCode1
Benchmarking Algorithms for Federated Domain GeneralizationCode1
A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified BenchmarkCode1
Benchmarking Test-Time Adaptation against Distribution Shifts in Image ClassificationCode1
Uncovering the Limits of Machine Learning for Automatic Vulnerability DetectionCode1
SCENEREPLICA: Benchmarking Real-World Robot Manipulation by Creating Replicable ScenesCode1
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMsCode1
Challenges and Opportunities in Improving Worst-Group Generalization in Presence of Spurious FeaturesCode1
Benchmarking and Analyzing 3D-aware Image Synthesis with a Modularized CodebaseCode1
GADBench: Revisiting and Benchmarking Supervised Graph Anomaly DetectionCode1
VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolutionCode1
IMP-MARL: a Suite of Environments for Large-scale Infrastructure Management Planning via MARLCode1
Geometric Deep Learning for Structure-Based Drug Design: A SurveyCode1
causalAssembly: Generating Realistic Production Data for Benchmarking Causal DiscoveryCode1
Beyond Normal: On the Evaluation of Mutual Information EstimatorsCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New BenchmarkingCode1
OpenDataVal: a Unified Benchmark for Data ValuationCode1
LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient LearningCode1
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and BeyondCode1
Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?Code1
FFB: A Fair Fairness Benchmark for In-Processing Group Fairness MethodsCode1
MLonMCU: TinyML Benchmarking with Fast RetargetingCode1
Symmetry-Informed Geometric Representation for Molecules, Proteins, and Crystalline MaterialsCode1
PaReprop: Fast Parallelized Reversible BackpropagationCode1
KoLA: Carefully Benchmarking World Knowledge of Large Language ModelsCode1
Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language ModelsCode1
AQuA: A Benchmarking Tool for Label Quality AssessmentCode1
NeuroGraph: Benchmarks for Graph Machine Learning in Brain ConnectomicsCode1
Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical MLCode1
On the Detectability of ChatGPT Content: Benchmarking, Methodology, and Evaluation through the Lens of Academic WritingCode1
RepoBench: Benchmarking Repository-Level Code Auto-Completion SystemsCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
Str2Str: A Score-based Framework for Zero-shot Protein Conformation SamplingCode1
TransDocAnalyser: A Framework for Offline Semi-structured Handwritten Document Analysis in the Legal DomainCode1
Spatially Resolved Gene Expression Prediction from H&E Histology Images via Bi-modal Contrastive LearningCode1
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language modelsCode1
Multilingual Conceptual Coverage in Text-to-Image ModelsCode1
Improving and Benchmarking Offline Reinforcement Learning AlgorithmsCode1
End-to-end Knowledge Retrieval with Multi-modal QueriesCode1
Accurate and Efficient Structural Ensemble Generation of Macrocyclic Peptides using Internal Coordinate DiffusionCode1
IDToolkit: A Toolkit for Benchmarking and Developing Inverse Design Algorithms in NanophotonicsCode1
SheetCopilot: Bringing Software Productivity to the Next Level through Large Language ModelsCode1
Decoding the Underlying Meaning of Multimodal Hateful MemesCode1
Zero is Not Hero Yet: Benchmarking Zero-Shot Performance of LLMs for Financial TasksCode1
KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range MultilaterationCode1
ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability AssessmentCode1
Exploring Large Language Models for Classical PhilologyCode1
Show:102550
← PrevPage 20 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified