SOTAVerified

Benchmarking

Papers

Showing 876900 of 5548 papers

TitleStatusHype
Clinical Prompt Learning with Frozen Language ModelsCode1
4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBsCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
ClearPose: Large-scale Transparent Object Dataset and BenchmarkCode1
AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan DatasetsCode1
Exploiting News Article Structure for Automatic Corpus Generation of Entailment DatasetsCode1
ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate ModelsCode1
IOHprofiler: A Benchmarking and Profiling Tool for Iterative Optimization HeuristicsCode1
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization CorrelationsCode1
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction TasksCode1
A Survey of Pathology Foundation Model: Progress and Future DirectionsCode1
Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical StudyCode1
Benchmarking Classical and Learning-Based Multibeam Point Cloud RegistrationCode1
A Comprehensive Benchmark for RNA 3D Structure-Function ModelingCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill MatchingCode1
An Exploration of Embodied Visual ExplorationCode1
Benchmarking Cognitive Biases in Large Language Models as EvaluatorsCode1
GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule GenerationCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative TasksCode1
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image CaptioningCode1
CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methodsCode1
Show:102550
← PrevPage 36 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified