SOTAVerified

Benchmarking

Papers

Showing 24512475 of 5548 papers

TitleStatusHype
The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition0
Efficient Lifelong Model Evaluation in an Era of Rapid ProgressCode1
Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized TasksCode2
FlowCyt: A Comparative Study of Deep Learning Approaches for Multi-Class Classification in Flow Cytometry BenchmarkingCode0
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
Editing Factual Knowledge and Explanatory Ability of Medical Large Language ModelsCode0
The Seeker's Dilemma: Realistic Formulation and Benchmarking for Hardware Trojan Detection0
Beacon, a lightweight deep reinforcement learning benchmark library for flow controlCode1
Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies0
Benchmarking Data Science AgentsCode1
The KANDY Benchmark: Incremental Neuro-Symbolic Learning and Reasoning with Kandinsky PatternsCode0
A Large-scale Evaluation of Pretraining Paradigms for the Detection of Defects in Electroluminescence Solar Cell Images0
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with DataCode1
Partial Rankings of OptimizersCode0
Benchmarking LLMs on the Semantic Overlap Summarization Task0
Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset0
Performance Comparison of Surrogate-Assisted Evolutionary Algorithms on Computational Fluid Dynamics Problems0
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMsCode0
PST-Bench: Tracing and Benchmarking the Source of PublicationsCode1
Decoding Intelligence: A Framework for Certifying Knowledge Comprehension in LLMs0
E(3)-equivariant models cannot learn chirality: Field-based molecular generation0
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMsCode1
ToMBench: Benchmarking Theory of Mind in Large Language ModelsCode2
Benchmarking the Robustness of Panoptic Segmentation for Automated Driving0
Benchmarking Observational Studies with Experimental Data under Right-Censoring0
Show:102550
← PrevPage 99 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified