Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2451–2475 of 5548 papers

Title	Date	Tasks	Status	Hype
The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition	Feb 29, 2024	Action Unit DetectionArousal Estimation	—Unverified	0
Efficient Lifelong Model Evaluation in an Era of Rapid Progress	Feb 29, 2024	BenchmarkingGPU	CodeCode Available	1
Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks	Feb 29, 2024	BenchmarkingDisentanglement	CodeCode Available	2
FlowCyt: A Comparative Study of Deep Learning Approaches for Multi-Class Classification in Flow Cytometry Benchmarking	Feb 28, 2024	BenchmarkingInductive Learning	CodeCode Available	0
Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions	Feb 28, 2024	BenchmarkingMultiple-choice	CodeCode Available	1
Editing Factual Knowledge and Explanatory Ability of Medical Large Language Models	Feb 28, 2024	BenchmarkingHallucination	CodeCode Available	0
The Seeker's Dilemma: Realistic Formulation and Benchmarking for Hardware Trojan Detection	Feb 27, 2024	Benchmarking	—Unverified	0
Beacon, a lightweight deep reinforcement learning benchmark library for flow control	Feb 27, 2024	BenchmarkingCPU	CodeCode Available	1
Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies	Feb 27, 2024	BenchmarkingSystematic Generalization	—Unverified	0
Benchmarking Data Science Agents	Feb 27, 2024	BenchmarkingCode Generation	CodeCode Available	1
The KANDY Benchmark: Incremental Neuro-Symbolic Learning and Reasoning with Kandinsky Patterns	Feb 27, 2024	BenchmarkingBinary Classification	CodeCode Available	0
A Large-scale Evaluation of Pretraining Paradigms for the Detection of Defects in Electroluminescence Solar Cell Images	Feb 27, 2024	BenchmarkingDefect Detection	—Unverified	0
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data	Feb 27, 2024	Benchmarking	CodeCode Available	1
Partial Rankings of Optimizers	Feb 26, 2024	Benchmarking	CodeCode Available	0
Benchmarking LLMs on the Semantic Overlap Summarization Task	Feb 26, 2024	BenchmarkingDocument Summarization	—Unverified	0
Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset	Feb 26, 2024	BenchmarkingCross-Lingual Transfer	—Unverified	0
Performance Comparison of Surrogate-Assisted Evolutionary Algorithms on Computational Fluid Dynamics Problems	Feb 26, 2024	BenchmarkingEvolutionary Algorithms	—Unverified	0
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs	Feb 25, 2024	BenchmarkingChatbot	CodeCode Available	0
PST-Bench: Tracing and Benchmarking the Source of Publications	Feb 25, 2024	Benchmarking	CodeCode Available	1
Decoding Intelligence: A Framework for Certifying Knowledge Comprehension in LLMs	Feb 24, 2024	BenchmarkingKnowledge Graphs	—Unverified	0
E(3)-equivariant models cannot learn chirality: Field-based molecular generation	Feb 24, 2024	BenchmarkingGraph Neural Network	—Unverified	0
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs	Feb 23, 2024	Benchmarkingslot-filling	CodeCode Available	1
ToMBench: Benchmarking Theory of Mind in Large Language Models	Feb 23, 2024	BenchmarkingMultiple-choice	CodeCode Available	2
Benchmarking the Robustness of Panoptic Segmentation for Automated Driving	Feb 23, 2024	BenchmarkingDecision Making	—Unverified	0
Benchmarking Observational Studies with Experimental Data under Right-Censoring	Feb 23, 2024	Benchmarking	—Unverified	0

Show:10 25 50

← PrevPage 99 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified