Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2451–2500 of 5548 papers

Title	Date	Tasks	Status	Hype
Efficient Lifelong Model Evaluation in an Era of Rapid Progress	Feb 29, 2024	BenchmarkingGPU	CodeCode Available	1
The 6th Affective Behavior Analysis in-the-wild (ABAW) Competition	Feb 29, 2024	Action Unit DetectionArousal Estimation	—Unverified	0
Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks	Feb 29, 2024	BenchmarkingDisentanglement	CodeCode Available	2
FlowCyt: A Comparative Study of Deep Learning Approaches for Multi-Class Classification in Flow Cytometry Benchmarking	Feb 28, 2024	BenchmarkingInductive Learning	CodeCode Available	0
Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions	Feb 28, 2024	BenchmarkingMultiple-choice	CodeCode Available	1
Editing Factual Knowledge and Explanatory Ability of Medical Large Language Models	Feb 28, 2024	BenchmarkingHallucination	CodeCode Available	0
The Seeker's Dilemma: Realistic Formulation and Benchmarking for Hardware Trojan Detection	Feb 27, 2024	Benchmarking	—Unverified	0
Beacon, a lightweight deep reinforcement learning benchmark library for flow control	Feb 27, 2024	BenchmarkingCPU	CodeCode Available	1
Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies	Feb 27, 2024	BenchmarkingSystematic Generalization	—Unverified	0
The KANDY Benchmark: Incremental Neuro-Symbolic Learning and Reasoning with Kandinsky Patterns	Feb 27, 2024	BenchmarkingBinary Classification	CodeCode Available	0
Benchmarking Data Science Agents	Feb 27, 2024	BenchmarkingCode Generation	CodeCode Available	1
A Large-scale Evaluation of Pretraining Paradigms for the Detection of Defects in Electroluminescence Solar Cell Images	Feb 27, 2024	BenchmarkingDefect Detection	—Unverified	0
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data	Feb 27, 2024	Benchmarking	CodeCode Available	1
Benchmarking LLMs on the Semantic Overlap Summarization Task	Feb 26, 2024	BenchmarkingDocument Summarization	—Unverified	0
Partial Rankings of Optimizers	Feb 26, 2024	Benchmarking	CodeCode Available	0
Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset	Feb 26, 2024	BenchmarkingCross-Lingual Transfer	—Unverified	0
Performance Comparison of Surrogate-Assisted Evolutionary Algorithms on Computational Fluid Dynamics Problems	Feb 26, 2024	BenchmarkingEvolutionary Algorithms	—Unverified	0
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs	Feb 25, 2024	BenchmarkingChatbot	CodeCode Available	0
PST-Bench: Tracing and Benchmarking the Source of Publications	Feb 25, 2024	Benchmarking	CodeCode Available	1
E(3)-equivariant models cannot learn chirality: Field-based molecular generation	Feb 24, 2024	BenchmarkingGraph Neural Network	—Unverified	0
Decoding Intelligence: A Framework for Certifying Knowledge Comprehension in LLMs	Feb 24, 2024	BenchmarkingKnowledge Graphs	—Unverified	0
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs	Feb 23, 2024	Benchmarkingslot-filling	CodeCode Available	1
ToMBench: Benchmarking Theory of Mind in Large Language Models	Feb 23, 2024	BenchmarkingMultiple-choice	CodeCode Available	2
Benchmarking the Robustness of Panoptic Segmentation for Automated Driving	Feb 23, 2024	BenchmarkingDecision Making	—Unverified	0
Benchmarking Observational Studies with Experimental Data under Right-Censoring	Feb 23, 2024	Benchmarking	—Unverified	0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data	Feb 22, 2024	Benchmarking	CodeCode Available	0
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning	Feb 22, 2024	Benchmarking	CodeCode Available	1
The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning	Feb 21, 2024	BenchmarkingRepresentation Learning	CodeCode Available	1
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment	Feb 21, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	1
MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms	Feb 21, 2024	BenchmarkingHate Speech Detection	CodeCode Available	0
PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models	Feb 21, 2024	BenchmarkingForm	CodeCode Available	0
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models	Feb 21, 2024	Benchmarking	—Unverified	0
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models	Feb 21, 2024	BenchmarkingImage to text	—Unverified	0
KetGPT -- Dataset Augmentation of Quantum Circuits using Transformers	Feb 20, 2024	Benchmarking	—Unverified	0
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning	Feb 20, 2024	Atomic number classificationBenchmarking	CodeCode Available	1
Benchmarking Retrieval-Augmented Generation for Medicine	Feb 20, 2024	BenchmarkingInformation Retrieval	CodeCode Available	4
CausalGym: Benchmarking causal interpretability methods on linguistic tasks	Feb 19, 2024	BenchmarkingInterpretability Techniques for Deep Learning	CodeCode Available	2
Synthetic location trajectory generation using categorical diffusion models	Feb 19, 2024	BenchmarkingDecision Making	CodeCode Available	0
FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation	Feb 19, 2024	BenchmarkingChatbot	—Unverified	0
Event-Based Motion Magnification	Feb 19, 2024	BenchmarkingMotion Detection	CodeCode Available	2
Class-incremental Learning for Time Series: Benchmark and Evaluation	Feb 19, 2024	Activity RecognitionBenchmarking	CodeCode Available	2
AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies	Feb 19, 2024	Benchmarking	CodeCode Available	0
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark	Feb 18, 2024	Benchmarking	CodeCode Available	2
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation	Feb 18, 2024	BenchmarkingLanguage Modeling	CodeCode Available	1
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence	Feb 17, 2024	BenchmarkingForm	CodeCode Available	2
VATr++: Choose Your Words Wisely for Handwritten Text Generation	Feb 16, 2024	BenchmarkingText Generation	—Unverified	0
Learning Disentangled Audio Representations through Controlled Synthesis	Feb 16, 2024	BenchmarkingDisentanglement	—Unverified	0
Benchmarking federated strategies in Peer-to-Peer Federated learning for biomedical data	Feb 15, 2024	BenchmarkingFederated Learning	—Unverified	0
Large-scale Benchmarking of Metaphor-based Optimization Heuristics	Feb 15, 2024	BenchmarkingExperimental Design	—Unverified	0
The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse	Feb 15, 2024	BenchmarkingModel Editing	CodeCode Available	0

Show:10 25 50

← PrevPage 50 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified