Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2476–2500 of 5548 papers

Title	Date	Tasks	Status	Hype
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data	Feb 22, 2024	Benchmarking	CodeCode Available	0
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning	Feb 22, 2024	Benchmarking	CodeCode Available	1
The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning	Feb 21, 2024	BenchmarkingRepresentation Learning	CodeCode Available	1
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment	Feb 21, 2024	Adversarial RobustnessBenchmarking	CodeCode Available	1
MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms	Feb 21, 2024	BenchmarkingHate Speech Detection	CodeCode Available	0
PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models	Feb 21, 2024	BenchmarkingForm	CodeCode Available	0
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models	Feb 21, 2024	Benchmarking	—Unverified	0
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models	Feb 21, 2024	BenchmarkingImage to text	—Unverified	0
KetGPT -- Dataset Augmentation of Quantum Circuits using Transformers	Feb 20, 2024	Benchmarking	—Unverified	0
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning	Feb 20, 2024	Atomic number classificationBenchmarking	CodeCode Available	1
Benchmarking Retrieval-Augmented Generation for Medicine	Feb 20, 2024	BenchmarkingInformation Retrieval	CodeCode Available	4
CausalGym: Benchmarking causal interpretability methods on linguistic tasks	Feb 19, 2024	BenchmarkingInterpretability Techniques for Deep Learning	CodeCode Available	2
Synthetic location trajectory generation using categorical diffusion models	Feb 19, 2024	BenchmarkingDecision Making	CodeCode Available	0
FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation	Feb 19, 2024	BenchmarkingChatbot	—Unverified	0
Event-Based Motion Magnification	Feb 19, 2024	BenchmarkingMotion Detection	CodeCode Available	2
Class-incremental Learning for Time Series: Benchmark and Evaluation	Feb 19, 2024	Activity RecognitionBenchmarking	CodeCode Available	2
AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies	Feb 19, 2024	Benchmarking	CodeCode Available	0
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark	Feb 18, 2024	Benchmarking	CodeCode Available	2
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation	Feb 18, 2024	BenchmarkingLanguage Modeling	CodeCode Available	1
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence	Feb 17, 2024	BenchmarkingForm	CodeCode Available	2
VATr++: Choose Your Words Wisely for Handwritten Text Generation	Feb 16, 2024	BenchmarkingText Generation	—Unverified	0
Learning Disentangled Audio Representations through Controlled Synthesis	Feb 16, 2024	BenchmarkingDisentanglement	—Unverified	0
Benchmarking federated strategies in Peer-to-Peer Federated learning for biomedical data	Feb 15, 2024	BenchmarkingFederated Learning	—Unverified	0
Large-scale Benchmarking of Metaphor-based Optimization Heuristics	Feb 15, 2024	BenchmarkingExperimental Design	—Unverified	0
The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse	Feb 15, 2024	BenchmarkingModel Editing	CodeCode Available	0

Show:10 25 50

← PrevPage 100 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified