SOTAVerified

Benchmarking

Papers

Showing 24762500 of 5548 papers

TitleStatusHype
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
The Effect of Batch Size on Contrastive Self-Supervised Speech Representation LearningCode1
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM AssessmentCode1
MM-Soc: Benchmarking Multimodal Large Language Models in Social Media PlatformsCode0
PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language ModelsCode0
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models0
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models0
KetGPT -- Dataset Augmentation of Quantum Circuits using Transformers0
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
Benchmarking Retrieval-Augmented Generation for MedicineCode4
CausalGym: Benchmarking causal interpretability methods on linguistic tasksCode2
Synthetic location trajectory generation using categorical diffusion modelsCode0
FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation0
Event-Based Motion MagnificationCode2
Class-incremental Learning for Time Series: Benchmark and EvaluationCode2
AnaloBench: Benchmarking the Identification of Abstract and Long-context AnalogiesCode0
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A BenchmarkCode2
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model EvaluationCode1
PEDANTS: Cheap but Effective and Interpretable Answer EquivalenceCode2
VATr++: Choose Your Words Wisely for Handwritten Text Generation0
Learning Disentangled Audio Representations through Controlled Synthesis0
Benchmarking federated strategies in Peer-to-Peer Federated learning for biomedical data0
Large-scale Benchmarking of Metaphor-based Optimization Heuristics0
The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models CollapseCode0
Show:102550
← PrevPage 100 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified