Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 976–1000 of 5548 papers

Title	Date	Tasks	Status	Hype
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models	Feb 6, 2025	BenchmarkingEmotional Intelligence	—Unverified	0
Verifiable Format Control for Large Language Model Generations	Feb 6, 2025	BenchmarkingInstruction Following	—Unverified	0
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization	Feb 6, 2025	BenchmarkingUncertainty Quantification	—Unverified	0
LUND-PROBE -- LUND Prostate Radiotherapy Open Benchmarking and Evaluation dataset	Feb 6, 2025	BenchmarkingComputed Tomography (CT)	—Unverified	0
Large Language Models for Multi-Robot Systems: A Survey	Feb 6, 2025	Action GenerationBenchmarking	CodeCode Available	1
SoK: Benchmarking Poisoning Attacks and Defenses in Federated Learning	Feb 6, 2025	BenchmarkingData Poisoning	CodeCode Available	2
Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated Samples	Feb 6, 2025	BenchmarkingDeepFake Detection	CodeCode Available	0
PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature Data	Feb 6, 2025	BenchmarkingTime Series	CodeCode Available	0
Benchmarking Time Series Forecasting Models: From Statistical Techniques to Foundation Models in Real-World Applications	Feb 5, 2025	BenchmarkingFeature Engineering	—Unverified	0
TGB-Seq Benchmark: Challenging Temporal GNNs with Complex Sequential Dynamics	Feb 5, 2025	BenchmarkingLink Prediction	CodeCode Available	0
MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf	Feb 5, 2025	BenchmarkingScheduling	—Unverified	0
Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation	Feb 5, 2025	BenchmarkingLarge Language Model	CodeCode Available	2
Optimal PMU Placement for Kalman Filtering of DAE Power System Models	Feb 5, 2025	BenchmarkingState Estimation	—Unverified	0
Energy & Force Regression on DFT Trajectories is Not Enough for Universal Machine Learning Interatomic Potentials	Feb 5, 2025	Benchmarking	—Unverified	0
PICBench: Benchmarking LLMs for Photonic Integrated Circuits Design	Feb 5, 2025	BenchmarkingPrompt Engineering	CodeCode Available	1
xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods	Feb 5, 2025	Benchmarking	—Unverified	0
LadderMIL: Multiple Instance Learning with Coarse-to-Fine Self-Distillation	Feb 4, 2025	BenchmarkingClassification	—Unverified	0
Dynamic benchmarking framework for LLM-based conversational data capture	Feb 4, 2025	Benchmarking	—Unverified	0
Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation	Feb 4, 2025	BenchmarkingInformation Retrieval	CodeCode Available	4
Evalita-LLM: Benchmarking Large Language Models on Italian	Feb 4, 2025	BenchmarkingMultiple-choice	—Unverified	0
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models	Feb 4, 2025	BenchmarkingDecision Making	—Unverified	0
A comparison of translation performance between DeepL and Supertext	Feb 4, 2025	BenchmarkingMachine Translation	CodeCode Available	0
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets	Feb 4, 2025	AllBenchmarking	CodeCode Available	0
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities	Feb 3, 2025	BenchmarkingLarge Language Model	—Unverified	0
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation	Feb 3, 2025	BenchmarkingFairness	—Unverified	0

Show:10 25 50

← PrevPage 40 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified