Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1676–1700 of 5548 papers

Title	Date	Tasks	Status
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services	May 29, 2025	BenchmarkingInformation Retrieval	CodeCode Available
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns	May 29, 2025	Benchmarking	—Unverified
Joint Phase Shift Optimization and Precoder Selection for RIS-Assisted 5G NR MIMO Systems	May 29, 2025	Benchmarking	—Unverified
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking	May 29, 2025	BenchmarkingGraph Question Answering	—Unverified
PGLearn -- An Open-Source Learning Toolkit for Optimal Power Flow	May 28, 2025	Benchmarking	—Unverified
Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese	May 28, 2025	Benchmarking	CodeCode Available
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate	May 28, 2025	Benchmarking	—Unverified
HelixDesign-Binder: A Scalable Production-Grade Platform for Binder Design Built on HelixFold3	May 28, 2025	BenchmarkingEfficient Exploration	—Unverified
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates	May 28, 2025	BenchmarkingDiversity	—Unverified
B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical Data	May 28, 2025	BenchmarkingDrug Discovery	CodeCode Available
Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective	May 28, 2025	BenchmarkingMemorization	CodeCode Available
TabularQGAN: A Quantum Generative Model for Tabular Data	May 28, 2025	BenchmarkingGenerative Adversarial Network	—Unverified
Jailbreak Distillation: Renewable Safety Benchmarking	May 28, 2025	BenchmarkingDiversity	—Unverified
StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association Analysis	May 28, 2025	Benchmarking	CodeCode Available
MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators	May 28, 2025	BenchmarkingChatbot	CodeCode Available
Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval	May 28, 2025	BenchmarkingRecommendation Systems	—Unverified
Fedivertex: a Graph Dataset based on Decentralized Social Networks for Trustworthy Machine Learning	May 27, 2025	Benchmarking	CodeCode Available
Laparoscopic Image Desmoking Using the U-Net with New Loss Function and Integrated Differentiable Wiener Filter	May 27, 2025	Benchmarking	CodeCode Available
VideoMarkBench: Benchmarking Robustness of Video Watermarking	May 27, 2025	Benchmarking	CodeCode Available
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge	May 27, 2025	BenchmarkingMultiple-choice	—Unverified
Gauss-Ramanujan Functions: Constructions, Properties, and Applications in Communications and Signal Processing	May 27, 2025	Benchmarking	—Unverified
MoE-Gyro: Self-Supervised Over-Range Reconstruction and Denoising for MEMS Gyroscopes	May 27, 2025	BenchmarkingDenoising	—Unverified
AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs	May 27, 2025	BenchmarkingQuestion Selection	CodeCode Available
DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding	May 27, 2025	BenchmarkingChange Detection	—Unverified
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering	May 27, 2025	BenchmarkingQuestion Answering	CodeCode Available

Show:10 25 50

← PrevPage 68 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified