Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–250 of 5548 papers

Title	Date	Tasks	Status	Hype
SORCE: Small Object Retrieval in Complex Environments	May 30, 2025	BenchmarkingImage Retrieval	CodeCode Available	0
GenSpace: Benchmarking Spatially-Aware Image Generation	May 30, 2025	BenchmarkingImage Generation	—Unverified	0
Segmenting France Across Four Centuries	May 30, 2025	BenchmarkingImage-to-Image Translation	CodeCode Available	0
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents	May 30, 2025	BenchmarkingBlocking	CodeCode Available	2
Geospatial Foundation Models to Enable Progress on Sustainable Development Goals	May 30, 2025	BenchmarkingEarth Observation	—Unverified	0
Benchmarking Foundation Models for Zero-Shot Biometric Tasks	May 30, 2025	AttributeBenchmarking	—Unverified	0
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs	May 30, 2025	Benchmarking	CodeCode Available	0
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization	May 30, 2025	BenchmarkingCryptanalysis	—Unverified	0
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation	May 30, 2025	BenchmarkingMachine Translation	—Unverified	0
Automated Structured Radiology Report Generation	May 30, 2025	Benchmarking	—Unverified	0
ByzFL: Research Framework for Robust Federated Learning	May 30, 2025	BenchmarkingFederated Learning	CodeCode Available	1
Bench4KE: Benchmarking Automated Competency Question Generation	May 30, 2025	BenchmarkingQuestion Generation	CodeCode Available	1
Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation	May 30, 2025	AllBenchmarking	CodeCode Available	1
PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models	May 30, 2025	Benchmarking	—Unverified	0
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs	May 29, 2025	BenchmarkingFairness	CodeCode Available	0
MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge	May 29, 2025	Benchmarking	—Unverified	0
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking	May 29, 2025	BenchmarkingGraph Question Answering	—Unverified	0
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns	May 29, 2025	Benchmarking	—Unverified	0
R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation	May 29, 2025	BenchmarkingImage Generation	—Unverified	0
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services	May 29, 2025	BenchmarkingInformation Retrieval	CodeCode Available	0
Joint Phase Shift Optimization and Precoder Selection for RIS-Assisted 5G NR MIMO Systems	May 29, 2025	Benchmarking	—Unverified	0
VERINA: Benchmarking Verifiable Code Generation	May 29, 2025	BenchmarkingCode Generation	CodeCode Available	2
LLM Performance for Code Generation on Noisy Tasks	May 29, 2025	BenchmarkingCode Generation	CodeCode Available	0
Toward Memory-Aided World Models: Benchmarking via Spatial Consistency	May 29, 2025	BenchmarkingMinecraft	CodeCode Available	1
Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective	May 28, 2025	BenchmarkingMemorization	CodeCode Available	0
MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators	May 28, 2025	BenchmarkingChatbot	CodeCode Available	0
Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking	May 28, 2025	Benchmarking	CodeCode Available	1
Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval	May 28, 2025	BenchmarkingRecommendation Systems	—Unverified	0
StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association Analysis	May 28, 2025	Benchmarking	CodeCode Available	0
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments	May 28, 2025	BenchmarkingRed Teaming	CodeCode Available	1
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates	May 28, 2025	BenchmarkingDiversity	—Unverified	0
GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking	May 28, 2025	BenchmarkingText Spotting	CodeCode Available	1
HelixDesign-Binder: A Scalable Production-Grade Platform for Binder Design Built on HelixFold3	May 28, 2025	BenchmarkingEfficient Exploration	—Unverified	0
PGLearn -- An Open-Source Learning Toolkit for Optimal Power Flow	May 28, 2025	Benchmarking	—Unverified	0
Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese	May 28, 2025	Benchmarking	CodeCode Available	0
Jailbreak Distillation: Renewable Safety Benchmarking	May 28, 2025	BenchmarkingDiversity	—Unverified	0
B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical Data	May 28, 2025	BenchmarkingDrug Discovery	CodeCode Available	0
TabularQGAN: A Quantum Generative Model for Tabular Data	May 28, 2025	BenchmarkingGenerative Adversarial Network	—Unverified	0
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate	May 28, 2025	Benchmarking	—Unverified	0
SVRPBench: A Realistic Benchmark for Stochastic Vehicle Routing Problem	May 28, 2025	Benchmarking	CodeCode Available	1
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering	May 27, 2025	BenchmarkingQuestion Answering	CodeCode Available	0
MoE-Gyro: Self-Supervised Over-Range Reconstruction and Denoising for MEMS Gyroscopes	May 27, 2025	BenchmarkingDenoising	—Unverified	0
Bencher: Simple and Reproducible Benchmarking for Black-Box Optimization	May 27, 2025	Benchmarking	CodeCode Available	1
AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs	May 27, 2025	BenchmarkingQuestion Selection	CodeCode Available	0
LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization Algorithms	May 27, 2025	Bayesian OptimizationBenchmarking	CodeCode Available	2
DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding	May 27, 2025	BenchmarkingChange Detection	—Unverified	0
FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone Navigation	May 27, 2025	BenchmarkingDecision Making	CodeCode Available	1
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge	May 27, 2025	BenchmarkingMultiple-choice	—Unverified	0
Laparoscopic Image Desmoking Using the U-Net with New Loss Function and Integrated Differentiable Wiener Filter	May 27, 2025	Benchmarking	CodeCode Available	0
Fedivertex: a Graph Dataset based on Decentralized Social Networks for Trustworthy Machine Learning	May 27, 2025	Benchmarking	CodeCode Available	0

Show:10 25 50

← PrevPage 5 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified