Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 476–500 of 5548 papers

Title	Date	Tasks	Status	Hype
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs	Apr 11, 2025	BenchmarkingImage Generation	CodeCode Available	1
Evolutionary Generation of Random Surreal Numbers for Benchmarking	Apr 9, 2025	Benchmarking	CodeCode Available	1
An Empirical Study of GPT-4o Image Generation Capabilities	Apr 8, 2025	BenchmarkingImage Generation	CodeCode Available	1
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models	Apr 8, 2025	BenchmarkingVisual Reasoning	CodeCode Available	1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization	Apr 6, 2025	BenchmarkingCombinatorial Optimization	CodeCode Available	1
A Survey of Pathology Foundation Model: Progress and Future Directions	Apr 5, 2025	BenchmarkingMultiple Instance Learning	CodeCode Available	1
Generative Evaluation of Complex Reasoning in Large Language Models	Apr 3, 2025	BenchmarkingMemorization	CodeCode Available	1
BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing	Apr 2, 2025	3D ReconstructionBenchmarking	CodeCode Available	1
SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers	Mar 31, 2025	Benchmarking	CodeCode Available	1
EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos	Mar 28, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs	Mar 27, 2025	AttributeBenchmarking	CodeCode Available	1
A Comprehensive Benchmark for RNA 3D Structure-Function Modeling	Mar 27, 2025	BenchmarkingDeep Learning	CodeCode Available	1
NeoRL-2: Near Real-World Benchmarks for Offline Reinforcement Learning with Extended Realistic Scenarios	Mar 25, 2025	BenchmarkingOffline RL	CodeCode Available	1
The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs	Mar 25, 2025	BenchmarkingScene Segmentation	CodeCode Available	1
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models	Mar 25, 2025	BenchmarkingImage Captioning	CodeCode Available	1
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery	Mar 24, 2025	BenchmarkingHumanitarian	CodeCode Available	1
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness	Mar 24, 2025	BenchmarkingSemantic Segmentation	CodeCode Available	1
GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks	Mar 23, 2025	BenchmarkingHallucination	CodeCode Available	1
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction	Mar 22, 2025	BenchmarkingVideo Understanding	CodeCode Available	1
QCPINN: Quantum-Classical Physics-Informed Neural Networks for Solving PDEs	Mar 20, 2025	BenchmarkingPhysics-informed machine learning	CodeCode Available	1
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination	Mar 20, 2025	BenchmarkingLarge Language Model	CodeCode Available	1
JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System	Mar 18, 2025	BenchmarkingIn-Context Learning	CodeCode Available	1
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos	Mar 17, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research	Mar 17, 2025	ArticlesBenchmarking	CodeCode Available	1
GNNs as Predictors of Agentic Workflow Performances	Mar 14, 2025	BenchmarkingPosition	CodeCode Available	1

Show:10 25 50

← PrevPage 20 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified