SOTAVerified

Benchmarking

Papers

Showing 476500 of 5548 papers

TitleStatusHype
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMsCode1
Evolutionary Generation of Random Surreal Numbers for BenchmarkingCode1
An Empirical Study of GPT-4o Image Generation CapabilitiesCode1
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language ModelsCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
A Survey of Pathology Foundation Model: Progress and Future DirectionsCode1
Generative Evaluation of Complex Reasoning in Large Language ModelsCode1
BlenderGym: Benchmarking Foundational Model Systems for Graphics EditingCode1
SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research PapersCode1
EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric VideosCode1
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMsCode1
A Comprehensive Benchmark for RNA 3D Structure-Function ModelingCode1
NeoRL-2: Near Real-World Benchmarks for Offline Reinforcement Learning with Extended Realistic ScenariosCode1
The Coralscapes Dataset: Semantic Scene Understanding in Coral ReefsCode1
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language ModelsCode1
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite ImageryCode1
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality RobustnessCode1
GeoBenchX: Benchmarking LLMs for Multistep Geospatial TasksCode1
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model InteractionCode1
QCPINN: Quantum-Classical Physics-Informed Neural Networks for Solving PDEsCode1
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data ContaminationCode1
JuDGE: Benchmarking Judgment Document Generation for Chinese Legal SystemCode1
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric VideosCode1
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific ResearchCode1
GNNs as Predictors of Agentic Workflow PerformancesCode1
Show:102550
← PrevPage 20 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified