SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 491–500 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery	Mar 24, 2025	BenchmarkingHumanitarian	CodeCode Available	1
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness	Mar 24, 2025	BenchmarkingSemantic Segmentation	CodeCode Available	1
GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks	Mar 23, 2025	BenchmarkingHallucination	CodeCode Available	1
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction	Mar 22, 2025	BenchmarkingVideo Understanding	CodeCode Available	1
QCPINN: Quantum-Classical Physics-Informed Neural Networks for Solving PDEs	Mar 20, 2025	BenchmarkingPhysics-informed machine learning	CodeCode Available	1
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination	Mar 20, 2025	BenchmarkingLarge Language Model	CodeCode Available	1
JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System	Mar 18, 2025	BenchmarkingIn-Context Learning	CodeCode Available	1
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos	Mar 17, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research	Mar 17, 2025	ArticlesBenchmarking	CodeCode Available	1
GNNs as Predictors of Agentic Workflow Performances	Mar 14, 2025	BenchmarkingPosition	CodeCode Available	1

Show:10 25 50

← PrevPage 50 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified