SOTAVerified

Benchmarking

Papers

Showing 311320 of 5548 papers

TitleStatusHype
Datasets and Benchmarks for Offline Safe Reinforcement LearningCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation ModelsCode2
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial TasksCode2
Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and BenchmarkCode2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence ActCode2
AIR-Bench: Benchmarking Large Audio-Language Models via Generative ComprehensionCode2
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual LocalizationCode2
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and ThoroughlyCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
Show:102550
← PrevPage 32 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified