SOTAVerified

Benchmarking

Papers

Showing 10511075 of 5548 papers

TitleStatusHype
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
Dataset and Benchmark: Novel Sensors for Autonomous Vehicle PerceptionCode1
DocuMint: Docstring Generation for Python using Small Language ModelsCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive ScenariosCode1
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement LearningCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
3D Common Corruptions and Data AugmentationCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM AgentsCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
Are We There Yet? Evaluating State-of-the-Art Neural Network based Geoparsers Using EUPEG as a Benchmarking PlatformCode1
Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous graph neural networksCode1
From Claims to Evidence: A Unified Framework and Critical Analysis of CNN vs. Transformer vs. Mamba in Medical Image SegmentationCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic ScenariosCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMsCode1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
Show:102550
← PrevPage 43 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified