SOTAVerified

Benchmarking

Papers

Showing 401425 of 5548 papers

TitleStatusHype
GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents0
Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models0
Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins -- Dataset and BenchmarksCode1
Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment BenchmarkingCode3
Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese0
VitaGraph: Building a Knowledge Graph for Biologically Relevant Learning TasksCode0
STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible BenchmarkingCode0
CleanPatrick: A Benchmark for Image Data CleaningCode0
Visual Anomaly Detection under Complex View-Illumination Interplay: A Large-Scale Benchmark0
MatTools: Benchmarking Large Language Models for Materials Science ToolsCode1
Relation Extraction Across Entire Books to Reconstruct Community Networks: The AffilKG Datasets0
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMsCode0
M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object DetectionCode1
Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities0
Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications GloballyCode1
Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization0
GNN-Suite: a Graph Neural Network Benchmarking Framework for Biomedical InformaticsCode0
On the Evaluation of Engineering Artificial General Intelligence0
Evaluating Robustness of Deep Reinforcement Learning for Autonomous Surface Vehicle Control in Field TestsCode1
DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs0
JointDistill: Adaptive Multi-Task Distillation for Joint Depth Estimation and Scene Segmentation0
Real-World fNIRS-Based Brain-Computer Interfaces: Benchmarking Deep Learning and Classical Models in Interactive Gaming0
PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto LanguageCode0
Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1MCode0
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and ThoroughlyCode2
Show:102550
← PrevPage 17 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified