SOTAVerified

Benchmarking

Papers

Showing 101125 of 5548 papers

TitleStatusHype
Multi-Head RAG: Solving Multi-Aspect Problems with LLMsCode3
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the WildCode3
MLVU: Benchmarking Multi-task Long Video UnderstandingCode3
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous DrivingCode3
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation DatasetCode3
Are EEG-to-Text Models Working?Code3
ACEGEN: Reinforcement learning of generative chemical agents for drug discoveryCode3
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual ComprehensionCode3
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge BasesCode3
DeepFake-O-Meter v2.0: An Open Platform for DeepFake DetectionCode3
Advancing LLM Reasoning Generalists with Preference TreesCode3
AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain FrameworkCode3
Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly DetectionCode3
Recurrent Drafter for Fast Speculative Decoding in Large Language ModelsCode3
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop QueriesCode3
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM AgentsCode3
Benchmarking LLMs via Uncertainty QuantificationCode3
A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray InterpretationCode3
SEED-Bench: Benchmarking Multimodal Large Language ModelsCode3
AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into OneCode3
LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for LocomotionCode3
CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous DrivingCode3
Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity AnalysisCode3
T^3Bench: Benchmarking Current Progress in Text-to-3D GenerationCode3
SMPLer-X: Scaling Up Expressive Human Pose and Shape EstimationCode3
Show:102550
← PrevPage 5 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified