SOTAVerified

Benchmarking

Papers

Showing 176200 of 5548 papers

TitleStatusHype
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image AnalysisCode2
Fino1: On the Transferability of Reasoning Enhanced LLMs to FinanceCode2
SoK: Benchmarking Poisoning Attacks and Defenses in Federated LearningCode2
Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance EstimationCode2
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language ModelCode2
Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy VideoCode2
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?Code2
nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation BenchmarkCode2
An OpenMind for 3D medical vision self-supervised learningCode2
XRAG: eXamining the Core -- Benchmarking Foundational Components in Advanced Retrieval-Augmented GenerationCode2
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous DrivingCode2
Open Universal Arabic ASR LeaderboardCode2
NeuralPLexer3: Accurate Biomolecular Complex Structure Prediction with Flow ModelsCode2
EvalGIM: A Library for Evaluating Generative Image ModelsCode2
Neptune: The Long Orbit to Benchmarking Long Video UnderstandingCode2
Video Quality Assessment: A Comprehensive SurveyCode2
Commit0: Library Generation from ScratchCode2
OpenQDC: Open Quantum Data CommonsCode2
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial TasksCode2
HourVideo: 1-Hour Video-Language UnderstandingCode2
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive PrototypingCode2
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI AcceleratorsCode2
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail ModelsCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
PC-Gym: Benchmark Environments For Process Control ProblemsCode2
Show:102550
← PrevPage 8 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified