SOTAVerified

Benchmarking

Papers

Showing 10311040 of 5548 papers

TitleStatusHype
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM AgentsCode3
The Karp Dataset0
Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy VideoCode2
Enhancing Biomedical Relation Extraction with DirectionalityCode1
AEON: Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise for Robust Learning0
You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain0
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale0
RAG-Reward: Optimizing RAG with Reward Modeling and RLHF0
Leveraging LLMs to Create a Haptic Devices' Recommendation System0
Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities0
Show:102550
← PrevPage 104 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified