SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1031–1040 of 5548 papers

Title	Date	Tasks	Status	Hype
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents	Jan 24, 2025	Benchmarking	CodeCode Available	3
The Karp Dataset	Jan 24, 2025	BenchmarkingMathematical Reasoning	—Unverified	0
Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video	Jan 24, 2025	3D ReconstructionBenchmarking	CodeCode Available	2
Enhancing Biomedical Relation Extraction with Directionality	Jan 23, 2025	BenchmarkingDocument-level Relation Extraction	CodeCode Available	1
AEON: Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise for Robust Learning	Jan 23, 2025	Benchmarkingimage-classification	—Unverified	0
You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain	Jan 23, 2025	BenchmarkingDomain Adaptation	—Unverified	0
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale	Jan 23, 2025	Benchmarking	—Unverified	0
RAG-Reward: Optimizing RAG with Reward Modeling and RLHF	Jan 22, 2025	BenchmarkingHallucination	—Unverified	0
Leveraging LLMs to Create a Haptic Devices' Recommendation System	Jan 22, 2025	Benchmarking	—Unverified	0
Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities	Jan 22, 2025	BenchmarkingReferring Expression	—Unverified	0

Show:10 25 50

← PrevPage 104 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified