SOTAVerified

Benchmarking

Papers

Showing 261270 of 5548 papers

TitleStatusHype
StreamBench: Towards Benchmarking Continuous Improvement of Language AgentsCode2
TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese MedicineCode2
Benchmarking and Improving Detail Image CaptionCode2
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of ParametersCode2
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language ModelsCode2
Large-Scale Multi-Center CT and MRI Segmentation of Pancreas with Deep LearningCode2
MTVQA: Benchmarking Multilingual Text-Centric Visual Question AnsweringCode2
PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language ModelsCode2
OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMsCode2
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image RetrievalCode2
Show:102550
← PrevPage 27 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified