SOTAVerified

Benchmarking

Papers

Showing 15261550 of 5548 papers

TitleStatusHype
Active Evaluation Acquisition for Efficient LLM Benchmarking0
Manual Verbalizer Enrichment for Few-Shot Text Classification0
Benchmarking of a new data splitting method on volcanic eruption data0
Translation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems0
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the WildCode1
Rule-based Data Selection for Large Language Models0
Precise Model Benchmarking with Only a Few Observations0
MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and DefenseCode2
Named Clinical Entity Recognition BenchmarkCode0
TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation ModelsCode0
Large Scale MRI Collection and Segmentation of Cirrhotic LiverCode1
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection0
dattri: A Library for Efficient Data AttributionCode2
Adjusting Pretrained Backbones for PerformativityCode0
Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends0
PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms0
Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic PlanningCode1
Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels0
TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable QuestionsCode0
How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension0
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities0
Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices0
Benchmarking the Fidelity and Utility of Synthetic Relational Data0
PersoBench: Benchmarking Personalized Response Generation in Large Language ModelsCode0
Ward: Provable RAG Dataset Inference via LLM Watermarks0
Show:102550
← PrevPage 62 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified