SOTAVerified

Benchmarking

Papers

Showing 11011125 of 5548 papers

TitleStatusHype
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image CaptioningCode1
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question AnsweringCode1
2.5D Visual Relationship DetectionCode1
Benchmarking Robustness to Adversarial Image ObfuscationsCode1
Benchmarking Robustness of Text-Image Composed RetrievalCode1
Benchmarking Robustness of Machine Reading Comprehension ModelsCode1
GEMv2: Multilingual NLG Benchmarking in a Single Line of CodeCode1
Benchmarking saliency methods for chest X-ray interpretationCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
Benchmarking Robustness of Multimodal Image-Text Models under Distribution ShiftCode1
Benchmarking Self-Supervised Learning on Diverse Pathology DatasetsCode1
Benchmarking LLM Faithfulness in RAG with Evolving LeaderboardsCode1
Benchmarking Segmentation Models with Mask-Preserved Attribute EditingCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric AlgebrasCode1
Benchmarking Simulation-Based InferenceCode1
Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating Parkinson's Disease Severity in Walking SequencesCode1
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization ModelingCode1
Benchmarking LLMs for Political Science: A United Nations PerspectiveCode1
Benchmarking Large Language Models on Controllable Generation under Diversified InstructionsCode1
Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge GraphCode1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
Grad DFT: a software library for machine learning enhanced density functional theoryCode1
Benchmarking the Performance of Bayesian Optimization across Multiple Experimental Materials Science DomainsCode1
Show:102550
← PrevPage 45 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified