SOTAVerified

Benchmarking

Papers

Showing 11011125 of 5548 papers

TitleStatusHype
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image CaptioningCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
2.5D Visual Relationship DetectionCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative TasksCode1
A Review and Efficient Implementation of Scene Graph Generation MetricsCode1
Benchmarking Multimodal Knowledge Conflict for Large Multimodal ModelsCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for HallucinationsCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures TranslationCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
Benchmarking Robustness of Multimodal Image-Text Models under Distribution ShiftCode1
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization ModelingCode1
Benchmarking LLMs for Political Science: A United Nations PerspectiveCode1
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality RobustnessCode1
CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version)Code1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
Benchmarking Multi-Scene Fire and Smoke DetectionCode1
Benchmarking Large Language Models on Controllable Generation under Diversified InstructionsCode1
Show:102550
← PrevPage 45 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified