SOTAVerified

Benchmarking

Papers

Showing 12261250 of 5548 papers

TitleStatusHype
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
HINT3: Raising the bar for Intent Detection in the WildCode1
CIDEr: Consensus-based Image Description EvaluationCode1
Histo-Genomic Knowledge Distillation For Cancer Prognosis From Histopathology Whole Slide ImagesCode1
Large Scale MRI Collection and Segmentation of Cirrhotic LiverCode1
Benchmarking Large Language Models for Automated Verilog RTL Code GenerationCode1
Hierarchical graph neural nets can capture long-range interactionsCode1
Uncovering the Limits of Machine Learning for Automatic Vulnerability DetectionCode1
A Reinforcement Learning Environment for Multi-Service UAV-enabled Wireless SystemsCode1
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary InvestigationCode1
Benchmarking Language Models for Code Syntax UnderstandingCode1
LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wildCode1
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event ExtractionCode1
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMMCode1
ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate ModelsCode1
A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive CareCode1
HAWKS: Evolving Challenging Benchmark Sets for Cluster AnalysisCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality mattersCode1
HazeSpace2M: A Dataset for Haze Aware Single Image DehazingCode1
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal ReasoningCode1
Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and EfficiencyCode1
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate CampaignsCode1
Show:102550
← PrevPage 50 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified