SOTAVerified

Benchmarking

Papers

Showing 23012325 of 5548 papers

TitleStatusHype
Unraveling the Capabilities of Language Models in News SummarizationCode0
Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency0
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding0
Solving Urban Network Security Games: Learning Platform, Benchmark, and Challenge for AI Research0
Benchmarking Quantum Convolutional Neural Networks for Signal Classification in Simulated Gamma-Ray Burst Detection0
Making Sense of Data in the Wild: Data Analysis Automation at Scale0
Transfer of Knowledge through Reverse Annealing: A Preliminary Analysis of the Benefits and What to Share0
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding0
A Benchmarking Environment for Worker Flexibility in Flexible Job Shop Scheduling Problems0
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding0
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation0
Benchmarking Quantum Reinforcement LearningCode0
CISOL: An Open and Extensible Dataset for Table Structure Recognition in the Construction Industry0
Beyond Benchmarks: On The False Promise of AI Regulation0
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree searchCode0
Self-supervised Benchmark Lottery on ImageNet: Do Marginal Improvements Translate to Improvements on Similar Datasets?0
Prompting ChatGPT for Chinese Learning as L2: A CEFR and EBCL Level Study0
Benchmarking global optimization techniques for unmanned aerial vehicle path planning0
The Karp Dataset0
Feature-based Evolutionary Diversity Optimization of Discriminating Instances for Chance-constrained Optimization Problems0
AEON: Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise for Robust Learning0
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale0
You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain0
CHaRNet: Conditioned Heatmap Regression for Robust Dental Landmark Localization0
RAG-Reward: Optimizing RAG with Reward Modeling and RLHF0
Show:102550
← PrevPage 93 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified