SOTAVerified

Benchmarking

Papers

Showing 28262850 of 5548 papers

TitleStatusHype
Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical StudyCode0
Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture0
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents0
Revisiting Safe Exploration in Safe Reinforcement learning0
Landscape-Aware Automated Algorithm Configuration using Multi-output Mixed Regression and Classification0
A practical generalization metric for deep networks benchmarking0
Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages0
Accelerating the discovery of steady-states of planetary interior dynamics with machine learning0
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckListsCode0
Understanding the User: An Intent-Based Ranking Dataset0
Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction0
Illuminating the Diversity-Fitness Trade-Off in Black-Box OptimizationCode0
Benchmarking foundation models as feature extractors for weakly-supervised computational pathology0
Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games0
VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily ActivitiesCode0
Applications in CityLearn Gym Environment for Multi-Objective Control Benchmarking in Grid-Interactive Buildings and Districts0
Cross-subject Brain Functional Connectivity Analysis for Multi-task Cognitive State Evaluation0
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis0
Benchmarking Reinforcement Learning Methods for Dexterous Robotic Manipulation with a Three-Fingered Gripper0
BOX3D: Lightweight Camera-LiDAR Fusion for 3D Object Detection and Localization0
FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text SpottingCode0
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences0
Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study0
Comparative Analysis: Violence Recognition from Videos using Transfer LearningCode0
DHP Benchmark: Are LLMs Good NLG Evaluators?0
Show:102550
← PrevPage 114 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified