SOTAVerified

Benchmarking

Papers

Showing 30013025 of 5548 papers

TitleStatusHype
PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs0
CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization0
GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets0
Position: Benchmarking is Limited in Reinforcement Learning Research0
CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans0
MetaGreen: Meta-Learning Inspired Transformer Selection for Green Semantic CommunicationCode0
Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video0
Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease GeneralizationCode0
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents0
Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors0
Beyond Optimism: Exploration With Partially Observable RewardsCode0
FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainabilityCode0
CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM PipelinesCode0
PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions0
DASB -- Discrete Audio and Speech Benchmark0
Selected Languages are All You Need for Cross-lingual Truthfulness TransferCode0
Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary0
Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data0
Resource-efficient Medical Image Analysis with Self-adapting Forward-Forward Networks0
QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse MoleculesCode0
The Elusive Pursuit of Reproducing PATE-GAN: Benchmarking, Auditing, DebuggingCode0
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models0
Benchmarking Unsupervised Online IDS for Masquerade Attacks in CANCode0
Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective0
Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration0
Show:102550
← PrevPage 121 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified