SOTAVerified

Benchmarking

Papers

Showing 30013050 of 5548 papers

TitleStatusHype
PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs0
CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization0
GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets0
Position: Benchmarking is Limited in Reinforcement Learning Research0
CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans0
MetaGreen: Meta-Learning Inspired Transformer Selection for Green Semantic CommunicationCode0
Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video0
Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease GeneralizationCode0
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents0
Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors0
Beyond Optimism: Exploration With Partially Observable RewardsCode0
FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainabilityCode0
CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM PipelinesCode0
PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions0
DASB -- Discrete Audio and Speech Benchmark0
Selected Languages are All You Need for Cross-lingual Truthfulness TransferCode0
Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary0
Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data0
Resource-efficient Medical Image Analysis with Self-adapting Forward-Forward Networks0
QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse MoleculesCode0
The Elusive Pursuit of Reproducing PATE-GAN: Benchmarking, Auditing, DebuggingCode0
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models0
Benchmarking Unsupervised Online IDS for Masquerade Attacks in CANCode0
Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective0
Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration0
Comparison of Open-Source and Proprietary LLMs for Machine Reading Comprehension: A Practical Analysis for Industrial Applications0
M4Fog: A Global Multi-Regional, Multi-Modal, and Multi-Stage Dataset for Marine Fog Detection and Forecasting to Bridge Ocean and AtmosphereCode0
Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance0
Exploring and Benchmarking the Planning Capabilities of Large Language Models0
MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts0
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop ReasoningCode0
Automatic benchmarking of large multimodal models via iterative experiment programmingCode0
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice QuestionsCode0
The Liouville Generator for Producing Integrable Expressions0
JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models0
InternalInspector I^2: Robust Confidence Estimation in LLMs through Internal States0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in ExplanationsCode0
Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading0
Benchmarking of LLM Detection: Comparing Two Competing Approaches0
Standardizing Structural Causal ModelsCode0
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician ExamsCode0
A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models0
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference ContentCode0
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning0
Evaluating the Performance of Large Language Models via Debates0
Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex0
Benchmarking Label Noise in Instance Segmentation: Spatial Noise MattersCode0
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences0
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language ModelsCode0
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment0
Show:102550
← PrevPage 61 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified