SOTAVerified

Benchmarking

Papers

Showing 30263050 of 5548 papers

TitleStatusHype
Comparison of Open-Source and Proprietary LLMs for Machine Reading Comprehension: A Practical Analysis for Industrial Applications0
M4Fog: A Global Multi-Regional, Multi-Modal, and Multi-Stage Dataset for Marine Fog Detection and Forecasting to Bridge Ocean and AtmosphereCode0
Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance0
Exploring and Benchmarking the Planning Capabilities of Large Language Models0
MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts0
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop ReasoningCode0
Automatic benchmarking of large multimodal models via iterative experiment programmingCode0
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice QuestionsCode0
The Liouville Generator for Producing Integrable Expressions0
JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models0
InternalInspector I^2: Robust Confidence Estimation in LLMs through Internal States0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in ExplanationsCode0
Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading0
Benchmarking of LLM Detection: Comparing Two Competing Approaches0
Standardizing Structural Causal ModelsCode0
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician ExamsCode0
A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models0
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference ContentCode0
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning0
Evaluating the Performance of Large Language Models via Debates0
Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex0
Benchmarking Label Noise in Instance Segmentation: Spatial Noise MattersCode0
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences0
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language ModelsCode0
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment0
Show:102550
← PrevPage 122 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified