SOTAVerified

Benchmarking

Papers

Showing 20012025 of 5548 papers

TitleStatusHype
Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective0
A large-scale multicenter breast cancer DCE-MRI benchmark dataset with expert segmentationsCode2
Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration0
BeHonest: Benchmarking Honesty in Large Language ModelsCode1
Benchmarking Unsupervised Online IDS for Masquerade Attacks in CANCode0
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models0
Comparison of Open-Source and Proprietary LLMs for Machine Reading Comprehension: A Practical Analysis for Industrial Applications0
M4Fog: A Global Multi-Regional, Multi-Modal, and Multi-Stage Dataset for Marine Fog Detection and Forecasting to Bridge Ocean and AtmosphereCode0
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual GenerationCode3
Exploring and Benchmarking the Planning Capabilities of Large Language Models0
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice QuestionsCode0
Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance0
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation ModelsCode2
TSI-Bench: Benchmarking Time Series ImputationCode3
WebCanvas: Benchmarking Web Agents in Online EnvironmentsCode3
Automatic benchmarking of large multimodal models via iterative experiment programmingCode0
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop ReasoningCode0
MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts0
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AICode2
JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models0
InternalInspector I^2: Robust Confidence Estimation in LLMs through Internal States0
Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading0
Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and BenchmarkingCode1
A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models0
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician ExamsCode0
Show:102550
← PrevPage 81 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified