Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2001–2025 of 5548 papers

Title	Date	Tasks	Status	Hype
Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective	Jun 19, 2024	BenchmarkingContinual Pretraining	—Unverified	0
A large-scale multicenter breast cancer DCE-MRI benchmark dataset with expert segmentations	Jun 19, 2024	Benchmarking	CodeCode Available	2
Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration	Jun 19, 2024	BenchmarkingDistractor Generation	—Unverified	0
BeHonest: Benchmarking Honesty in Large Language Models	Jun 19, 2024	BenchmarkingMisinformation	CodeCode Available	1
Benchmarking Unsupervised Online IDS for Masquerade Attacks in CAN	Jun 19, 2024	BenchmarkingIntrusion Detection	CodeCode Available	0
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models	Jun 19, 2024	BenchmarkingOpen-Domain Question Answering	—Unverified	0
Comparison of Open-Source and Proprietary LLMs for Machine Reading Comprehension: A Practical Analysis for Industrial Applications	Jun 19, 2024	BenchmarkingMachine Reading Comprehension	—Unverified	0
M4Fog: A Global Multi-Regional, Multi-Modal, and Multi-Stage Dataset for Marine Fog Detection and Forecasting to Bridge Ocean and Atmosphere	Jun 19, 2024	BenchmarkingSpatio-Temporal Forecasting	CodeCode Available	0
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation	Jun 19, 2024	BenchmarkingImage Generation	CodeCode Available	3
Exploring and Benchmarking the Planning Capabilities of Large Language Models	Jun 18, 2024	BenchmarkingIn-Context Learning	—Unverified	0
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions	Jun 18, 2024	BenchmarkingMultiple-choice	CodeCode Available	0
Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance	Jun 18, 2024	Benchmarking	—Unverified	0
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models	Jun 18, 2024	BenchmarkingDepth Estimation	CodeCode Available	2
TSI-Bench: Benchmarking Time Series Imputation	Jun 18, 2024	BenchmarkingDeep Learning	CodeCode Available	3
WebCanvas: Benchmarking Web Agents in Online Environments	Jun 18, 2024	AI AgentBenchmarking	CodeCode Available	3
Automatic benchmarking of large multimodal models via iterative experiment programming	Jun 18, 2024	BenchmarkingLanguage Modeling	CodeCode Available	0
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning	Jun 18, 2024	BenchmarkingWorld Knowledge	CodeCode Available	0
MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts	Jun 18, 2024	ArticlesBenchmarking	—Unverified	0
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI	Jun 18, 2024	Benchmarkingscientific discovery	CodeCode Available	2
JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models	Jun 17, 2024	Benchmarkingcounterfactual	—Unverified	0
InternalInspector I^2: Robust Confidence Estimation in LLMs through Internal States	Jun 17, 2024	BenchmarkingContrastive Learning	—Unverified	0
Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading	Jun 17, 2024	Autonomous VehiclesBenchmarking	—Unverified	0
Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and Benchmarking	Jun 17, 2024	BenchmarkingDemand Forecasting	CodeCode Available	1
A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models	Jun 17, 2024	BenchmarkingSurvey	—Unverified	0
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams	Jun 17, 2024	AllBenchmarking	CodeCode Available	0

Show:10 25 50

← PrevPage 81 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified