Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1651–1675 of 5548 papers

Title	Date	Tasks	Status
Benchmarking Neural Speech Codec Intelligibility with SITool	Jun 2, 2025	BenchmarkingDiagnostic	—Unverified
Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices	Jun 2, 2025	Benchmarking	—Unverified
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists	Jun 2, 2025	BenchmarkingForm	—Unverified
ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness	Jun 1, 2025	BenchmarkingManagement	CodeCode Available
MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book	Jun 1, 2025	Benchmarking	CodeCode Available
ModuLM: Enabling Modular and Multimodal Molecular Relational Learning with Large Language Models	Jun 1, 2025	BenchmarkingRelational Reasoning	—Unverified
The iNaturalist Sounds Dataset	May 31, 2025	Benchmarking	—Unverified
Benchmarking Foundation Models for Zero-Shot Biometric Tasks	May 30, 2025	AttributeBenchmarking	—Unverified
Geospatial Foundation Models to Enable Progress on Sustainable Development Goals	May 30, 2025	BenchmarkingEarth Observation	—Unverified
GenSpace: Benchmarking Spatially-Aware Image Generation	May 30, 2025	BenchmarkingImage Generation	—Unverified
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation	May 30, 2025	BenchmarkingMachine Translation	—Unverified
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs	May 30, 2025	Benchmarking	CodeCode Available
Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents	May 30, 2025	BenchmarkingCode Repair	—Unverified
Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework	May 30, 2025	Benchmarking	CodeCode Available
SORCE: Small Object Retrieval in Complex Environments	May 30, 2025	BenchmarkingImage Retrieval	CodeCode Available
Segmenting France Across Four Centuries	May 30, 2025	BenchmarkingImage-to-Image Translation	CodeCode Available
Automated Structured Radiology Report Generation	May 30, 2025	Benchmarking	—Unverified
PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset	May 30, 2025	BenchmarkingMultiple Instance Learning	CodeCode Available
PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models	May 30, 2025	Benchmarking	—Unverified
Progressive Class-level Distillation	May 30, 2025	BenchmarkingKnowledge Distillation	—Unverified
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization	May 30, 2025	BenchmarkingCryptanalysis	—Unverified
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs	May 29, 2025	BenchmarkingFairness	CodeCode Available
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns	May 29, 2025	Benchmarking	—Unverified
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services	May 29, 2025	BenchmarkingInformation Retrieval	CodeCode Available
R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation	May 29, 2025	BenchmarkingImage Generation	—Unverified

Show:10 25 50

← PrevPage 67 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified