SOTAVerified

Benchmarking

Papers

Showing 16511675 of 5548 papers

TitleStatusHype
Benchmarking Neural Speech Codec Intelligibility with SITool0
Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices0
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists0
ACCESS DENIED INC: The First Benchmark Environment for Sensitivity AwarenessCode0
MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access BookCode0
ModuLM: Enabling Modular and Multimodal Molecular Relational Learning with Large Language Models0
The iNaturalist Sounds Dataset0
Benchmarking Foundation Models for Zero-Shot Biometric Tasks0
Geospatial Foundation Models to Enable Progress on Sustainable Development Goals0
GenSpace: Benchmarking Spatially-Aware Image Generation0
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation0
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMsCode0
Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents0
Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal FrameworkCode0
SORCE: Small Object Retrieval in Complex EnvironmentsCode0
Segmenting France Across Four CenturiesCode0
Automated Structured Radiology Report Generation0
PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image DatasetCode0
PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models0
Progressive Class-level Distillation0
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization0
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMsCode0
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns0
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking ServicesCode0
R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation0
Show:102550
← PrevPage 67 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified