SOTAVerified

Benchmarking

Papers

Showing 176200 of 5548 papers

TitleStatusHype
AMLgentex: Mobilizing Data-Driven Research to Combat Money Laundering0
FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure ModesCode0
Tactile MNIST: Benchmarking Active Tactile Perception0
FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models0
SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation0
NetPress: Dynamically Generated LLM Benchmarks for Network ApplicationsCode1
Rethinking Machine Unlearning in Image Generation ModelsCode1
FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents0
CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language ModelsCode0
TIIF-Bench: How Does Your T2I Model Follow Your Instructions?0
Benchmarking Neural Speech Codec Intelligibility with SITool0
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code0
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists0
GSCodec Studio: A Modular Framework for Gaussian Splat CompressionCode2
Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices0
ModuLM: Enabling Modular and Multimodal Molecular Relational Learning with Large Language Models0
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
ACCESS DENIED INC: The First Benchmark Environment for Sensitivity AwarenessCode0
MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access BookCode0
The iNaturalist Sounds Dataset0
AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-TimeCode1
Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents0
PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image DatasetCode0
Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal FrameworkCode0
GenSpace: Benchmarking Spatially-Aware Image Generation0
Show:102550
← PrevPage 8 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified