SOTAVerified

Benchmarking

Papers

Showing 30513075 of 5548 papers

TitleStatusHype
NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics0
GANmut: Generating and Modifying Facial Expressions0
Reactor Mk.1 performances: MMLU, HumanEval and BBH test results0
Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation ModelsCode0
Beyond Slow Signs in High-fidelity Model ExtractionCode0
ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate DisclosuresCode0
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic GradingCode0
On the Evaluation of Speech Foundation Models for Spoken Language Understanding0
Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming0
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework0
DefAn: Definitive Answer Dataset for LLMs Hallucination EvaluationCode0
CubeSat-Enabled Free-Space Optics: Joint Data Communication and Fine Beam Tracking0
ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents0
ECBD: Evidence-Centered Benchmark Design for NLPCode0
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living0
Decoding the Diversity: A Review of the Indic AI Research Landscape0
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition0
A Review of 315 Benchmark and Test Functions for Machine Learning Optimization Algorithms and Metaheuristics with Mathematical and Visual Descriptions0
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents0
How well it works: Benchmarking performance of GPT models on medical natural language processing tasks0
It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives0
Reinforcement Learning to Disentangle Multiqubit Quantum States from Partial ObservationsCode0
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets0
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases0
A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection0
Show:102550
← PrevPage 123 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified