SOTAVerified

Benchmarking

Papers

Showing 30513100 of 5548 papers

TitleStatusHype
NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics0
GANmut: Generating and Modifying Facial Expressions0
Reactor Mk.1 performances: MMLU, HumanEval and BBH test results0
Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation ModelsCode0
Beyond Slow Signs in High-fidelity Model ExtractionCode0
ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate DisclosuresCode0
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic GradingCode0
On the Evaluation of Speech Foundation Models for Spoken Language Understanding0
Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming0
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework0
DefAn: Definitive Answer Dataset for LLMs Hallucination EvaluationCode0
CubeSat-Enabled Free-Space Optics: Joint Data Communication and Fine Beam Tracking0
ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents0
ECBD: Evidence-Centered Benchmark Design for NLPCode0
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living0
Decoding the Diversity: A Review of the Indic AI Research Landscape0
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition0
A Review of 315 Benchmark and Test Functions for Machine Learning Optimization Algorithms and Metaheuristics with Mathematical and Visual Descriptions0
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents0
How well it works: Benchmarking performance of GPT models on medical natural language processing tasks0
It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives0
Reinforcement Learning to Disentangle Multiqubit Quantum States from Partial ObservationsCode0
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets0
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases0
A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection0
Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing0
Benchmarking Vision-Language Contrastive Methods for Medical Representation LearningCode0
DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition0
Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images0
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models0
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion RecognitionCode0
Can Language Models Serve as Text-Based World Simulators?0
Multivariate Stochastic Dominance via Optimal Transport and Applications to Models Benchmarking0
Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model ArchitectureCode0
JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language ModelsCode0
Data-driven Power Flow Linearization: Simulation0
Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications0
1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation0
GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models0
Deep Jansen-Rit Parameter Inference for Model-Driven Analysis of Brain ActivityCode0
Scenarios and Approaches for Situated Natural Language Explanations0
Behavior Structformer: Learning Players Representations with Structured Tokenization0
VisionAD, a software package of performant anomaly detection algorithms, and Proportion Localised, an interpretable metricCode0
Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation0
Better Late Than Never: Formulating and Benchmarking Recommendation EditingCode0
Benchmarking AlphaFold3's protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation0
Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As0
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning0
BEADs: Bias Evaluation Across Domains0
Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices0
Show:102550
← PrevPage 62 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified