SOTAVerified

Benchmarking

Papers

Showing 19511975 of 5548 papers

TitleStatusHype
Benchmarking Deep Learning Models on NVIDIA Jetson Nano for Real-Time Systems: An Empirical InvestigationCode0
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QACode2
Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language0
MatText: Do Language Models Need More than Text & Scale for Materials Modeling?Code1
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender SystemsCode0
Towards Efficient and Scalable Training of Differentially Private Deep LearningCode0
NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods0
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models0
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models0
CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization0
FaceScore: Benchmarking and Enhancing Face Quality in Human GenerationCode2
A Closer Look at Mortality Risk Prediction from ElectrocardiogramsCode1
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness BenchmarkingCode2
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language ModelsCode1
Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournamentsCode4
DreamBench++: A Human-Aligned Benchmark for Personalized Image GenerationCode2
Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation TrackCode1
General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug DesignCode1
PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs0
GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets0
HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image AnalysisCode3
Towards Open Respiratory Acoustic Foundation Models: Pretraining and BenchmarkingCode2
Position: Benchmarking is Limited in Reinforcement Learning Research0
MetaGreen: Meta-Learning Inspired Transformer Selection for Green Semantic CommunicationCode0
Show:102550
← PrevPage 79 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified