SOTAVerified

Benchmarking

Papers

Showing 19512000 of 5548 papers

TitleStatusHype
Benchmarking Deep Learning Models on NVIDIA Jetson Nano for Real-Time Systems: An Empirical InvestigationCode0
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QACode2
Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language0
NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods0
Towards Efficient and Scalable Training of Differentially Private Deep LearningCode0
A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender SystemsCode0
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models0
MatText: Do Language Models Need More than Text & Scale for Materials Modeling?Code1
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models0
CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization0
FaceScore: Benchmarking and Enhancing Face Quality in Human GenerationCode2
A Closer Look at Mortality Risk Prediction from ElectrocardiogramsCode1
Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournamentsCode4
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness BenchmarkingCode2
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language ModelsCode1
DreamBench++: A Human-Aligned Benchmark for Personalized Image GenerationCode2
General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug DesignCode1
Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation TrackCode1
PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs0
HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image AnalysisCode3
Position: Benchmarking is Limited in Reinforcement Learning Research0
GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets0
Towards Open Respiratory Acoustic Foundation Models: Pretraining and BenchmarkingCode2
MetaGreen: Meta-Learning Inspired Transformer Selection for Green Semantic CommunicationCode0
CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans0
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex InstructionsCode4
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-PolygraphCode2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data AnalysisCode2
Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors0
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion ModelsCode1
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and BenchmarkingCode7
Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease GeneralizationCode0
Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video0
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents0
CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM PipelinesCode0
Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary0
QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse MoleculesCode0
Beyond Optimism: Exploration With Partially Observable RewardsCode0
Selected Languages are All You Need for Cross-lingual Truthfulness TransferCode0
How far are today's time-series models from real-world weather forecasting applications?Code2
The Elusive Pursuit of Reproducing PATE-GAN: Benchmarking, Auditing, DebuggingCode0
Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data0
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?Code2
Resource-efficient Medical Image Analysis with Self-adapting Forward-Forward Networks0
DASB -- Discrete Audio and Speech Benchmark0
A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular DataCode1
FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainabilityCode0
PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions0
Show:102550
← PrevPage 40 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified