Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1951–2000 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchmarking Deep Learning Models on NVIDIA Jetson Nano for Real-Time Systems: An Empirical Investigation	Jun 25, 2024	Action DetectionBenchmarking	CodeCode Available	0
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation	Jun 25, 2024	ARCBenchmarking	CodeCode Available	0
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA	Jun 25, 2024	BenchmarkingLong-Context Understanding	CodeCode Available	2
Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language	Jun 25, 2024	Benchmarking	—Unverified	0
NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods	Jun 25, 2024	3DGSBenchmarking	—Unverified	0
Towards Efficient and Scalable Training of Differentially Private Deep Learning	Jun 25, 2024	BenchmarkingDeep Learning	CodeCode Available	0
A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems	Jun 25, 2024	BenchmarkingCollaborative Filtering	CodeCode Available	0
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models	Jun 25, 2024	Benchmarking	—Unverified	0
MatText: Do Language Models Need More than Text & Scale for Materials Modeling?	Jun 25, 2024	Benchmarking	CodeCode Available	1
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models	Jun 24, 2024	Benchmarking	—Unverified	0
CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization	Jun 24, 2024	Bayesian OptimizationBenchmarking	—Unverified	0
FaceScore: Benchmarking and Enhancing Face Quality in Human Generation	Jun 24, 2024	BenchmarkingDenoising	CodeCode Available	2
A Closer Look at Mortality Risk Prediction from Electrocardiograms	Jun 24, 2024	BenchmarkingPrediction	CodeCode Available	1
Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments	Jun 24, 2024	Benchmarking	CodeCode Available	4
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness Benchmarking	Jun 24, 2024	BenchmarkingNeRF	CodeCode Available	2
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models	Jun 24, 2024	BenchmarkingData Augmentation	CodeCode Available	1
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation	Jun 24, 2024	BenchmarkingImage Generation	CodeCode Available	2
General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design	Jun 24, 2024	BenchmarkingDrug Design	CodeCode Available	1
Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track	Jun 24, 2024	BenchmarkingRAG	CodeCode Available	1
PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs	Jun 24, 2024	BenchmarkingMachine Unlearning	—Unverified	0
HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis	Jun 23, 2024	BenchmarkingRepresentation Learning	CodeCode Available	3
Position: Benchmarking is Limited in Reinforcement Learning Research	Jun 23, 2024	BenchmarkingPosition	—Unverified	0
GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets	Jun 23, 2024	Benchmarking	—Unverified	0
Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking	Jun 23, 2024	Benchmarking	CodeCode Available	2
MetaGreen: Meta-Learning Inspired Transformer Selection for Green Semantic Communication	Jun 22, 2024	BenchmarkingMeta-Learning	CodeCode Available	0
CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans	Jun 22, 2024	BenchmarkingDecision Making	—Unverified	0
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions	Jun 22, 2024	BenchmarkingCode Generation	CodeCode Available	4
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph	Jun 21, 2024	BenchmarkingText Generation	CodeCode Available	2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis	Jun 21, 2024	AI AgentAutoML	CodeCode Available	2
Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors	Jun 21, 2024	Adversarial DefenseAdversarial Robustness	—Unverified	0
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models	Jun 21, 2024	Benchmarking	CodeCode Available	1
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking	Jun 21, 2024	Autonomous DrivingBenchmarking	CodeCode Available	7
Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease Generalization	Jun 21, 2024	BenchmarkingSegmentation	CodeCode Available	0
Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video	Jun 21, 2024	BenchmarkingFew-Shot Learning	—Unverified	0
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents	Jun 21, 2024	Benchmarking	—Unverified	0
CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines	Jun 20, 2024	BenchmarkingDecision Making	CodeCode Available	0
Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary	Jun 20, 2024	BenchmarkingIn-Context Learning	—Unverified	0
QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules	Jun 20, 2024	Benchmarking	CodeCode Available	0
Beyond Optimism: Exploration With Partially Observable Rewards	Jun 20, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available	0
Selected Languages are All You Need for Cross-lingual Truthfulness Transfer	Jun 20, 2024	AllBenchmarking	CodeCode Available	0
How far are today's time-series models from real-world weather forecasting applications?	Jun 20, 2024	BenchmarkingTime Series	CodeCode Available	2
The Elusive Pursuit of Reproducing PATE-GAN: Benchmarking, Auditing, Debugging	Jun 20, 2024	Benchmarking	CodeCode Available	0
Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data	Jun 20, 2024	Animal Pose EstimationBenchmarking	—Unverified	0
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification	Jun 20, 2024	BenchmarkingClassification	CodeCode Available	1
HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?	Jun 20, 2024	BenchmarkingPoint Processes	CodeCode Available	2
Resource-efficient Medical Image Analysis with Self-adapting Forward-Forward Networks	Jun 20, 2024	BenchmarkingMedical Image Analysis	—Unverified	0
DASB -- Discrete Audio and Speech Benchmark	Jun 20, 2024	BenchmarkingEmotion Recognition	—Unverified	0
A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular Data	Jun 20, 2024	BenchmarkingKolmogorov-Arnold Networks	CodeCode Available	1
FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainability	Jun 20, 2024	BenchmarkingFairness	CodeCode Available	0
PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions	Jun 20, 2024	Animal Pose EstimationAutonomous Driving	—Unverified	0

Show:10 25 50

← PrevPage 40 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified