Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2901–2950 of 5548 papers

Title	Date	Tasks	Status	Hype
Profit: Benchmarking Personalization and Robustness Trade-off in Federated Prompt Tuning	Oct 6, 2023	BenchmarkingFederated Learning	—Unverified	0
CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis	Oct 6, 2023	BenchmarkingDomain Generalization	—Unverified	0
Bringing Quantum Algorithms to Automated Machine Learning: A Systematic Review of AutoML Frameworks Regarding Extensibility for QML Algorithms	Oct 6, 2023	AutoMLBenchmarking	—Unverified	0
A Review of Deep Reinforcement Learning in Serverless Computing: Function Scheduling and Resource Auto-Scaling	Oct 5, 2023	BenchmarkingDeep Reinforcement Learning	—Unverified	0
PepMLM: Target Sequence-Conditioned Generation of Therapeutic Peptide Binders via Span Masked Language Modeling	Oct 5, 2023	BenchmarkingLanguage Modeling	CodeCode Available	1
Benchmarking a foundation LLM on its ability to re-label structure names in accordance with the AAPM TG-263 report	Oct 5, 2023	Benchmarking	—Unverified	0
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation	Oct 5, 2023	BenchmarkingDecision Making	CodeCode Available	2
Deep Reinforcement Learning Algorithms for Hybrid V2X Communication: A Benchmarking Study	Oct 4, 2023	Autonomous VehiclesBenchmarking	—Unverified	0
Can Language Models Employ the Socratic Method? Experiments with Code Debugging	Oct 4, 2023	Benchmarking	CodeCode Available	1
Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal Carcinoma	Oct 4, 2023	BenchmarkingSegmentation	CodeCode Available	0
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference	Oct 4, 2023	BenchmarkingGPU	—Unverified	0
On the Performance of Multimodal Language Models	Oct 4, 2023	BenchmarkingBinary Classification	—Unverified	0
T^3Bench: Benchmarking Current Progress in Text-to-3D Generation	Oct 4, 2023	3D GenerationBenchmarking	CodeCode Available	3
PGDQN: Preference-Guided Deep Q-Network	Oct 3, 2023	Atari GamesBenchmarking	CodeCode Available	1
CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery	Oct 3, 2023	BenchmarkingCausal Discovery	CodeCode Available	1
EGraFFBench: Evaluation of Equivariant Graph Neural Network Force Fields for Atomistic Simulations	Oct 3, 2023	Atomic ForcesBenchmarking	—Unverified	0
EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods	Oct 3, 2023	Benchmarkingtext-guided-image-editing	—Unverified	0
Benchmarking and Improving Generator-Validator Consistency of Language Models	Oct 3, 2023	BenchmarkingInstruction Following	—Unverified	0
GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth Benchmarking	Oct 3, 2023	Benchmarkingcounterfactual	CodeCode Available	1
Learning Quantum Processes with Quantum Statistical Queries	Oct 3, 2023	BenchmarkingCryptanalysis	CodeCode Available	0
Adaptive Visual Scene Understanding: Incremental Scene Graph Generation	Oct 2, 2023	BenchmarkingContinual Learning	CodeCode Available	0
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench	Oct 2, 2023	BenchmarkingSafety Alignment	CodeCode Available	1
A New Real-World Video Dataset for the Comparison of Defogging Algorithms	Oct 2, 2023	BenchmarkingDeblurring	—Unverified	0
NewsRecLib: A PyTorch-Lightning Library for Neural News Recommendation	Oct 2, 2023	BenchmarkingNews Recommendation	CodeCode Available	1
TRAM: Benchmarking Temporal Reasoning for Large Language Models	Oct 2, 2023	BenchmarkingFew-Shot Learning	—Unverified	0
CoDBench: A Critical Evaluation of Data-driven Models for Continuous Dynamical Systems	Oct 2, 2023	BenchmarkingComputational Efficiency	—Unverified	0
FELM: Benchmarking Factuality Evaluation of Large Language Models	Oct 1, 2023	BenchmarkingMath	CodeCode Available	1
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models	Oct 1, 2023	Benchmarking	CodeCode Available	2
Adaptive Control of an Inverted Pendulum by a Reinforcement Learning-based LQR Method	Sep 30, 2023	BenchmarkingReinforcement Learning (RL)	—Unverified	0
The Sparsity Roofline: Understanding the Hardware Limits of Sparse Neural Networks	Sep 30, 2023	Benchmarking	—Unverified	0
MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data	Sep 29, 2023	BenchmarkingContrastive Learning	CodeCode Available	1
Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve Multimodal Sarcasm Detection	Sep 29, 2023	BenchmarkingDiversity	—Unverified	0
FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of Things	Sep 29, 2023	BenchmarkingFederated Learning	CodeCode Available	1
Optimizing with Low Budgets: a Comparison on the Black-box Optimization Benchmarking Suite and OpenAI Gym	Sep 29, 2023	Bayesian OptimizationBenchmarking	—Unverified	0
Benchmarking Collaborative Learning Methods Cost-Effectiveness for Prostate Segmentation	Sep 29, 2023	BenchmarkingFederated Learning	—Unverified	0
Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?	Sep 29, 2023	BenchmarkingKnowledge Graph Completion	CodeCode Available	1
Benchmarking Cognitive Biases in Large Language Models as Evaluators	Sep 29, 2023	BenchmarkingIn-Context Learning	CodeCode Available	1
Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors	Sep 29, 2023	BenchmarkingComputational Efficiency	—Unverified	0
A rigorous benchmarking of methods for SARS-CoV-2 lineage abundance estimation in wastewater	Sep 29, 2023	Benchmarking	—Unverified	0
Intuitive or Dependent? Investigating LLMs' Behavior Style to Conflicting Prompts	Sep 29, 2023	BenchmarkingDecision Making	—Unverified	0
SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation	Sep 29, 2023	3D Human Pose Estimation3D Human Reconstruction	CodeCode Available	3
G4SATBench: Benchmarking and Advancing SAT Solving with Graph Neural Networks	Sep 29, 2023	Benchmarking	CodeCode Available	1
FORB: A Flat Object Retrieval Benchmark for Universal Image Embedding	Sep 28, 2023	BenchmarkingImage Retrieval	CodeCode Available	1
LagrangeBench: A Lagrangian Fluid Mechanics Benchmarking Suite	Sep 28, 2023	Benchmarking	CodeCode Available	1
Revisiting Neural Program Smoothing for Fuzzing	Sep 28, 2023	BenchmarkingCPU	CodeCode Available	1
Language Models as a Service: Overview of a New Paradigm and its Challenges	Sep 28, 2023	Benchmarking	—Unverified	0
LawBench: Benchmarking Legal Knowledge of Large Language Models	Sep 28, 2023	ArticlesBenchmarking	CodeCode Available	2
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond	Sep 28, 2023	Benchmarking	CodeCode Available	2
The Trickle-down Impact of Reward (In-)consistency on RLHF	Sep 28, 2023	Benchmarking	CodeCode Available	1
OceanBench: The Sea Surface Height Edition	Sep 27, 2023	BenchmarkingSensor Fusion	CodeCode Available	1

Show:10 25 50

← PrevPage 59 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified