Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 3001–3025 of 5548 papers

Title	Date	Tasks	Status
PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs	Jun 24, 2024	BenchmarkingMachine Unlearning	—Unverified
CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization	Jun 24, 2024	Bayesian OptimizationBenchmarking	—Unverified
GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets	Jun 23, 2024	Benchmarking	—Unverified
Position: Benchmarking is Limited in Reinforcement Learning Research	Jun 23, 2024	BenchmarkingPosition	—Unverified
CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans	Jun 22, 2024	BenchmarkingDecision Making	—Unverified
MetaGreen: Meta-Learning Inspired Transformer Selection for Green Semantic Communication	Jun 22, 2024	BenchmarkingMeta-Learning	CodeCode Available
Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video	Jun 21, 2024	BenchmarkingFew-Shot Learning	—Unverified
Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease Generalization	Jun 21, 2024	BenchmarkingSegmentation	CodeCode Available
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents	Jun 21, 2024	Benchmarking	—Unverified
Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors	Jun 21, 2024	Adversarial DefenseAdversarial Robustness	—Unverified
Beyond Optimism: Exploration With Partially Observable Rewards	Jun 20, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available
FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainability	Jun 20, 2024	BenchmarkingFairness	CodeCode Available
CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines	Jun 20, 2024	BenchmarkingDecision Making	CodeCode Available
PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions	Jun 20, 2024	Animal Pose EstimationAutonomous Driving	—Unverified
DASB -- Discrete Audio and Speech Benchmark	Jun 20, 2024	BenchmarkingEmotion Recognition	—Unverified
Selected Languages are All You Need for Cross-lingual Truthfulness Transfer	Jun 20, 2024	AllBenchmarking	CodeCode Available
Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary	Jun 20, 2024	BenchmarkingIn-Context Learning	—Unverified
Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data	Jun 20, 2024	Animal Pose EstimationBenchmarking	—Unverified
Resource-efficient Medical Image Analysis with Self-adapting Forward-Forward Networks	Jun 20, 2024	BenchmarkingMedical Image Analysis	—Unverified
QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules	Jun 20, 2024	Benchmarking	CodeCode Available
The Elusive Pursuit of Reproducing PATE-GAN: Benchmarking, Auditing, Debugging	Jun 20, 2024	Benchmarking	CodeCode Available
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models	Jun 19, 2024	BenchmarkingOpen-Domain Question Answering	—Unverified
Benchmarking Unsupervised Online IDS for Masquerade Attacks in CAN	Jun 19, 2024	BenchmarkingIntrusion Detection	CodeCode Available
Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective	Jun 19, 2024	BenchmarkingContinual Pretraining	—Unverified
Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration	Jun 19, 2024	BenchmarkingDistractor Generation	—Unverified

Show:10 25 50

← PrevPage 121 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified