Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 5101–5150 of 5548 papers

Title	Date	Tasks	Status
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning	Jan 22, 2025	Benchmarking	CodeCode Available
Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks	Jan 5, 2025	Adversarial RobustnessBenchmarking	CodeCode Available
Benchmarking LLM-based Relevance Judgment Methods	Apr 17, 2025	BenchmarkingOpen-Domain Question Answering	CodeCode Available
Toward 3D Object Reconstruction from Stereo Images	Oct 18, 2019	3D Object ReconstructionBenchmarking	CodeCode Available
DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models	Jun 8, 2023	BenchmarkingFairness	CodeCode Available
Skelite: Compact Neural Networks for Efficient Iterative Skeletonization	Mar 10, 2025	BenchmarkingComputational Efficiency	CodeCode Available
Divergent Creativity in Humans and Large Language Models	May 13, 2024	Benchmarking	CodeCode Available
A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time Series	Jun 4, 2025	BenchmarkingIrregular Time Series	CodeCode Available
A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric	Jan 22, 2021	BenchmarkingSentence	CodeCode Available
Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems	Oct 8, 2023	Benchmarking	CodeCode Available
User-Guided Deep Anime Line Art Colorization with Conditional Adversarial Networks	Aug 9, 2018	BenchmarkingColorization	CodeCode Available
Towards a Benchmark for Large Language Models for Business Process Management Tasks	Oct 4, 2024	BenchmarkingManagement	CodeCode Available
Weighting-Based Treatment Effect Estimation via Distribution Learning	Dec 26, 2020	Benchmarking	CodeCode Available
Slot Filling for Extracting Reskilling and Upskilling Options from the Web	Jul 11, 2022	BenchmarkingEntity Linking	CodeCode Available
On Pitfalls of RemOve-And-Retrain: Data Processing Inequality Perspective	Apr 26, 2023	BenchmarkingFeature Importance	CodeCode Available
Distributional Depth-Based Estimation of Object Articulation Models	Aug 12, 2021	BenchmarkingObject	CodeCode Available
Benchmarking Linguistic Diversity of Large Language Models	Dec 13, 2024	BenchmarkingDiversity	CodeCode Available
On Recurrent Neural Networks for Sequence-based Processing in Communications	May 24, 2019	BenchmarkingDecoder	CodeCode Available
Benchmarking Learning Efficiency in Deep Reservoir Computing	Sep 29, 2022	Benchmarking	CodeCode Available
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation	Apr 21, 2025	Benchmarking	CodeCode Available
Towards a Comprehensive Benchmark for Pathological Lymph Node Metastasis in Breast Cancer Sections	Nov 16, 2024	BenchmarkingDiagnostic	CodeCode Available
Benchmarking Large Language Model Uncertainty for Prompt Optimization	Sep 16, 2024	BenchmarkingDiversity	CodeCode Available
Diversity Over Size: On the Effect of Sample and Topic Sizes for Topic-Dependent Argument Mining Datasets	May 23, 2022	Argument MiningBenchmarking	CodeCode Available
On the Evaluation Consistency of Attribution-based Explanations	Jul 28, 2024	Benchmarking	CodeCode Available
On the Evaluation of Conditional GANs	Jul 11, 2019	BenchmarkingDiversity	CodeCode Available
A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient Voice	Dec 20, 2024	BenchmarkingDiagnostic	CodeCode Available
Arena-Rosnav 2.0: A Development and Benchmarking Platform for Robot Navigation in Highly Dynamic Environments	Feb 20, 2023	BenchmarkingRobot Navigation	CodeCode Available
On the Fragility of Active Learners for Text Classification	Mar 23, 2024	Active LearningBenchmarking	CodeCode Available
Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation	Oct 29, 2021	BenchmarkingBrain Tumor Segmentation	CodeCode Available
Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset	Feb 8, 2024	Benchmarking	CodeCode Available
Benchmarking Large Language Models for Math Reasoning Tasks	Aug 20, 2024	BenchmarkingIn-Context Learning	CodeCode Available
Benchmarking Large Language Models for Image Classification of Marine Mammals	Oct 22, 2024	Benchmarkingimage-classification	CodeCode Available
On the Loss of Context-awareness in General Instruction Fine-tuning	Nov 5, 2024	BenchmarkingInstruction Following	CodeCode Available
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation	May 16, 2025	BenchmarkingEthics	CodeCode Available
SNaC: Coherence Error Detection for Narrative Summarization	May 19, 2022	BenchmarkingCoherence Evaluation	CodeCode Available
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services	May 29, 2025	BenchmarkingInformation Retrieval	CodeCode Available
Using Motif Transitions for Temporal Graph Generation	Jun 19, 2023	BenchmarkingGraph Generation	CodeCode Available
Accurate Peak Detection in Multimodal Optimization via Approximated Landscape Learning	Mar 23, 2025	Benchmarking	CodeCode Available
Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias	Jul 3, 2024	BenchmarkingBias Detection	CodeCode Available
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams	Jun 17, 2024	AllBenchmarking	CodeCode Available
Word Embeddings for the Construction Domain	Oct 28, 2016	BenchmarkingGeneral Classification	CodeCode Available
What Actions are Needed for Understanding Human Actions in Videos?	Aug 9, 2017	Benchmarking	CodeCode Available
ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness	Jun 1, 2025	BenchmarkingManagement	CodeCode Available
On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of Classifiers	Mar 16, 2022	Benchmarking	CodeCode Available
On the Use of ArXiv as a Dataset	Apr 30, 2019	ArticlesAuthor Attribution	CodeCode Available
On the use of automatically generated synthetic image datasets for benchmarking face recognition	Jun 8, 2021	BenchmarkingFace Recognition	CodeCode Available
Benchmarking Large Language Models for Molecule Prediction Tasks	Mar 8, 2024	BenchmarkingPrediction	CodeCode Available
Accel-NASBench: Sustainable Benchmarking for Accelerator-Aware NAS	Apr 9, 2024	BenchmarkingNeural Architecture Search	CodeCode Available
SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds	May 17, 2025	BenchmarkingBinary Classification	CodeCode Available
On Training Sample Memorization: Lessons from Benchmarking Generative Modeling with a Large-scale Competition	Jun 6, 2021	BenchmarkingMemorization	CodeCode Available

Show:10 25 50

← PrevPage 103 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified