| Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning | Jan 22, 2025 | Benchmarking | CodeCode Available | 0 |
| Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks | Jan 5, 2025 | Adversarial RobustnessBenchmarking | CodeCode Available | 0 |
| Benchmarking LLM-based Relevance Judgment Methods | Apr 17, 2025 | BenchmarkingOpen-Domain Question Answering | CodeCode Available | 0 |
| Toward 3D Object Reconstruction from Stereo Images | Oct 18, 2019 | 3D Object ReconstructionBenchmarking | CodeCode Available | 0 |
| DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models | Jun 8, 2023 | BenchmarkingFairness | CodeCode Available | 0 |
| Skelite: Compact Neural Networks for Efficient Iterative Skeletonization | Mar 10, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 0 |
| Divergent Creativity in Humans and Large Language Models | May 13, 2024 | Benchmarking | CodeCode Available | 0 |
| A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time Series | Jun 4, 2025 | BenchmarkingIrregular Time Series | CodeCode Available | 0 |
| A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric | Jan 22, 2021 | BenchmarkingSentence | CodeCode Available | 0 |
| Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems | Oct 8, 2023 | Benchmarking | CodeCode Available | 0 |
| User-Guided Deep Anime Line Art Colorization with Conditional Adversarial Networks | Aug 9, 2018 | BenchmarkingColorization | CodeCode Available | 0 |
| Towards a Benchmark for Large Language Models for Business Process Management Tasks | Oct 4, 2024 | BenchmarkingManagement | CodeCode Available | 0 |
| Weighting-Based Treatment Effect Estimation via Distribution Learning | Dec 26, 2020 | Benchmarking | CodeCode Available | 0 |
| Slot Filling for Extracting Reskilling and Upskilling Options from the Web | Jul 11, 2022 | BenchmarkingEntity Linking | CodeCode Available | 0 |
| On Pitfalls of RemOve-And-Retrain: Data Processing Inequality Perspective | Apr 26, 2023 | BenchmarkingFeature Importance | CodeCode Available | 0 |
| Distributional Depth-Based Estimation of Object Articulation Models | Aug 12, 2021 | BenchmarkingObject | CodeCode Available | 0 |
| Benchmarking Linguistic Diversity of Large Language Models | Dec 13, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |
| On Recurrent Neural Networks for Sequence-based Processing in Communications | May 24, 2019 | BenchmarkingDecoder | CodeCode Available | 0 |
| Benchmarking Learning Efficiency in Deep Reservoir Computing | Sep 29, 2022 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation | Apr 21, 2025 | Benchmarking | CodeCode Available | 0 |
| Towards a Comprehensive Benchmark for Pathological Lymph Node Metastasis in Breast Cancer Sections | Nov 16, 2024 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Benchmarking Large Language Model Uncertainty for Prompt Optimization | Sep 16, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |
| Diversity Over Size: On the Effect of Sample and Topic Sizes for Topic-Dependent Argument Mining Datasets | May 23, 2022 | Argument MiningBenchmarking | CodeCode Available | 0 |
| On the Evaluation Consistency of Attribution-based Explanations | Jul 28, 2024 | Benchmarking | CodeCode Available | 0 |
| On the Evaluation of Conditional GANs | Jul 11, 2019 | BenchmarkingDiversity | CodeCode Available | 0 |
| A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient Voice | Dec 20, 2024 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Arena-Rosnav 2.0: A Development and Benchmarking Platform for Robot Navigation in Highly Dynamic Environments | Feb 20, 2023 | BenchmarkingRobot Navigation | CodeCode Available | 0 |
| On the Fragility of Active Learners for Text Classification | Mar 23, 2024 | Active LearningBenchmarking | CodeCode Available | 0 |
| Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation | Oct 29, 2021 | BenchmarkingBrain Tumor Segmentation | CodeCode Available | 0 |
| Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset | Feb 8, 2024 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Large Language Models for Math Reasoning Tasks | Aug 20, 2024 | BenchmarkingIn-Context Learning | CodeCode Available | 0 |
| Benchmarking Large Language Models for Image Classification of Marine Mammals | Oct 22, 2024 | Benchmarkingimage-classification | CodeCode Available | 0 |
| On the Loss of Context-awareness in General Instruction Fine-tuning | Nov 5, 2024 | BenchmarkingInstruction Following | CodeCode Available | 0 |
| HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation | May 16, 2025 | BenchmarkingEthics | CodeCode Available | 0 |
| SNaC: Coherence Error Detection for Narrative Summarization | May 19, 2022 | BenchmarkingCoherence Evaluation | CodeCode Available | 0 |
| SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services | May 29, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 0 |
| Using Motif Transitions for Temporal Graph Generation | Jun 19, 2023 | BenchmarkingGraph Generation | CodeCode Available | 0 |
| Accurate Peak Detection in Multimodal Optimization via Approximated Landscape Learning | Mar 23, 2025 | Benchmarking | CodeCode Available | 0 |
| Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias | Jul 3, 2024 | BenchmarkingBias Detection | CodeCode Available | 0 |
| Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams | Jun 17, 2024 | AllBenchmarking | CodeCode Available | 0 |
| Word Embeddings for the Construction Domain | Oct 28, 2016 | BenchmarkingGeneral Classification | CodeCode Available | 0 |
| What Actions are Needed for Understanding Human Actions in Videos? | Aug 9, 2017 | Benchmarking | CodeCode Available | 0 |
| ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness | Jun 1, 2025 | BenchmarkingManagement | CodeCode Available | 0 |
| On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of Classifiers | Mar 16, 2022 | Benchmarking | CodeCode Available | 0 |
| On the Use of ArXiv as a Dataset | Apr 30, 2019 | ArticlesAuthor Attribution | CodeCode Available | 0 |
| On the use of automatically generated synthetic image datasets for benchmarking face recognition | Jun 8, 2021 | BenchmarkingFace Recognition | CodeCode Available | 0 |
| Benchmarking Large Language Models for Molecule Prediction Tasks | Mar 8, 2024 | BenchmarkingPrediction | CodeCode Available | 0 |
| Accel-NASBench: Sustainable Benchmarking for Accelerator-Aware NAS | Apr 9, 2024 | BenchmarkingNeural Architecture Search | CodeCode Available | 0 |
| SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds | May 17, 2025 | BenchmarkingBinary Classification | CodeCode Available | 0 |
| On Training Sample Memorization: Lessons from Benchmarking Generative Modeling with a Large-scale Competition | Jun 6, 2021 | BenchmarkingMemorization | CodeCode Available | 0 |