| PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs | Jun 24, 2024 | BenchmarkingMachine Unlearning | —Unverified | 0 |
| CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization | Jun 24, 2024 | Bayesian OptimizationBenchmarking | —Unverified | 0 |
| GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets | Jun 23, 2024 | Benchmarking | —Unverified | 0 |
| Position: Benchmarking is Limited in Reinforcement Learning Research | Jun 23, 2024 | BenchmarkingPosition | —Unverified | 0 |
| CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans | Jun 22, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| MetaGreen: Meta-Learning Inspired Transformer Selection for Green Semantic Communication | Jun 22, 2024 | BenchmarkingMeta-Learning | CodeCode Available | 0 |
| Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video | Jun 21, 2024 | BenchmarkingFew-Shot Learning | —Unverified | 0 |
| Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease Generalization | Jun 21, 2024 | BenchmarkingSegmentation | CodeCode Available | 0 |
| FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents | Jun 21, 2024 | Benchmarking | —Unverified | 0 |
| Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors | Jun 21, 2024 | Adversarial DefenseAdversarial Robustness | —Unverified | 0 |
| Beyond Optimism: Exploration With Partially Observable Rewards | Jun 20, 2024 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 0 |
| FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainability | Jun 20, 2024 | BenchmarkingFairness | CodeCode Available | 0 |
| CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines | Jun 20, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions | Jun 20, 2024 | Animal Pose EstimationAutonomous Driving | —Unverified | 0 |
| DASB -- Discrete Audio and Speech Benchmark | Jun 20, 2024 | BenchmarkingEmotion Recognition | —Unverified | 0 |
| Selected Languages are All You Need for Cross-lingual Truthfulness Transfer | Jun 20, 2024 | AllBenchmarking | CodeCode Available | 0 |
| Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary | Jun 20, 2024 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data | Jun 20, 2024 | Animal Pose EstimationBenchmarking | —Unverified | 0 |
| Resource-efficient Medical Image Analysis with Self-adapting Forward-Forward Networks | Jun 20, 2024 | BenchmarkingMedical Image Analysis | —Unverified | 0 |
| QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules | Jun 20, 2024 | Benchmarking | CodeCode Available | 0 |
| The Elusive Pursuit of Reproducing PATE-GAN: Benchmarking, Auditing, Debugging | Jun 20, 2024 | Benchmarking | CodeCode Available | 0 |
| Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models | Jun 19, 2024 | BenchmarkingOpen-Domain Question Answering | —Unverified | 0 |
| Benchmarking Unsupervised Online IDS for Masquerade Attacks in CAN | Jun 19, 2024 | BenchmarkingIntrusion Detection | CodeCode Available | 0 |
| Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective | Jun 19, 2024 | BenchmarkingContinual Pretraining | —Unverified | 0 |
| Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration | Jun 19, 2024 | BenchmarkingDistractor Generation | —Unverified | 0 |
| Comparison of Open-Source and Proprietary LLMs for Machine Reading Comprehension: A Practical Analysis for Industrial Applications | Jun 19, 2024 | BenchmarkingMachine Reading Comprehension | —Unverified | 0 |
| M4Fog: A Global Multi-Regional, Multi-Modal, and Multi-Stage Dataset for Marine Fog Detection and Forecasting to Bridge Ocean and Atmosphere | Jun 19, 2024 | BenchmarkingSpatio-Temporal Forecasting | CodeCode Available | 0 |
| Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance | Jun 18, 2024 | Benchmarking | —Unverified | 0 |
| Exploring and Benchmarking the Planning Capabilities of Large Language Models | Jun 18, 2024 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts | Jun 18, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning | Jun 18, 2024 | BenchmarkingWorld Knowledge | CodeCode Available | 0 |
| Automatic benchmarking of large multimodal models via iterative experiment programming | Jun 18, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions | Jun 18, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| The Liouville Generator for Producing Integrable Expressions | Jun 17, 2024 | Benchmarking | —Unverified | 0 |
| JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models | Jun 17, 2024 | Benchmarkingcounterfactual | —Unverified | 0 |
| InternalInspector I^2: Robust Confidence Estimation in LLMs through Internal States | Jun 17, 2024 | BenchmarkingContrastive Learning | —Unverified | 0 |
| GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations | Jun 17, 2024 | BenchmarkingDataset Generation | CodeCode Available | 0 |
| Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading | Jun 17, 2024 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| Benchmarking of LLM Detection: Comparing Two Competing Approaches | Jun 17, 2024 | Benchmarking | —Unverified | 0 |
| Standardizing Structural Causal Models | Jun 17, 2024 | BenchmarkingCausal Inference | CodeCode Available | 0 |
| Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams | Jun 17, 2024 | AllBenchmarking | CodeCode Available | 0 |
| A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models | Jun 17, 2024 | BenchmarkingSurvey | —Unverified | 0 |
| RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content | Jun 17, 2024 | BenchmarkingGeneral Knowledge | CodeCode Available | 0 |
| Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning | Jun 16, 2024 | BenchmarkingMath | —Unverified | 0 |
| Evaluating the Performance of Large Language Models via Debates | Jun 16, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex | Jun 16, 2024 | BenchmarkingObject Recognition | —Unverified | 0 |
| Benchmarking Label Noise in Instance Segmentation: Spatial Noise Matters | Jun 16, 2024 | BenchmarkingInstance Segmentation | CodeCode Available | 0 |
| WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences | Jun 16, 2024 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models | Jun 16, 2024 | Benchmarking | CodeCode Available | 0 |
| VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment | Jun 16, 2024 | Action UnderstandingBenchmarking | —Unverified | 0 |