| Benchmarking GNNs Using Lightning Network Data | Jul 5, 2024 | Benchmarking | —Unverified | 0 |
| From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano | Jul 5, 2024 | AttributeBenchmarking | —Unverified | 0 |
| Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality matters | Jul 5, 2024 | Benchmarkingvalid | CodeCode Available | 1 |
| Towards Stable 3D Object Detection | Jul 5, 2024 | 3D Object DetectionAutonomous Driving | —Unverified | 0 |
| SH17: A Dataset for Human Safety and Personal Protective Equipment Detection in Manufacturing Industry | Jul 5, 2024 | Benchmarkingobject-detection | CodeCode Available | 2 |
| On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation | Jul 4, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| Craftium: An Extensible Framework for Creating Reinforcement Learning Environments | Jul 4, 2024 | BenchmarkingMinecraft | CodeCode Available | 2 |
| Benchmarking Complex Instruction-Following with Multiple Constraints Composition | Jul 4, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| Benchmark on Drug Target Interaction Modeling from a Structure Perspective | Jul 4, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 1 |
| Benchmarking End-To-End Performance of AI-Based Chip Placement Algorithms | Jul 3, 2024 | BenchmarkingCPU | —Unverified | 0 |
| Comics Datasets Framework: Mix of Comics datasets for detection benchmarking | Jul 3, 2024 | BenchmarkingObject | CodeCode Available | 1 |
| Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias | Jul 3, 2024 | BenchmarkingBias Detection | CodeCode Available | 0 |
| CoIR: A Comprehensive Benchmark for Code Information Retrieval Models | Jul 3, 2024 | BenchmarkingCode Search | CodeCode Available | 2 |
| Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset | Jul 3, 2024 | BenchmarkingDiversity | CodeCode Available | 1 |
| GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models | Jul 3, 2024 | Benchmarking | CodeCode Available | 1 |
| TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations | Jul 2, 2024 | Benchmarkingtext-to-speech | —Unverified | 0 |
| Open foundation models for Azerbaijani language | Jul 2, 2024 | Benchmarking | —Unverified | 0 |
| Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining Tasks | Jul 2, 2024 | Activity PredictionAnomaly Detection | CodeCode Available | 0 |
| Occlusion-Aware Seamless Segmentation | Jul 2, 2024 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| Modified CMA-ES Algorithm for Multi-Modal Optimization: Incorporating Niching Strategies and Dynamic Adaptation Mechanism | Jul 1, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| MIRAI: Evaluating LLM Agents for Event Forecasting | Jul 1, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| Task-oriented Over-the-air Computation for Edge-device Co-inference with Balanced Classification Accuracy | Jul 1, 2024 | Benchmarking | —Unverified | 0 |
| BERGEN: A Benchmarking Library for Retrieval-Augmented Generation | Jul 1, 2024 | BenchmarkingRAG | CodeCode Available | 3 |
| Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents | Jul 1, 2024 | Benchmarking | CodeCode Available | 1 |
| ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions | Jul 1, 2024 | BenchmarkingQuestion Generation | —Unverified | 0 |
| FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models | Jul 1, 2024 | BenchmarkingFairness | CodeCode Available | 2 |
| EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting | Jul 1, 2024 | 3D ReconstructionBenchmarking | —Unverified | 0 |
| MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations | Jul 1, 2024 | Benchmarkingdocument understanding | CodeCode Available | 2 |
| FineSurE: Fine-grained Summarization Evaluation using LLMs | Jul 1, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| Reinvestigating the R2 Indicator: Achieving Pareto Compliance by Integration | Jul 1, 2024 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Predictive Coding Networks -- Made Simple | Jul 1, 2024 | Benchmarking | CodeCode Available | 2 |
| AI Agents That Matter | Jul 1, 2024 | Benchmarking | CodeCode Available | 1 |
| Overcoming Common Flaws in the Evaluation of Selective Classification Systems | Jul 1, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| Commute Graph Neural Networks | Jun 30, 2024 | Benchmarking | —Unverified | 0 |
| GenderBias-VL: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing | Jun 30, 2024 | Benchmarkingcounterfactual | —Unverified | 0 |
| PerSEval: Assessing Personalization in Text Summarizers | Jun 29, 2024 | BenchmarkingHuman Judgment Correlation | —Unverified | 0 |
| GraphArena: Benchmarking Large Language Models on Graph Computational Problems | Jun 29, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activities | Jun 27, 2024 | Benchmarking | CodeCode Available | 1 |
| Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges | Jun 27, 2024 | BenchmarkingClinical Knowledge | —Unverified | 0 |
| Benchmarking M6 Competitors: An Analysis of Financial Metrics and Discussion of Incentives | Jun 27, 2024 | Benchmarking | —Unverified | 0 |
| UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models | Jun 27, 2024 | AttributeBenchmarking | CodeCode Available | 2 |
| Quantum-tunnelling deep neural network for optical illusion recognition | Jun 26, 2024 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| Evaluating and Benchmarking Foundation Models for Earth Observation and Geospatial AI | Jun 26, 2024 | BenchmarkingCrop Type Mapping | —Unverified | 0 |
| XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis | Jun 26, 2024 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| GenRL: Multimodal-foundation world models for generalization in embodied agents | Jun 26, 2024 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 2 |
| MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data | Jun 26, 2024 | BenchmarkingMath | CodeCode Available | 2 |
| RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems | Jun 25, 2024 | BenchmarkingRAG | —Unverified | 0 |
| Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making | Jun 25, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark Detection | Jun 25, 2024 | BenchmarkingPrompt Learning | CodeCode Available | 1 |
| SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It) | Jun 25, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 |