| Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models | May 2, 2025 | Benchmarking | CodeCode Available | 0 |
| EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP Models | May 2, 2025 | Benchmarking | CodeCode Available | 0 |
| Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation | May 1, 2025 | BenchmarkingPosition | —Unverified | 0 |
| EnronQA: Towards Personalized RAG over Private Documents | May 1, 2025 | BenchmarkingMemorization | —Unverified | 0 |
| InterLoc: LiDAR-based Intersection Localization using Road Segmentation with Automated Evaluation Method | May 1, 2025 | BenchmarkingMotion Planning | —Unverified | 0 |
| AI-ready Snow Radar Echogram Dataset (SRED) for climate change monitoring | May 1, 2025 | BenchmarkingDeep Learning | —Unverified | 0 |
| Towards Robust and Generalizable Gerchberg Saxton based Physics Inspired Neural Networks for Computer Generated Holography: A Sensitivity Analysis Framework | Apr 30, 2025 | BenchmarkingLearning Theory | —Unverified | 0 |
| From Precision to Perception: User-Centred Evaluation of Keyword Extraction Algorithms for Internet-Scale Contextual Advertising | Apr 30, 2025 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| Galvatron: An Automatic Distributed System for Efficient Foundation Model Training | Apr 30, 2025 | Benchmarking | —Unverified | 0 |
| Sadeed: Advancing Arabic Diacritization Through Small Language Model | Apr 30, 2025 | Arabic Text DiacritizationBenchmarking | —Unverified | 0 |
| TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models | Apr 29, 2025 | BenchmarkingDataset Generation | CodeCode Available | 0 |
| SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories | Apr 29, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs | Apr 29, 2025 | BenchmarkingFace Generation | —Unverified | 0 |
| Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking | Apr 29, 2025 | BenchmarkingIntrusion Detection | —Unverified | 0 |
| Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model Validation | Apr 29, 2025 | BenchmarkingFairness | CodeCode Available | 0 |
| On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks | Apr 29, 2025 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| Hydra: Marker-Free RGB-D Hand-Eye Calibration | Apr 29, 2025 | Benchmarking | —Unverified | 0 |
| The Leaderboard Illusion | Apr 29, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets | Apr 28, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| BLADE: Benchmark suite for LLM-driven Automated Design and Evolution of iterative optimisation heuristics | Apr 28, 2025 | Benchmarking | —Unverified | 0 |
| WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution | Apr 28, 2025 | BenchmarkingImage Attribution | —Unverified | 0 |
| ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies | Apr 28, 2025 | BenchmarkingData Augmentation | —Unverified | 0 |
| Quantitative evaluation of brain-inspired vision sensors in high-speed robotic perception | Apr 27, 2025 | BenchmarkingEvent-based vision | —Unverified | 0 |
| The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach | Apr 27, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion Collider | Apr 26, 2025 | BenchmarkingGPU | CodeCode Available | 0 |
| Assessing the Utility of Audio Foundation Models for Heart and Respiratory Sound Analysis | Apr 25, 2025 | Benchmarking | —Unverified | 0 |
| QuantBench: Benchmarking AI Methods for Quantitative Investment | Apr 24, 2025 | BenchmarkingContinual Learning | —Unverified | 0 |
| Token Sequence Compression for Efficient Multimodal Computing | Apr 24, 2025 | Benchmarking | —Unverified | 0 |
| Design and benchmarking of a two degree of freedom tendon driver unit for cable-driven wearable technologies | Apr 24, 2025 | Benchmarking | —Unverified | 0 |
| From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets and Code Repositories | Apr 23, 2025 | Benchmarking | CodeCode Available | 0 |
| MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark | Apr 23, 2025 | Benchmarking | CodeCode Available | 0 |
| Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations | Apr 22, 2025 | BenchmarkingFew-Shot Learning | —Unverified | 0 |
| Towards responsible AI for education: Hybrid human-AI to confront the Elephant in the room | Apr 22, 2025 | BenchmarkingFairness | —Unverified | 0 |
| CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents | Apr 22, 2025 | BenchmarkingCross-Lingual Information Retrieval | —Unverified | 0 |
| Fluorescence Reference Target Quantitative Analysis Library | Apr 22, 2025 | Benchmarking | CodeCode Available | 0 |
| A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs | Apr 22, 2025 | BenchmarkingClass-level Code Generation | —Unverified | 0 |
| Benchmarking machine learning models for predicting aerofoil performance | Apr 22, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3 | Apr 22, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Establishing Reliability Metrics for Reward Models in Large Language Models | Apr 21, 2025 | Benchmarking | —Unverified | 0 |
| Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture | Apr 21, 2025 | Benchmarkingclass-incremental learning | —Unverified | 0 |
| Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues | Apr 21, 2025 | BenchmarkingSpeaker Identification | —Unverified | 0 |
| Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation | Apr 21, 2025 | Benchmarking | CodeCode Available | 0 |
| IXGS-Intraoperative 3D Reconstruction from Sparse, Arbitrarily Posed Real X-rays | Apr 20, 2025 | 3D ReconstructionAnatomy | —Unverified | 0 |
| A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents | Apr 20, 2025 | BenchmarkingTask Planning | —Unverified | 0 |
| Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation | Apr 19, 2025 | BenchmarkingImage Restoration | —Unverified | 0 |
| CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations | Apr 19, 2025 | Benchmarking | —Unverified | 0 |
| AI Idea Bench 2025: AI Research Idea Generation Benchmark | Apr 19, 2025 | Benchmarkingscientific discovery | —Unverified | 0 |
| LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers | Apr 19, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and Rendering | Apr 19, 2025 | BenchmarkingDataset Generation | —Unverified | 0 |
| OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation | Apr 18, 2025 | Benchmarking | —Unverified | 0 |