| EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP Models | May 2, 2025 | Benchmarking | CodeCode Available | 0 |
| Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models | May 2, 2025 | Benchmarking | CodeCode Available | 0 |
| EnronQA: Towards Personalized RAG over Private Documents | May 1, 2025 | BenchmarkingMemorization | —Unverified | 0 |
| InterLoc: LiDAR-based Intersection Localization using Road Segmentation with Automated Evaluation Method | May 1, 2025 | BenchmarkingMotion Planning | —Unverified | 0 |
| AI-ready Snow Radar Echogram Dataset (SRED) for climate change monitoring | May 1, 2025 | BenchmarkingDeep Learning | —Unverified | 0 |
| Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation | May 1, 2025 | BenchmarkingPosition | —Unverified | 0 |
| From Precision to Perception: User-Centred Evaluation of Keyword Extraction Algorithms for Internet-Scale Contextual Advertising | Apr 30, 2025 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| Galvatron: An Automatic Distributed System for Efficient Foundation Model Training | Apr 30, 2025 | Benchmarking | —Unverified | 0 |
| Towards Robust and Generalizable Gerchberg Saxton based Physics Inspired Neural Networks for Computer Generated Holography: A Sensitivity Analysis Framework | Apr 30, 2025 | BenchmarkingLearning Theory | —Unverified | 0 |
| Sadeed: Advancing Arabic Diacritization Through Small Language Model | Apr 30, 2025 | Arabic Text DiacritizationBenchmarking | —Unverified | 0 |
| The Leaderboard Illusion | Apr 29, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs | Apr 29, 2025 | BenchmarkingFace Generation | —Unverified | 0 |
| SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories | Apr 29, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model Validation | Apr 29, 2025 | BenchmarkingFairness | CodeCode Available | 0 |
| On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks | Apr 29, 2025 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| Hydra: Marker-Free RGB-D Hand-Eye Calibration | Apr 29, 2025 | Benchmarking | —Unverified | 0 |
| TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models | Apr 29, 2025 | BenchmarkingDataset Generation | CodeCode Available | 0 |
| Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking | Apr 29, 2025 | BenchmarkingIntrusion Detection | —Unverified | 0 |
| WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution | Apr 28, 2025 | BenchmarkingImage Attribution | —Unverified | 0 |
| Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets | Apr 28, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies | Apr 28, 2025 | BenchmarkingData Augmentation | —Unverified | 0 |
| BLADE: Benchmark suite for LLM-driven Automated Design and Evolution of iterative optimisation heuristics | Apr 28, 2025 | Benchmarking | —Unverified | 0 |
| Quantitative evaluation of brain-inspired vision sensors in high-speed robotic perception | Apr 27, 2025 | BenchmarkingEvent-based vision | —Unverified | 0 |
| The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach | Apr 27, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion Collider | Apr 26, 2025 | BenchmarkingGPU | CodeCode Available | 0 |