| Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents | Feb 27, 2025 | Benchmarking | CodeCode Available | 1 |
| MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors | Feb 26, 2025 | Benchmarking | —Unverified | 0 |
| Medical Hallucinations in Foundation Models and Their Impact on Healthcare | Feb 26, 2025 | BenchmarkingHallucination | CodeCode Available | 2 |
| Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv10 | Feb 26, 2025 | Benchmarkingobject-detection | —Unverified | 0 |
| Agentic Mixture-of-Workflows for Multi-Modal Chemical Search | Feb 26, 2025 | BenchmarkingRetrieval | —Unverified | 0 |
| CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation | Feb 26, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review | Feb 26, 2025 | BenchmarkingText Detection | —Unverified | 0 |
| Modelling Regional Solar Photovoltaic Capacity in Great Britain | Feb 26, 2025 | Benchmarking | —Unverified | 0 |
| Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking Study | Feb 26, 2025 | BenchmarkingBlood pressure estimation | CodeCode Available | 1 |
| MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering | Feb 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |