| Disambiguation in Conversational Question Answering in the Era of LLM: A Survey | May 18, 2025 | BenchmarkingConversational Question Answering | —Unverified | 0 |
| ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models | May 18, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| OSS-Bench: Benchmark Generator for Coding LLMs | May 18, 2025 | Benchmarking | CodeCode Available | 0 |
| What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion Summarization | May 18, 2025 | Benchmarking | CodeCode Available | 1 |
| GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification | May 18, 2025 | Benchmarking | CodeCode Available | 2 |
| Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind | May 18, 2025 | BenchmarkingScene Understanding | —Unverified | 0 |
| MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks | May 18, 2025 | BenchmarkingMedical Visual Question Answering | CodeCode Available | 1 |
| CompBench: Benchmarking Complex Instruction-guided Image Editing | May 18, 2025 | BenchmarkingInstruction Following | —Unverified | 0 |
| Machine Learning-Based Analysis of ECG and PCG Signals for Rheumatic Heart Disease Detection: A Scoping Review (2015-2025) | May 17, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| GenderBench: Evaluation Suite for Gender Biases in LLMs | May 17, 2025 | Benchmarking | CodeCode Available | 0 |