| Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark | Mar 10, 2025 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning | Mar 10, 2025 | BenchmarkingMedical Question Answering | CodeCode Available | 2 |
| Medical Hallucinations in Foundation Models and Their Impact on Healthcare | Feb 26, 2025 | BenchmarkingHallucination | CodeCode Available | 2 |
| Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts | Feb 24, 2025 | BenchmarkingFact Verification | CodeCode Available | 2 |
| TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators | Feb 20, 2025 | BenchmarkingCode Generation | CodeCode Available | 2 |
| FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis | Feb 20, 2025 | Age EstimationBenchmarking | CodeCode Available | 2 |
| Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance | Feb 12, 2025 | BenchmarkingLong-Context Understanding | CodeCode Available | 2 |
| SoK: Benchmarking Poisoning Attacks and Defenses in Federated Learning | Feb 6, 2025 | BenchmarkingData Poisoning | CodeCode Available | 2 |
| Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation | Feb 5, 2025 | BenchmarkingLarge Language Model | CodeCode Available | 2 |
| SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model | Jan 28, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |