| FORB: A Flat Object Retrieval Benchmark for Universal Image Embedding | Sep 28, 2023 | BenchmarkingImage Retrieval | CodeCode Available | 1 | 5 |
| FragXsiteDTI: Revealing Responsible Segments in Drug-Target Interaction with Transformer-Driven Interpretation | Nov 4, 2023 | BenchmarkingDrug Discovery | CodeCode Available | 1 | 5 |
| GEMv2: Multilingual NLG Benchmarking in a Single Line of Code | Jun 22, 2022 | BenchmarkingText Generation | CodeCode Available | 1 | 5 |
| GNNs as Predictors of Agentic Workflow Performances | Mar 14, 2025 | BenchmarkingPosition | CodeCode Available | 1 | 5 |
| FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging | Jun 6, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios | Oct 25, 2024 | BenchmarkingDiversity | CodeCode Available | 1 | 5 |
| FineSurE: Fine-grained Summarization Evaluation using LLMs | Jul 1, 2024 | BenchmarkingHallucination | CodeCode Available | 1 | 5 |
| ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning | Sep 27, 2024 | AutoMLBenchmarking | CodeCode Available | 1 | 5 |
| FiFAR: A Fraud Detection Dataset for Learning to Defer | Dec 20, 2023 | BenchmarkingDecision Making | CodeCode Available | 1 | 5 |
| Benchmarking: Past, Present and Future | Aug 1, 2021 | BenchmarkingReading Comprehension | CodeCode Available | 1 | 5 |