| Benchmark on Drug Target Interaction Modeling from a Structure Perspective | Jul 4, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 1 |
| GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models | Jul 3, 2024 | Benchmarking | CodeCode Available | 1 |
| Comics Datasets Framework: Mix of Comics datasets for detection benchmarking | Jul 3, 2024 | BenchmarkingObject | CodeCode Available | 1 |
| Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset | Jul 3, 2024 | BenchmarkingDiversity | CodeCode Available | 1 |
| Occlusion-Aware Seamless Segmentation | Jul 2, 2024 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| FineSurE: Fine-grained Summarization Evaluation using LLMs | Jul 1, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| Overcoming Common Flaws in the Evaluation of Selective Classification Systems | Jul 1, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents | Jul 1, 2024 | Benchmarking | CodeCode Available | 1 |
| AI Agents That Matter | Jul 1, 2024 | Benchmarking | CodeCode Available | 1 |
| GraphArena: Benchmarking Large Language Models on Graph Computational Problems | Jun 29, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |