| EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements | Jun 10, 2025 | Binary ClassificationFinancial Analysis | CodeCode Available | 1 | 5 |
| Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers | Apr 25, 2025 | Large Language Model | CodeCode Available | 1 | 5 |
| AuditWen:An Open-Source Large Language Model for Audit | Oct 9, 2024 | Answer GenerationLanguage Modeling | CodeCode Available | 1 | 5 |
| Meaning Typed Prompting: A Technique for Efficient, Reliable Structured Output Generation | Oct 22, 2024 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 | 5 |
| DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments | May 31, 2025 | Large Language Model | CodeCode Available | 1 | 5 |
| Exploring Empty Spaces: Human-in-the-Loop Data Augmentation | Oct 1, 2024 | Data AugmentationDiversity | CodeCode Available | 1 | 5 |
| Matching Patients to Clinical Trials with Large Language Models | Jul 27, 2023 | Language ModellingLarge Language Model | CodeCode Available | 1 | 5 |
| A Study of Generative Large Language Model for Medical Research and Healthcare | May 22, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory | May 8, 2025 | Large Language ModelNavigate | CodeCode Available | 1 | 5 |
| MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems | May 23, 2023 | Language ModellingLarge Language Model | CodeCode Available | 1 | 5 |