| GPT-4o System Card | Oct 25, 2024 | Multiple-choiceSpatial Reasoning | —Unverified | 0 |
| Beyond Multiple-Choice Accuracy: Real-World Challenges of Implementing Large Language Models in Healthcare | Oct 24, 2024 | Multiple-choice | —Unverified | 0 |
| Large Language Models Still Exhibit Bias in Long Text | Oct 23, 2024 | FairnessMultiple-choice | —Unverified | 0 |
| GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks | Oct 22, 2024 | Code GenerationCode Summarization | —Unverified | 0 |
| How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making? | Oct 21, 2024 | counterfactualDecision Making | CodeCode Available | 0 |
| Susu Box or Piggy Bank: Assessing Cultural Commonsense Knowledge between Ghana and the U.S | Oct 21, 2024 | Multiple-choice | —Unverified | 0 |
| Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models | Oct 18, 2024 | FairnessMultiple-choice | —Unverified | 0 |
| LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs | Oct 18, 2024 | BenchmarkingFairness | —Unverified | 0 |
| CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy | Oct 17, 2024 | Multiple-choiceResponse Generation | —Unverified | 0 |
| LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights | Oct 17, 2024 | Legal ReasoningMultiple-choice | —Unverified | 0 |