| Collaboration among Multiple Large Language Models for Medical Question Answering | May 22, 2025 | Medical Question AnsweringMultiple-choice | —Unverified | 0 | 0 |
| Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing | Apr 18, 2024 | HallucinationMultiple-choice | —Unverified | 0 | 0 |
| Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments | Nov 30, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| Graph-Structured Representations for Visual Question Answering | Sep 19, 2016 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| GraphITE: Estimating Individual Effects of Graph-structured Treatments | Sep 29, 2020 | counterfactualDecision Making | —Unverified | 0 | 0 |
| COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain | May 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering | Dec 5, 2024 | Information RetrievalMultiple-choice | —Unverified | 0 | 0 |
| CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models | Mar 20, 2025 | Code GenerationMultiple-choice | —Unverified | 0 | 0 |
| A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs | Jun 11, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| GPT-4 to GPT-3.5: 'Hold My Scalpel' -- A Look at the Competency of OpenAI's GPT on the Plastic Surgery In-Service Training Exam | Apr 4, 2023 | Multiple-choice | —Unverified | 0 | 0 |