| MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback | Oct 17, 2024 | Fact VerificationHallucination | CodeCode Available | 0 |
| CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models | Jun 7, 2024 | Multiple-choicePhilosophy | CodeCode Available | 0 |
| Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language Learning | May 30, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 |
| Automating Turkish Educational Quiz Generation Using Large Language Models | Jun 5, 2024 | Multiple-choice | CodeCode Available | 0 |
| How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making? | Oct 21, 2024 | counterfactualDecision Making | CodeCode Available | 0 |
| Measuring Agreeableness Bias in Multimodal Models | Aug 17, 2024 | Decision MakingMultiple-choice | CodeCode Available | 0 |
| CSEPrompts: A Benchmark of Introductory Computer Science Prompts | Apr 3, 2024 | Multiple-choice | CodeCode Available | 0 |
| MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks | May 6, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| MedG-KRP: Medical Graph Knowledge Representation Probing | Dec 14, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 |
| How much do LLMs learn from negative examples? | Mar 18, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |