| KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models | Oct 15, 2023 | Multiple-choiceTriplet | CodeCode Available | 0 |
| AutoCast++: Enhancing World Event Prediction with Zero-shot Ranking-based Context Retrieval | Oct 3, 2023 | ArticlesDecision Making | CodeCode Available | 0 |
| Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare | Feb 22, 2025 | Decision MakingMultiple-choice | CodeCode Available | 0 |
| Uncertainty quantification in fine-tuned LLMs using LoRA ensembles | Feb 19, 2024 | Multiple-choiceUncertainty Quantification | CodeCode Available | 0 |
| Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings | Dec 9, 2024 | Multiple-choice | CodeCode Available | 0 |
| Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam | Jun 14, 2024 | FairnessLogical Reasoning | CodeCode Available | 0 |
| VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models | May 13, 2025 | FormMultiple-choice | CodeCode Available | 0 |
| Towards Democratizing Multilingual Large Language Models For Medicine Through A Two-Stage Instruction Fine-tuning Approach | Sep 9, 2024 | Computational EfficiencyContinual Pretraining | CodeCode Available | 0 |
| Evaluating Large Language Model Biases in Persona-Steered Generation | May 30, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| SeqSAM: Autoregressive Multiple Hypothesis Prediction for Medical Image Segmentation using SAM | Mar 12, 2025 | Image SegmentationMedical Image Segmentation | CodeCode Available | 0 |