| SportQA: A Benchmark for Sports Understanding in Large Language Models | Feb 24, 2024 | Few-Shot LearningMultiple-choice | CodeCode Available | 1 |
| Biomedical Entity Linking as Multiple Choice Question Answering | Feb 23, 2024 | Entity LinkingMultiple-choice | CodeCode Available | 0 |
| ToMBench: Benchmarking Theory of Mind in Large Language Models | Feb 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 2 |
| tinyBenchmarks: evaluating LLMs with fewer examples | Feb 22, 2024 | MMLUMultiple-choice | CodeCode Available | 2 |
| Uncertainty-Aware Evaluation for Vision-Language Models | Feb 22, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 1 |
| Identifying Multiple Personalities in Large Language Models with External Evaluation | Feb 22, 2024 | Multiple-choice | —Unverified | 0 |
| "My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models | Feb 22, 2024 | Multiple-choiceText Generation | CodeCode Available | 0 |
| Ranking Large Language Models without Ground Truth | Feb 21, 2024 | Multiple-choiceTriplet | —Unverified | 0 |
| Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models | Feb 21, 2024 | Multiple-choice | —Unverified | 0 |
| KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge | Feb 21, 2024 | 4kMultiple-choice | —Unverified | 0 |