| Boosting Healthcare LLMs Through Retrieved Context | Sep 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers | Dec 7, 2023 | MathMultiple-choice | CodeCode Available | 1 |
| INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance | Jun 13, 2024 | Multiple-choiceVisual Reasoning | CodeCode Available | 1 |
| IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce | Jun 14, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |
| Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams | Mar 29, 2023 | Multiple-choice | CodeCode Available | 1 |
| BiMediX: Bilingual Medical Mixture of Experts LLM | Feb 20, 2024 | Mixture-of-ExpertsMultiple-choice | CodeCode Available | 1 |
| Language Model Uncertainty Quantification with Attention Chain | Mar 24, 2025 | Computational EfficiencyLanguage Modeling | CodeCode Available | 1 |
| Explicit Planning Helps Language Models in Logical Reasoning | Mar 28, 2023 | Logical ReasoningMultiple-choice | CodeCode Available | 1 |
| Delving into the Reversal Curse: How Far Can Large Language Models Generalize? | Oct 24, 2024 | Multiple-choice | CodeCode Available | 1 |