| The Claude 3 Model Family: Opus, Sonnet, Haiku | Mar 4, 2024 | 1 Image, 2*2 StitchingArithmetic Reasoning | —Unverified | 0 |
| KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations | Mar 3, 2024 | MedQAMMLU | —Unverified | 0 |
| OpenMedLM: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models | Feb 29, 2024 | Medical Question AnsweringMedQA | —Unverified | 0 |
| Do Large Language Models Mirror Cognitive Language Processing? | Feb 28, 2024 | ChatbotLogical Reasoning | —Unverified | 0 |
| MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning | Feb 27, 2024 | 8kLanguage Modeling | CodeCode Available | 0 |
| ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance Labeling | Feb 21, 2024 | MMLURetrieval | CodeCode Available | 0 |
| Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models | Feb 19, 2024 | MMLU | —Unverified | 0 |
| When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards | Feb 1, 2024 | Answer SelectionLanguage Modeling | CodeCode Available | 0 |
| Towards Uncertainty-Aware Language Agent | Jan 25, 2024 | MMLUStrategyQA | —Unverified | 0 |
| LLaMA Beyond English: An Empirical Study on Language Capability Transfer | Jan 2, 2024 | GPUInformativeness | —Unverified | 0 |