| Marathon: A Race Through the Realm of Long Context with Large Language Models | Dec 15, 2023 | Long-Context UnderstandingMultiple-choice | CodeCode Available | 1 |
| Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers | Dec 7, 2023 | MathMultiple-choice | CodeCode Available | 1 |
| Fake Alignment: Are LLMs Really Aligned Well? | Nov 10, 2023 | Multiple-choice | CodeCode Available | 1 |
| Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models | Nov 10, 2023 | GSM8KMemorization | CodeCode Available | 1 |
| Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis | Nov 2, 2023 | Density EstimationDiversity | CodeCode Available | 1 |
| An Open Source Data Contamination Report for Large Language Models | Oct 26, 2023 | HellaSwagLanguage Modeling | CodeCode Available | 1 |
| JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuning | Oct 16, 2023 | Domain AdaptationMedical Question Answering | CodeCode Available | 1 |
| OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models | Oct 11, 2023 | HallucinationIn-Context Learning | CodeCode Available | 1 |
| BRAINTEASER: Lateral Thinking Puzzles for Large Language Models | Oct 8, 2023 | Distractor GenerationLanguage Modelling | CodeCode Available | 1 |
| LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models | Oct 5, 2023 | Common Sense ReasoningMultiple-choice | CodeCode Available | 1 |