| Large Language Models as Realistic Microservice Trace Generators | Dec 16, 2024 | Graph GenerationLanguage Modeling | CodeCode Available | 1 |
| Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula | Aug 8, 2024 | GSM8KLanguage Modeling | CodeCode Available | 1 |
| Evaluating Retrieval Quality in Retrieval-Augmented Generation | Apr 21, 2024 | GPULanguage Modeling | CodeCode Available | 1 |
| Large Language Models Can Be Easily Distracted by Irrelevant Context | Jan 31, 2023 | Arithmetic ReasoningLanguage Modeling | CodeCode Available | 1 |
| Evaluating Human-Language Model Interaction | Dec 19, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees | Mar 11, 2025 | ChatbotLanguage Modeling | CodeCode Available | 1 |
| Protein Structure Tokenization: Benchmarking and New Recipe | Feb 28, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Large Language Model Unlearning | Oct 14, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| BOND: BERT-Assisted Open-Domain Named Entity Recognition with Distant Supervision | Jun 28, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction | Jul 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |