| Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations | Oct 2, 2023 | In-Context LearningInstruction Following | CodeCode Available | 1 |
| Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation | Sep 19, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| Large Language Models Are Not Robust Multiple Choice Selectors | Sep 7, 2023 | Computational EfficiencyMultiple-choice | CodeCode Available | 1 |
| CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models | Sep 5, 2023 | Code GenerationMultiple-choice | CodeCode Available | 1 |
| LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models | Aug 20, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | Aug 18, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |
| Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework | Jul 24, 2023 | Contrastive LearningMultimodal Reasoning | CodeCode Available | 1 |
| SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models | Jul 20, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation | Jun 9, 2023 | JurisprudenceManagement | CodeCode Available | 1 |