| LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding | Jul 22, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity | Jul 22, 2024 | DiversityMultiple-choice | CodeCode Available | 2 |
| MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding | Jul 6, 2024 | ArticlesInstruction Following | CodeCode Available | 2 |
| ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World | Jun 19, 2024 | DiagnosticMultiple-choice | CodeCode Available | 2 |
| CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models | Jun 14, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena | Jun 11, 2024 | Multiple-choiceSelection bias | CodeCode Available | 2 |
| Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation | May 22, 2024 | InformativenessLanguage Modeling | CodeCode Available | 2 |
| Self-Reflection in LLM Agents: Effects on Problem-Solving Performance | May 5, 2024 | Multiple-choice | CodeCode Available | 2 |
| PLAYER*: Enhancing LLM-based Multi-Agent Communication and Interaction in Murder Mystery Games | Apr 26, 2024 | Decision MakingLanguage Modeling | CodeCode Available | 2 |
| An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM | Mar 27, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |