| MiniCPM-V: A GPT-4V Level MLLM on Your Phone | Aug 3, 2024 | HallucinationMultiple-choice | CodeCode Available | 12 | 5 |
| HealthBench: Evaluating Large Language Models Towards Improved Human Health | May 13, 2025 | Instruction FollowingMultiple-choice | CodeCode Available | 7 | 5 |
| LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks | Dec 19, 2024 | 8kIn-Context Learning | CodeCode Available | 5 | 5 |
| VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | Jun 11, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 5 | 5 |
| MMBench: Is Your Multi-modal Model an All-around Player? | Jul 12, 2023 | AllInstruction Following | CodeCode Available | 5 | 5 |
| Zero-Shot Learners for Natural Language Understanding via a Unified Multiple Choice Perspective | Oct 16, 2022 | Coreference ResolutionMultiple-choice | CodeCode Available | 4 | 5 |
| VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation | May 20, 2025 | MMEMultiple-choice | CodeCode Available | 4 | 5 |
| The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning | Mar 5, 2024 | Multiple-choice | CodeCode Available | 4 | 5 |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Jan 30, 2023 | Generative Visual Question AnsweringImage Captioning | CodeCode Available | 4 | 5 |
| I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench | Jan 31, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 4 | 5 |
| MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens | Apr 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 | 5 |
| BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text | Mar 27, 2024 | ArticlesLanguage Modeling | CodeCode Available | 4 | 5 |
| Flamingo: a Visual Language Model for Few-Shot Learning | Apr 29, 2022 | Few-Shot LearningGenerative Visual Question Answering | CodeCode Available | 4 | 5 |
| Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | Nov 16, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 | 5 |
| SEED-Bench: Benchmarking Multimodal Large Language Models | Jan 1, 2024 | BenchmarkingImage Generation | CodeCode Available | 3 | 5 |
| Fine-Tuning Language Models with Just Forward Passes | May 27, 2023 | GPUIn-Context Learning | CodeCode Available | 3 | 5 |
| C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models | May 15, 2023 | Multiple-choice | CodeCode Available | 3 | 5 |
| PCToolkit: A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models | Mar 26, 2024 | Code CompletionFew-Shot Learning | CodeCode Available | 3 | 5 |
| Improving Model Evaluation using SMART Filtering of Benchmark Datasets | Oct 26, 2024 | ChatbotDiversity | CodeCode Available | 3 | 5 |
| Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset | May 17, 2024 | 16kBenchmarking | CodeCode Available | 3 | 5 |
| MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models | Aug 2, 2024 | Multimodal ReasoningMultiple-choice | CodeCode Available | 3 | 5 |
| MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | Apr 8, 2024 | GPUMultiple-choice | CodeCode Available | 3 | 5 |
| SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension | Apr 25, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 3 | 5 |
| FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models | Aug 19, 2023 | Multiple-choice | CodeCode Available | 2 | 5 |
| MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models | Aug 5, 2024 | Image ComprehensionMultiple-choice | CodeCode Available | 2 | 5 |