| SEED-Bench: Benchmarking Multimodal Large Language Models | Jan 1, 2024 | BenchmarkingImage Generation | CodeCode Available | 3 |
| Fine-Tuning Language Models with Just Forward Passes | May 27, 2023 | GPUIn-Context Learning | CodeCode Available | 3 |
| C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models | May 15, 2023 | Multiple-choice | CodeCode Available | 3 |
| VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning | May 29, 2025 | Anomaly DetectionDescriptive | CodeCode Available | 2 |
| Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | May 16, 2025 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 2 |
| EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning | May 7, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 | Mar 31, 2025 | Logical ReasoningMultiple-choice | CodeCode Available | 2 |
| Mellow: a small audio language model for reasoning | Mar 11, 2025 | Audio captioningLanguage Modeling | CodeCode Available | 2 |
| BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology | Feb 28, 2025 | Multiple-choicescientific discovery | CodeCode Available | 2 |
| Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models | Feb 24, 2025 | GSM8KMath | CodeCode Available | 2 |