| ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling | May 19, 2025 | Graph GenerationKnowledge Distillation | —Unverified | 0 |
| MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO | May 19, 2025 | DecoderImage Generation | CodeCode Available | 0 |
| Beyond Retrieval: Joint Supervision and Multimodal Document Ranking for Textbook Question Answering | May 17, 2025 | Document RankingLarge Language Model | —Unverified | 0 |
| Unifying Segment Anything in Microscopy with Multimodal Large Language Model | May 16, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Batch Augmentation with Unimodal Fine-tuning for Multimodal Learning | May 10, 2025 | Image AugmentationLarge Language Model | CodeCode Available | 0 |
| Is your multimodal large language model a good science tutor? | May 9, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills | May 9, 2025 | Image RetouchingLarge Language Model | —Unverified | 0 |
| On Path to Multimodal Generalist: General-Level and General-Bench | May 7, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| Consistency-aware Fake Videos Detection on Short Video Platforms | Apr 30, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 0 |
| TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation | Apr 24, 2025 | Caption GenerationDense Video Captioning | —Unverified | 0 |