| SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model | Apr 13, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models | Apr 11, 2025 | ClusteringLanguage Modeling | CodeCode Available | 2 |
| GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation | Apr 10, 2025 | Contrastive LearningLanguage Modeling | CodeCode Available | 2 |
| TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling | Apr 9, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation | Apr 3, 2025 | Computational EfficiencyGPU | CodeCode Available | 2 |
| Unicorn: Text-Only Data Synthesis for Vision Language Model Training | Mar 28, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model | Mar 27, 2025 | EgoSchemaLanguage Modeling | CodeCode Available | 2 |
| Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector | Mar 26, 2025 | Binary ClassificationDeepFake Detection | CodeCode Available | 2 |
| Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis | Mar 25, 2025 | Contrastive LearningImage-text Retrieval | CodeCode Available | 2 |
| MC-LLaVA: Multi-Concept Personalized Vision-Language Model | Mar 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |