| LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale | Jan 1, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction | Jan 1, 2025 | DescriptiveInstruction Following | —Unverified | 0 |
| HOIGPT: Learning Long-Sequence Hand-Object Interaction with Language Models | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Once-Tuning-Multiple-Variants: Tuning Once and Expanded as Multiple Vision-Language Model Variants | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving | Jan 1, 2025 | Autonomous DrivingCARLA longest6 | —Unverified | 0 |
| Taxonomy-Aware Evaluation of Vision-Language Models | Jan 1, 2025 | Fine-Grained Image ClassificationLanguage Modeling | —Unverified | 0 |
| Libra-Merging: Importance-redundancy and Pruning-merging Trade-off for Acceleration Plug-in in Large Vision-Language Model | Jan 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output | Jan 1, 2025 | Instruction FollowingLanguage Modeling | CodeCode Available | 0 |
| Flexible Frame Selection for Efficient Video Reasoning | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification | Jan 1, 2025 | ClassificationLanguage Modeling | CodeCode Available | 0 |