| FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression | Dec 5, 2024 | DescriptiveVisual Question Answering | CodeCode Available | 2 |
| SensorLLM: Human-Intuitive Alignment of Multivariate Sensor Data with LLMs for Activity Recognition | Oct 14, 2024 | Activity RecognitionDescriptive | CodeCode Available | 2 |
| SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion | Sep 26, 2024 | DescriptiveGeneralized Referring Expression Comprehension | CodeCode Available | 2 |
| SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description | Aug 24, 2024 | DescriptiveSpeech Synthesis | CodeCode Available | 2 |
| Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision | Jul 8, 2024 | Action Quality AssessmentDescriptive | CodeCode Available | 2 |
| DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification | Jul 4, 2024 | DescriptiveDiversity | CodeCode Available | 2 |
| MedCalc-Bench: Evaluating Large Language Models for Medical Calculations | Jun 17, 2024 | DescriptiveMedical Diagnosis | CodeCode Available | 2 |
| RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent | Jun 11, 2024 | AI AgentDescriptive | CodeCode Available | 2 |
| Composed Image Retrieval for Remote Sensing | May 24, 2024 | Composed Image Retrieval (CoIR)Descriptive | CodeCode Available | 2 |
| TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning | Apr 14, 2024 | Dense Video CaptioningDescriptive | CodeCode Available | 2 |