| GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model | Jan 1, 2025 | AttributeLanguage Modeling | —Unverified | 0 |
| Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene | Jan 1, 2025 | Graph GenerationLarge Language Model | —Unverified | 0 |
| SynTab-LLaVA: Enhancing Multimodal Table Understanding with Decoupled Synthesis | Jan 1, 2025 | Large Language Model | CodeCode Available | 1 |
| Chain of Semantics Programming in 3D Gaussian Splatting Representation for 3D Vision Grounding | Jan 1, 2025 | 3DGSLarge Language Model | —Unverified | 0 |
| HOIGPT: Learning Long-Sequence Hand-Object Interaction with Language Models | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models | Jan 1, 2025 | Large Language ModelObject | —Unverified | 0 |
| DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving | Jan 1, 2025 | Autonomous DrivingCARLA longest6 | —Unverified | 0 |
| VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos | Jan 1, 2025 | Large Language ModelVideo Segmentation | —Unverified | 0 |
| Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering | Jan 1, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| Video-Bench: Human-Aligned Video Generation Benchmark | Jan 1, 2025 | Large Language ModelVideo Generation | —Unverified | 0 |