ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

2024-02-27Code Available3· sign in to hype

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma

Code Available — Be the first to reproduce this paper.

Code

github.com/qizekun/ShapeLLM
Officialpytorch★ 230
github.com/qizekun/ReCon
pytorch★ 154
github.com/runpeidong/act
pytorch★ 103

Abstract

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: https://qizekun.github.io/shapellm/

Tasks

3D geometry 3D Object Captioning 3D Point Cloud Classification 3D Point Cloud Linear Classification 3D Question Answering (3D-QA)Few-Shot 3D Point Cloud Classification Generative 3D Object Classification Instruction Following Language Modeling Language Modelling Large Language Model MM-Vet Multimodal Large Language Model Object Visual Grounding Zero-shot 3D classification Zero-Shot Transfer 3D Point Cloud Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Objaverse	ShapeLLM-7B	GPT-4	46.92	—	Unverified
Objaverse	ShapeLLM-13B	GPT-4	48.94	—	Unverified

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Code

Abstract

Tasks

Benchmark Results

Reproductions