SOTAVerified

DiffusionVID: Denoising Object Boxes with Spatio-temporal Conditioning for Video Object Detection

2023-10-30IEEE Access 2023Code Available1· sign in to hype

Si-Dong Roh, Ki-Seok Chung

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Several existing still image object detectors suffer from image deterioration in videos, such as motion blur, camera defocus, and partial occlusion. We present DiffusionVID, a diffusion model-based video object detector, that exploits spatio-temporal conditioning. Inspired by the diffusion model, DiffusionVID refines random noise boxes to obtain the original object boxes in a video sequence. To effectively refine the box from the degraded images in the videos, we used three novel approaches: cascade refinement, dynamic core-set conditioning, and local batch refinement. The cascade refinement architecture effectively collects information from object regions, whereas the dynamic core-set conditioning further improves the denoising quality using adaptive conditional guidance based on the spatio-temporal core-set. Local batch refinement significantly improves the refinement speed by exploiting GPU parallelism. On the standard and widely used ImageNet-VID benchmark, our DiffusionVID with the ResNet-101 and Swin-Base backbones achieves 86.9 mAP @ 46.6 FPS and 92.4 mAP @ 27.0 FPS, respectively, which is state-of-the-art performance. To the best of the authors’ knowledge, this is the first video object detector based on a diffusion model. The code and models are available at https://github.com/sdroh1027/DiffusionVID.

Tasks

Reproductions