Omnivore: A Single Model for Many Visual Modalities
Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/facebookresearch/omnivoreOfficialpytorch★ 572
- github.com/towhee-io/towheepytorch★ 3,459
Abstract
Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivore is simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. Omnivore's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| EPIC-KITCHENS-100 | OMNIVORE (Swin-B, finetuned) | Action@1 | 49.9 | — | Unverified |
| Something-Something V2 | OMNIVORE (Swin-B, IN-21K+ Kinetics400 pretrain) | Top-1 Accuracy | 71.4 | — | Unverified |