Joint learning of images and videos with a single Vision Transformer

2023-08-21Unverified0· sign in to hype

Shuki Shimizu, Toru Tamaki

Unverified — Be the first to reproduce this paper.

Abstract

In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer IV-ViT, and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.

Tasks

Action Recognition

Joint learning of images and videos with a single Vision Transformer

Abstract

Tasks

Reproductions