SOTAVerified

Cascaded Interactional Targeting Network for Egocentric Video Analysis

2016-06-01CVPR 2016Unverified0· sign in to hype

Yang Zhou, Bingbing Ni, Richang Hong, Xiaokang Yang, Qi Tian

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Knowing how hands move and what object is being manipulated are two key sub-tasks for analyzing first-person (egocentric) action. However, lack of fully annotated hand data as well as imprecise foreground segmentation make either sub-task challenging. This work aims to explicitly address these two issues via introducing a cascaded interactional targeting (i.e., infer both hand and active object regions) deep neural network. Firstly, a novel EM-like learning framework is proposed to train the pixel-level deep convolutional neural network (DCNN) by seamlessly integrating weakly supervised data (i.e., massive bounding box annotations) with a small set of strongly supervised data (i.e., fully annotated hand segmentation maps) to achieve state-of-the-art hand segmentation performance. Secondly, the resulting high-quality hand segmentation maps are further paired with the corresponding motion maps and object feature maps, in order to explore the contextual information among object, motion and hand to generate interactional foreground regions (operated objects). The resulting interactional target maps (hand + active object) from our cascaded DCNN are further utilized to form discriminative action representation. Experiments show that our framework has achieved the state-of-the-art egocentric action recognition performance on the benchmark dataset Activities of Daily Living (ADL).

Tasks

Reproductions