基于以物体为中心表示学习的视频解析

OBJECT-CENTRIC REPRESENTATION LEARNING FOR VIDEO PARSING

  • 摘要: 目前以物体为中心的多视角表示学习由于不成熟性,遗留了诸多问题:1)绝大多数模型需要利用昂贵的视角标签学习物体表示;2)目前模型无法很好地重构出被遮挡物体的完整形状。为此,专注于从物体静态、视角变化且无视角标签的视频数据中无监督学习以物体为中心的表示,同时学习不同帧的视角表示。模型基于Transformer学习帧与帧之间的相关性,从而能联合地推断属于不同帧的视角表示;另外为了保持物体表示在不同视角下的一致性,模型使用基于交叉注意力的序列扩展版模块以学习物体的3D表示。该建模增强了两种表示的解耦性,能更好地重构完整形状。在多个专门设计的合成数据集上的实验表明该模型在视频分解、物体遮挡性能方面均优于现有模型。

     

    Abstract: Recently the multi-view object-centric learning has left multiple problems due to its immaturity: 1) most methods need to use expensive viewpoint annotations to learn object-centric representations; 2) current models cannot reconstruct the complete shape of the occluded object well. For this reason, the paper focused on unsupervised learning of object-centric representations from object-static, viewpoint-changing videos without viewpoint annotations, simultaneously learning the perspective representation of different frames. The model learned the correlation between frames based on the Transformer, so it could jointly infer the view representations belonging to different frames. In addition, in order to maintain the object constancy from different viewpoints, the model proposed a sequential extension module to learn the 3D object representations based on the cross-attention mechanism. The proposed model enhanced the disentanglement between different representations and could better reconstruct the complete shape. Experiments on multiple specifically designed synthetic datasets show that the proposed model outperforms existing models in video decomposition and object occlusion performance.

     

/

返回文章
返回