Abstract:
Recently the multi-view object-centric learning has left multiple problems due to its immaturity: 1) most methods need to use expensive viewpoint annotations to learn object-centric representations; 2) current models cannot reconstruct the complete shape of the occluded object well. For this reason, the paper focused on unsupervised learning of object-centric representations from object-static, viewpoint-changing videos without viewpoint annotations, simultaneously learning the perspective representation of different frames. The model learned the correlation between frames based on the Transformer, so it could jointly infer the view representations belonging to different frames. In addition, in order to maintain the object constancy from different viewpoints, the model proposed a sequential extension module to learn the 3D object representations based on the cross-attention mechanism. The proposed model enhanced the disentanglement between different representations and could better reconstruct the complete shape. Experiments on multiple specifically designed synthetic datasets show that the proposed model outperforms existing models in video decomposition and object occlusion performance.