Object co-segmentation

In computer vision, object co-segmentation is a special case of image segmentation, which is defined as jointly segmenting semantically similar objects in multiple images or video frames^[2]^[3]. The problem of separating out a foreground object from the background across all frames of a video is known as video object segmentation. The goal is to label each pixel in all video frames according to whether it belongs to the unknown target object or not. The resulting segmentation is a spatio-temporal object tube delineating the boundaries of the object throughout a video. Such capacity can be useful for a variety of computer vision tasks, such as object centric video summarization, action analysis, video surveillance, and content-based video retrieval.

Challenges

It is often challenging extract segmentation masks of a target/object from a noisy collection of images or video frames, which involves object discovery coupled with segmentation. A noisy collection implies that the object/target is present sporadically in a set of images or the object/target disappears intermittently throughout the video of interest. Most methods either only fulfill video object discovery, or video object segmentation presuming the existence of the object in all video frames. The unrealistically optimistic assumption is often made in these methods, that the target object is present in all (or most) video frames. Therefore, methods robust to a large number of noisy frames (i.e., irrelevant frames devoid of the target object) are urgently needed.

Moreover, most of existing methods emphasized on leveraging the low-level features (i.e., color and motion) or contextual information shared among individual or consecutive frames to find the common regions, and simply employed the short-term motion (e.g., optical flow) between consecutive frames to smooth the spatio-temporal segmentation. Therefore, they often encountered difficulties when the objects exhibit large variations in appearance, motion, size, pose, and viewpoint.

Furthermore, several methods^[4],^[5],^[6],^[7],^[8],^[9] employed the mid-level representation of objects (i.e., object proposals^[10]) as an additional cue to facilitate the segmentation of the object, with object discovery and object segmentation conveniently isolated as two independent tasks and performed in a two-step manner^[11]}. Unfortunately, the disregard of their dependencies often leads to sub-optimal performances, e.g., object segmentation dramatically failing at focusing on the target, object discovery providing wildly inaccurate object proposals.

Dynamic Markov Networks-based Methods

A joint object discover and co-segmentation method based on coupled dynamic Markov Networks has been proposed recently^[1], which claims significant improvements in robustness against irrelevant/noisy video frames. A principled probabilistic model is introduced with two coupled dynamic Markov networks, one for discovery and the other for segmentation. When conducting the Bayesian inference on this model using belief propagation, the bi-directional propagation of the beliefs of the object's posteriors on an object proposal graph and a superpixel graph reveals a clear collaboration between these two inference tasks. More specifically, object discovery is conducted through the object proposal graph representing the correlations of object proposals among multiple frames, which is built under the help of the spatio-temporal object segmentation tube obtained by object segmentation on the superpixel graph. Object segmentation is achieved on the superpixel graph representing the connections of superpixels, which is benefited from the spatio-temporal object proposal tube generated by object discovery through the object proposal graph.

Given a video ${\textstyle \mathbf {V} =\{{f}_{t}\}_{t=1}^{T}}$ with a significant number of noisy frames, our goal is to jointly find an object discovery labeling ${\textstyle \mathbf {L} }$ and an object segmentation labeling ${\textstyle \mathbf {B} }$ from ${\textstyle \mathbf {V} }$ . ${\textstyle \mathbf {L} =\{\mathbf {L} _{t}\}_{t=1}^{T}}$ is a spatio-temporal region (object) proposal tube of ${\textstyle \mathbf {V} }$ . ${\textstyle \mathbf {L} _{t}=\{l_{t,i}\}_{i=1}^{K}}$ is the object discovery label of each frame ${\textstyle f_{t}}$ , where ${\textstyle l_{t,i}\in \{0,1\}}$ and ${\textstyle \sum _{i=1}^{K}{l_{t,i}}\leq 1}$ , i.e., no more than one region proposal among all the ${\textstyle K}$ proposals in ${\textstyle f_{t}}$ will be identified as the object. ${\textstyle \mathbf {B} =\{\mathbf {B} _{t}\}_{t=1}^{T}}$ is a spatio-temporal object segmentation tube of ${\textstyle \mathbf {V} }$ . ${\textstyle \mathbf {B} _{t}=\{b_{t,j}\}_{j=1}^{J}}$ is the object segmentation label of ${\textstyle f_{t}}$ , where ${\textstyle b_{t,j}\in \{0,1\}}$ denotes that each of the ${\textstyle J}$ superpixels either belongs to the object ( ${\textstyle b_{t,j}=1}$ ) or the background ( ${\textstyle b_{t,j}=0}$ ). The image observations associated with ${\textstyle \mathbf {L} }$ , ${\textstyle \mathbf {L} _{t}}$ , ${\textstyle \mathbf {B} }$ , and ${\textstyle \mathbf {B} _{t}}$ are denoted by ${\textstyle \mathbf {O} =\{\mathbf {O} _{t}\}_{t=1}^{T}}$ , ${\textstyle \mathbf {O} _{t}=\{o_{t,i}\}_{i=1}^{K}}$ , ${\textstyle \mathbf {S} =\{\mathbf {S} _{t}\}_{t=1}^{T}}$ and ${\textstyle \mathbf {S} _{t}=\{s_{t,j}\}_{j=1}^{J}}$ , respectively. ${\textstyle o_{t,i}}$ and ${\textstyle s_{t,j}}$ are the representations of a region proposal and a superpixel, respectively.

Specifically, the beneficial information are encouraged to be propagated between the joint inference of ${\textstyle \mathbf {L} }$ and ${\textstyle \mathbf {B} }$ , and hence video object discovery and video object segmentation can naturally benefit each other. A Markov network is employed to characterize the joint object discovery and segmentation from ${\textstyle \mathbf {V} }$ . The undirected link represents the mutual influence of object discovery and object segmentation, and is associated with a potential compatibility function ${\textstyle \Psi (\mathbf {L,B} )}$ . The directed links represent the image observation processes, and are associated with two image likelihood functions ${\textstyle p(\mathbf {O|L} )}$ and ${\textstyle p(\mathbf {S|B} )}$ . According to the Bayesian rule, it is easy to obtain $p(\mathbf {L,B|O,S} )={\frac {1}{Z_{Q}}}\Psi (\mathbf {L,B} )p(\mathbf {O|L} )p(\mathbf {S|B} ),$ where ${\textstyle {Z_{Q}}}$ is a normalization constant. The above Markov network is a generative model at one time instant.

When putting the above Markov network into temporal context by accommodating dynamic models, we construct two coupled dynamic Markov networks. The subscript ${\textstyle t}$ represents the time index. In addition, we denote the collective image observations associated with the object discovery labels from the beginning to ${\textstyle t}$ by ${\textstyle {\vec {\mathbf {O} }}_{t}=\{\mathbf {O} _{1},\dots ,\mathbf {O} _{t}\}}$ , and reversely from the end to ${\textstyle t}$ by ${\textstyle {\overleftarrow {\mathbf {O} }}_{t}=\{\mathbf {O} _{T},\dots ,\mathbf {O} _{t}\}}$ . The collective image observations associated with the object segmentation labels are built in the same way, i.e., ${\textstyle {\vec {\mathbf {S} }}_{t}=\{\mathbf {S} _{1},\dots ,\mathbf {S} _{t}\}}$ and ${\textstyle {\overleftarrow {\mathbf {S} }}_{t}=\{\mathbf {S} _{T},\dots ,\mathbf {S} _{t}\}}$ . In this formulation, the problem of joint video object discovery and segmentation from a single noisy video is to perform Bayesian inference of the dynamic Markov networks to obtain the marginal posterior probabilities ${\textstyle p(\mathbf {L} _{t}|\mathbf {O} ,\mathbf {S} )}$ and ${\textstyle p(\mathbf {B} _{t}|\mathbf {O} ,\mathbf {S} )}$ .

CNN and LSTM-based Methods

In action localization applications, object co-segmentation is also implemented as the Segment-Tube spatio-temporal detector^[12]. Inspired by the recent spatio-temporal action localization efforts with tubelets (sequences of bounding boxes), Le et al. present a new spatio-temporal action localization detector Segment-tube, which consists of sequences of per-frame segmentation masks. This Segment-tube detector can temporally pinpoint the starting/ending frame of each action category in the presence of preceding/subsequent interference actions in untrimmed videos. Simultaneously, the Segment-tube detector produces per-frame segmentation masks instead of bounding boxes, offering superior spatial accuracy to tubelets. This is achieved by alternating iterative optimization between temporal action localization and spatial action segmentation.

The proposed Segment-tube detector is illustrated in the flowchart on the right. The sample input is an untrimmed video containing all frames in a pair figure skating video, with only a portion of these frames belonging to a relevant category (e.g., the DeathSpirals). Initialized with saliency based image segmentation on individual frames, this method first performs temporal action localization step with a cascaded 3D CNN and LSTM, and pinpoints the starting frame and the ending frame of a target action with a coarse-to-fine strategy. Subsequently, the Segment-tube detector refines per-frame spatial segmentation with graph cut by focusing on relevant frames identified by the temporal action localization step. The optimization alternates between the temporal action localization and spatial action segmentation in an iterative manner. Upon practical convergence, the final spatio-temporal action localization results are obtained in the format of a sequence of per-frame segmentation masks (bottom row in the flowchart) with precise starting/ending frames.

References

^ ^a ^b ^c ^d Liu, Ziyi; Wang, Le; Hua, Gang; Zhang, Qilin; Niu, Zhenxing; Wu, Ying; Zheng, Nanning (2018). "Joint Video Object Discovery and Segmentation by Coupled Dynamic Markov Networks" (PDF). IEEE Transactions on Image Processing. 27 (12): 5840–5853. doi:10.1109/tip.2018.2859622. ISSN 1057-7149.
^ Vicente, Sara; Rother, Carsten; Kolmogorov, Vladimir (2011). Object cosegmentation. IEEE. doi:10.1109/cvpr.2011.5995530. ISBN 978-1-4577-0394-2.
^ Chen, Ding-Jie; Chen, Hwann-Tzong; Chang, Long-Wen (2012). Video object cosegmentation. New York, New York, USA: ACM Press. doi:10.1145/2393347.2396317. ISBN 978-1-4503-1089-5.
^ Lee, Yong Jae; Kim, Jaechul; Grauman, Kristen (2011). Key-segments for video object segmentation. IEEE. doi:10.1109/iccv.2011.6126471. ISBN 978-1-4577-1102-2.
^ Ma, Tianyang; Latecki, Longin Jan. Maximum weight cliques with mutex constraints for video object segmentation. IEEE CVPR 2012. doi:10.1109/CVPR.2012.6247735.
^ Zhang, Dong; Javed, Omar; Shah, Mubarak (2013). Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions. IEEE. doi:10.1109/cvpr.2013.87. ISBN 978-0-7695-4989-7.
^ Fragkiadaki, Katerina; Arbelaez, Pablo; Felsen, Panna; Malik, Jitendra (2015). Learning to segment moving objects in videos. IEEE. doi:10.1109/cvpr.2015.7299035. ISBN 978-1-4673-6964-0.
^ Perazzi, Federico; Wang, Oliver; Gross, Markus; Sorkine-Hornung, Alexander (2015). Fully Connected Object Proposals for Video Segmentation. IEEE. doi:10.1109/iccv.2015.369. ISBN 978-1-4673-8391-2.
^ Koh, Yeong Jun; Kim, Chang-Su (2017). Primary Object Segmentation in Videos Based on Region Augmentation and Reduction. IEEE. doi:10.1109/cvpr.2017.784. ISBN 978-1-5386-0457-1.
^ Krähenbühl, Philipp; Koltun, Vladlen (2014). "Geodesic Object Proposals". Computer Vision – ECCV 2014. Cham: Springer International Publishing. pp. 725–739. doi:10.1007/978-3-319-10602-1_47. ISBN 978-3-319-10601-4. ISSN 0302-9743.
^ Xue, Jianru; Wang, Le; Zheng, Nanning; Hua, Gang (2013). "Automatic salient object extraction with contextual cue and its applications to recognition and alpha matting". Pattern Recognition. 46 (11). Elsevier BV: 2874–2889. doi:10.1016/j.patcog.2013.03.028. ISSN 0031-3203.
^ ^a ^b ^c Wang, Le; Duan, Xuhuan; Zhang, Qilin; Niu, Zhenxing; Hua, Gang; Zheng, Nanning (2018-05-22). "Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation" (PDF). Sensors. 18 (5). MDPI AG: 1657. doi:10.3390/s18051657. ISSN 1424-8220.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[Liu_Wang_Hua_Zhang_2018_pp._5840–5853-1] Liu, Ziyi; Wang, Le; Hua, Gang; Zhang, Qilin; Niu, Zhenxing; Wu, Ying; Zheng, Nanning (2018). "Joint Video Object Discovery and Segmentation by Coupled Dynamic Markov Networks" (PDF). IEEE Transactions on Image Processing. 27 (12): 5840–5853. doi:10.1109/tip.2018.2859622. ISSN 1057-7149.

[Vicente_Rother_Kolmogorov_2011_p.-2] Vicente, Sara; Rother, Carsten; Kolmogorov, Vladimir (2011). Object cosegmentation. IEEE. doi:10.1109/cvpr.2011.5995530. ISBN 978-1-4577-0394-2.

[Chen_Chen_Chang_2012_p.-3] Chen, Ding-Jie; Chen, Hwann-Tzong; Chang, Long-Wen (2012). Video object cosegmentation. New York, New York, USA: ACM Press. doi:10.1145/2393347.2396317. ISBN 978-1-4503-1089-5.

[lee2011key-4] Lee, Yong Jae; Kim, Jaechul; Grauman, Kristen (2011). Key-segments for video object segmentation. IEEE. doi:10.1109/iccv.2011.6126471. ISBN 978-1-4577-1102-2.

[ma2012maximum-5] Ma, Tianyang; Latecki, Longin Jan. Maximum weight cliques with mutex constraints for video object segmentation. IEEE CVPR 2012. doi:10.1109/CVPR.2012.6247735.

[zhang2013video-6] Zhang, Dong; Javed, Omar; Shah, Mubarak (2013). Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions. IEEE. doi:10.1109/cvpr.2013.87. ISBN 978-0-7695-4989-7.

[fragkiadaki2015learning-7] Fragkiadaki, Katerina; Arbelaez, Pablo; Felsen, Panna; Malik, Jitendra (2015). Learning to segment moving objects in videos. IEEE. doi:10.1109/cvpr.2015.7299035. ISBN 978-1-4673-6964-0.

[perazzi2015fully-8] Perazzi, Federico; Wang, Oliver; Gross, Markus; Sorkine-Hornung, Alexander (2015). Fully Connected Object Proposals for Video Segmentation. IEEE. doi:10.1109/iccv.2015.369. ISBN 978-1-4673-8391-2.

[koh2017primary-9] Koh, Yeong Jun; Kim, Chang-Su (2017). Primary Object Segmentation in Videos Based on Region Augmentation and Reduction. IEEE. doi:10.1109/cvpr.2017.784. ISBN 978-1-5386-0457-1.

[krahenbuhl2014geodesic-10] Krähenbühl, Philipp; Koltun, Vladlen (2014). "Geodesic Object Proposals". Computer Vision – ECCV 2014. Cham: Springer International Publishing. pp. 725–739. doi:10.1007/978-3-319-10602-1_47. ISBN 978-3-319-10601-4. ISSN 0302-9743.

[xue2013automatic-11] Xue, Jianru; Wang, Le; Zheng, Nanning; Hua, Gang (2013). "Automatic salient object extraction with contextual cue and its applications to recognition and alpha matting". Pattern Recognition. 46 (11). Elsevier BV: 2874–2889. doi:10.1016/j.patcog.2013.03.028. ISSN 0031-3203.

[Wang_Duan_Zhang_Niu_p=1657-12] Wang, Le; Duan, Xuhuan; Zhang, Qilin; Niu, Zhenxing; Hua, Gang; Zheng, Nanning (2018-05-22). "Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation" (PDF). Sensors. 18 (5). MDPI AG: 1657. doi:10.3390/s18051657. ISSN 1424-8220.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Challenges

Dynamic Markov Networks-based Methods

CNN and LSTM-based Methods

See also

References