Supplementary Material: The Edge of Depth: Explicit Constraints between Segmentation and...

5
Supplementary Material: The Edge of Depth: Explicit Constraints between Segmentation and Depth Shengjie Zhu, Garrick Brazil, Xiaoming Liu Michigan State University, East Lansing MI {zhusheng, brazilga, liuxm}@msu.edu 1. Proof of Local Optimality We give a brief proof that, under constructed transfor- mation set {φ(x | q, p)}, the proposed edge-edge consis- tency l c (Γ(T s | T d ), T * d ), can achieve the local optimality when the segmentation-augmented (or morphed) disparity edge points satisfy T * d = {p | I * ˆ d (p) x > t 1+t · k 1 }. To prove this, let’s start by evaluating the gradient of morphed disparity map I * ˆ d at a semantic edge pixel q: q Γ(T s | T d ), I * ˆ d (x) x x=q = I ˆ d (φ(x)) ∂φ(x) * ∂φ(x) x x=q = I ˆ d (y) y y=p * ∂φ(x) x x=q , (1) Note if x = q, φ(x | q, p)= p. If I * ˆ d (x) x x=q is suffi- ciently larger than a threshold, a semantic edge pixel q is also an edge pixel in the morphed disparity map, leading to the perfect edge-edge consistency for q. We now derive the two terms in Eq. 1, in order to find that threshold. When x is on the line segment -→ qp, its projection x 0 over- laps with itself. We can thus compute ∂φ(x) x x=q as: ∂φ(x) x x=q = x + -→ qp - 1 1+t · -→ qx 0 x x=q = x + -→ qp - 1 1+t · -→ qx x x=q = x +(p - q) - 1 1+t · (x - q) x x=q = t 1+t · x + p - t 1+t · q x x=q = t 1+ t . (2) Using ∂φ(x) x x=q = t 1+t with Eq. 1, we have: q Γ(T s | T d ), I * ˆ d (x) x x=q = I ˆ d (y) y y=p * ∂φ(x) x x=q = t 1+ t * I ˆ d (y) y y=p > t 1+ t · k 1 , (3) where the inequality is derived from Eq. 1 of the main pa- per, which defines the threshold k 1 for detecting edge pix- els on the original disparity map. Here, in morphed dis- parity map I * ˆ d , since every counted semantic edge pixel q Γ(T s | T d ) in computing the consistency l c has a gra- dient magnitude larger than the threshold t 1+t · k 1 , q over- laps with the paired or matched depth/disparity edge pixel p as well, i.e., T * d = {p | I * ˆ d (p) x > t 1+t · k 1 }. Thus, in morphed disparity map I * ˆ d , semantic border overlaps with depth borders, making proposed consistency measurement l c hit local minimum 0: q Γ(T s | T d ), I * ˆ d (x) x x=q > t 1+ t · k 1 ⇐⇒ ∀q Γ(T s | T d ), δ(q, T * d )= min {pT * d } kp - qk = kq - qk =0 ⇐⇒ l c (Γ(T s | T d ), I * d )=0. (4) This shows that, under the defined transformation, we are realigning the depth edge set Ω to the segmentation edge set Γ d s , making the edge-edge consistency a local optimality. Note that the threshold t 1+t · k 1 is not actually being applied to the morphed disparity map for edge detection. Rather, we derive it as the condition that will be naturally satisfied in our work, when both the morph function and k 1 threshold for disparity map depth estimation (Eq. 1 of the main paper) are employed. 1

Transcript of Supplementary Material: The Edge of Depth: Explicit Constraints between Segmentation and...

Page 1: Supplementary Material: The Edge of Depth: Explicit Constraints between Segmentation and Depthcvlab.cse.msu.edu/pdfs/zhu_brazil_liu_cvpr2020_supp.pdf · Supplementary Material: The

Supplementary Material:The Edge of Depth: Explicit Constraints between Segmentation and Depth

Shengjie Zhu, Garrick Brazil, Xiaoming LiuMichigan State University, East Lansing MI

zhusheng, brazilga, [email protected]

1. Proof of Local OptimalityWe give a brief proof that, under constructed transfor-

mation set φ(x | q,p), the proposed edge-edge consis-tency lc(Γ(Ts | Td), T∗d), can achieve the local optimalitywhen the segmentation-augmented (or morphed) disparityedge points satisfy T∗d = p |

∥∥∥∂I∗d(p)∂x

∥∥∥ > t1+t · k1.

To prove this, let’s start by evaluating the gradient ofmorphed disparity map I∗

dat a semantic edge pixel q:

∀q ∈ Γ(Ts | Td),∂I∗

d(x)

∂x

∣∣∣∣x=q

=∂Id(φ(x))

∂φ(x)∗ ∂φ(x)

∂x

∣∣∣∣x=q

=∂Id(y)

∂y

∣∣∣∣y=p

∗ ∂φ(x)∂x

∣∣∣∣x=q

,

(1)

Note if x = q, φ(x | q,p) = p. If∂I∗

d(x)

∂x

∣∣x=q

is suffi-ciently larger than a threshold, a semantic edge pixel q isalso an edge pixel in the morphed disparity map, leading tothe perfect edge-edge consistency for q. We now derive thetwo terms in Eq. 1, in order to find that threshold.

When x is on the line segment−→qp, its projection x′ over-laps with itself. We can thus compute ∂φ(x)

∂x

∣∣x=q

as:

∂φ(x)

∂x

∣∣∣∣x=q

=∂(x +−→qp− 1

1+t ·−→qx′)

∂x

∣∣∣∣x=q

=∂(x +−→qp− 1

1+t ·−→qx)

∂x

∣∣∣∣x=q

=∂(x + (p− q)− 1

1+t · (x− q))

∂x

∣∣∣∣x=q

=∂(

t1+t · x + p− t

1+t · q)

∂x

∣∣∣∣x=q

=t

1 + t.

(2)

Using ∂φ(x)∂x

∣∣x=q

= t1+t with Eq. 1, we have:

∀q ∈ Γ(Ts | Td),∂I∗

d(x)

∂x

∣∣∣∣x=q

=∂Id(y)

∂y

∣∣∣∣y=p

∗ ∂φ(x)∂x

∣∣∣∣x=q

=t

1 + t∗∂Id(y)

∂y

∣∣∣∣y=p

>t

1 + t· k1,

(3)

where the inequality is derived from Eq. 1 of the main pa-per, which defines the threshold k1 for detecting edge pix-els on the original disparity map. Here, in morphed dis-parity map I∗

d, since every counted semantic edge pixel

q ∈ Γ(Ts | Td) in computing the consistency lc has a gra-dient magnitude larger than the threshold t

1+t · k1, q over-laps with the paired or matched depth/disparity edge pixelp as well, i.e., T∗d = p |

∥∥∥∂I∗d(p)∂x

∥∥∥ > t1+t · k1. Thus, in

morphed disparity map I∗d

, semantic border overlaps withdepth borders, making proposed consistency measurementlc hit local minimum 0:

∀q ∈ Γ(Ts | Td),∂I∗

d(x)

∂x

∣∣∣∣x=q

>t

1 + t· k1

⇐⇒ ∀q ∈ Γ(Ts | Td), δ(q,T∗d) = minp∈T∗

d‖p− q‖

= ‖q− q‖ = 0

⇐⇒ lc(Γ(Ts | Td), I∗d) = 0.

(4)

This shows that, under the defined transformation, we arerealigning the depth edge set Ω to the segmentation edge setΓds , making the edge-edge consistency a local optimality.

Note that the threshold t1+t · k1 is not actually being

applied to the morphed disparity map for edge detection.Rather, we derive it as the condition that will be naturallysatisfied in our work, when both the morph function and k1threshold for disparity map depth estimation (Eq. 1 of themain paper) are employed.

1

Page 2: Supplementary Material: The Edge of Depth: Explicit Constraints between Segmentation and Depthcvlab.cse.msu.edu/pdfs/zhu_brazil_liu_cvpr2020_supp.pdf · Supplementary Material: The

Depth Decoderlayer k s c res input activationupconv5 3 1 256 32 econv5 ELU[1]iconv5 3 1 256 16 ↑ upconv5, econv4 ELUupconv4 3 1 128 16 iconv5 ELUiconv4 3 1 128 8 ↑ upconv4, econv3 ELUdisp4 3 1 1 1 iconv4 Sigmoidupconv3 3 1 64 8 iconv4 ELUiconv3 3 1 64 4 ↑ upconv3, econv2 ELUdisp3 3 1 1 1 iconv3 Sigmoidupconv2 3 1 32 4 iconv3 ELUiconv2 3 1 32 2 ↑ upconv2, econv1 ELUdisp2 3 1 1 1 iconv2 Sigmoidupconv1 3 1 16 2 iconv2 ELUiconv1 3 1 16 1 ↑ upconv1 ELUdisp1 3 1 1 1 iconv1 Sigmoid

Table 1: The network architecture of our decoder. k, s and cdenote the kernel size, stride and output channel numbers ofthe layer, respectively. res refers to relative downsamplingscale to the input image. ↑ symbol means a 2× nearest-neighbour upsampling to input.

2. Network details

Across our experiments, we use ImageNet [2] pretrainedResNet18 and ResNet50 [6] as our encoder. Our decoderstructure is same as Godard et al. [5] and Waston et al. [10],as detailed in Table 1. We also incorporate other prac-tices such as color augmentation, random flip, edge-awaresmoothness and exclusion of stationary pixels.

3. More Ablations

In this section, we perform additional ablations to fur-ther validate our proposed approach. We ablate (1) Ourproposed morph strategy achieves local optimality of edge-edge consistency lc, and (2) The stereo occlusion mask Mboosts clear borders. All our ablations are conducted onEigen [3] test splits of KITTI [4].

Reducing edge-edge consistency via morphing: We plotthe edge-edge consistency loss lc under various edge de-tection thresholds k1 in Fig. 1. We cross-validate morph-ing (detailed in main paper Section 3.1) as a technique toachieve local optimality of lc from Fig. 1 via showing con-sistently decreased measurement lc after applying morphingonce and twice. The lower loss in Fig. 1 shows that ourmodels are more consistent with segmentation comparedto [10]. Additionally, increased threshold k1 leads to thinneredges and neglects distant objects, which have two effects.First of all, thinner edges make edge-edge consistency to bemore challenge, thus higher loss values. Second, focusingon close-range objects can best leverage the high-qualitysegmentation, which leads to larger improvement marginover the baseline [10].

Figure 1: We plot the edge-edge consistency lc betweenWatson19 [10] and ours at different edge detection thresh-olds k1. Additionally, we show the change of consistency lcafter applying morph strategy once and twice during infer-ence, in addition to using our learned network.

Figure 2: The effects of proposed stereo occlusion maskM. We plot the trend of the average detected edge num-

bers 1n

∑i=ni=1 (‖

∂Iid(x)

∂x ‖ > k1) at different edge detectionthresholds k1, where n is for total number of tested images.

Stereo Occlusion Mask: In Fig. 5, we observe bleedingartifacts universally exist in stereo-based systems [8, 9, 10].In [10], the utilization of stereo proxy label partially sup-presses it as its additional constrain on the low texture area.[5] reduces the artifacts via supervision from videos. Incomparison, without any additional supervision sources,we eliminate it via the proposed stereo occlusion maskM. As an example, the top-right subfigure of Fig. 3reveals a clearer and thinner border when comparing lragainst lr + M. This motivates us to treat “thinness” asa measurement and use the average detected edge number1n

∑i=ni=1 (‖

∂Iid(x)

∂x ‖ > k1) as an approximated metric of bor-der clearance, as shown in Fig. 2. As expected, after apply-ing the mask M, edges become more “thinner” and clearer,reflected as the decreased number of detected edges.

More quality comparisons: We show additional qualita-tive examples when different loss are applied in Fig. 3. Wefurther provide qualitative comparisons against the baselinemethod [10] in Fig. 4, and other methods in Fig. 5.

2

Page 3: Supplementary Material: The Edge of Depth: Explicit Constraints between Segmentation and Depthcvlab.cse.msu.edu/pdfs/zhu_brazil_liu_cvpr2020_supp.pdf · Supplementary Material: The

Figure 3: On the left column, explicit utilization of segmentation information helps recovering more details. On the the right,we show blobbed border artifacts in the low texture areas, caused by noisy predicted segmentation labels and low constrainfrom the photometric loss lr. We suppress the artifacts by the incorporation of texture weight w and utilization of proxystereo labels [7, 10].

3

Page 4: Supplementary Material: The Edge of Depth: Explicit Constraints between Segmentation and Depthcvlab.cse.msu.edu/pdfs/zhu_brazil_liu_cvpr2020_supp.pdf · Supplementary Material: The

Figure 4: More comparison between ours model and the state-of-the-art baseline [10]. Content within yellow box is zoomedin and attached to the right. We show significantly improved border quality compared to the method of [10].

4

Page 5: Supplementary Material: The Edge of Depth: Explicit Constraints between Segmentation and Depthcvlab.cse.msu.edu/pdfs/zhu_brazil_liu_cvpr2020_supp.pdf · Supplementary Material: The

Figure 5: Comparison against other state of the arts [5, 8, 9, 10]. Our method reconstructs more object details compared toprevious works and possesses the most clear border overall.

References[1] Djork-Arne Clevert, Thomas Unterthiner, and Sepp Hochre-

iter. Fast and accurate deep network learning by exponentiallinear units (elus). arXiv preprint arXiv:1511.07289, 2015.

[2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pages 248–255, 2009.

[3] David Eigen, Christian Puhrsch, and Rob Fergus. Depth mapprediction from a single image using a multi-scale deep net-work. In Advances in neural information processing systems(NIPS), pages 2366–2374, 2014.

[4] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are weready for autonomous driving? the KITTI vision benchmarksuite. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 3354–3361,2012.

[5] Clement Godard, Oisin Mac Aodha, Michael Firman, andGabriel J Brostow. Digging into self-supervised monoculardepth estimation. In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), pages 3828–3838,2019.

[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, 2016.

[7] Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. Semi-supervised deep learning for monocular depth map predic-

tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 6647–6655,2017.

[8] Sudeep Pillai, Rares Ambrus, and Adrien Gaidon. Su-perdepth: Self-supervised, super-resolved monocular depthestimation. In Proceedings of the International Confer-ence on Robotics and Automation (ICRA), pages 9250–9256,2019.

[9] Matteo Poggi, Fabio Tosi, and Stefano Mattoccia. Learningmonocular depth estimation with unsupervised trinocular as-sumptions. In Proceedings of the IEEE International Con-ference on 3D Vision (3DV), pages 324–333, 2018.

[10] Jamie Watson, Michael Firman, Gabriel J Brostow, andDaniyar Turmukhambetov. Self-supervised monocular depthhints. In Proceedings of the IEEE International Conferenceon Computer Vision (ICCV), pages 2162–2171, 2019.

5