深入研究自监督单目深度估计
- 同济智能汽车
- 2020-01-04 16:33:36
- 0
- 39
本文在寻求通过单张输入图像能够自动推测一个稠密深度图的方法。
文章译自arXiv的论文
《Digging into self-supervised monocular depth estimation》
原作者:Clément Godard, Oisin Mac Aodha, Michael Firman, Gabriel Brostow
编者按:基于深度学习的单目深度估计近年来发展迅速,其通过单张RGB图像即可推测出合理的图像深度,对此的研究可以极大促进三维重建、视觉SLAM、AR和3D目标检测等领域的发展。早期基于学习的单目深度估计主要采用监督学习的形式,但是大规模深度真值的难以取得为算法实际应用带来了困难。文章介绍了一个基于自监督学习的单目深度估计算法,其在不需要深度真值参与训练的情况下仍然能够获得良好的深度估计性能,值得借鉴。
摘要:大规模地获取逐像素的深度真值数据十分具有挑战性。为了克服这个限制,自监督学习已经成为了通过模型训练来进行单目深度估计的可行的替代方案。本文提出了一系列的改进方案,使得与其他的自监督学习方法相比无论在定量还是定性方面都取得了更好的深度估计效果。对于自监督单目训练的研究通常在探索更复杂的算法架构,损失函数以及图像生成模型,这些都有效减少了其与监督学习方法的差距。而本文发现通过一个令人吃惊的简单的模型与一系列设计选择便能获得良好的估计结果。特别的,本文提出了:1.最小化重建误差,用于鲁棒地处理遮挡;2.全分辨率多尺度采样方法,用于减少视觉伪影;3.自动化掩膜损失,用于忽略违背场景静止假设的像素。我们分别证明了各个组件的有效性,并且在KITTI数据集上达到了高质量的,SOTA的效果。
1介绍
本文在寻求通过单张输入图像能够自动推测一个稠密深度图的方法。在没有第二张输入图像构成三角几何的情况下,估计一个绝对的,或者相对的深度看起来是一个不适定的问题。然而,人类通过在真实世界中的导航和互动中学习,可以对于崭新的场景假设一个合理的深度估计[19]。
进行高质量的“从图像中预测深度”任务是具有吸引力的,因为其能够与在自动驾驶任务中使用的激光雷达进行廉价地互补,并且使新的单张图像应用,比如图像编辑和AR,成为可能。求解深度同样也可以通过大量无标签的图像数据集进行深度网络预训练,进而用于下游的识别任务上[24]。然而,收集大量且多样,并且具有准确深度真值的图像数据集用于监督学习[56,10]本身就是一个艰巨的挑战。作为替代,近年来一些自监督方法已经证明了只通过同步的双目图像[13, 16]或单目图像[76]来训练单目深度估计模型的可能性。
在上述的两种自监督方法之中,单目图像是双目图像的一个具有吸引力的替代方式,但是其自身也存在着一系列挑战。除了预测深度,模型同时还需要预测时序图像序列之间的相机自身运动。这通常涉及到需要训练一个位姿预测网络,其输入为一个有限长度的图像序列,输出为相应的相机运动。相对而言,利用双目信息进行训练使得相机位姿的预测成为了一个一次性的离线标定过程,但是遮挡和纹理拷贝会对模型造成较大的影响。
本文提出了三个架构和损失函数上的创新点,它们组合起来能够对单目深度估计的表现产生巨大的提升,这个提升不仅仅可以用于利用单目图像训练的模型,还可以用于利用双目图像训练的模型或者用于利用单目+双目混合训练的模型:1.一个先进的外观匹配损失,其能够解决在使用单目约束情况下出现的遮挡像素的问题;2.一个先进并且简单的自动化掩膜,其能够忽略与相机不存在相对运动的像素;3.一个多尺度外观匹配损失,其将所有的图像采样置于输入分辨率下进行,从而减少深度伪影。将这些创新点组合在一起,本文提出的单目和双目自监督深度估计表现在KITTI数据集[14]上达到了SOTA的效果,并且简化了现有先进模型中的很多组件。
2相关工作
2.1监督深度估计
2.2自监督深度估计
2.3基于外观的损失
3方法
在本节中,本文描述本文提出的深度预测网络,该网络接受单张RGB图像It作为输入,并且生成一个深度图Dt。本文首先回顾自监督单目深度估计训练的关键思想,然后介绍本文提出的深度预测网络和相应的训练损失。
3.1自监督训练
这其中pe是光度重建误差(比如在像素空间的L1距离);proj()代表利用深度Dt计算出的在It'上进行采样的2D坐标,<>代表采样操作。为了简化公式,本文假设每一帧给定的相机内参矩阵K都是相同的,虽然它们有可能不同。与[22]相同,本文提出的方法采用双线性插值的方法在源图像上进行采样,这使得其局部可偏微分,同时本文提出的方法与[75, 16]一样组合L1和SSIM[65]作为光度误差pe。
3.2改进后的自监督深度预测
现有的单目方法与监督模型相比生成的深度图质量较低。为了缩小差距,本文提出了几个改进措施以显著地提升预测深度图的质量,同时不添加其他额外需要训练的组件(参见图3)。
图3 总览。(a)深度预测网络:本文所提出的网络使用了一个传统的,全卷积U型网络来预测深度。(b)位姿预测网络:一对图像帧之间的位姿通过一个独立的位姿预测网络进行预测。(c)逐像素最小化重投影:在匹配关系良好的情况下,重投影误差应该较低。然而,遮挡使得在当前帧的像素有可能不会在前后两帧中出现。基线模型利用平均损失来促使神经网络匹配遮挡的像素,但是本文所提出方法的最小重投影误差只匹配在两帧之间都出现的像素,造成了更锐利的结果。(d)全分辨率多尺度预测:本文所提出的网络在隐藏层将预测出的深度进行上采样,将所有的损失都放大到输入分辨率上来进行计算,减少了纹理拷贝的影响。
逐像素最小化重投影损失
图4 在MS训练过程中最小重投影误差的效果。在圆圈中的像素在中被遮挡,因此在
之间不会施加损失。相反,像素将会与
中的像素相匹配,因为它们在这张图像中可见。右上角的图像表明了最终用于等式4的匹配是从哪张源图像中得到的。
自动滤掉相对静止像素的掩膜
图2 运动物体。单目方法对于在训练过程中被观测为运动的物体,比如运动的车辆,一般会出现预测的错误,即便对于显式处理运动物体的方法[71, 39, 52]也是如此。本文所提出的方法在这个问题上有很好的效果,但其他方法以及基线模型将出现预测错误。

式中[]为艾佛森括号。在相机和其他物体以相似的速度运动的场景下,μ防止了在图像中相对静止的像素污染损失函数。同样的,当相机完全静止时,掩膜可以滤掉图像中的所有像素(图5)。本文研究中通过实验证明这个简单并且代价不大的修改能够带来显著的提升。
图5 自动化掩膜。上图显示了在训练第一代之后的自动化掩膜,其中黑色的像素将会从损失中被移除(亦即μ=0)。该掩膜防止与相机以相似速度运动的物体(上端)与当相机静止时的整个一帧(下端)对损失函数的污染。该掩膜通过输入图像和网络预测基于等式5计算得到。
多尺度预测
最终的训练损失
3.3额外的改进
表2简化测试。本文的模型(Monodepth2)的不同变量组合在KITTI 2015[14]数据集上利用Eigen划分方法进行实验的结果。(a)基线模型,在不包含本文贡献的时候,表现较差。在加入本文提出的最小化重投影,自动化掩膜与全分辨率多尺度模块后,其表现得以显著提升。(b)即便在没有ImageNet预训练模型的基础上,本文提出的模型也能造成较大的提升,同样可以查看表1。(c)如果本文利用整个Eigen数据集进行训练(而不是由[76]提出的数据子集),本文模型与基线相比也有着较大提升。
4实验
4.1 KITTI Eigen划分
图8 失效场景。顶端:本文提出的模型无法在扭曲,反射或色彩饱和的位置进行良好的深度预测。底端:本文模型无法准确地勾勒出边界模糊(左侧)或者形状复杂(右侧)的物体。
表1 定量结果。本文方法与现有方法在KITTI 2015[14]利用Eigen划分方法下的比较。每一个指标下最好的结果用粗体表示,次好的结果用下划线表示。所有的结果都没有在进行后处理[16]的情况下显示。尽管本文的方法是为了单目训练设计,但其仍旧在双目训练的情况下获得了良好的准确度。同时本文也显示了当分辨率达到1024*320时能够达到更好的效果,这与[48]一致。如果这些高分辨率的结果超过了其他的方法(包括本文提出的低分辨率方法),那么它们会被标为粗体。
4.1.1 KITTI模型简化测试
自动化掩膜的作用
4.2额外的数据集验证
KITTI里程计数据集
虽然本文的焦点在于更优的深度预测,但是本文的位姿预测网络仍能够与现有方法相比。现有方法通常会像其位姿预测网络中注入更多的图像帧,这会提升其泛化能力。
KITTI深度预测基准
Make3D
图6 定性的Make3D结果。所有的方法都在KITTI数据集上进行单目训练。
表3 Make3D结果。所有的M结果都利用了中位数尺度恢复,MS使用的是未修改的网络预测。
5结论
本文提出了一个完善的自监督单目深度预测模型,达到了SOTA的深度预测效果。本文介绍了三个贡献点:(i) 逐像素的最小重投影误差,用以处理单目视频中帧间的遮挡,(ii) 自动化掩膜,用以忽略相对相机静止的像素,(iii) 全分辨率多尺度采样方法。本文显示了其组合在一起组成了一个简单但高效的深度预测模型,并且可以利用单目,双目,单目和双目数据进行训练。
6参考文献
[1] F. Aleotti, F. Tosi, M. Poggi, and S. Mattoccia. Generative adversarial networks for unsupervised monocular depth prediction. In ECCV Workshops, 2018
[2] A. Atapour-Abarghouei and T. Breckon. Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In CVPR, 2018.
[3] V. M. Babu, K. Das, A. Majumdar, and S. Kumar. Undemon: Unsupervised deep network for depth and ego-motion estimation. In IROS, 2018.
[4] A. Byravan and D. Fox. Se3-nets: Learning rigid body motion using deep neural networks. In ICRA, 2017.
[5] V. Casser, S. Pirk, R. Mahjourian, and A. Angelova. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. 2019.
[6] W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth perception in the wild. InNeurIPS, 2016.
[7] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). arXiv:1511.07289, 2015.
[8] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv:1511.07289, 2015. [9] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, 2015. [10] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, 2014. [11] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In CVPR, 2018. [12] Y. Furukawa and C. Hernandez. Multi-view stereo: A tutorial. Foundations and Trends in Computer Graphics and Vision, 2015. [13] R. Garg, V. Kumar BG, and I. Reid. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In ECCV, 2016. [14] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In CVPR, 2012. [15] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. [16] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.[17] X. Guo, H. Li, S. Yi, J. Ren, and X. Wang. Learning monocular depth by distilling cross-domain stereo networks. In ECCV, 2018. [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InCVPR, 2016. [19] C. B. Hochberg and J. E. Hochberg. Familiar size and the perception of depth. The Journal of Psychology, 1952. [20] D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo pop-up. TOG, 2005. [21] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. FlowNet2: Evolution of optical flow estimation with deep networks. In CVPR, 2017. [22] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NeurIPS, 2015.[23] J. Janai, F. Guney, A. Ranjan, M. Black, and A. Geiger. Unsupervised learning of multi-frame optical flow with occlusions. In ECCV, 2018. [24] H. Jiang, E. Learned-Miller, G. Larsson, M. Maire, and G. Shakhnarovich. Self-supervised relative depth learning for urban scene understanding. In ECCV, 2018. [25] K. Karsch, C. Liu, and S. B. Kang. Depth transfer: Depth extraction from video using non-parametric sampling. PAMI,2014. [26] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry. End-to-end learning of geometry and context for deep stereo regression. In ICCV, 2017. [27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014. [28] KITTI Single Depth Evaluation Server. http://www.cvlibs.net/datasets/kitti/eval depth.php?benchmark=depth prediction. 2017.[29] M. Klodt and A. Vedaldi. Supervising the new with the old-learning SFM from SFM. InECCV, 2018. [30] S. Kong and C. Fowlkes. Pixel-wise attentional gating for parsimonious pixel labeling.arXiv:1805.01556, 2018. [31] Y. Kuznietsov, J. Stuckler, and B. Leibe. Semi-supervised deep learning for monocular depth map prediction. In CVPR, 2017. [32] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3DV, 2016. [33] B. Li, Y. Dai, and M. He. Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recognition, 2018. [34] R. Li, S. Wang, Z. Long, and D. Gu. UnDeepVO: Monocular visual odometry through unsupervised deep learning. arXiv:1709.06841, 2017. [35] R. Li, K. Xian, C. Shen, Z. Cao, H. Lu, and L. Hang. Deep attention-based classification network for robust depth prediction. arXiv:1807.03959, 2018. [36] Z. Li and N. Snavely. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, 2018. [37] F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. PAMI, 2015. [38] M. Liu, M. Salzmann, and X. He. Discrete-continuous depth estimation from a single image. In CVPR, 2014. [39] C. Luo, Z. Yang, P. Wang, Y. Wang, W. Xu, R. Nevatia, and A. Yuille. Every pixel counts++: Joint learning of geometry and motion with 3D holistic understanding. arXiv:1810.06125, 2018. [40] Y. Luo, J. Ren, M. Lin, J. Pang, W. Sun, H. Li, and L. Lin. Single view stereo matching. InCVPR, 2018. [41] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In CVPR, 2018. [42] N. Mayer, E. Ilg, P. Fischer, C. Hazirbas, D. Cremers, A. Dosovitskiy, and T. Brox. What makes good synthetic training data for learning disparity and optical flow estimation? IJCV, 2018.
[43] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, 2016. [44] I. Mehta, P. Sakurikar, and P. Narayanan. Structured adversarial training for unsupervised monocular depth estimation. In 3DV, 2018. [45] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. ORBSLAM: a versatile and accurate monocular SLAM system. Transactions on Robotics, 2015. [46] J. Nath Kundu, P. Krishna Uppala, A. Pahuja, and R. V. Babu. AdaDepth: Unsupervised content congruent adaptation for depth estimation. In CVPR, 2018. [47] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NeurIPS-W, 2017. [48] S. Pillai, R. Ambrus, and A. Gaidon. Superdepth: Selfsupervised, super-resolved monocular depth estimation. In ICRA, 2019. [49] A. Pilzer, D. Xu, M. M. Puscas, E. Ricci, and N. Sebe. Unsupervised adversarial depth estimation using cycled generative networks. In 3DV, 2018. [50] M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia. Towards real-time unsupervised monocular depth estimation on cpu. In IROS, 2018. [51] M. Poggi, F. Tosi, and S. Mattoccia. Learning monocular depth estimation with unsupervised trinocular assumptions. In 3DV, 2018. [52] A. Ranjan, V. Jampani, K. Kim, D. Sun, J. Wulff, and M. J. Black. Adversarial collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation.arXiv:1805.09806, 2018. [53] Z. Ren, J. Yan, B. Ni, B. Liu, X. Yang, and H. Zha. Unsupervised deep learning for optical flow estimation. In AAAI, 2017. [54] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. [55] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.[56] A. Saxena, M. Sun, and A. Ng. Make3d: Learning 3d scene structure from a single still image. PAMI, 2009. [57] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV, 2002. [58] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.[59] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, 2018. [60] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger. Sparsity invariant CNNs. In 3DV, 2017. [61] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. DeMoN: Depth and motion network for learning monocular stereo. In CVPR, 2017. [62] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki. SfM-Net: Learning of structure and motion from video. arXiv:1704.07804, 2017. [63] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct methods. In CVPR, 2018. [64] Y. Wang, Y. Yang, Z. Yang, L. Zhao, and W. Xu. Occlusion aware unsupervised learning of optical flow. In CVPR, 2018.[65] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 2004. [66] Y. Wu, S. Ying, and L. Zheng. Size-to-depth: A new perspective for single image depth estimation. arXiv:1801.04461, 2018. [67] J. Xie, R. Girshick, and A. Farhadi. Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In ECCV, 2016.[68] N. Yang, R. Wang, J. Stuckler, and D. Cremers. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In ECCV, 2018. [69] Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia. LEGO: Learning edge with geometry all at once by watching videos. In CVPR, 2018. [70] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia. Unsupervised learning of geometry with edge-aware depth-normal consistency. In AAAI, 2018. [71] Z. Yin and J. Shi. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR, 2018.[72] J. Zbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare image patches. JMLR, 2016.[73] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. InCVPR, 2018. [74] Z. Zhang, C. Xu, J. Yang, Y. Tai, and L. Chen. Deep hierarchical guidance and regularization learning for end-to-end depth estimation. Pattern Recognition, 2018. [75] H. Zhao, O. Gallo, I. Frosio, and J. Kautz. Loss functions for image restoration with neural networks. Transactions on Computational Imaging, 2017. [76] T. Zhou, M. Brown, N. Snavely, and D. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017. [77] T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence via 3d-guided cycle consistency. In CVPR, 2016. [78] D. Zoran, P. Isola, D. Krishnan, and W. T. Freeman. Learning ordinal relationships for mid-level vision. In ICCV, 2015. [79] Y. Zou, Z. Luo, and J.-B. Huang. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In ECCV, 2018.