基于卷积神经网络的立体深度计算

所需积分/C币:9 2017-05-10 15:18:13 3.37MB PDF

用卷积神经网络进行深度图计算,取得很好的效果
Left image patch Right image patch 33. Matching cost 9 The matching cost Ccnn(p, d) is computed directly from the output of the network 9 (p,d)=fme(<P9(p),P9(pd)>) where fneg(< pl, pR>)is the output of the network 5 the negative class when run on input patches p and pg or Naively, we would have to perform the forward pass for each image location p and each disparity d under consider ation. The following three implementation details kept the runtime manageable L2 200 200 1. The output of layers L1, L2, and L3 need to be com puted only once per location p and need not be recom 200 200 puted for every disparity d concatenate 2. The output of L3 can be computed for all loca 400 tions in a single forward pass by feeding the net- work full-resolution images instead of 9x 9 image patches. To achieve this, we apply layers L2 and L3 convolutionally--layer L2 with filters of size 5x 32 and layer l3 with filters of size lx x200, both out 300 putting 200 feature maps 3. Similarly, L4 through l8 can be replaced with convo L6: 300 lutional filters of size 1 x 1 in order to compute the output of all locations in a single forward pass. Unfor tunately, we still have to perform the forward pass for each disparity under consideration 4. Stereo method Figure 2. The architecture of our convolutional neural network n order to meaningfully evaluate the matching cost, we need to pair it with a stereo method. The stereo method we used was influenced by Mei et al. [Il 4.1. Cross-based cost aggregation through l7, with 300 neurons each. The final layer, L8 Information from neighboring pixels can be combined projects the output to two real numbers that are fed through by averaging the matching cost over a fixed window. This a softmax function, producing a distribution over the two approach fails near depth discontinuities where the assump classes(good match and bad match). The weights in LI tion of constant depth within a window is violated. We L2, and L3 of the networks for the left and right image might prefer a method that adaptively selects the neighbor- patch are tied. Rectified linear units follow each layer. ex hood for each pixel so that support is collected only from cept L8. We did not use pooling in our architecture. The pixels with similar disparities. In cross-based cost aggrega network contains almost 600 thousand parameters. The ar tion [21 we build a local neighborhood around each loca- chitecture is appropriate for gray images, but can easily be tion comprising pixels with similar image intensity values extended to handle rgb images by learning 3, in- Cross-based cost aggregation begins by constructing an stead of 5 x 5 x 1 filters in L1. The best hyperparameters upright cross at each position. The left arm pi at position p of the network(such as the number of layers, the number of extends left as long as the following two conditions hold neurons in each layer, and the size of input patches)will dif- fer from one dataset to another we chose this architecture .I(p)-I(pi< T. The absolute difference in image because it performed well on the kitti stereo dataset intensities at positions p and p, is smaller than T 1594 Ip- pll n. The horizontal distance(or vertical disparity image D distance, in case of top and bottom arms)between p and p, is less than n B(D)=∑ cBCA (p, D(p) The right, bottom, and top arms are constructed analo ∑P×1{Dp)-D(q)=1} gously. Once the four arms are known, we can define the support region U(p)as the union of horizontal arms of all positions q laying on ps vertical arm(see Figure 3). Zhang +∑P2×1D(p)-Dq)>1}),(0 q∈N top arm where 1[ denotes the indicator function. The first term penalizes disparities D(p) with high matching costs. The second term adds a penalty Pi when the disparity of neigh horizontal arms of boring pixels differ by one. The third term adds a larger penalty P2 when the neighboring disparities differ by more than one. Rather than minimizing E(D) in 2D, we per- form the minimization in a single direction with dynamic left arm PL right arm programming. This solution introduces unwanted streak ing effects, since there is no incentive to make the disparity image smooth in the directions we are not optimizing over In semiglobal matching we minimize the energy E(D)in bottom arm many directions and average to obtain the final result. Al- though Hirschmuller [4] suggests choosing sixteen direc Figure 3. The support region for position p, is the union of hori- tion, we only optimized along the two horizontal and the zontal arms of all positions g on p's vertical arm two vertical directions; adding the diagonal directions did not improve the accuracy of our system et al. [21] suggest that aggregation should consider the sup To minimize E(D)in direction r, we define a matching port regions of both images in a stereo pair. Let U and U cost Cr(p, d)with the following recurrence relation denote the support regions in the left and right image. We CBCA P, d)-min Cr(p-r, k) define the combined support region d as C(p-r,d),C7(P-r,d-1)+f1 Ua(p)={qq∈U(p),qd∈U(pd)} Cr(p-r, d+1)+Pl, min Cr(p-r, k)+P2.(11) The matching cost is averaged over the combined support region he second term is included to prevent values of Cr(p, d) from growing too large and does not affect the optimal dis parity map. The parameters Pi and P2 are set according to the image gradient so that jumps in disparity coincide with CCBCA (P, d )=CcNN(p, d) edges in the image. Let D1=Ip-I(p-r) (p,d)Ua(p)l geva(p) ∑ceA(q,d,(9) d-r). We set P1 and P2 according to the following rules P1=I P2=∏ I DI< where i is the iteration number. We repeat the averag P1=[1/ 4, P2=112/4 if D1 2 Tso, D2 <Tso, g four times; the output of cross-based cost aggregation is P1=11/ 4, P2=l2 /4 if D1 TSo, D2 2 TS0, P1=I1/10 I2/10ifD1≥7so,D2≥Tso; CBCA where Il1, 112, and Tso are hyperparameters. The value 4.2. Semiglobal matching of Pi is halved when minimizing in the vertical directions The final cost CSGM(p, d) is computed by taking the average We refine the matching cost by enforcing smoothness across all four directions constraints on the disparity image. Following Hirschmuller 1 [4], we define an energy function E(D)that depends on the CSGM(p, d)= 4 ∑G(D,d 1595 After semiglobal matching we repeat cross-based cost ag- where g(a)is the probability density function of a zero gregation, as described in the previous section mean normal distribution with standard deviation g and 4.3. Computing the disparity image w(p) is the normalizing constant The dis parity image D is computed by the winner-take- W(p)=29(p-Q)1171(p)-1L(@)<TBFJ.(16) all strategy, i.e. by finding the disparity d that minimizes C(p, d D(p)=argmin C(p, d) BF and o are hyperparameters. DBF is the final output of our stereo method 4.3.1 Interpolation 5. Experimental results Let D denote the disparity map obtained by treating the We evaluate our method on the kitti stereo dataset left image as the reference image-this was the case so far, i.e. D(p)=D(p-and let dh denote the disparity map because of its large training set size required to learn the obtained by treating the right image as the reference im weights of the convolutional neural network age. Both D and D contain errors in occluded regions. 5.1. kitti stereo dataset We attempt to detect these errors by performing a left-right consistency check. We label each position p as either The KITTi stereo dataset [2] is a collection of gray im age pairs taken from two video cameras mounted on the correct if d-D(pd) 1 for d= Dl(p), roof of a car, roughly 54 centimeters apart. The images nisTilaich ifd-(pd)l< 1 for any other d are recorded while driving in and around the city of Karl occlusion otherwis sruhe, in sunny and cloudy weather at daytime. The dataset For positions marked as occlusion, we want the new dispar comprises 194 training and 195 test image pairs at resolu ity value to come from the background. We interpolate by tion 1240X376. Each image pair is rectified,i.e.trans- moving left until we find a position labeled correct and use formed in such a way that an object appears on the same its value. For positions marked as mismatch, we find th vertical position in both inages. A rotating laser scan nearest correct pixels in 16 different directions and use the ner,mounted behind the left camera, provides ground truth median of their disparities for interpolation. We refer to the depth. The true disparities for the test set are withheld and interpolated disparity map as DINT an online leaderboard is provided where researchers can evaluate their method on the test set submissions are al 4.3.2 Subpixel enhancement lowed only once every three days. The goal of the KITTi stereo dataset is to predict the disparity for each pixel on Subpixel enhancement provides an easy way to increase the the left image. Error is measured by the percentage of pix- resolution of a stereo algorithm. We fit a quadratic curve els where the true disparity and the predicted disparity differ through the neighboring costs to obtain a new disparity im- by more than three pixels. Translated into depth, this means age. that, for example, the error tolerance is +3 centimeters for C+-C (P)=d (14) objects 2 meters from the camera and +80 centimeters for 2(C+-2C+C objects 10 meters from the camera where d= DINT(P), C-= CsGM(p, d-1), C CSGM(p, d), and C+- CsGm(p, d+1) 5.2. Details of learning We train the network using stochastic gradient descent 4.3.3 Refinement to minimize the cross-entropy loss. The batch size was set to 128. We trained for 16 epochs with the learning rate ini The size of the disparity image DSE is smaller than the size tially set to 0.01 and decreased by a factor of 10 on the 12 of the original image, due to the bordering effects of convo and i sih iteration. We shuffle the training examples prior to lution The disparity image is enlarged to match the size of learning. From the 194 training image pairs we extracted the input by copying the disparities of the border pixels. We 45 millie 45 million examples. Half belonging to the positive class proceed by applying a 5 X 5 median filter and the following half to the negative class. We preprocessed each image b bilateral filter. subtracting the mean and dividing by the standard deviation of its pixel intensity values. The stereo method is imple DBE(P)=W(p) ∑DsEq)·9(|p-q) mented in CUDA, while the network training is done with q∈Mp Iibs t /datasets/kit |(p)-1q)<TBF},(15) al\ stereo\flow. php? benchmark=stereo 1596 the Torch environment [1]. The hyperparameters of the 3.65% stereo method were 3.6 No=4,m=4, =5.656, 3.55% Nhn1=8,T=0.0442,Ⅱ2=32 3.5 TBF =0 3.45% Pi= 1 so=0.0625 34% 53. Results 3.35% 3.3 Our method achieves an error rate of 2.61 %o on the 8.25 KITTI stereo test set and is currently ranked first on the on 0406080100120140160 line leaderboard. Table 1 compares the error rates of the Number of training stereo pairs best performing stereo algorithms on this dataset Figure 4. The error on the test set as a function of the number of Rank method E rror stereo pairs in the training set. MC-CNN This paper 2.61% 2 SPS-StFI Yamaguchi et al. [20] 2.83 0 3 VC-SF Vogel et al. 16 3.05% serve an almost linear relationship between the training set CoP Anonymous submission 3.30% size and error on the test set. These results imply that our SPS-St Yamaguchi et al. [20] 3.39 method will improve as larger datasets become available in 56789 PCBP-SS Yamaguchi et al. [] 3.40 the future DDS-SS Anonymous submission 3.83 %0 StereosliC Yamaguchi et al. [19 3.92% 6. Conclusion PR-Sf+E vogel et al. [17] 4.02% 10 PCBP Yamaguchi et al.[ 18] 4.04 %0 Our result on the KiTTi stereo dataset seems to suggest that convolutional neural networks are a good fit for com Table 1. the kitti stereo leaderboard as it stands in November puting the stereo matching cost. Training on bigger datasets 2014 will reduce the error rate even further. Using supervised learning in the stereo method itself could also be benefi A selected set of examples, together with predictions cial. Our method is not yet suitable for real-Lime applica- from our method, are shown in Figure 5 tions such as robot navigation Future work will focus on 54. Runtime improving the networks runtime performance. We measure the runtime of our implementation on a References computer with a Nvidia geForce gtX Titan gPu. train ing takes 5 hours. Predicting a single image pair takes 100 [11 Collobert, R. Kavukcuoglu, K, and Farabet, C (2011) seconds. It is evident from Table 2 that the majority of time Torch.A matlab-like environment for machine learn during prediction is spent in the forward pass of the convo ing. In Big learn, NIPS Workshop, number EPFL lutional neural network CONF-192376. Component Runtime [2 Geiger, A, Lenz, P. Stiller, C, and Urtasun, R.(2013) Convolutional neural network 95s Vision meets robotics the kitti dataset internationa Semiglobal matching o S Journal of robotics Research(JRR) Cross-based cost aggregation S [3] Haeusler,R, Nair, R, and Kondermann, D (2013). En- Everything else 0.03s semble learning for confidence measures in stereo vision Table 2 Time required for prediction of each component In Computer vision and Pattern Recognition(CVPR), 2013 IEEE Conference on, pages 305-312. IEEE 5.5. Training set size [4] Hirschmuller, H.(2008). Stereo processing by semiglobal matching and mutual information We would like to know if more training data would lead Analysis and Machine Intelligence, IEEE Transactions to a better stereo method. To answer this question, we train On,30(2):328-341. our convolutional neural network on many instances of the KITTi stereo dataset while varying the training set size. the [5 Hirschmuller, H. and Scharstein, D.(2009). Evalua- results of the experiment are depicted in figure 4. we ob tion of stereo matching costs on images with radiometric 1597 Figure 5. The left column displays the left input image, while the right column displays the output of our stereo method. Examples are sorted by difficulty, with easy examples appearing at the top. Some of the difficulties include reflective surfaces, occlusions, as well a regions with Imlany juMps in disparily, e.g. fences and shrubbery. The examples low ards the bottoin were selected to highlight the laws in our method and to demonstrate the inherent difficulties of stereo matching on real-world images differences. Pattern Analysis and Machine intelligence learning multiple experts behaviors. In BMVC, pages TEEE Transactions on, 31(9): 1582-1599 97-106 [6] Kong, D and Tao, H(2004). A method for learning 8 Krizhevsky, A, Sutskever, I, and Hinton, G.(2012) matching errors for stereo computation In BMVC, pages Imagenet classification with deep convolutional neural l-10. networks. In Advances in Neural Information Processing Systems 25, pages 1106-1114 [7] Kong, D. and Tao, H.(2006). Stereo matching via [9] LeCun, Y, Bottou, L, Bengio, Y, and Haffner, P 1598 (1998). Gradient-based learning applied to document [20 Yamaguchi, K, McAllester, D, and Urtasun, R recognition. Proceedings of the IEEE, 86(11): 2278- (2014). Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In Computer vision-ECCV 2014, pages 756-771 Springer [10 Li, Y and Huttenlocher, D. P(2008). Learning for stereo vision using the structured support vector ma [21] Zhang, K, Lu, J, and Lafruit, G(2009). Cross-based chine. In Computer vision and Pattern Recognition local stereo matching using orthogonal integral images 2008. CVPR 2008. IEEE Conference on, pages 1-8 Circuits and systems for video Technology, IEEE Trans- IEEE actions on,19(7):1073-1079 [11] Mei, X, Sun, X, Zhou, M, Wang, H, Zhang, X, [22] Zhang, L and Seitz, s M.(2007). Estimating opti et al.(2011). On building an accurate stereo matchin mal parameters for mrf stereo from a single image pair system on graphics hardware. In Computer vision work- Pattern Analysis and Machine Intelligence, IEEE Trans shops(ICCV Workshops), 20// IEEE International Con actions on,29(2):331-342 ference on, pages 467-474. EEE. [121 Peris, M., Maki, A, Martull, S, Ohkawa, Y, and Fukui, K.(2012). Towards a simulation driven stereo vision system. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 1038-1042 IEEE [13] Scharstein, D. and Pal, C.(2007). Learning condi tional random fields for stereo. In Computer vision and Pattern Recognition, 2007. CVPR'O7. IEEE Conference on, pages 1-8. IEEE [14 Scharstein, D. and Szeliski, R.(2002). A taxon omy and evaluation of dense two-frame stereo corre spondence algorithms. International journal of computer VISLOn,47(1-3):7-42 [15] Spyropoulos, A, Komodakis, N, and Mordohai, P. (2014). Learning to detect ground control points for im- proving the accuracy of stereo matching. In Computer Vision and pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1621-1628 IEEE [16 Vogel, C, Roth, s, and Schindler, K.(2014) View-consistent 3d scene fow estimation over multiple frames. In Computer Vision-ECCV 2014, pages 263 ,78 Springer [17 Vogel, C, Schindler, K, and roth, S(2013). Piece- wise rigid scene flow. In Computer Vision (1CCv), 2013 TEEE International Conference on, pages 1377-1384 IEEE [ 18] Yamaguchi, K, Hazan, T, McAllester, D, and Urta sun, R(2012). Continuous markov random fields forro- bust stereo estimation In Computer vision - 2012 pages 45-58. Springer [19 Yamaguchi, K, McAllester, D, and Urtasun, R (2013). Robust monocular epipolar flow estination. In Computer vision and Pattern Recognition(CVPR), 2013 TEEE Conference on, pages 1862-1869. IEEE 1599

...展开详情
img
静默虚空

关注 私信 TA的资源

上传资源赚积分,得勋章