Fine-Grained Head Pose Estimation Without Keypoints

所需积分/C币:14 2018-12-03 15:16:25 1.51MB PDF

Abstract Estimating the head pose of a person is a crucial prob- lem that has a large amount of applications such as aiding in gaze estimation, modeling attention, fitting 3D models to video and performing face alignment. Traditionally head pose is computed by estimating some keypoints from the tar-
3. 2. The Multi-Loss Approach All previous work which predicted head pose using con- volutional networks regressed all three Euler angles directly using a mean squared error loss. We notice that this ap- proach does not achieve the best results on our large-scale synthetic training data We propose to use three separate losses, one for each angle. Each loss is a com bination of two components: a binned pose classification and a regression component. Any backbone network can be used and augmented with three fully-connected layers which predict the angles. These three fully-connected layers share the previous convolu tional layers of the network The idea behind this approach is that by performin bin classification we use the very stable softmax layer and cross-entropy, thus the network learns to predict the neigh bourhood of the pose in a robust fashion. By having three cross-entropy losses, one for each Euler angle, we have three signals which are backpropagated into the network hich i ng obtain a fi d predictions we compute the expectation of each output an 1. Example difficult scena sing our gle for the binned output. The detailed architecture is shown in figure 2 face, green pointing downward and red pointing to the side. Best We then add a regression loss to the network, namely a viewed in color mean-squared error loss, in order to improve fine-grained predictions. We have three final losses, one for each angle, and each is a linear conbination of both the respective clas head pose this approach has not been studied extensively sification and the regression losses. We vary the weight of and is not commonly used for head pose estimation tasks the regression loss in Section 4.4 and we hold the weight of Instead if very accurate head pose is needed then depth cam the classification loss constant at 1. the final loss for each eras are installed and if no depth footage exists landmarks Euler angle is the following are detected and pose is retrieved. In this work we show that a network trained on a large synthetic dataset, which by def- C=H(y,y)+a·MSE(y,y) inition has accurate pose annotations, can predict pose ac curately in real cases. We test the networks on real datasets Where H and MsE respectively designate the cross- which have accurate pose annotations and show state-of- entropy and mean squared error loss functions the-art results on the aFlW, AFLW2000 [ 35] and BIwi[6 datasets. Additionally we are starting to close the gap with very accurate methods which use depth information on the 3. 3. Datasets for fine- grained Pose estimation biWi dataset In order to truly make progress in the problem of predict compared to landmark-to-pose methods, for example, ages We believe that deep networks have large advantages ing pose from image intensities we have to find real datasets which contain precise pose annotations, numerous identi- ties, different lighting conditions, all of this across large They are not dependent on: the head model chosen, the poses. We identify two very different datasets which fill landmark detection method, the subset of points used hese requirements for alignment of the head model or the optimization First is the challenging AFLW2000 dataset This dataset method used for aligning 2D to 3D points contains the first 2000 identities of the in-the-wild aflw dataset which have been re-annotated with 683D landmarks They al ways output a pose prediction which is not the using a 3d model which is fit to each face. consequently case for the latter method when the landmark detection this dataset contains accurate fine-grained pose annotations method fails and is a prime candidate to be used as a test set for our task Second the biwi dataset is gathered in a laboratory set- ting by recording rGB-D video of different subjects across FC Layers softmax ●- Total yaw loss SE LOSS Cross Entropy Loss yaw 250 ■(·:m-…m ResNet50 83040 ) softmax MSE LoSs Cross Entropy Loss Figure 2. ResNet50 architecture with combined Mean Squared Error and Cross Entropy Losses different head poses using a Kinect v2 device. It contains 3.5. The Effects of Low-Resolution roughly 15,000 frames and the rotations are +75 for yaw, Currently there is need for head pose estimation at a dis ±60° for pitch and±50° for rolL. a3 D model was fit to each individual's point cloud and the head rotations were tance and there exist multiple example applications in areas tracked to produce the pose annotations. This dataset is such as video surveillance. autonomous driving and adver- commonly used as a benchmark for pose estimation us- tisement. Future head pose estimation methods should look ing depth methods which attests to the precision of its la to improve estimation for low-resolution heads bels. In our case we will not use the depth information nor We present an in-depth study of the effect of low the temporal information, only individual color frames. In resolution on widely-used landmark detectors as well as Section 4.1 we compare to a very accurate state-of-the-art state-of-the-art detectors We contend that low-resolution depth method to ascertain the performance gap between ap should worsen the performance of landmark detection since estinating keypoints necessitates access to features which proaches disappear at lower resolutions. We argue that although de tailed features are important for pose estimation they are not as critical. Moreover this area is relatively untapped: there 3. 4. Training on a Synthetically Expanded dataset is scarce related work discussing head pose estimation at a distance. As far as we know there is no work discussing We follow the path of [2] which used synthetically ex- low-resolution head pose estimation using deep learning panded data to train their landmark detection model. Or Deep networks which predict pose directly from image of the datasets they train on is the 300W-LP dataset which intensities are a good candidate method for this applica- is a collection of popular in-the-wild 2D landmark datasets tion because robustness can be built into them by modifying which have been grouped and re-annotated. A face model is the network or augmenting its training data in smart ways fit on each image and the image is distorted to vary the yaw We propose a simple yet surprisingly effective way of de of the face which gives us pose across several yaw angles. veloping robustness to low-resolution images: we augment Pose is accurately labeled because we have the 3D model our data by downsampling and upsampling randomly which and 6-d degrees of freedom of the face for each image forces the network to learn effective representations for var- ied resolutions. We also augment the data by blurring the We show in Section 4. I that by carefully training on large images. Experiments are shown in Section 4.5 amounts of synthetic data we can begin closing the gap with existing depth methods and can achieve very good accura 4. EXPERIMENTAL RESULTS cies on datasets with fine-grained pose annotations. We also test our method against other deep learning methods whose We perform experiments showing the overall perfor authors have graciously run on some of the test datasets that mance of our proposed method on different datasets for we use in Section 4.1. Additionally in the same Section, we pose estimation as well as popular landmark detection test landmark-to-pose methods and other types of pose esti datasets we show ablation studies for the multi-loss addi mation methods such as 3D model fitting tionally, we delve into landmark-to-pose methods and shed light on their robustness. Finally we present experiments Pitch Ro MAE suggesting that a holistic approach to pose using deep net- Muti- Loss ResNet50(a=2)64706.5595.4366155 Multi- Loss ResNet50(=1)6920663756746.410 works outperforms landmark-to-pose methods when resolu 3DDFA 351 5.4008.5308.250 393 tion is low even if the landmark detector is state-of-the-art FAN [2](12 points 6.35812.2778.7149.116 Dlib [ll](68 points) 23.1531363310.54515.777 4.1. Fine-Grained Pose Estimation on the Ground truth landmarks 2411.7568.2718.651 AFLW2000 and BIWi Datasets Table l Mean average error of Euler angles across different meth- ods on the aflw2000 dataset [35] We evaluate our method on the aflw2000 and biwi datasets for the task of fine-grained pose estimation and Pitch Roll MAE ompare to pose estimated from landmarks using two differ Multi-Loss ResNet50(a=2)5.1676.9753.385.177 Multi-Loss resnet50(=1)48106663.2694895 ent landmark detectors, FAN [2] and Dlib[ll, and ground KEPLER [14]1 8.08417.27716.19613.852 truth landmarks(only available for AFLW2000) Mulli-Luss resNe50(=1)578511.7268.1948.56 FAN is a very impressive state-of-the-art landmark de 3DMM+ Online [33] 2.5001.5002.2002.066 tector described in [2] by Bulat and Tzimiropoulos. It FAN [2](12 points) 8.5327.48376317.882 DIib [ll](68 points) 16.75613.8026.19012.249 uses Stacked Hourglass Networks [161 originally intended 3DDFA 35 36.17512.2528.77619.068 for human body pose estimation and switches the normal Table 2 Mean average error of Euler angles across different meth- ResNet bottleneck block for a hierarchical, parallel and ods on the biwi dataset 16.*These methods use depth informa- multi-scale block proposed in another paper by the same au tion I Trained on AFLW thors [l]. We were inspired to train our pose-estimation net- ork on 300W-LP from their work which trains their net- Y Pitch roll Sum of errors Multi-Loss ResNet50(=1)3.293.393.00968 work on this dataset for the task of landmark detection Dlib et al. 5 4.033.0310.97 implements a landmark detector which uses an ensemble of Table 3. Comparison with Gu et al. [5]. Mean average error of regression trees and which is described in [ll Euler angles averaged over train-test splits of the biwi dataset [6] We run both of these landmark detectors on the AFLW2000 and biwi datasets. AFLW2000 images are small and are cropped around the face. For biwi we run color channel. Note that since our method bins angles in a Faster R-CNn [22] face detector trained on the WIDER the +99 range we discard images with angles outside of Face Dataset [32, 10] and deployed in a Docker con this range. Only 3 1 images are not used from the 2000 im- tainer [24]. We loosely crop the faces around the bounding ages of Afl 2000 box in order to conserve the rest of the head. We also re In order to compare to Gu et al. [5 we train on three trieve pose from the ground-truth landmarks of AFlw2000 different 70-30 splits of videos in the biwi dataset and we Results can be seen in tables 1 and 2 average our mean average error for each split. For this eval Additionally, we run 3DDFA B5] which directly fits a uation we use weight decay with a coefficient of 0.04 be 3D face model to rgb image via convolutional neutral net- cause of the smaller amount of data available. We compare works. The primary task of 3DDFA is to align facial land our result to their single -frame result which was trained in marks even for the occluded ones using a dense 3D model the same fashion and we show the results in table 3. our As a result of their 3D fitting process, a 3D head pose is method compares favorably to Gu et al. and lowers the sun produced and we report this pose. of mean average errors by 1.29% Finally, we compare our results to the state-of-the-art RGBD method [33]. We can see that our proposed method 4. 2. Landmark-To-Pose Study considerably shrinks the gap between RGBD methods and In this set of experiments, we examine the approach of ResNet50 [8]. Pitch estimation is still lagging behind in using facial landmarks as a proxy to head pose and inves part due to the lack of large quantities of extreme pitch ex tigate the limitations of its use for pose estimation The amples in the 300W-LP dataset We expect that this gap will commonly used pipeline for landmark-to-pose estimation be closed when more data is available involves a number of steps, 2D landmarks are detected, 3D We present two multi-loss ResNet50 networks with dif human mean face model is assumed, camera intrinsic pa ferent regression coefficients of 1 and 2 trained on the rameters are approximated, and finally the 2D-3D corre 300W-LP dataset. For Biwi we also present a multi-loss spondence problem is solved. We show how this pipeline ResNet50(a= 1) trained on AFLW. All three networks is affected by different error sources. Specifically, us were trained for 25 epochs using Adam optimization[ 2 ing the aFlW2000 benchmark dataset, we conduct exper with a learning rate of 10-5 and B1=0.9, B2=0.999 iments starting from the best available condition(ground and e=10-8. We normalize the data before training b truth 2D landmarks, ground truth 3D mean face model) and using the Image Net mean and standard deviation for each examine the final head pose estimation error by deviating tations. Pose was obtained by annotating landmarks and using a landmark-to-pose method. Results can be seen in Table 4 AFW is a popular dataset, also commonly used to test landmark detection, which contains rough pose annotations 468 in-the-wild faces with absolute yaw degree's up to +90. Methods only compare mean average error for yaw. Methods usually output discrete predictions and round their output to the closest 15 multiple. As such at the 15 error margin, which is one of the main metrics reported in 2 the literature. this dataset is saturated and methods achieve Results are shown in figure 7 Using our joint classification and regression losses for Number of keypoints used for pose AlexNet [13 we obtain similar mean average error after training for 25 epochs. We compare our results to the Ke Figure 3. We show the effects of using ditferent number of land- PLER [14] method which uses a modified Google Net for mark points for 3D head pose estimation using ground truth fa simultaneous landmark detection and pose estimation and cial landmarks and the ground truth mean tace model on the aFl 2000 dataset to [19] which uses a 4-layer convolutional network. Multi Loss Resnetso achieves lower mean Average error than KEPLER across all angles in the afl test-set after 25 from this condition. For all of these experiments, we as epochs of training using Adam and same learning param sume zero lens distortion and run iterative method based eters as in section 4. These results can be observed in on Levenberg-Marquardt optimization to find 2D-3D corre Table 4 spondence which is implemented as the function SolvePnP We test the previously trained AlexNet and Multi-Loss in OpenC v ResNet50 networks on the aFw dataset and display the re We first run the pipeline only with ground truth land- sults in Figure 7. We evaluate the results uniquely on the marks, varying the number of points used in the optimiza vaw as all related work does. We constrain our networks to tion method. We observe that in this ideal condition, using output discrete yaw in 15 degree increments and display the all of the available 68 landmark points actually gives biggest accuracy at two different yaw thresholds. a face is correctly error as shown in Figure 3. Then, we jitter the ground truth classified if the absolute error of the predicted yaw is lower 2D landmarks by adding random noise independently in x, or equal than the threshold presented y direction per landmark. Figure 4 shows the results of this The same testing protocol is adopted for all compared experiment with up to 10 pixel of jittering. We repeat the methods and numbers are reported directly from the associ experiment with the same set of keypoints selected for Fig- ated papers. Hyperface [20] and All-In-One [21] both use ure 3. Finally, we change the mean face model by stretching a single network for numerous facial analysis tasks. Hyper- the ground truth mean face in width and height up to 40% face uses an AlexNet pre-trained on Image Net as a back Figure 5. Additionally, we also report results based on esti- bone and All-In-One uses a backbone 7-layer conv-net pre mated landmarks using fan and Dlib in Figure 6 trained on the face recognition task using triplet probability The results suggest that with ground truth 2D landmarks, constraints [251 using less key points produces less error since it's less likely We show that by pre-training on ImageNet and fine- to be affected by pose-irrelevant deformation such as facial tuning on the aFlw dataset we achieve accuracies that are expression. However, the more points we use for correspon very close to the best results of the related work. we do dence problem, the more robust it becomes to random jitter- not use any other supervisory information which might im- ing. In other words, there exists a tradeoff; if we know the prove the performance of the network such as 2D landmark keypoints are very accurate we want to use less points for annotations. We do however use a more powerful backbone pose, but if there's error we want to use more points. With network in ResNet50. We show performance of the same estimated landmarks it's not clear how we can weigh these network on both the aFlw test-set and AFW. two, and we find that using more points can both help and worsen pose estimation as presented in Figure 6 4. 4. AFLW2000 Multi-Loss Ablation 4.3. AFLW and aFw Benchmarking In this section we present an ablation study of the multi loss. We train resNet5o only using a Mean Squared Error The aflw dataset, which is commonly used to train and (MSE) LoSS and compare this to resNet5o using a multi test landmark detection methods, also includes pose anno loss with different coefficients for the mse component The 匪 12 可uz Yaw Pitch J ttering 12 keypoints Jittering 21 keypoints Jittering 35 key poirts Jittering 66 keypoints Figure 4. We show the effect of jittering landmark points around their ground truth position on the task of 3D head pose estimation on AFLW2000 to simulate the effects of noise in the facial keypoint detector. We repeat this experiment four times with different number of landmarks. For all experiments we use the ground truth mean face model for the landmark-to-pose alignment task 6420 8542 40% +20% Meanface model, height change Meanface model, with cnange Figure 5. We show the effects of changing the 3D mean face model on the task of 3d head pose estimation from 2D landmarks. We use 2D ground truth landmarks and modify the mean face model by stretching its width and height Y Pitch Roll MAe Yaw pitch roll MAe Multi-Loss resNet50(a=l) 6.26 5.89 3.82 5.324 ResNet50 regression only 13.1106.72657998.545 AlexNel(C=1) 7.797.416.057.084 Multi-Loss ResNet50 7.087687056216.526 KEPLER「14 6.455858.757.017 26.4706.5595.4366.155 Patacchiola, Cangelosi [19 9206.63756746.410 Table 4. Mean average errors of predicted Euler angles in the 0.110.2706.8675.4207.519 0.0111.4106.8475.8368.031 AfL test set 11.6287.1195.9668.238 Multi-Loss alex net 127.6508.543895415049 0.130.1109.5489.27316.310 weight of the Cross-Entropy loss is maintained constant at 0.012509084428.28713.940 I. We also compare this to alex Net to discern the effects of 24.4698.3508.35313.724 having a more powerful architecture Table 5. Ablation analysis: MAE across different models and re We observe the best results on the aflw20oo dataset gression loss weights on the aflw2000 dataset hen the regression coefficient is equal to 2. We demon strate increased accuracy when weighing each loss roughly ith the same magnitude. This phenomenon can be ob- x10 and x15. In general images are around 20-30 pixels served in Table 5 wide and high when downsampled x15. We then upsample these images and run them through the detectors and deep 4.5. LoW-Resolution AFLW2000 Study networks. We use nearest neighbor interpolation for down sampling and upsampling We study the effects of downsampling all images from For our method we present a multi-loss resNet50 with the aflw2000 dataset and testing landmark-to-pose meth- regression coefficient of 1 trained on normal resolution im- ods on these datasets. We compare these results to our ages. We also train three identical networks: for the first one method using different data augmentation strategies. We we augment the dataset by randomly downsampling and up test the pose retrieved from the state-of-the-art landmark sampling the input image by x10, for next one we ra detection network FAN and also from Dlib. We test all downsample and upsample an image by an integer ranging methods on five different scales of downsampling xl, X5, from I to 10 and for the last one we randomly downsample □A 10 21 Number of keypoints from Dlib used for pose Number of keypoints from FAN used fcr pose Figure 6. Using estimated 2D landmark points, this experiment shows the 3d pose estimation error depending on how many facial keypoints are used 50 Hopenet (a=1) Hopenet(a=1, random downsample x 10 Hopenet(a=1, random downsample [1, 10D) Hopenet (a=1, random downsample 11, 6,11, 16, 21 FAN 12 points Dlib(88 points) 0.2 30 H>penet, alpha=l (95.159% 25 ose estimaton error in degrees) Figure 7. AFw pose benchmark result along with other meth ods[21,20,14,36] 10 and upsample an image by one of the following integers 1 Downsampled 6,11,16,21 We observe that from the get-go our methods show better figure 8 mean average error for different methods on the down performance than pose from the dlib landmarks, yet pose sampled AFLW2000 dataset in order to determine robustness of from the fan landmarks is acceptable. Pose from the Fan methods to low-resolution images landmarks degrades as the resolution gets very low which iS natural since landmarks are very hard to estimate at these resolutions especially for methods that rely heavily on ap such as head model and landmark detection accuracy pearance. Pose from the network without augmentation We also show that our proposed method generalizes deteriorates strongly yet the networks with augmentation cross datasets and that it outperforms networks that regress show much more robustness and perform decently at very head pose as a sub-goal in detecting landmarks. We show low resolutions. Results are presented in Figure 8. This that landmark-to-pose is fragile in cases of very low res is exciting news for long-distance and low-resolution hea olution and that, if the training data is appropriately aug pose estimation. mented. our method shows robustness to these situations Synthetic data generation for extreme poses seems to be 5 CONCLUSIONS AND FUTURE WORK a way to improve performance for the proposed method as are studies into more intricate network architectures that In this work we show that a multi-loss deep network can might take into account full body pose for example directly, accurately and robustly predict head rotation from image intensities. We show that such a network outperforms References landmark-to-pose methods using state-of-the-art landmark detection methods. Landmark-to-pose methods are studied [1 A. Bulat and G. Tzimiropoulos. Binarized convolutional in this work to show their dependence on extraneous factors landmark localizers for human pose estimation and face alignment with limited resources. In International Confer- [15 I Matthews and s Baker. Active Appearance Models revis ence on Computer vision 2017.5 ited. International Journal of Computer vision, 60(2): 135 [2] A. Bulat and G ' Tzimiropoulos. How far are we from solv- 164,2004.1 ing the 2d 3d face alignment problem?(and a dataset of [16 A. Newell, K. Yang, and J. Deng. Stacked hourglass net- 230,000 3d facial landmarks ). In International Conference works for human pose estimation. In European Conference on Computer Vision, 2017. 1, 2, 4, 5 on Computer Vision, pages 483-499. Springer, 2016. 5 [3] F.-J. Chang, A T. Tran, T Hassner, I Masi, R. Nevatia, and [17]J Ng and S. Gong. Composite support vector machines for G Medioni. Faceposenet: Making a case for landmark-free detection of faces across views and pose estimation. Image face alignment. In Computer Vision Workshop(Iccvw) and Vision Computing, 20(5): 359-368, 2002. 2 017 IEEE International Conference on, pages 1599-1608. [18] E. Osuna, R. Freund, and F. Girosit. Training support vec IEEE.2017.1.2 tor machines: an application to face detection. In Computer [4]T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appear- vision and pattern recognition, 1997. ProceedingS., 1997 ance models. TEEE Transactions on Pattern Analysis and IEEE computer society conference on, pages 130-136.IEEE Machine intelligence, 23(6): 681-685, jun 2001. I 1997.2 [5]J. G.x.Y.S. De and M.J. Kautz. Dynamic facial analysis: 19] M. Patacchiola and A. Cangelosi. Head pose estimation in From bayesian filtering to recurrent neural network. 2017.2 the wild using convolutional neural networks and adaptive 5 gradient methods. Pattern Recognition, 2017. 1, 2, 6, 7 [6 G. Fanelli, M. Dantone, J. Gall. A. Fossati, and L. Van Gool [20]R. Ranjan, V.M. Patel, andR. Chellappa Hyperface: A deep Random forests for real time 3d face analysis. IntJ. Comput multi-task learning framework for face detection, landmark Vision,101(3):437-458, February2013.3.5 localization, pose estimation, and gender recognition. arXi [7R Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea preprint arXiv:603.01249,2016.1,2,6.8 ture hierarchies for accurate object detection and semantic [21] R Ranjan, S Sankaranarayanan, C. D. Castillo, and R. Chel- egmentation. In Computer Vision and Pattern Recognition appa. An all-in-one convolutional neural network for face 2014.2 analysis. In Automatic Face Gesture Recognition (FG [8] K. He, X. Zhang, s Ren, and J. Sun. Deep residual learn- 2017), 2017 12th IEEE International Conference on, pages ing for image recognition. arXiv preprint arXiv: 1572.03.38.5 17-24.IEEE,2017.1,2,6,8 2015.5 122S. Ren, K. He, R Girshick, and J. Sun. Faster R-CNN: To- [9] J. Huang, X Shao, and H. Wechsler. Face pose discrimina wards real-time object detection with region proposal net- tion using support vector machines(svm). In Pattern recog works. In Advances in Neural Information Processing Sys nition, 1998. Proceedings. Fourteenth international Confer temS(NIPS), 2015. 5 ence on, volume 1, pages 154-156. IEEE, 1998. 2 [23]H. A Rowley,s. Baluja, and T Kanade. Neural network [10 H. Jiang and E. Learned-Miller. Face detection with the based face detection. IEEE Transactions on pattern analysis faster r-cnn. In Automatic Face Gesture Recognition(FG and machine intelligence, 20(1): 23-38. 1998.2 2017), 2017 12th IEEE International Conference on, pages [24]N. Ruiz and L M. Rehg. Dockerface: an easy to install and 650657.IEEE,2017.5 use faster r-cnn face detector in a docker container. arXi [11] V Kazemi and J. Sullivan. One millisecond face alignment preprint arXiv: /708.04370, 2017.5 with an ensemble of regression trees. In Proceedings of the [25] S Sankaranarayanan, A. Alavi, C D. Castillo, and R Chel- TEEE Conference on Computer Vision and Pattern Recogni lappa. Triplet probabilistic embedding for face verification tion, pages1867-1874,2014.5 and clustering. In Biometrics Theory, Applications and Sys- [12] D. Kingma and J. Ba. Adam: A method for stochastic opti- tems(BTAS), 2016 IEEE Sth International Conference on mization. arXiv preprint arXiv: 1412.6980, 2014. 5 pages 1-8. IEEE, 2016. 6 13]A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet [26] J.M. Saragih, SLucey, and J F Cohn. Deformable model classification with deep convolutional neural networks. In fitting by regularized landmark mean-shift. International Advances in neural information processing systems, pages Journal of computer Vision, 91(2): 200-215, 2011. 1 1097-1105,2012.6 [27] J. Sherrah, S Gong, and E.J. Ong. Understanding pose dis [14]A. Kunar, A. Alavi, and R. Chellappa. Kepler: Keypoint and crimination in similarity space. In BMVC, pages 1-10, 1999 pose estimation of unconstrained faces by learning efficient h-cnn regressors. In Automatic Face Gesture Recognition [28] J. Sherrah, s. Gong, and E -J. Ong. Face distributions in (FG 2077), 20/7/2th /EFF International Conference on similarity space under varying head pose. Image and vision pages258-265.IEEE,2017.1.2,5,6,7,8 Computing,19(12):807-819,2001.2 [29] Xiangxin Zhu and D Ramanan Face detection, pose estima tion, and landmark localization in the wild. In Proceedings IEEE Conference on Computer vision and Pattern Recogni tion(CVPR), pages 2879-2886, jun 2012. I 30]X. Xiong and F De la Torre. Supervised descent method and its applications to face alignment. In Proceedings of the TEEE Conference on Computer Vision and Pattern Recogni tion(CVPR), pages 532-539, 2013. I [31]H. Yang, W. Mou, Y. Zhang, I. Patras, H. Gunes, and P. Robinson. Face alignment assisted by head pose estima tion. In Proceedings of the british Machine vision Confer. ence(BMvc). 2015. 1 [32]S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face: A face detection benchmark In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016. 5 33 Y. Yu, K.A. F Mora, and I M. Odobez. Robust and accu- rate 3d head pose estimation through 3dmm and online head model reconstruction. In Automatic Face& Gesture Recog- nition(FG 2017), 2017 12th IEEE International Conference on, pages 711-718. IEEE, 2017. 5 [341 Z. Zhang, Y. Hu, M. Liu, and T. Huang. Head pose estimation in seminar room using multi view face detec tors. In International Evaluation Work shop on Classifica- tion of Events, Activities and Relationship.s, pages 299-304 Springer, 2006. 2 [35 X. Zhu, Z Lei, X Liu, H. Shi, and S Z Li. Face alignment ge poses: A 3d solution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni tion, pages146-155,2016.1.2,3,5 [36]X. Zhu and D. Ramanan. Face detection, pose estimation and landmark localization in the wild. In Computer vision and Pattern Recognition (CVPR), 2012 IEEE Conference on pages28792886.IEE,2012.8

试读 10P Fine-Grained Head Pose Estimation Without Keypoints

关注 私信 TA的资源

    Fine-Grained Head Pose Estimation Without Keypoints 14积分/C币 立即下载
    Fine-Grained Head Pose Estimation Without Keypoints第1页
    Fine-Grained Head Pose Estimation Without Keypoints第2页
    Fine-Grained Head Pose Estimation Without Keypoints第3页


    14积分/C币 立即下载 >