FineGrained Head Pose Estimation Without Keypoints

Abstract Estimating the head pose of a person is a crucial prob lem that has a large amount of applications such as aiding in gaze estimation, modeling attention, fitting 3D models to video and performing face alignment. Traditionally head pose is computed by estimating some keypoints from the tar
3. 2. The MultiLoss Approach All previous work which predicted head pose using con volutional networks regressed all three Euler angles directly using a mean squared error loss. We notice that this ap proach does not achieve the best results on our largescale synthetic training data We propose to use three separate losses, one for each angle. Each loss is a com bination of two components: a binned pose classification and a regression component. Any backbone network can be used and augmented with three fullyconnected layers which predict the angles. These three fullyconnected layers share the previous convolu tional layers of the network The idea behind this approach is that by performin bin classification we use the very stable softmax layer and crossentropy, thus the network learns to predict the neigh bourhood of the pose in a robust fashion. By having three crossentropy losses, one for each Euler angle, we have three signals which are backpropagated into the network hich i ng obtain a fi d predictions we compute the expectation of each output an 1. Example difficult scena sing our gle for the binned output. The detailed architecture is shown in figure 2 face, green pointing downward and red pointing to the side. Best We then add a regression loss to the network, namely a viewed in color meansquared error loss, in order to improve finegrained predictions. We have three final losses, one for each angle, and each is a linear conbination of both the respective clas head pose this approach has not been studied extensively sification and the regression losses. We vary the weight of and is not commonly used for head pose estimation tasks the regression loss in Section 4.4 and we hold the weight of Instead if very accurate head pose is needed then depth cam the classification loss constant at 1. the final loss for each eras are installed and if no depth footage exists landmarks Euler angle is the following are detected and pose is retrieved. In this work we show that a network trained on a large synthetic dataset, which by def C=H(y,y)+a·MSE(y,y) inition has accurate pose annotations, can predict pose ac curately in real cases. We test the networks on real datasets Where H and MsE respectively designate the cross which have accurate pose annotations and show stateof entropy and mean squared error loss functions theart results on the aFlW, AFLW2000 [ 35] and BIwi[6 datasets. Additionally we are starting to close the gap with very accurate methods which use depth information on the 3. 3. Datasets for fine grained Pose estimation biWi dataset In order to truly make progress in the problem of predict compared to landmarktopose methods, for example, ages We believe that deep networks have large advantages ing pose from image intensities we have to find real datasets which contain precise pose annotations, numerous identi ties, different lighting conditions, all of this across large They are not dependent on: the head model chosen, the poses. We identify two very different datasets which fill landmark detection method, the subset of points used hese requirements for alignment of the head model or the optimization First is the challenging AFLW2000 dataset This dataset method used for aligning 2D to 3D points contains the first 2000 identities of the inthewild aflw dataset which have been reannotated with 683D landmarks They al ways output a pose prediction which is not the using a 3d model which is fit to each face. consequently case for the latter method when the landmark detection this dataset contains accurate finegrained pose annotations method fails and is a prime candidate to be used as a test set for our task Second the biwi dataset is gathered in a laboratory set ting by recording rGBD video of different subjects across FC Layers softmax ● Total yaw loss SE LOSS Cross Entropy Loss yaw 250 ■(·:m…m ResNet50 83040 ) softmax MSE LoSs Cross Entropy Loss Figure 2. ResNet50 architecture with combined Mean Squared Error and Cross Entropy Losses different head poses using a Kinect v2 device. It contains 3.5. The Effects of LowResolution roughly 15,000 frames and the rotations are +75 for yaw, Currently there is need for head pose estimation at a dis ±60° for pitch and±50° for rolL. a3 D model was fit to each individual's point cloud and the head rotations were tance and there exist multiple example applications in areas tracked to produce the pose annotations. This dataset is such as video surveillance. autonomous driving and adver commonly used as a benchmark for pose estimation us tisement. Future head pose estimation methods should look ing depth methods which attests to the precision of its la to improve estimation for lowresolution heads bels. In our case we will not use the depth information nor We present an indepth study of the effect of low the temporal information, only individual color frames. In resolution on widelyused landmark detectors as well as Section 4.1 we compare to a very accurate stateoftheart stateoftheart detectors We contend that lowresolution depth method to ascertain the performance gap between ap should worsen the performance of landmark detection since estinating keypoints necessitates access to features which proaches disappear at lower resolutions. We argue that although de tailed features are important for pose estimation they are not as critical. Moreover this area is relatively untapped: there 3. 4. Training on a Synthetically Expanded dataset is scarce related work discussing head pose estimation at a distance. As far as we know there is no work discussing We follow the path of [2] which used synthetically ex lowresolution head pose estimation using deep learning panded data to train their landmark detection model. Or Deep networks which predict pose directly from image of the datasets they train on is the 300WLP dataset which intensities are a good candidate method for this applica is a collection of popular inthewild 2D landmark datasets tion because robustness can be built into them by modifying which have been grouped and reannotated. A face model is the network or augmenting its training data in smart ways fit on each image and the image is distorted to vary the yaw We propose a simple yet surprisingly effective way of de of the face which gives us pose across several yaw angles. veloping robustness to lowresolution images: we augment Pose is accurately labeled because we have the 3D model our data by downsampling and upsampling randomly which and 6d degrees of freedom of the face for each image forces the network to learn effective representations for var ied resolutions. We also augment the data by blurring the We show in Section 4. I that by carefully training on large images. Experiments are shown in Section 4.5 amounts of synthetic data we can begin closing the gap with existing depth methods and can achieve very good accura 4. EXPERIMENTAL RESULTS cies on datasets with finegrained pose annotations. We also test our method against other deep learning methods whose We perform experiments showing the overall perfor authors have graciously run on some of the test datasets that mance of our proposed method on different datasets for we use in Section 4.1. Additionally in the same Section, we pose estimation as well as popular landmark detection test landmarktopose methods and other types of pose esti datasets we show ablation studies for the multiloss addi mation methods such as 3D model fitting tionally, we delve into landmarktopose methods and shed light on their robustness. Finally we present experiments Pitch Ro MAE suggesting that a holistic approach to pose using deep net Muti Loss ResNet50(a=2)64706.5595.4366155 Multi Loss ResNet50(=1)6920663756746.410 works outperforms landmarktopose methods when resolu 3DDFA 351 5.4008.5308.250 393 tion is low even if the landmark detector is stateoftheart FAN [2](12 points 6.35812.2778.7149.116 Dlib [ll](68 points) 23.1531363310.54515.777 4.1. FineGrained Pose Estimation on the Ground truth landmarks 2411.7568.2718.651 AFLW2000 and BIWi Datasets Table l Mean average error of Euler angles across different meth ods on the aflw2000 dataset [35] We evaluate our method on the aflw2000 and biwi datasets for the task of finegrained pose estimation and Pitch Roll MAE ompare to pose estimated from landmarks using two differ MultiLoss ResNet50(a=2)5.1676.9753.385.177 MultiLoss resnet50(=1)48106663.2694895 ent landmark detectors, FAN [2] and Dlib[ll, and ground KEPLER [14]1 8.08417.27716.19613.852 truth landmarks(only available for AFLW2000) MulliLuss resNe50(=1)578511.7268.1948.56 FAN is a very impressive stateoftheart landmark de 3DMM+ Online [33] 2.5001.5002.2002.066 tector described in [2] by Bulat and Tzimiropoulos. It FAN [2](12 points) 8.5327.48376317.882 DIib [ll](68 points) 16.75613.8026.19012.249 uses Stacked Hourglass Networks [161 originally intended 3DDFA 35 36.17512.2528.77619.068 for human body pose estimation and switches the normal Table 2 Mean average error of Euler angles across different meth ResNet bottleneck block for a hierarchical, parallel and ods on the biwi dataset 16.*These methods use depth informa multiscale block proposed in another paper by the same au tion I Trained on AFLW thors [l]. We were inspired to train our poseestimation net ork on 300WLP from their work which trains their net Y Pitch roll Sum of errors MultiLoss ResNet50(=1)3.293.393.00968 work on this dataset for the task of landmark detection Dlib et al. 5 4.033.0310.97 implements a landmark detector which uses an ensemble of Table 3. Comparison with Gu et al. [5]. Mean average error of regression trees and which is described in [ll Euler angles averaged over traintest splits of the biwi dataset [6] We run both of these landmark detectors on the AFLW2000 and biwi datasets. AFLW2000 images are small and are cropped around the face. For biwi we run color channel. Note that since our method bins angles in a Faster RCNn [22] face detector trained on the WIDER the +99 range we discard images with angles outside of Face Dataset [32, 10] and deployed in a Docker con this range. Only 3 1 images are not used from the 2000 im tainer [24]. We loosely crop the faces around the bounding ages of Afl 2000 box in order to conserve the rest of the head. We also re In order to compare to Gu et al. [5 we train on three trieve pose from the groundtruth landmarks of AFlw2000 different 7030 splits of videos in the biwi dataset and we Results can be seen in tables 1 and 2 average our mean average error for each split. For this eval Additionally, we run 3DDFA B5] which directly fits a uation we use weight decay with a coefficient of 0.04 be 3D face model to rgb image via convolutional neutral net cause of the smaller amount of data available. We compare works. The primary task of 3DDFA is to align facial land our result to their single frame result which was trained in marks even for the occluded ones using a dense 3D model the same fashion and we show the results in table 3. our As a result of their 3D fitting process, a 3D head pose is method compares favorably to Gu et al. and lowers the sun produced and we report this pose. of mean average errors by 1.29% Finally, we compare our results to the stateoftheart RGBD method [33]. We can see that our proposed method 4. 2. LandmarkToPose Study considerably shrinks the gap between RGBD methods and In this set of experiments, we examine the approach of ResNet50 [8]. Pitch estimation is still lagging behind in using facial landmarks as a proxy to head pose and inves part due to the lack of large quantities of extreme pitch ex tigate the limitations of its use for pose estimation The amples in the 300WLP dataset We expect that this gap will commonly used pipeline for landmarktopose estimation be closed when more data is available involves a number of steps, 2D landmarks are detected, 3D We present two multiloss ResNet50 networks with dif human mean face model is assumed, camera intrinsic pa ferent regression coefficients of 1 and 2 trained on the rameters are approximated, and finally the 2D3D corre 300WLP dataset. For Biwi we also present a multiloss spondence problem is solved. We show how this pipeline ResNet50(a= 1) trained on AFLW. All three networks is affected by different error sources. Specifically, us were trained for 25 epochs using Adam optimization[ 2 ing the aFlW2000 benchmark dataset, we conduct exper with a learning rate of 105 and B1=0.9, B2=0.999 iments starting from the best available condition(ground and e=108. We normalize the data before training b truth 2D landmarks, ground truth 3D mean face model) and using the Image Net mean and standard deviation for each examine the final head pose estimation error by deviating tations. Pose was obtained by annotating landmarks and using a landmarktopose method. Results can be seen in Table 4 AFW is a popular dataset, also commonly used to test landmark detection, which contains rough pose annotations 468 inthewild faces with absolute yaw degree's up to +90. Methods only compare mean average error for yaw. Methods usually output discrete predictions and round their output to the closest 15 multiple. As such at the 15 error margin, which is one of the main metrics reported in 2 the literature. this dataset is saturated and methods achieve Results are shown in figure 7 Using our joint classification and regression losses for Number of keypoints used for pose AlexNet [13 we obtain similar mean average error after training for 25 epochs. We compare our results to the Ke Figure 3. We show the effects of using ditferent number of land PLER [14] method which uses a modified Google Net for mark points for 3D head pose estimation using ground truth fa simultaneous landmark detection and pose estimation and cial landmarks and the ground truth mean tace model on the aFl 2000 dataset to [19] which uses a 4layer convolutional network. Multi Loss Resnetso achieves lower mean Average error than KEPLER across all angles in the afl testset after 25 from this condition. For all of these experiments, we as epochs of training using Adam and same learning param sume zero lens distortion and run iterative method based eters as in section 4. These results can be observed in on LevenbergMarquardt optimization to find 2D3D corre Table 4 spondence which is implemented as the function SolvePnP We test the previously trained AlexNet and MultiLoss in OpenC v ResNet50 networks on the aFw dataset and display the re We first run the pipeline only with ground truth land sults in Figure 7. We evaluate the results uniquely on the marks, varying the number of points used in the optimiza vaw as all related work does. We constrain our networks to tion method. We observe that in this ideal condition, using output discrete yaw in 15 degree increments and display the all of the available 68 landmark points actually gives biggest accuracy at two different yaw thresholds. a face is correctly error as shown in Figure 3. Then, we jitter the ground truth classified if the absolute error of the predicted yaw is lower 2D landmarks by adding random noise independently in x, or equal than the threshold presented y direction per landmark. Figure 4 shows the results of this The same testing protocol is adopted for all compared experiment with up to 10 pixel of jittering. We repeat the methods and numbers are reported directly from the associ experiment with the same set of keypoints selected for Fig ated papers. Hyperface [20] and AllInOne [21] both use ure 3. Finally, we change the mean face model by stretching a single network for numerous facial analysis tasks. Hyper the ground truth mean face in width and height up to 40% face uses an AlexNet pretrained on Image Net as a back Figure 5. Additionally, we also report results based on esti bone and AllInOne uses a backbone 7layer convnet pre mated landmarks using fan and Dlib in Figure 6 trained on the face recognition task using triplet probability The results suggest that with ground truth 2D landmarks, constraints [251 using less key points produces less error since it's less likely We show that by pretraining on ImageNet and fine to be affected by poseirrelevant deformation such as facial tuning on the aFlw dataset we achieve accuracies that are expression. However, the more points we use for correspon very close to the best results of the related work. we do dence problem, the more robust it becomes to random jitter not use any other supervisory information which might im ing. In other words, there exists a tradeoff; if we know the prove the performance of the network such as 2D landmark keypoints are very accurate we want to use less points for annotations. We do however use a more powerful backbone pose, but if there's error we want to use more points. With network in ResNet50. We show performance of the same estimated landmarks it's not clear how we can weigh these network on both the aFlw testset and AFW. two, and we find that using more points can both help and worsen pose estimation as presented in Figure 6 4. 4. AFLW2000 MultiLoss Ablation 4.3. AFLW and aFw Benchmarking In this section we present an ablation study of the multi loss. We train resNet5o only using a Mean Squared Error The aflw dataset, which is commonly used to train and (MSE) LoSS and compare this to resNet5o using a multi test landmark detection methods, also includes pose anno loss with different coefficients for the mse component The 匪 12 可uz Yaw Pitch J ttering 12 keypoints Jittering 21 keypoints Jittering 35 key poirts Jittering 66 keypoints Figure 4. We show the effect of jittering landmark points around their ground truth position on the task of 3D head pose estimation on AFLW2000 to simulate the effects of noise in the facial keypoint detector. We repeat this experiment four times with different number of landmarks. For all experiments we use the ground truth mean face model for the landmarktopose alignment task 6420 8542 40% +20% Meanface model, height change Meanface model, with cnange Figure 5. We show the effects of changing the 3D mean face model on the task of 3d head pose estimation from 2D landmarks. We use 2D ground truth landmarks and modify the mean face model by stretching its width and height Y Pitch Roll MAe Yaw pitch roll MAe MultiLoss resNet50(a=l) 6.26 5.89 3.82 5.324 ResNet50 regression only 13.1106.72657998.545 AlexNel(C=1) 7.797.416.057.084 MultiLoss ResNet50 7.087687056216.526 KEPLER「14 6.455858.757.017 26.4706.5595.4366.155 Patacchiola, Cangelosi [19 11.047.154.47530 9206.63756746.410 Table 4. Mean average errors of predicted Euler angles in the 0.110.2706.8675.4207.519 0.0111.4106.8475.8368.031 AfL test set 11.6287.1195.9668.238 MultiLoss alex net 127.6508.543895415049 0.130.1109.5489.27316.310 weight of the CrossEntropy loss is maintained constant at 0.012509084428.28713.940 I. We also compare this to alex Net to discern the effects of 24.4698.3508.35313.724 having a more powerful architecture Table 5. Ablation analysis: MAE across different models and re We observe the best results on the aflw20oo dataset gression loss weights on the aflw2000 dataset hen the regression coefficient is equal to 2. We demon strate increased accuracy when weighing each loss roughly ith the same magnitude. This phenomenon can be ob x10 and x15. In general images are around 2030 pixels served in Table 5 wide and high when downsampled x15. We then upsample these images and run them through the detectors and deep 4.5. LoWResolution AFLW2000 Study networks. We use nearest neighbor interpolation for down sampling and upsampling We study the effects of downsampling all images from For our method we present a multiloss resNet50 with the aflw2000 dataset and testing landmarktopose meth regression coefficient of 1 trained on normal resolution im ods on these datasets. We compare these results to our ages. We also train three identical networks: for the first one method using different data augmentation strategies. We we augment the dataset by randomly downsampling and up test the pose retrieved from the stateoftheart landmark sampling the input image by x10, for next one we ra detection network FAN and also from Dlib. We test all downsample and upsample an image by an integer ranging methods on five different scales of downsampling xl, X5, from I to 10 and for the last one we randomly downsample □A 10 21 Number of keypoints from Dlib used for pose Number of keypoints from FAN used fcr pose Figure 6. Using estimated 2D landmark points, this experiment shows the 3d pose estimation error depending on how many facial keypoints are used 50 Hopenet (a=1) Hopenet(a=1, random downsample x 10 Hopenet(a=1, random downsample [1, 10D) Hopenet (a=1, random downsample 11, 6,11, 16, 21 FAN 12 points Dlib(88 points) 0.2 30 H>penet, alpha=l (95.159% 25 ose estimaton error in degrees) Figure 7. AFw pose benchmark result along with other meth ods[21,20,14,36] 10 and upsample an image by one of the following integers 1 Downsampled 6,11,16,21 We observe that from the getgo our methods show better figure 8 mean average error for different methods on the down performance than pose from the dlib landmarks, yet pose sampled AFLW2000 dataset in order to determine robustness of from the fan landmarks is acceptable. Pose from the Fan methods to lowresolution images landmarks degrades as the resolution gets very low which iS natural since landmarks are very hard to estimate at these resolutions especially for methods that rely heavily on ap such as head model and landmark detection accuracy pearance. Pose from the network without augmentation We also show that our proposed method generalizes deteriorates strongly yet the networks with augmentation cross datasets and that it outperforms networks that regress show much more robustness and perform decently at very head pose as a subgoal in detecting landmarks. We show low resolutions. Results are presented in Figure 8. This that landmarktopose is fragile in cases of very low res is exciting news for longdistance and lowresolution hea olution and that, if the training data is appropriately aug pose estimation. mented. our method shows robustness to these situations Synthetic data generation for extreme poses seems to be 5 CONCLUSIONS AND FUTURE WORK a way to improve performance for the proposed method as are studies into more intricate network architectures that In this work we show that a multiloss deep network can might take into account full body pose for example directly, accurately and robustly predict head rotation from image intensities. We show that such a network outperforms References landmarktopose methods using stateoftheart landmark detection methods. Landmarktopose methods are studied [1 A. Bulat and G. Tzimiropoulos. Binarized convolutional in this work to show their dependence on extraneous factors landmark localizers for human pose estimation and face alignment with limited resources. In International Confer [15 I Matthews and s Baker. Active Appearance Models revis ence on Computer vision 2017.5 ited. International Journal of Computer vision, 60(2): 135 [2] A. Bulat and G ' Tzimiropoulos. How far are we from solv 164,2004.1 ing the 2d 3d face alignment problem?(and a dataset of [16 A. Newell, K. Yang, and J. Deng. Stacked hourglass net 230,000 3d facial landmarks ). In International Conference works for human pose estimation. In European Conference on Computer Vision, 2017. 1, 2, 4, 5 on Computer Vision, pages 483499. Springer, 2016. 5 [3] F.J. Chang, A T. Tran, T Hassner, I Masi, R. Nevatia, and [17]J Ng and S. Gong. Composite support vector machines for G Medioni. Faceposenet: Making a case for landmarkfree detection of faces across views and pose estimation. Image face alignment. In Computer Vision Workshop(Iccvw) and Vision Computing, 20(5): 359368, 2002. 2 017 IEEE International Conference on, pages 15991608. [18] E. Osuna, R. Freund, and F. Girosit. Training support vec IEEE.2017.1.2 tor machines: an application to face detection. In Computer [4]T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appear vision and pattern recognition, 1997. ProceedingS., 1997 ance models. TEEE Transactions on Pattern Analysis and IEEE computer society conference on, pages 130136.IEEE Machine intelligence, 23(6): 681685, jun 2001. I 1997.2 [5]J. G.x.Y.S. De and M.J. Kautz. Dynamic facial analysis: 19] M. Patacchiola and A. Cangelosi. Head pose estimation in From bayesian filtering to recurrent neural network. 2017.2 the wild using convolutional neural networks and adaptive 5 gradient methods. Pattern Recognition, 2017. 1, 2, 6, 7 [6 G. Fanelli, M. Dantone, J. Gall. A. Fossati, and L. Van Gool [20]R. Ranjan, V.M. Patel, andR. Chellappa Hyperface: A deep Random forests for real time 3d face analysis. IntJ. Comput multitask learning framework for face detection, landmark Vision,101(3):437458, February2013.3.5 localization, pose estimation, and gender recognition. arXi [7R Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea preprint arXiv:603.01249,2016.1,2,6.8 ture hierarchies for accurate object detection and semantic [21] R Ranjan, S Sankaranarayanan, C. D. Castillo, and R. Chel egmentation. In Computer Vision and Pattern Recognition appa. An allinone convolutional neural network for face 2014.2 analysis. In Automatic Face Gesture Recognition (FG [8] K. He, X. Zhang, s Ren, and J. Sun. Deep residual learn 2017), 2017 12th IEEE International Conference on, pages ing for image recognition. arXiv preprint arXiv: 1572.03.38.5 1724.IEEE,2017.1,2,6,8 2015.5 122S. Ren, K. He, R Girshick, and J. Sun. Faster RCNN: To [9] J. Huang, X Shao, and H. Wechsler. Face pose discrimina wards realtime object detection with region proposal net tion using support vector machines(svm). In Pattern recog works. In Advances in Neural Information Processing Sys nition, 1998. Proceedings. Fourteenth international Confer temS(NIPS), 2015. 5 ence on, volume 1, pages 154156. IEEE, 1998. 2 [23]H. A Rowley,s. Baluja, and T Kanade. Neural network [10 H. Jiang and E. LearnedMiller. Face detection with the based face detection. IEEE Transactions on pattern analysis faster rcnn. In Automatic Face Gesture Recognition(FG and machine intelligence, 20(1): 2338. 1998.2 2017), 2017 12th IEEE International Conference on, pages [24]N. Ruiz and L M. Rehg. Dockerface: an easy to install and 650657.IEEE,2017.5 use faster rcnn face detector in a docker container. arXi [11] V Kazemi and J. Sullivan. One millisecond face alignment preprint arXiv: /708.04370, 2017.5 with an ensemble of regression trees. In Proceedings of the [25] S Sankaranarayanan, A. Alavi, C D. Castillo, and R Chel TEEE Conference on Computer Vision and Pattern Recogni lappa. Triplet probabilistic embedding for face verification tion, pages18671874,2014.5 and clustering. In Biometrics Theory, Applications and Sys [12] D. Kingma and J. Ba. Adam: A method for stochastic opti tems(BTAS), 2016 IEEE Sth International Conference on mization. arXiv preprint arXiv: 1412.6980, 2014. 5 pages 18. IEEE, 2016. 6 13]A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet [26] J.M. Saragih, SLucey, and J F Cohn. Deformable model classification with deep convolutional neural networks. In fitting by regularized landmark meanshift. International Advances in neural information processing systems, pages Journal of computer Vision, 91(2): 200215, 2011. 1 10971105,2012.6 [27] J. Sherrah, S Gong, and E.J. Ong. Understanding pose dis [14]A. Kunar, A. Alavi, and R. Chellappa. Kepler: Keypoint and crimination in similarity space. In BMVC, pages 110, 1999 pose estimation of unconstrained faces by learning efficient hcnn regressors. In Automatic Face Gesture Recognition [28] J. Sherrah, s. Gong, and E J. Ong. Face distributions in (FG 2077), 20/7/2th /EFF International Conference on similarity space under varying head pose. Image and vision pages258265.IEEE,2017.1.2,5,6,7,8 Computing,19(12):807819,2001.2 [29] Xiangxin Zhu and D Ramanan Face detection, pose estima tion, and landmark localization in the wild. In Proceedings IEEE Conference on Computer vision and Pattern Recogni tion(CVPR), pages 28792886, jun 2012. I 30]X. Xiong and F De la Torre. Supervised descent method and its applications to face alignment. In Proceedings of the TEEE Conference on Computer Vision and Pattern Recogni tion(CVPR), pages 532539, 2013. I [31]H. Yang, W. Mou, Y. Zhang, I. Patras, H. Gunes, and P. Robinson. Face alignment assisted by head pose estima tion. In Proceedings of the british Machine vision Confer. ence(BMvc). 2015. 1 [32]S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face: A face detection benchmark In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016. 5 33 Y. Yu, K.A. F Mora, and I M. Odobez. Robust and accu rate 3d head pose estimation through 3dmm and online head model reconstruction. In Automatic Face& Gesture Recog nition(FG 2017), 2017 12th IEEE International Conference on, pages 711718. IEEE, 2017. 5 [341 Z. Zhang, Y. Hu, M. Liu, and T. Huang. Head pose estimation in seminar room using multi view face detec tors. In International Evaluation Work shop on Classifica tion of Events, Activities and Relationship.s, pages 299304 Springer, 2006. 2 [35 X. Zhu, Z Lei, X Liu, H. Shi, and S Z Li. Face alignment ge poses: A 3d solution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni tion, pages146155,2016.1.2,3,5 [36]X. Zhu and D. Ramanan. Face detection, pose estimation and landmark localization in the wild. In Computer vision and Pattern Recognition (CVPR), 2012 IEEE Conference on pages28792886.IEE,2012.8
 [论文理解] FSANet: Learning FineGrained Structure Aggregation for Head Pose Estimation from a Single I... 107020190917FSANet: Learning FineGrained Structure Aggregation for Head Pose Estimation from a Single Image 简介 本文提出了一种新型的网络结构，借助attention机制，帮助实现更加有效的head pose 角度估计任务。并且在多个数据集上达到SOTA。本文的结构依赖于SSRnet，把任务转换为分...
 FSANet: Learning FineGrained Structure Aggregation for Head Pose Estimation from a Single Image 58420191228本文来源于2019A类会议CVPR的论文FSANet，对其中一部分进行翻译 摘要： 本文提出了一种基于单个图像的头部姿态估计方法。以往的方法往往是通过landmark或depth估计来预测头部姿态，计算量大。我们的方法是基于回归和特征聚集。为了得到一个紧凑的模型，我们采用了soft stagewise regression方案。现有的特征聚集方法将输入视为一组特征，从而忽略它们在特征图中的空...
人脸姿态估计_course
20190102有没有什么好的估计人脸姿态的方法，需要得到头部转动的角度（左右转动角度+上下角度）？求大神们给点意见。
计算机视觉中头部姿态估计的研究综述Head Pose Estimation in Computer Vision: A Survey（中文）下载_course
20181021计算机视觉中头部姿态估计的研究综述Head Pose Estimation in Computer Vision: A Survey（中文翻译） 相关下载链接：//download.csdn.net/

学院
量化策略入门VNPY系统应用
量化策略入门VNPY系统应用

下载
钨板 GB/T 38752017
钨板 GB/T 38752017

学院
JIRA培训之定制工作流培训
JIRA培训之定制工作流培训

学院
Java权限框架合集
Java权限框架合集

下载
Modbus点位数据监控曲线DotTrend 2.0安装包.rar
Modbus点位数据监控曲线DotTrend 2.0安装包.rar

下载
批量密钥检测V6.8.exe
批量密钥检测V6.8.exe

学院
2020年 Kubernetes架构师：k8s从零开始的进阶之路
2020年 Kubernetes架构师：k8s从零开始的进阶之路

下载
2017国家行业标准—GB、T47542017.zip
2017国家行业标准—GB、T47542017.zip

学院
Kubernetes知识拓展
Kubernetes知识拓展

博客
UGUI ScrollView设置
UGUI ScrollView设置

下载
Animer.zip
Animer.zip

下载
epaswmm5_apps_manual.zip
epaswmm5_apps_manual.zip

下载
第5部分：隧道环境检测器 GB/T 34428.52017 高速公路监控设施通信规程
第5部分：隧道环境检测器 GB/T 34428.52017 高速公路监控设施通信规程

学院
Elasticsearch 7.x 快速入门与实战
Elasticsearch 7.x 快速入门与实战

博客
RTC时钟日期转换（增减天测试Demo）
RTC时钟日期转换（增减天测试Demo）

下载
第30部分：炼焦 GB/T 18916.302017 取水定额
第30部分：炼焦 GB/T 18916.302017 取水定额

博客
【数据结构】串与模式匹配详解（详细注释版）
【数据结构】串与模式匹配详解（详细注释版）

博客
理解group by
理解group by

学院
大白话系列算法
大白话系列算法

博客
DES 加密
DES 加密

博客
Apache Flink 零基础入门（三）：开发环境搭建和应用的配置、部署及运行
Apache Flink 零基础入门（三）：开发环境搭建和应用的配置、部署及运行

学院
从底层决胜LoRa
从底层决胜LoRa

博客
elementui结合vuecropper实现裁剪
elementui结合vuecropper实现裁剪

博客
#error 、 #line 和 #pragma 的使用
#error 、 #line 和 #pragma 的使用

下载
800∕900MHz射频识别读∕写设备规范 GB/T 349962017
800∕900MHz射频识别读∕写设备规范 GB/T 349962017

博客
nginx 日志正则分割
nginx 日志正则分割

学院
Java全栈工程师SSH框架整合
Java全栈工程师SSH框架整合

下载
进口可用作原料的固体废物环境保护控制标准—废钢铁 GB 16487.62017
进口可用作原料的固体废物环境保护控制标准—废钢铁 GB 16487.62017

学院
ABP .NET Core 入门系列
ABP .NET Core 入门系列

下载
CTAB法 GB/T 3780.52017 炭黑 第5部分：比表面积的测定
CTAB法 GB/T 3780.52017 炭黑 第5部分：比表面积的测定