所需积分/C币:9 2014-11-11 09:07:09 348KB PDF
收藏 收藏

Comparison and combination of state-of-the-art Techniques for handwritten Character recognition Topping the mnist Benchmark Daniel Kevsers 'sQiupr net UPR Research G ToUD German Research Center for Artificial Intelligence and Technical University Kaiserslautern lay 2006 abstract Q. Although the recognition of isolated handwritten digits has been a re search topic for many years, it continues to be of interest for the research community and for commercial applications. We show that despite the maturity of the field, different approaches still deliver results that vary enough to allow improvements by using their combination. We do so y choosing four well-motivated state-of-the-art recognition systems for which results on the standard mnist benchmark are available. When comparing the errors made, we observe that the errors made differ be tween all four systems, suggesting the use of classifier combinaiton. We then determine the error rate of a hypothetical system that combines the output of the four systems. The result obtained in this manner is an error rate of 0.35% on the mniSt data, the best, result published so far. We furthermore discuss the statistical significance of the combined result and of the results of the individual classifiers 1 Introduction The recognition of handwritten digits is a topic of practical importance because of applications like automated form reading and handwritten zip-code process ing. It is also a subject that has continued to produce much research effort over the last decades for several reasons The problcm is prototypical for image proccssing and pattcrn rccognition with a small number of classes Standard benchmark data sets exist that make it easy to obtain valid results quickl Many publications and techniques are available that can be cited and built on, respectively The practical applications motivate the research performed Improvements ill classilicalion accuracy over existing techniques continue to be obtained using new approaches This paper has the objective to analyze four of the state-of-the-art methods for the recognition of handwritten digits 3, 9, 15, 26 by comparing the errors made on the standard MNIST benchmark data. (A part of this work has been described in [13 We perforin a statistically analysis of the errors using a bootstrapping technique 5 that not only uses the error count but also takes into account which errors were made. Using this technique we can determine more accurate estimates of the statistical significance of improvements. When analyzing the errors made we observe thatalthough the error rates obtained are all very similar--there are substantial differences in which patterns are classified erroneously. This can be interpreted as an indicator for using classifier combination. An experiment shows that indeed a combination of the classifiers performs better than the single best classifier. The statistical analysis shows that the probability that this rosults constitutes a rcal improvcment and is not based on chance alone is 94% Related work This paper is of course only possible because the results of the four chosen base methods 3, 9, 15, 26 were available. These approaches are presented in more detail in Section 4. We are aware that there exist other methods that also achieve very good classification error rates on the data used, e.g. [18. However we feel that the four methods chosen comprise a set of well-motivated and self contained approaches. Furthermore, they represent the different classification methods most, commonly used(in the research literature), that is, the near est neighbor classifier, neural networks, and the support vector machine. All four methods use the appearance-based paradigm in the broad sense and can thus be considered as being sufficiently general as to be applied to other object ecognition tasks There is a large amount of work available on the topic of classifier combi llalion as well (an introductiOn can be found e.g. in[16) and much work exists on applying classifier combination to handwriting recognition(e. g. 4, 7, 8, 12) Note that we do not propose new algorithms for classification of handwritten digits or for the combination of classifiers. Instead, our contribution is to present a statistical analysis that compares different classifiers and to show that their combination improves the performance even though the individual classifiers all reach state-of-the-art error rates by themselves We would like to thank Patrice Simard for providing the recognition results to us and the authors of 3, 9 for listing the errors in the respective papers 2 Q3q5789 Figure 1: Example images from the MniST data set 3 The mnist task The modified NIST handwritten digit database(MNIST, 17) contains 60,000 images in the training set and 10,000 patterns in the test set, each of size 28 X 28 pixels with 256 graylevels. The data set is available online and some examples from the MNISt corpus are shown in Figure l Thc preprocessing of the images is dcscribcd as follows in 17: "Thc orig. inal black and white(bilevel)images were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain gray levels as result of the antialiasing (image interpolation) technique used by the normalization algorithm the images were centered in a 28x28 image by computing the center of mass of the pixels and translating the image so as to position this point at the center of the 28X28 field. Note that some authors use a 'deslanted version of the database The lask is generally not considered to be dillicult'(in the sense that ab solute error rates are high) recognition task for two reasons. First, the human error rate is estimated to be only about 0.27, although it has not been deter mined for the whole test set [27. Second, the large training set allows machine learning algorithms to generalize well. With respect to the connection between training set size and classification performance for OCR tasks it is argued 28 that increasing the training set size by a factor of ten cuts the error rate ap proximately to half the original figure Table l gives a comprehensive overview of the error rates reported for the MNIST data. Onc disadvantagc of the mNiSt corpus is that thcrc exists no development test set, which leads to effects known as 'training on the testing data. This is not necessarily true for each of the research groups performing experiments, but it cannot always be ruled out. Note that in some publications (e. g. 26])the authors explicitly state that all parameters of the system were hosen by using a subset of the training set for validation, which then rules out the overadaptation to the test set. However, the tendency exists to evaluate one method with different parameters or different methods several times on the same 3 data until the best performance seems to have been reached. This procedure leads to an overly optimistic estimation of the error rate of the classifier and the number of tuned parameters should be considered when judging such error rates Ideally, a development test set would be used to determine the best parameters for the classifiers and the results would be obt ained from one run on the test set itself. Nevertheless a comparison of 'best performing, algorithms may lead to valid conclusions, especially if these perform well on several different tasks Note that Dong gives lower error rates than in 11 of 0.38 to 0. 44 percent on his web page(accessed February 2005 ), but it remains somewhat unclear how these error rates were obtained and if possibly these low error rates are due to Lhe effect of"Training On the Lesting dala'. Also, 31 try a variety of SVMs and Table 1: Error rates for the MNiST task in % The systems marked with are those we use for analy relerence Inethod ER 27 AT&T numan performance 0.2 Euclidean nearest neighbor 3 19 U Lige decision frees sub-windows 2.63 7 AT&T deslant. Euclidean 3-NN 4 20 Kyushu elastic matching 2.10 14 RWTII one-sided tangent distance 1.9 6 At&T neural net lenet1 1.7 21 UC London products of experts 1.7 2 U Quebec hyperplanes +support vector m.1.5 24 TU Berlin support vector machine 1.4 6 At&T neural net lenet4 1.⊥ 27 At&T tangent distance 1.1 14 RWTH LwO-sided tangent d, virt data 1.0 0 CENPARMI local learning 0.99 25 MPI, AT&T virtual SVM 0.8 17 AT&T distortions neural net lenet5 0.82 [17 AT&T distortions, boosted leNet4 0.7 30 U Singapore bio-inspired features +SVM 0.72 9 Caltech, MPI virtual SVM(jitter) 0.68 3 UC Berkeley shape context mlalchillg *0.63 [11 CENPARMI support vector machine 0.60 31 U Singapore deslant, biology-inspired features[0.59 Boston U ascaded shape context 0.58 9 Caltech, MPI deslant, virtual SVM(jitter, shift)"0.56 1 Boston U shape context matching 0.54 15 RWTH deformation model(IDM) *0.54 8 Hitach t vector m.0.42 26) Microsoft neural net+virtual data 0.42 this work hyp. comb of 4 systcms( 0.35 4 networks which yield error rates ranging from 0.59 percent to 0.81 percent. The IDM [15 as described in the Section 4 was not optimized for the MNIST task Instead, all parameter settings were determined using the smaller USPS data set and then the complete setup was evaluated once on the mnist data Figure 2 shows the examples from the mnist test, set. At least, one of the four state-of-the-art systems misclassifies each sample. (These systems are marked with *, in Table 1. Those samples that are misclassified by all four systems are marked by a surrounding frame. This presentation is possible 9 and in 3 the authors present the set of samples misclassified by their systems. Furthermore, Patrice Simard kindly provided the classification results of his systeIll as described in 26 for all lest dala. The availability of these results also makes it possible to determine the error rate of a hypothetical system that combines these four best systems as described in the following Section 4 Some of the images in Figure 2 are a good illustration of the inherent class overlap that exists for this problem: some instances of e. g. 3 vS. 5,4vS.9 and 8 vS. 9 are not distinguishable by taking into account the observed image only. This suggests Chal we are dealing witha probleIn with 11Oll-zero Bayes error rate. Further improvements in the error rate on this data set might therefore bc problematic. For cxamplc, consider a classifier that classifies the sccond framed image as a 9: despite the fact that this classifier would not make an error with this decision according to the class labels, we might prefer a classifier that classifies the image as a 4. Note that recently [29 has presented a more detailed discussion of different types of errors made by state-of-the-art classifiers for handwritten characters 4 The classifiers and their combination We briefly describe the four systems for hand written digit recognition that we compare and combine. Then, we discuss the statistical significance of their results and present a simple classifier combination of these four methods that achieves a(hypothetical) error rate of 0. 35% Shape context matching. 3 presents the shape context matching ap- proach. The method proceeds by first extracting contour points of the images In the case of handwritten character images the resulting contour points trace both sides of the pen strokes the charactor is composed of. Then, at cach con- Lour point a local descriptor of the shape as represented by the contour poinls is extracted. This local descriptor is ca led a shape context and is a histogram of the contour points in the surrounding of the central point. This histogram has a finer resolution at points close to the central point and a coarser for regions farther away, which is achieved using a log- polar representation The classification is then done by using a nearest neighbor classifier(although the authors chose to use only one third of the training data for the MNIST Lask). The distance within the classifier is determined using an iterative latch- ing based on the shape context descriptors and two-dimensional deformatio 。伊215号 q3Y h S44 5 5 4。5 51157又 5s3 6 1 545 7 7 :74 2 215 25 。46"?,5 69 9 26, 2,?。8。,又。么 d6,f2如,f Figure 2: Difficult examples from the MNiSt test set along with their target labels At least one of the four state-of-the-art systems(cp. Table 1) misclassifies these images The framed examples are misclassified by all four systems 6 The shape contexts of training and test image are assigned to each other by using the Hungarian algorithm on a bipartite graph representation with edge weights according to the similarity of the shape context descriptors. This assignment is then used to estimate a two-dimensional spline transformation best matching the two images. The images are transformed accordingly and the whole process (including extraction of shape contexts) is iterated until a stopping criterion is reached. The resulting distance is used in the classifier Recently, 1 discuss a cascading technique to speed up the slow nearest neigh bor matching by" two to three orders of magnitude". While the result that this discussion is based on only used the first 20,000 training samples for reasons of efficiency and resulted inl an error rale of.63%2, 1 report an error rale of 0.54% for the full training set, and 0.58% for the cascaded classifier that, uses only about 300 distance calculations per test Invariant support vector machine. 9 presents a support vector ma- chine(svm) that is especially suited for handwritten digit recognition by incor porating prior knowledge about the task. This is achieved by using virtual data or a special kernel function within the SVM. The special kernel function applies several transforinlalions to the compared iIllages that leave the class identity un changed and return the kernel function of the appropriate pair of transformed images. This mcthod is referred to as kcrncl jittering. The sccond uses so-called virtual support vectors. This approach consists of first training a support vec tor machine. Now, the set of support vectors contains sufficient information about the recognition problem and can therefore be considered a condensed representation of the training data for discrimination purposes. The method proceeds to create transformed versions of the support vectors which are the virtual support vectors. In the experiments leading to the error rate of 0.56% the transformations used werc image shifts within thc cight-ncighborhood plus horizontal and vertical shifts of two pixels, thus resulting in 9+4=13 virtual support vectors for each original support vector.(This experiment also used the deslanted version of the mniSt data[17. )On this new set of virtual support vectors, another support vector machine was trained and evaluated on the test Pixel-to-pixel image matching with local contexts. [15 presents de- formable models for handwritten character recognition. It is shown that a simple zero-order matching approach called image distortion model(IDM) can lead to very competitive results if the local context of each pixel is considered in the distortion. The local context is represented by a 3 x 3 surrounding window of the horizontal and vertical image gradient, resulting in an 18-dimensional de scriptor. The iim a llows to choose for each pixel of the test image the best fitting counterpart of the reference image within a suitable corresponding range The distance as determined by the best match between two images is then used within a 3-nearest-neighbor classifier. More elaborate models for image match ing are also discussed, but only small improvements can be obtained at the cost of much higher computational costs. The IDM can be seen as the best com pronise between high classilicatiOn speed and high recogniLion accuracy whi being conceptually very simple and easy to implement Convolutional neural net and virtual data.[ 26 presents a large con- volutional neural network of about 3,000 nodes in five layers that is especially designed for handwritten character classification. The new concept in the ap proach is to present a new set of virtual training images to the learning algorithm et in each iteration of the training. the al training set is structed from the given training data by applying a separate two-dimensional random displacement field that is smoothed with a gaussian filter to each of the images. This makes it possible to generate a very large amount of virtual data 1.000 virtual iginal elemer aining data set. The data is generated on the fly in each training iteration and there- Tore does not have lo be saved, which avoids che problenis with dala handli Apart from the generation of virtual examples there is another point where prior knowledge about the task comes into play, namely the use of a convolutional neural net. This architecture, which is described in greater detail in[ 17,con tains prior knowledge in that it uses tying of weights within the neural net to extract low-level features from the input that are invariant with respect to the position within the image, and only in later layers of the neural net the position nformation is used Discussion and combination. We can observe that all four methods take virtual data and image matching methods. At the same time the concrete ca ag spccial mcasurcs to dcal with the image variability prcscnt in the images, usi sification algorithm seems to play a somewhat smaller role in the performance as nearest neighbor classifiers, support vector machines, and neural networks all perform very well. Only a slight advantage of the neural net can be seen in the possibility to use very large amounts of virtual data in training because the training proceeds in several iterations, which need not use the same data but can usc distorted samples of the images instcad Figure 2 shows all the errors made by one of the four classifiers. It is remark able that only eight samples are classified incorrectly by all four systems. This observation naturally suggests the use of classifier combination to further reduce the error rate. The availability of the results of the other classifiers makes it possible to determine this error rate of a simple hypothetical combined system However, we are somewhat restricted for the choice of combination scheme, because for two of classifiers we only know if the result was correct or not We thus decided to use a simple majority vote combination based on the four classifiers. where the neural net classifier is used for tie-breaking(because it has the best single error rate). Note that the result is only an upper bound of the error rate that a real combined system would have, because we do not use the class labels the patterns were assigned to(but only the information if the decision was correct or not). This means that in case of a disagreement between the falsely assigned classes we could have a correct assignment when using the class labels. Furthermore, it seems likely that the use of the confidence values of the component classifiers in the combination scheme could also improve the joint decision Using the described hy polhelical conbination, the resulting error rale is 0.35%. In the following section we will show that this improvement has a prob

试读 15P MNIST数据识别
立即下载 低至0.43元/次 身份认证VIP会员低至7折
MNIST数据识别 9积分/C币 立即下载

试读结束, 可继续读2页

9积分/C币 立即下载