FCN(science)论文

所需积分/C币:40 2017-11-02 09:17:06 14.88MB PDF
144
收藏 收藏
举报

First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 1 The ability to learn and generalize from a few examples is a hallmark of human intelligence (1). CAPTCHAs, images used by websites to block automated interactions, are examples of problems that are easy for humans but difficult for comput-ers. CAPTCHAs are hard for algorithms because they add clutter and crowd letters together to create a chicken-and-egg problem for character classifiers — the classifiers work well for characters that have been segmented out, but segmenting the individual characters requires an understanding of the characters, each of which might be rendered in a combinato-rial number of ways (2–5). A recent deep-learning approach for parsing one specific CAPTCHA style required millions of labeled examples from it (6), and earlier approaches mostly relied on hand-crafted style-specific heuristics to segment out the character (3, 7); whereas humans can solve new styles without explicit training (Fig. 1A). The wide variety of ways in which letterforms could be rendered and still be under-stood by people is illustrated in Fig. 1. Building models that generalize well beyond their train-ing distribution is an important step toward the flexibility Douglas Hofstadter envisioned when he said that “for any program to handle letterforms with the flexibility that human beings do, it would have to possess full-scale artificial intelli-gence” (8). Many researchers have conjectured that this could be achieved by incorporating the inductive biases of the vis-ual cortex (9–12), utilizing the wealth of data generated by neuroscience and cognitive science research. In the mamma-lian brain, feedback connections in the visual cortex play roles in figure-ground-segmentation, and in object-based top-down attention that isolates the contours of an object even when partially transparent objects occupy the same spatial locations (13–16). Lateral connections in the visual co
8.1.1 re CAPTCHA CNN control experiment 39 8.4.5 RCN on re CaPtcha control dataset 40 8.4.6 Bot Dctcct 8.1.7 Bot Detect with the appearance model 8.4.8 Determining the transferability of the botDetect parsing parameters 43 8.4.9 Pay Pal 43 8.4.10 Yahoo 8.4.11 Using the same font set for parsing different CAPTCIIAs 44 8.5 TCDAR text recognition in uncontrolled environments 8.5.1 Methods 44 8.5.2 Results and comparison 8.6 One-shot classification and generation for Omniglot dataset 47 8.7 Classification of mnist datasct and its noisy variants 48 8.7.1 CNN control experiments 50 8.7.2 Classification of noiseless MNIST with low training complexit 50 8.7.3 Classification of noisy variants of MNISt 8.8 Occlusion reasoning on MNist 53 8.8.1 Dataset 53 8.8.2 Classification and occlusion reasoning 54 8.9 ReconstrucTion Iron noisy MNIST using RCN, VAE, and DRAW 55 8.10 Importance of lateral connections and backward pass 8.11 The running timc scaling of two-lcvcl and thrcc-lcvcl RCN modcls 8. 12 RCN on 3D object renderings 8.13 Improving rcN 64 Organization This supplementary material is organized in two parts. The first part(Section 1-7)provides the ..l foundations of the rcn model and establishes connections with the literature and its biological inspiration. The second part(Section 8) has a more applied focus and provides additional details about RCN's practical implementation, architecture and performance on several benchmark datasets In the first, more theoretical part, Sections 1-3 describe the ron generative model, factorized over shape and appearance. Section 4 describes inference given an input image, and Section 5 describes algorithIns for learning the parameters alld structure of the nodel. Section 6 provides context for the present model and establishes connections with the existing literature. We elaborate on the guidance from neuroscience in Scction 7 In the second, more applied part, we describe the details of our preprocessing and post-processing sLeps in Sections 8.1 and 8.2, respectively. In Section 8.3, a summary of the rcn architectures used throughout the different experiments is given. Experiments on several Captciia datasets, ICDAR Omniglot. and MNIST (and its noisy variants) arc rcportcd in Scction 8.48.9. Thc importancc of the lateral connections and backward pass is highlighted through a lesion study in Section 8.10. A computational complexity study is performed in Section 8.11 which demonstrates the better scaling bahavior of dccpcr RCN hicrarchics. Scction 8 12 shows cxpcrimcnts on 3D object renderings. Wc conclude with some remarks about future work to improve RCN in Section 8.13 1 Factorizing shape and appearance for object recognition Human ability to recognize objects is invariant to drastic appearance changes. If we were to see for example, an entirely blue tree for the first time ever, we would be able to correctly recognize it as a tree and identify its color as blue. Despite the initial surprise, we would not be confused or inclined to think it is a big blueberry. This strongly suggests that we are able to perceive the shape of objects independently of their appearance, and that our categorization of objects relies more strongly on shape cues than on appearance cues. Similarly, even if we never saw an entirely blue tree, we are still able to imagine it by composing our model of tree(which contributes a shape) with our idea of "blue"(which contributes an appearance). I.e., our internal representation of objects factorizes shape and appearance and compounds them to form objects Based on the above observations, it seems reasonable to expect that an image model with human-level recognition capabilities will have a factorized representation of shape and appearance. Very few works have pursued shape and appearance factorization for image recognition 63, 64. This factorization enables a model to generalize from fewer examples, since training data only needs to contain images ith sufficient diversity of shapes and appearances (and not cvcry combination of them) for the Inodel to come up with a representalion thal is able lo handle the entire cross producL space. Despile its success, several mainstream image recognition models, such as the convolutional neural network (CNN)[45, 65, entangle shape and appearance. Thesc modcls arc unable to recognize at test timc objects with all appearance Chat signilicantly departs Iron the ones seen Joy thal purliculur' objecl in the training set; they may fail the blue tree test. The obvious solution for these models is to augment their training sets to include images with more combinations of shapes and appearances. Though using trees of every possible color during training would indeed resolve the blue tree problem, using this approach in general results in an unbearable sample complexit Our model assumes shape and appearance to be factorized and generates images by combining these two elements using the " coloring book" approach. First the shape of the object, which defines its external and internal boundaries, is generated. Then it is "colored by having its interior regions filled in with appearance (i. e, with some color or, in general, some texture 2 A generative hierarchical model for shape In this section we will describe a probabilistic model that generates an edge map F(l)with the shape of an object. The edge map F will later be combined with the appearance of the object to form the final image X. We call this model the recursive cortical network(RCN). An RCN is a hierarchical latent variable model in which all the latent variables are discrete. We proceed to describe it next 2.1 Hierarchical model In order to model the edge map F(), the RCN uses a hierarchical arrangement alternating poolin layers and feature layers of latent discrete variables from top to bottom a very similar arrangement was also used in previous models, such as the convolutional neural network (CNN)[45, 65. Unlike the CNN, the RCN is a fully generative model, and its properties, even when ised for discrimination are in stark contrast wit h those of the cnn We will use F(e) and He) to collect the latent variables corresponding to the e-th feature layer and pooling layer, respectively. The latent variables in the feature layer are binary, whereas those in the pooling layers are multinomial. The variables of any feature laver F(e) can be arranged in a three-dimensional grid, wilh elements ff'r'er.The subscripts reler respectively to ealur'e(also called channel for clarity), row, and column of the given layer. Each of the multinomial variables of a pooling layer can also be arranged in a three dimensional grid, with elements h(e with the same meaning frc There are a total of C layers of each type, numbered from the bottom (closer to the resulting edge map F())to the top(closcr to thc classifier laycr H(O), whose rolc will bc dctailcd bclow The variables of each layer depend only on those of the layer above. Therefore, the joint probability of the model can be written as logp(P(,H(),,F(C),H(O)=logp(F()H(1)+logp(H(山)F(2)+ +logP(F())+logp(H(C) (S1) All that remains to completely specify the shape model is to describe the probability of a single feature layer(conditional on the pooling layer on top of it )and a single pooling layer(conditional on the feature layer on top of it, if it exists 2.2 Feature lavers Each of the variables in a feature layer indicates the presence or a bsence of a given feature f at a given location'(r, c)in the image X. For instance, we can have a binary variable at F()accounting for the fcaturc "corner at the cantor of the image".When that variable is turned oN, a slightly distorted corner shape will be generated close to the center of X. The exact shape and location of the corner in X is unknown given F(); that information is encoded by the layers below it, and would change if we were to fix the values of F() and generate multiple samples of X For simplicity, we will assume in this manuscript that no subsampling is occurring in the layers of the RCN (i.e n CNn terminology, that we are using a stride of 1). This produces a one-to-one correspondence between the (r, c) locations in any feature layer F() or pooling layer H() and the image X, which are useful for description purposes Subsampling is indeed possible(and computationally desirable) in this architecture, and is analogous to using stride va lues above 1 in CNNs Each individual feature variable is independent of the others given the layer above, so we can write log p(F()H(e) ∑ logp(Pf (2) Usually, Fyre will only depend on a small subset of the variables H(). Each pool variable in H(e) is multinomial and has several states, each of them associated to one of the variables in F(e), plus a. special state which is called the OfF state. The set of feature variables to which a pool is associated are called the pool members. Multiple pools can share the same pool member To make the above idea llore concrete, we will consider a type of connectivity Iron pools lo features called"translational pooling ". In translalional pooling: eacll pool H(e) las as pool menbers the IFrc: r'-rI< vps,c-c <hps, f'constanty, where vps and hpst are the vertical and horizontal pool shapes rcspcctivcly, and f" is the fcaturc that Hfr/ endows with local translation ability. In the common case in which square pools are used, we use a single parameter to define their height and width: pool size =2 x hps-1=2x vpse-1. A pool can either be OfF or activating any of thc fcaturc variables with fcaturc index fand in thc vicinity of its position(r, c). Notc that a pool can only activate a single pool member at a time Once the connectivity between a pooling layer and the feature layer immediately below has been set(for instance, using translational pooling) the conditional probability of each feature variable is sImply F 1|H() 1 if any pool in H(O) is in a state associated to p(e) f'r (S3) 0 otherwise In other words, F(e)is deterministic given H(e). There are multiple h(e) that can result in the same F(e, so the relation is not bijective The top feature layer F(C) is assumed to contain complete objects. At F(C) each channel accounts for a different object type, whereas each(r, c)location within that channel accounts for a different location of the generated object 2.3 Pooling layers Unlike feature layers, pooling layers are not deterministic, so all the variability in image generation from an rCn comes from the pooling layers. This makes them quite a bit more complex than feature ayers As explained before, each multinomial pool variable in a pooling layer h(e) can be in any of states. one of which is the ofF state and the others are feature variables of layer F(e), which form the members of that pool. One can think of the members of a pool as different alternatives that express slight variations of the sane Feature. For instance, the pool ebers of a translational pool will be features that look identical but are located at different positions, within a radius of the pool position There is an implicit top level pooling layer H(C)over classifier features. This layer has a single multinomial variable thal chooses ainong the dillerent top-level leatures 2 Since we are taking the generative perspective, readers familiar with the Cnn literature might prefer to think of them as unpooling layers. Since this is a probabilistic model, the same layer can actually be performing pooling or unpooling operations depending on the task (sampling or inference The performance of rcn can be significantly improved by the use of lateral connections. We will first consider the simpler no-laterals case and then move on to the laterally connected model 2.3.1 No lateral connections In the absence of laterals, each individual pool state is a lso independent of all others given its parent on the layer above, so we can write lgp(h(+1)=∑ ∑lgp(H5(+)=∑logp (S4) fr whcrc Fpre is the fcature that activates pool hyre (. c, FPre is the parent of H/re). Notc that each pool variable has a single parent (but each Teature variable can have nultiple children Each feature lf'rlo activates some of the pools located close to it; i. e, some of the pools ir(e) r'rl<vfs, Icc< hfs), where vfs and hfs arc rcspcctivcly the vertical and horizontal fcaturc shape. NoLe that a feature variable at level l+ 1 and channel can activate a set of pools at layer e at multiple diferent channels if), whereas in the previous section we saw that, given translational pooling, a pool at lcvcl l and channel f can only activate fcaturcs at level e with all of them residing in the sarne channel/. Another imporlant distinction with Che previous section is thal two dillerent features cannot activate the same pool (unlike pools, which can have the same feature variable within their pool members, sharing it). In the cases in which it would be desirable for multiple features to activate the same pool, we can just make a copy (i.e, create an additional channel in that pooling layer with the same connectivity to the layers below) and use that pool instead A feature can be described in terms of the position (relative to its own) of the pools that it activates Whenever any of the features that activate a given pool H(e) is set to 1, then the pool cannot be in the OFf state. This defines the probability density of the pooling lavers as p(llfrc- pool memberm/p(e+1) 1/M if parent fcaturc is 1 (S5) 0 if parent feature is 0 where M is the number of pool members for that pool. This of course implies that p(hire OFFIFH+l-1 when the parent feature is O This in turn means that the joint probability of the latent variables for a valid assignment(one that docs not result in the joint probability bcing zero)is lg(F0),m(),…,F(),m(C)-∑ (S6) active pool pool members of active poo The top layer H(C) in Eq. S1 is a pool with no parents. The OFF probability for that pool is0(it always activates exactly one pool member ) and contains as pool members all the feature variable of the next layer. That means it can generate any object (we regard the top-level features as entire objects) at any location. This pooling layer is special in the sense that it only has a single pool whereas all the other pooling layers, even if they consist of a single channel, have multiple pools following a topographic arrangement The hierarchical model as described so far is directly useful to model edge maps, but turns out to be too flexible in many cases. If we use translational pools, we can control the amount of distortion in Che generation process by changing the pool shape. Bul we found that no selling of this produced entirely reasonable results: small values resulted in too rigid images that could not adapt 6 to the distortions found in actual images(e. g, a corner made of many pools with close-to-zero pool shape results in two almost rigid edges), while large values resulted in shapes with discontinuous edges (c.g. the previous cornor would look like a cloud of points with no clcar cdgs, since cach pool choice is independent of the rest) To solve the above problem, we would like a feature to spawn multiple pools that behave in a coor linated way- a corner shape could be distorted while still looking like a corner and exhibiting edge 2.3.2 Lateral connections In the no-laterals case, the Lern P(H(e(F(+1))is fully factorized, as shown in Eq. S4. This implies no coordination among the pool choices. We can coordinate these pool choices by entangling all the pool choices belonging to the same parent. The factorization is now gp(H(O|p(2+1) lo >logp(HfmC: whose parent is F*d jiFF*e (S7) We can When deline the joint the probability of a set of pools with a colllllon parent by introducing pairwise constraints between some (or all) of the pool states. Consider a set of pools Il, with a common parent feature F, where we drop layer and position indicators for simplicity. According to the previous description, F is defined by a set of triplets(f1,△71,△e1).…,(f,△7,△cr) specifying which pools in the layer below it should activate, with their positions being encoded relative to the position of F itself. We can augment that information by including a set of pairwise constraints between the states of pools i and j, which we will denote si and si,icFii(8i: Si)) that indicates which pairs of states are allowed (those for which the constraint has value 1)and which are not With each feature F specifying its own set of constraints CFii(Si, Si) for its children pools, we get the following joint density J F 1/SF if VijCFij(si, si) (S8 0 otherwise where sp is the number of joint states allowed by the constraints of F. This of course implies P(Hi= OFF, H2=OFF,., HJ=OFFF=0=1 (S9) In this case, the joint probability of the latent variables for a valid assignment is og p( F(1).H C)lC) (S10) P:,=1 Therc arc many scts of pairwise constraints that could produce satisfactory pool coordination cffccts a typical pool coordination constraint in our work are called perturb laterals: we first expand each state s, of pool ,(which is located at (ri,ci))to the tuple describing the member feature to which it is associated s,=(Ari, Aci), with the delta positions being relative to the pool center and the channel f of each member feature being the same as that of its pool. Then, the perturb lateral constraint can be written as follows c(3,3)=c(△,△c),(△r1,△c)=△r2-△r<|r-r/p∧△c-△c<|e-c/pf,(S11) position of the perturbation factor for layer e. In this way, the maximum perturbation in the relative two pools is proportional to their distance, with pf being the(inverse) proportionality constant 2.4 Translational invariance Let us sum up the parameters of this architecture · Feature parameters,(f1,△r1,△c1),…(f/,△rJ,△C)and{cFx/(2,;1)}, pecifying connectivity pools in the layer below and constraints among their selected pool members when the feature s aclive. These par'allleters are potentially dillerent foreach binary fealure variable in the entire hiera.rchy · Pool parameters,(Δr1,△c1).,(△rM3△cnr), specifying connectivity to pool members in the feature layer below. These parameters are potentially different for each pool in the hierarch o Even though this is not ncccssarily a propcrty of the above hicrarchy, in vision it is generally uscful to have a model that shows equivariance under translations of the input Equivariance in this case means that, given an image and a set of hidden variables, the models likelihood will not change if we shift the input imagc and the position of the valucs of all the hidden variables by the samc amount. The hidden variable H(C)is unique and cannot be shifted, so instead its value will be shifted accordingly This can easily be achieved in our hierarchy if we set the parameters above corresponding to the same channel and layer to be identical. This is thc convolutional assumption from which CNas arc namcd Additionally, this results in a reduction of the free parameters in the model Because child features are allowed to overlap, local translations of features can also produce ize invariance of a higher-level feature or object, as shown in Fig. S1 2 (a) Sizc variations (b)Input with multiplc rectangles Figure S1:(a)Representalion ol a rectangle as a conjunction of four pools, each representing a corner Translations of the corners can represent size variations of the rectangle because RCn allows child features to overlap. The corner features are rendered in different thicknesses and shades to show the overlap between thell. Each corner feature ilsell can be broken down lurther into line segment leatures to produce more local variations. Note that different aspect ratios and translations of a rectangle can be generated from the same higher-level feature node. The factor between pool 1 and pool 4 is omitted for clarity.(b) An input image with multiple rectangles. When the rectangles are well separated they will be represented by different top-level feature nodes at different translations. Since each of the overlapping rectangles could be represented by a single top-level feature node, explaining bot h the rectangles would require two copies of the same feature at the top level. In general, it can be assumed that the same feature node is copied a fixed number of times, allowing for multiple instances of the same object to appear roughly at the same location. In practice, this effect can also be approximately achieved without making copies of the features by using the same hierarchy multiple times until all evidence is explained. See section 4.7 for more delails on scene parsing 3 Combining shape and appearance an appearance model. The previous model generates the edge map F on In this section, we show how to combine the shape model described in the previous section with which is an array of binary variables indexed by(f, r,c). Each of the features f corresponds to a patch descriptor, which is defined by a set of IN and OUT variables and a set of edges arranged inside a rectangular patch(see Fig OUT IN IN OUT OUT IN IN IN OUT OUT IN OUT INININ OUT OUT OUT ININ INININ Figure S2: Patch descriptors: the three possible orientations of a 45 edge and 3x 3 patch size When a variable at(f, T, c) within the edge map F(1)is turned on, a small edge of orientation f will appear in the final image at the row and column (r, c). The edges are oriented because, in the case of an outer cdgc, onc of the sides is dcfincd to bc in whilc the other is dcfincd to bc OUT. For instance there are two separate features corresponding to an outer vertical edge. One of them defines in to be on the left of the edge (and OUT on the right)whereas the other feature defines in to be on the right of the edge(and OUT on the left ) In the case of an inner edge in an object(i.e, an edge that separates two interior regions of the object), both sides of the edge are defined as IN. For inner edges there is additional inlorinalion showing the border between the in values of each side of the edge. as depicted in Fig. S2(c). Fig. S3(a)contains two types of vertical outer edges, two types of horizonta outer edges, and a diagonal inner edge(which is precisely the one shown in Fig. S2(c). Therefore for each rotation of an edge, there are three possible orientations: two corresponding to outer edges lus one corresponding to an inner edge. a typical number of edge map features is 48, accounting for the thrcc orientations of 16 diffcrent rotations Let us denote by y the canvas on which both the oriented edges of the edge map are going to be drawl, and the interiors of objects are going lo be lilled in with soine texture or color:. Y is a 2D array of multinomial random variables, indexed by row and column(r, c). The row and column indices correspond to the row and column of pixels in the final image; therc is a onc-to-onc correspondence between the variables in y and the observed image pixels X. The state of each multinomial variable of Y represents both a color (or texture) and the in or OUT state. Therefore, each variable of y has twice as many statcs as rcquircd to rcprcscnt all possiblc colors(or textures) is a conditional random ficld(CRF) with a fixcd structurc of 4-connccted variables, so that cach variable has an edge connecting it to the 1 other variables in its immediate neighborhood. These edges represent pairwise potentials, and there also exist per-variable unary potentials. The potentials depend on the values of F()(making the random field conditional). Each active variable in F() following its patch descriptor(see Figs. S2 and $3), forces some of the variables of y to be in an In or an OUT state, and for inner edges modifies the pairwise potentials between variables in an IN state that belong to different regions, as depicted in Fig. S2 (c). The set of pairs of adjacent locations in y with an inner edge active between them(according to f(1))is called IE. The remaining set of pairs of adjacent locations (i.e, all those that do not have an inner edge activation between Lhelll coining from f(l) are ca The variables of y obey the following conditional densit (Y1F()-乙11 I亚m(Ye2yre)更m(Yr,Ye)(Y,Ype) F(1) (rc),r’,')∈IE (r, c), (r/, C)CIE (S12)

...展开详情
试读 73P FCN(science)论文
立即下载 身份认证VIP会员低至7折
一个资源只可评论一次,评论内容不能少于5个字
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
上传资源赚钱or赚积分
最新推荐
FCN(science)论文 40积分/C币 立即下载
1/73
FCN(science)论文第1页
FCN(science)论文第2页
FCN(science)论文第3页
FCN(science)论文第4页
FCN(science)论文第5页
FCN(science)论文第6页
FCN(science)论文第7页
FCN(science)论文第8页
FCN(science)论文第9页
FCN(science)论文第10页
FCN(science)论文第11页
FCN(science)论文第12页
FCN(science)论文第13页
FCN(science)论文第14页
FCN(science)论文第15页

试读结束, 可继续读5页

40积分/C币 立即下载