论文,ICCV17c-基于层次化多模态LSTM的视觉语义联合嵌入

所需积分/C币:49 2017-11-16 19:12:52 419KB PDF
收藏 收藏
举报

ICCV2017c入选论文-基于层次化多模态LSTM的视觉语义联合嵌入,Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding。找了好半天才找到。CSDN最低资源分为2,我只好设置成最低的了。版权归原作者和其单位所有。【上次遇到了系统自动增加资源分的情况,要怪就怪CSDN网站。】
as the mean of their word embeddings After that. some Input: the'sentence-image' pairs in the dataset[(Sa, Idid=1 sophisticated models such as the SDT-RNN [29] are pro 1. Initialization stage: coarse-grained embedding learning. Only posed to learn sentence embedding representations. Re the known sentence-level correspondences are utilized to learn cently, Deep Structure-Preserving(DeepSP)[34] is pro a simplified HM-LSTM model. And then, the initial representa- posed for image-text embedding and achieves the state-of- tions for phrases and image regions are estimated the-art performance 2. Loop for t=1,…,y: For dense embedding, the most related works are the (a)Phrasc-lcvel correspondences Learning Iven Deepvs [14] and the Defrag [151, which also align words the learned representations of phrases and regions and short phrases within sentences to bounding boxes. In we establish some phrase-region' correspondences Deepvs[141, in order to build phrase representations, they Sd, k, Id,k)k for cach image by mcasuring thcir imilarity (refers to Section 3.3) additionally apply a Markov Random Field (Mrf)to con (b) Fine-grained embedding learning. Given the previous nect neighboring words as a phrase. On the contrary, our phrase-level correspondences, the HM-LSTM model is hierarchical model can naturally generate syntax-correct learned to update the phrase and rcgion representations phrases and naturally build their representations. In de (refers to Section 3.2.2) Frag [15], although the tree parsing is leveraged for phrase Output: the representations of sentences, phrases, images, and image representation, the phrases are independently represented regions,i.e.i(hd, k, vd, k)d=lk=o and hence the tree structure is actually discarded in favor of a simpler model. On the contrary, the hierarchical relations Figure 3. The iterative learning procedure for the hierarchical mul among phrases can be explicitly modeled by our method Bimodal ermbeddir Moreover, the phrases are jointly instead of independently modeled in our approach However, only the sentence-level(rather than the phrase Image Caption Generation. Many methods are proposed level) correspondences are known at the beginning. But if for image caption generation [22][17[33][5][30]. They we have the representations of all phrases and image re ain to generate descriptions by sampling Iron conditional gions, it is easy to establish their correspondences, e. g neural language models. Particularly, an'encoder-decoder' by measuring the similarities between their representations framework [173] is often adopted by those methods, Thus, in our approach we take an alternative learning pro- here a cnn is used to represent an image, and an RNN cedure for the embedding learning. i. e. to learn the multi is used to generate descriptions conditioned on the image nodal embedding space and those phrase-level correspon representation dences alternatively In particular, we have an initial learning stage, where 3. Our Approach only the sentence-image'losses are minimized to learn a We attempt to map all of full sentences, phrases, whole simplified HM-LSTM model produce the initial representations for all the phrases and images, and image regions into a common space. Therefore, our approach needs not only to learn the phrase-level corre mage regions, which can be further used to construct the initial phrase-level correspondences. After that, a full ver- spondence (i.e, the correspondences between phrases and sion of HM-LSTM model(both sentence-level and phrase image regions)but also to learn a multimodal embedding space containing all the sentences, phrases, images, and im level losses are minimized) is learned, and the embedding learning and the correspondences learning can be conducted c regions iteratively, as shown in Fig. 3 Specifically, each sentence is first represented as a Con stituency Tree with the Stanford Parser [18], where each 3.1. Images Embedding intermediate node in the tree indicates a phrase while the root node indicates the full sentence. Meanwhile. for We follow the work of [14 to represent images. In pa each image, the Region Convolutional Neural Network(R icular, some object proposals are extracted using the selec CNN)[7 is adopted to extract a feature representation for tive search method [32]; and they are represented with an the image region which is generated by using object pro R-CNN [7. Following Karpathy et al. 1141, we adopt the posal methods [32 top 19 detected locations in addition to the whole image Next, if the phrase-level correspondences are known, our and compute the representations based on the pixels Ib in- HM-LSTM model can utilize such correspondences to con side each bounding box as follows duct the embedding learning. In particular, each loss layer Um=Wm[CNN (1b)1+bm is introduced to connect a noun phrase node to an image region, as shown in Fig 4. At last, all the losses(includ- where CNN(Ib) transforms the pixels inside the bounding ing'phrase-region'losses and'sentence-image' losses)are box Ib into 4096-dimensional activations of the fully con simultaneously minimized to learn the embedding space nected layer immediately before the classifier 3. 2. Hierarchical Multimodal Embedding Given the phrase-level correspondences, our HM-LSTM twwo-Jldl Iuthi-TietL2-NormL2-Noml model is able to learn a dense embedding space contain ing all the sentences, phrases, images, and image regions In particular, we first review the Tree-LSTM model [31 which was recently proposed for sentence embedding. Then R-CNN it is extended to a syntax-aware model, namely Hierarchi- cal LSTM (H-LSTM) model, where noun phrases and the other phrases are distinctively modeled. At last, our HM LSTM model is proposed based on the H-lSTM model coGs:回 which is a multimodal model for joint embedding of sen- Figure 4. The structure of our HM-LSTM Each sentence is parsed tences, phrases, images, and image regions ds a tree, where the intermediate nodes indicate the phrases within the sentence. Some noun phrases (NP)hd, k are associated to the 3. 2.1 Hierarchical lsTM corresponding image regions vd. k: by specific loss layers lossa k Recently, the Tree-LSTM model [31] has been proposed to explicitly model the hierarchical structure of sentences. In (9×D() +U )h;+ b particular, a sentence is parsed as a tree, where the root in 0=0(W()+0(0+万1+b) dicates the full sentence and the intermediate nodes indicate the phrases within the sentence u,=tanh(W(u)s,+U(ulh,+h,+b u) In Tree-LSTM, children nodes are equally treated when +∑ kCk+ connected to their parent node without considering their ∑ syntax type -noun phrase children and the other phrase k∈N() ∈N(j) children(e. g, verb phrase) are equally treated. However, h=o⊙tanh(c) since our task mostly focuses on objects, noun phrases and the other phrases are modeled with different emphasis, i.e As the standard lstM. each h-lstm leaf node takes the noun phrase children should have larger contributions input vector tj. In our applications, each i is a vector rep than the other phrase children resentation of a word, which is determined as c;= wm,llt To this end, we extend the Tree-LSTM as a syntax-aware where It is an indicator column vector that has a single one model, naely Hierarchical LSTM (H-LSTM) model at the index of the t-th word in a word vocabulary. The Specifically, each unit of H-LSTM (indexed by ))contains weights Wu specify a word embedding matrix that we ini an input gate ii, an output gate Oj, a memory cell ci, and a tialize with 300-dimensional word vec [24] weights and hidden state hi. Suppose there are N() noun phrase chil- keep fixed due to overfitting concerns. In addition, as the dren for j, and N(i) the other phrase children for j, each Tree-LSTM model, the hidden state h, of node is regarded H-LSTM unit will have N() forget gates fik, k C N() as the representation of the corresponding phrase and N() forget gates f; l, LE N(), as in Eq(3)and Eq(4) For a parent node j, the hidden state of its noun phrase 3. 2.2 Hierarchical MultimodaL LSTM children hk, kc N()and the other phrase children hi, lc Based on the h-lsty. we propose a hierarchical multi NLi) are respectively summed up(denoted as hj and hj) modal LSTM(HM-LSTM) to jointly embed all of images before impacting the parent node j, as in Eq (2). Further- image regions, sentences, and phrases into a common space more, the h, and h, have different effects on the input gate Let Id k denote the k-th image region in the d-th Zy Dy using distinct parameters (to)and t(o) as shown in age, Sa. k denote the corresponding phrase. And let Id,0 Eq (5). It is similar for the output gate 0, and memory cel denote the d-th full image, and sd,o denote the corre- Cr, as shown in Eq(6), and Eq (7). This allows the H-LSTM sponding full sentence. If all the e 'phrase-region pairs to sufficiently consider the syntax type of children nodes I(Sa,k, Id,k)ld'lk-o are known, we learn the HM-LSTM as follows a h-lstm model is first constructed for each sentence,and for each phrase-region' pair(Sa.k: Idk)a hk;万-∑h loss layer loss. k is introduced. Inspired by DeepSP [34 ∈N() we introduce a two-branch-network instead of a simple loss layer for each 'phrase-region' pair. Specifically, each fik: -o(Wwx,+WHk:+6),kE N((3) branch is composed of one fully connected layers(Wt for f2=(Wn1+h2+b(),l∈N()(4) ext and wm for images), one Batch Normalization(BN [12, and one L2-normalization layer, as she Fig 4. Note that the batch normalization could accelerate Two people sitting on rocking the training and also make gradient updates more stable chairs on the deck Let ud. k indicate the representation of the Id k, and hd, k indicate the representation of the Sd. k. We can define a scor ng fu wo people sitting the deck ilarity. Therefore, for each phrase-region' pair(Sd,k: Idk) we define a contrastive loss to measure the distance between rocking chairs 圆圖 their representations, as the following rocking chairs two people sirti loss, k >ma.c[0, m-s(U, k, h, k)+s(vd, k, ha, )J the deck 2ma.c[0, m-s(hd,h, vd, k)+s(hd, k, 1d. ) Figure 5. Correspondences between phrases and image regions With those image region and phrase candidates, we can establish the 'phrase-region' correspondences according to where m is the margin, hd. i is a contrastive phrase for image their representations. In particular, we compute a matrix S region idk, and vice-versa with ad. to measure the similarities of representations for those can Next, the total loss can be defined by the weighted sum didates where each element s; i =v:hi indicates the sim of all losses, as the following ilarity score between the image region vi and the phrase hj Therefore, for each phrase we select the best matched in age region, and thus we can establish those 'phrase-image L ∑∑ wd ilos sak 10) region'pairs, as shown in Fig 5. Besides, for each gener d=1k=0 ated phrase-region'pair(vi,hj ). their similarity score sij is regarded as the confidence of their correspondence, which where wd, k is the weight for the k-th'phrase-image region is used to determine the weight of this correspondence, as pair. The los sd o indicates the loss at the root layer for the shown in Eq(10) d-th image, and loss.k, k=l,., Kd indicates the loss at the intermediate layer, as shown in Fig. 4 3. 4. Initialization and optimization The weight ua k can be determined from the learning of Initialization. At the initial learning stage. the initial phrase-level correspondences, e.g., the wd k is determined representations for all of sentences, phrases, images, and according to the confidence of the correspondence for the image regions are obtained by learning a simplified HM- ki-th phrase-region'pail LSTM mode Note that our hM-lstm model is learned with the only the losses at the root tlossd,0fd-1 Back-propagation Through Structure (BPTS)algorithm [8 are minimized and the other losses llossu, k d=l_1are where the errors of different loss functions are respectively neglected. Obviously, only the sentence-level correspon injected to the corresponding loss layers, and back propa dences are used to learn the simplified HM-LSTM gated from root node to leaf nodes along the tree structure Optimization. The Cnn part of our model comes from Karpathy et al.[14], which is pre-trained on ImageNet [4] 3.3. Phrase-level Correspondences and fine tuned on the 200 classes of the imageNet detection Challenge [28]. We use Adam [16] to optimize the hM Before the learning of the HM-LSTM, we need to ob- LSTM with a learning rate of 8x 103. In particular, we tain the phrase-level correspondences. We can address this use mini-batches of 64 paired image-sentences for training problem by measuring the representation similarities among phrase candidates and image region candidates 4. Image Caption Ranking Specifically, given the image region candidates(i.e, the top-19 object proposals ), their representations can be easil With the learned hierarchical multimodal embeddin obtained according to Eq(1). Meanwhile, each sentence is model, we can describe a new image with a full sentence parsed as a tree, where each intermediate node in the tree i.e., image-sentence ranking. In particular, we first extract represents a phrase. Due to that we are just interested in image features by using the cnn and retrieve the near objects in an image, only noun phrases are selected as the est sentence vector hd,0 hd,o d=l in the embedding phrase candidates. Such selection is trivial since the syntax space, which is regarded as the caption for the image type of each phrase(i.e, noun phrase, verb phrase, adjective More importantly, our method can produce region phrase, etc. is available after parsing oriented phrase-level description for a new image. In par- Table 1. Flickr8K experiments R@K is Recall@ K(high is good ) Table 2. Flickr30K experiments. R@K is Recall@k (high is Med r is the median rank (low is good) youd). Med r is the Median rank(low is good) Flickr Flickr30K Modcl Image Annotation Image Search Modcl Image Annotation Image Search Ro1 R@10 Med rR@1 R@10 Med r R@1「R@10 Med rr@I「R@10Medr 0.1 1.1 631 0.1 1.0 500 Random 0.1 1.1 0.1 1.0 50U SDT-RNN[29]6.0340 6.631.7 SDT-RNN [29 9.6 11.1 16 8.9 41.1 DeviSE Io .929.6 DeViSE [Gl 4.5 29.2 26 6.732.7 DeFrag [15] 12644.0 742.5 DeFrag [15 4251.3 4 SC-NLM[17]13.545.7 10.443.7 14 SC-NLM[17 14850.9 10 11846.3 13 DeepⅤS[l4 16.5542 7611.84.712.4 DeepS[41 22.261.4 4815250.5 9.2 m-RNN [22] 14548.5 11 11.542.4 15 m-RNN [22 18.450.9 12.641.5 16 NIC [33 5 NIC 33 17.056.0 了 17.057.0 7 HM-LSTM 27.768.6 68.1 m-RNN-vgs[23][35.4 322863.1 DeepS [34 357744N/A25.166.5NA HM-LSTM 3876.5 3 27.7688 4 licular, after detecting some salient image regions/object proposals, we can extract the visual features from then. Table 3. MS-COCO experiments. R@K is Recall K(high is and retrieve specific and detailed phrases to describe theIn, good). Med r is the median rank(low is good) namely region-phrase ranking in this paper MS- COCO Model Image Annotation Varel 5. Experiments 1 R@10 Med rR@ R@ 10 Med T dom 61 1.0 We use the Flickr8K [ll, Flickr30K [35[25] and Ms Deeps[14] 36480.9 3 28.1 6.1 229.0770 COCO [20][2] datasets in our experiments. These datasets m-RNN-vgg[23]41.083.5 DeepS [341 40.7 N/A contain 8, 000, 31, 000 and 123. 000 images respectively HM-LSTM 43.9 and each is annotated with 5 sentences using AMT. For Flickr&K and Flickr30K, we use 1, 000 images for valida tion, 1,000 for testing and the rest for training, which text embedding and achieves the state-of-the-art perfor- consistent with [11[14 For Ms-CoCO we follow [14]to mance, where the captions for the same image are encour aged to be close to each other use 5, 000 images for both validation and testing 5.1. Image-Sentence Ranking 5.1.1 Results on flickr 8K and flickr30K We first evaluate the proposed method on the task of We evaluate our approach on the Flickr8k and flickr3ok image-sentence ranking. We adopt Recall @ K as the met Particularly, the dimension of the embedding space is set as ric for evaluation, namely the mean number of images for 512, i.e., h; and vi are 5 12-dimensional vectors which the correct caption is ranked within the top-K re- The rok and med rof different methods are shown in trieved results(and vice-versa for sentences) Table 1 and Table 2. Our Inodel outperforms the rankin We compare our method with some visual-semantic cm- based methods by a large margin. Besides, our method also bedding methods (i.e, ranking-based methods) including compares favorably with the state-of-the-art methods De VisE, SDT-RNN, and Defrag. For DeVise [6],sen- The results of DeepSP[34] in Table 2 are based on the tences are represented as the mean of their word embed mean vector representations, i. e,, a sentence is represented dings. The recursive neural network is used to learn sen- as the mean of their word embeddings. This is a fair com- tence representations in SDT-RNN [29]. For DeFrag[15 parison since both our model and this version of DeepSP sentences are represented as a bag of dependency parses are based on the same word embeddings- word2vec repre In addition, some generation-based methods are also in sentation [24]. Note that if more sophisticated sentence rep volved in comparison. The m-RNN [22] and m-RNN resentations such as Fisher vector(FV)are utilized, the per- vgg [23] are methods that do not use a ranking loss formance of DeepS could be further improved [34]. How and instead optimizes the log-likelihood of predicting the ever, the memory cost is huge and hence it is not well-suited next word in a sequence conditioned on an image. The to a large scale image-sentence ranking task. Deepvs[14] is proposed to first learn an embedding space 5.1.2 Results on Ms-CoCO with a bidirectional-rnn and then train an rnn sen- tence generator based on the embedding space. Simi- On the dataset of Ms-COcO, we follow the experimental larly, the NiC 33] is another method that provides the setting of [14 to randomly sample 1, 000 images for test visual input directly to the rnn model. Recently, Deep ing. Specifically, the dimension of the embedding space Structure-Preserving(DeepSP)[34] is proposed for image- is set as 512, and the Multiscale Combinatorial Grouping MMCG)[I] is adopted to replace the r-Cnn to generate ob Table 4. Region-Phrase Ranking. R@k is Recall k (high is ject proposals youd). Med r is the Median rank(low is good) The results of the ranking tasks are shown in Table 3 Region Annotation Model R@1 R@5 R@10 Med r Obviously, we can see that our method significantly outper Random 0020.120.24 3133 forms the ranking-based methods. Even for the state-of-the Deeps[14] 7218.126.8 64 art methods such as m-RNN-vgg[23 and DeepsP[34,our m-RNN-vgg I2318.120.6282 approach still compares favorably with them HM-LSTM 10.822.630 Frorn the results of inage-sentence ranking on all three datasets, we have a conclusion that the performance of gen eral image captioning could be significantly inproved by learning a dense embedding space. This is attributed to the joint embedding of full sentences and their phrases. Since there are hierarchical relations amony full sentences and their phrases, such relations could benefit both their embed ding learning when they are jointly represented and mapped i D)a wh:te and gray cr with a striped tail (1)acons standing in he grass witii a tug in ifr ear 12)a close up ofa cat laving next ioc mouse )u cow witi a luck fire nto the embedding space 4)ccat with am intent cak (4)marier cow aang next to her taby on tne grass (5)a close up ofa black and ihiiteco't 5.2. Region- Phrase Ranking (a) a region of cat (b) a region of‘cow Our method can produce region-oriented phrase-level Figure 6. Our approach can produce subtle and detailed descrip- description for a new image. Generally, after detecting tions for an image region. Besides, many descriptions are diverse some salient image regions/object proposals, our model can so that they can describe different aspects of an object retrieve subtle and detailed phrases to describe them. For easier evaluation, the image regions are manually annotated sequences(i.e, phrases). So we have a conclusion that our instead of being automatically detected in this experiments model can jointly represent short phrases along with long sentences, and better utilize their relations as well For quantitative evaluation, we publish a new dataset Qualitative results. Our method can describe image re based on the MS-COCO, namely MS-COCO-region gions with detailed and subtle phrases. For example, for the dataset Specifically, 1000 images and corresponding sen- Fig. 6(a) previous methods tend to describe it with a gen tences are randomly selected [rom the Ms-Coco valida eral and overview description, e. g, 'A cat sitting under an lion set. And then, AMT workers [27 are asked to anno umbrella. In contrast, our method targets a salient image tate image regions in those images and associate them to region(e. g, which is marked by red box), and produce de- the phrases within the sentences. Although some phrase tailed and subtle descriptions such as 'a white and gray cat level captioning datasets such as Visual Genome [19 and with a strip tail. Compared to the coarse descriptiona cat Flickr30k-Entities [26 have been proposed, their phrases our description is more informative and expressive either are freely annotated by workers or have no relations In addition, our approach can produce some diverse de- with the sentences. On the contrary, the phrases in Ms scriptions for a given image region. As shown in Fig. 6(b) COCO-region dataset are automatically extracted from the for the image region containing a 'cow', the top-5 retrieved given sentences, and there are hierarchical relations be tween sentences and phrases. phrases diversely describe the cow,, e.g,a cow standing in the grass with a tag in its ear focuses on the ear of the pecifically, for each sentence, 1 5 noun phrases are cow, while a cow staring into the camera focuses on the automatically extracted by using Stanford Parser. For each action of the cow. In other words, our approach can di image, some AMT workers are asked to annotate 1n 8 re- versely describe different aspects of an object of interest gions and associate them to those extracted phrases. As are- sult, 4467 salient regions and 18724 corresponding phrases 5.3. Discussion are collected in total For comparison, Deep VS and m-RNN-vgg are adopted 5.3.1 Learned Embedding Space as baselines, where each region-phrase pair is indepen- To intuitively and qualitatively check the properties of the dently fed to those models to obtain their embeddings. The learned embedding space, we visualize the learned embed results of region-phrase ranking are shown in Table 4. Obvi- ding vectors in a 2-D space by using t-SNE [21]. Specif ously, our method outperforms both Deep Vs and m-RNN- ically, we randomly sample 60 images and corresponding vgg. It is mainly because(1)the relations among phrases sentences from our MS-COCO testing dataset. And their are better utilized due to the hierarchical structure of our embedding vectors are visualized in a 2-D space, as shown model, and(2)the chain structured rNn is good at repre- in Fig 8. Particularly, we connect each image embedding to senting long sequences(i. e, full sentences)instead of short 5 corresponding sentence embeddings by lines. We can see An empty room containing a plant and Modern kitchen with assortment of cooking and Two people sitting on rocking chairs on the deck ainting on the w food items on counter Two giraffes and a zebra in an outdoor zoo. 21rocking cha so)the deck II aN empI 1)two ng on/ocking chairs (4) a painting( 5)a plan and a painting cooking and i 1)two giraf. Figure 7. Four examples of the learned correspondences between phrases and image regions. For image (a), we obtain 4 phrases afto sentence parsing: (1)two people',(2)rocking chairs, (3)the deck', and (4)two people sitting on rocking chairs, meanwhile some salient image regions are obtained. The learned correspondences between phrases and image regions are indicated by their color, e. g, the phrase ' two people' corresponds to the orange box. Obviously, our approach can learn correct correspondences in most cases. Note that (d)is a failure example, it is mainly due to that the salient regions do not cover the objects mentioned in its caption an evaluation on a subset of training data. In practice Embedding space ample sentences we randomly sample 2000 'phrase-region' pairs from all learned phrase-level correspondences, and ask 10 users to AArowr: dug dr inil f orr: u wuter borre judge whether each pair is correct. After a majority voting among those users, we find out that 82% learned correspon 4: Rider trps ile hiah e nerol ure'w. 59: 1 person taher a drinA ef waiewhwe dences are correct a0) The man an watm an shm af'theirmirhing sAul Fig. 7 illustrates four examples of the learned correspon- TaNGos 6: I man and a wawen reud a oeok while their/ rewad dences between phrases and image regions. In most cases "f 4(is+ 2. 7we won a wearing imilar hrs wdk to te lot: our approach is able to find correct correspondences. More- 2: d hele girl iu a red shirt Holds o to u over, there are consistent mappings between the phrases'as ER. The un all gird in thr red shirt uile: he li rie y well as the regions' hierarchical structures, e. 8, the phrase Figure 8. The visualization of the learned embedding space. Each two people sitting on rocking chairs is on top of two image is connected to 5 corresponding sentences by lines. Obv- phrases two people' and rocking chairs, meanwhile the ously, the image and the corresponding sentences are very close red box for 'two people sitting on rocking chairs exactly to each other in most cases. Besides images/sentences with sim cover the orange box for 'two people and the green box for ilar semantics are also close to each other, e.g., the 38-th, 54-th, rocking chairs,etc and 2-nd images are all related to Dog,, and their embeddings are exactly neighbors in the embedding space(within the red circle 6. Conclusion In this paper, a Hierarchical Multimodal lstM model that the learned image embedding is very close to its sen is proposed for dense visual-semantic embedding, which tence embeddings in most cases, which demonstrates the can jointly learn the embeddings of all the sentences, their effectiveness of our approach phrases, images, and salient image regions. Due to the hi Moreover, from Fig. 8 we can see that our model can erarchical structure, we can naturally build representations learn a semantic embedding space. where images/sentences for all phrases and image regions, and exploit their hierar with similar semantics will be mapped close to each other. chical relations as well. The experimental results turn out For example, the 38-th, 54-th, and 2-th images are all related that the performance of general image captioning can be to'Dog'(as shown by their descriptions). And their learned significantly improved due to learning a dense embedding embedding vectors are exactly neighbors in the embedding space. Bedsides, our method can produce detailed and di space(within the red circle) verse phrases to describe image salient regions 5.3.2 Learned Phrase-level Correspondences 7. Acknowledgement When learning the dense embedding space, our approach This work was supported by NSFC Grant 61432014 can automatically find the 'phrase -region' correspondences J1605252,61402348,61672402,61602355,and61503296,by in the training data. We evaluate the quality of those learned Key Industrial Innovation Chain 2016KTZDGY-02 and National correspondences here. Since it is expensive to obtain the High-Level Talents CS31117200001. Dr. Gang Hua was sup- ground truth phrase-level correspondences, we only make ported by NSFC Grant 61629301 References 20] T. Lin, M. Maire, S. Bclongic, J. Hays, P. Perona, D manan, P Dollar and C. Zitnick. Microsoft coco: Common ll] P. Arbelaez, J. lust, J. Barron, F. Marques, and J. Malik objects in contexL. arXiv: 1405.0512, 2014 Multiscale combinatorial grouping. CVPR, 2014 [21 X Chen, H. Fang, T.-Y. Lin, R. Vedantam, S Gupta, P Dol [21 V. Maalen and G. Hinton. Visualizing high-dimensional data using t-sne. Journal of machine learning research lar, and C. Zitnick. Microsoft coco captions: Dala collection 9(11):25792605,2008 and evaluation server. arXiv: 7504.0032.5 2015 22]J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Explain 3K. Cho, B. Merrienboer, C. Giulcehre, D. Bahdanau images with multimodal recurrent neural network F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase 014 representations using rnn encoder-decoder for statistical ma- chine translation ar Xiv. 406.1078 2014 [23]I Mao, w. Xu, Y. Yang, Wang, and A. Yuille. Deep cap tioning with multimodal recurrent neural networks. ICLR [4]J. Deng, W. Dong, R. Socher, L. J. Li,K. Li, and I. Fei-Fei 2015 Imagenet: A large-scale hierarchical image database. CVPR 2009 [24]T. Mikolov, I. Sutskever, K Chen, G. Corrado, and J. Dean Distributed representations of words and phrases and their [5] J. Devlin, H. Cheng, H. Fang, S Gupta, L Deng, X. He compositionality. NIPS, 2013 G. Zweig, and M. Mitchell. Language nodels for image captioning: The quirks and what works. arXiv: 1505.01809 [25] B. Plummer, L. Wang, C. Cervantes, J Caicedo, J. Hock enmaier. and s. Lazebnik. Flickr30k entities: Collect [6 A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, and ng region-to-phrase correspondences for richer image-to. sentence models. ICCV 2015 T. Ranzalo. Devise: A deep visual-seinanlic embedding [26B. Plummer, L. Wang, C. Cervantes, J Caicedo, J. Hock model. NIPS. 2013 [7R Girshick, J. Donahue, T. Darrell, and I Malik. Rich fea enmaier. and s. Lazebnik. Flickr 30k entities: Collect ing region-to-phrase correspondences for richer image-to ture hierarchies for accurate object detection and semantic sentence models. CV. 2016. segmentation. CVPR, 2014 [27] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaic [8 C. Goller and A. Kuchler. Learning task-dependent dis Collecting image annotations using amazons mechano tributed representations by backpropagation through struc- turk. NAACL-HLT workshop, 2010 ture. ICNN.996 [28O. Russakovsky, J. Deng, H. Su,J. Krause, S. Satheesh [9]A Graves. Generating sequences with recurrent neural net S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein works. arXiv: 1308.0850 2013 A. Berg, and L. Fei-Fei. Imagenet large scale visual recog [10 S Hochreiter and J Schmidhuber. Long short-term memory nition challenge. CVPR. 2014 Neural computation, 1997 [29R. Socher Q. Le, C. Manning, and A Ng. Grounded com [11 M. Hodosh, P. Young, and J Hockenmaier. Framing image positional semantics for finding and describing images witH description as a ranking task: Data. models and evaluation sentences. TACL. 2014 metrics, JAIR. 2013 [30] I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence [12] S Ioffe and C. Szegedy. Batch normalization: Accelerating learning with neural networks. NIPS, 2014 deep network training by reducing internal covariate shift 31]K. Tai, R. Socher, and C. Manning. Improved semantic CMⅠ.2015 representations from tree-structured long short-term memory [13]J. Johnson, A Karpathy, and L Fei-Fei. Densecap: Fully networks, ACL 2015 convolutional localization networks for dense captioning 132] J. Uijlings, K Sande, T Gevers, and A Smeulders Selective CVPR 2016 search for object recognition. 1JCV, 2013 [14]A. Karpathy and L. Fei- Fel. Deep visual-semantic align [33]O. Vinyals, A. Toshev, S Bengio, and D. Erhan. Show and ments for generating image descriptions. CVPR, tell: A neural image caption generator. arXiv: 1411.4555 [15]A. Karpathy, A Joulinl, and L. Fei-Fei. Deep fragment en 2014 beddings for bidirectional image-sentence mapping. NIPS 134] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure 2014 preserving image-text embeddings. CVPR, 2016. [16] D. Kingma and J Ba. Adam: A method for stochastic opti [35 P. Young, A Lai, M. Hodosh, and J Hockenmaier. From im li∠ aton.lCLR,2015 age descriptions to visual denotations: New similarity met- [17]R. Kirus, R. Salakhutdinov, and R. S. Zemel. Unilying rics for semantic inference over event descriptions. TACL isual-semantic embeddings with multimodal neural lan- 2014 guage models. NIPS, 2014 [18 D. Klein and C. manning ate unlexicalized parsing ACL,2003 [19 R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz S. Chen. Y Kalantidis. L. Li. D. Shamma. M. bernstein and L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations https:/arxiv:org/abs/1602.07332,2016

...展开详情
试读 9P 论文,ICCV17c-基于层次化多模态LSTM的视觉语义联合嵌入
立即下载 低至0.43元/次 身份认证VIP会员低至7折
抢沙发
一个资源只可评论一次,评论内容不能少于5个字
  • 签到新秀

    累计签到获取,不积跬步,无以至千里,继续坚持!
关注 私信 TA的资源
上传资源赚积分,得勋章
最新推荐
论文,ICCV17c-基于层次化多模态LSTM的视觉语义联合嵌入 49积分/C币 立即下载
1/9
论文,ICCV17c-基于层次化多模态LSTM的视觉语义联合嵌入第1页
论文,ICCV17c-基于层次化多模态LSTM的视觉语义联合嵌入第2页

试读结束, 可继续读1页

49积分/C币 立即下载 >