Python-TensorFlow20Keras注意力机制实现集_cbamtensorflow2资源-CSDN文库

共14个文件

py：5个

png：5个

md：4个

Python开发-机器学习

需积分: 41 58 浏览量 2019-08-09 19:48:05 上传评论 4 收藏 125KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Python-TensorFlow20Keras注意力机制实现集.zip （14个子文件）

attention-mechanisms-master

examples

document_classification.py 5KB

text_generation.py 6KB

sentiment_classification.py 3KB

machine_translation.py 11KB

CONTRIBUTING.md 2KB

assets

alignment_functions.png 15KB

self_attention.png 36KB

global_attention.png 19KB

attention_categories.png 14KB

local_attention.png 19KB

LICENSE.md 1KB

README.md 13KB

layers.py 21KB

CODE_OF_CONDUCT.md 3KB

# Attention Mechanisms ## Table of Contents 1. [Introduction](#introduction) 2. [Attention Types](#attention-types) * [Self Attention](#self-attention) * [Global Attention](#global-attention) * [Local Attention](#local-attention) 3. [Alignment Functions](#alignment-functions) 4. [Implementation Details](#implementation-details) 5. [Examples](#examples) * [Sentiment Classification](#sentiment-classification) * [Text Generation](#text-generation) * [Machine Translation](#machine-translation) 6. [Contributing](#contributing) 7. [Resources](#resources) ## Introduction This repository includes custom layer implementations for a whole family of attention mechanisms, compatible with TensorFlow and Keras integration. Attention mechanisms have transformed the landscape of machine translation, and their utilization in other domains of natural language processing & understanding are increasing day by day. In a broader sense, they aim to eliminate the disadvantageous compression and loss of information in RNNs. These originate due to the fixed-length encoding of hidden states derived from input sequences by recurrent layers in sequence-to-sequence models. The layers in this repository can be used for both **many-to-many** and **many-to-one** sequence tasks. Applications include *sentiment classification*, *text generation*, *machine translation*, and *question answering*. It is also worthwhile to mention that this project will soon be **deployed** as a Python package. Check *Contributing* subsection on how to contribute to this project. ## Attention Types <img src="assets/attention_categories.png"> ### Self Attention First introduced in *Long Short-Term Memory-Networks for Machine Reading* by Jianpeng Cheng et al. The idea is to relate different positions of the same hidden state space derived from the input sequence, based on the argument that multiple components together form the overall semantics of a sequence. This approach brings together these differently positioned information through **multiple hops** attention. This particular implementation follows *A Structured Self-Attentive Sentence Embedding* by Zhouhan Lin et al. where authors propose an additional loss metric for regularization to prevent the redundancy problems of the embedding matrix if the attention mechanism always provides similar annotation weights. <img src="assets/self_attention.png"> ### Global (Soft) Attention First introduced in *Neural Machine Translation by Jointly Learning to Align and Translate* by Dzmitry Bahdanau et al. The idea is to derive a context vector based on **all** hidden states of the encoder RNN. Hence, it is said that this type of attention **attends** to the entire input state space. <img src="assets/global_attention.png"> ### Local (Hard) Attention First introduced in *Show, Attend and Tell: Neural Image Caption Generation with Visual Attention* by Kelvin Xu et al. and adapted to NLP in *Effective Approaches to Attention-based Neural Machine Translation* by Minh-Thang Luong et al. The idea is to eliminate the attentive cost of global attention by instead focusing on a small subset of tokens in hidden states set derived from the input sequence. This window is proposed as ```[p_t-D, p_t+D]``` where ```D=width```, and we disregard positions that cross sequence boundaries. The aligned position, ```p_t```, is decided either through **a) monotonic alignment:** set ```p_t=t```, or **b) predictive alignment**: set ```p_t = S*sigmoid(FC1(tanh(FC2(h_t)))``` where fully-connected layers are trainable weight matrices. Since yielding an integer index value is undifferentiable due to ```tf.cast()``` and similar methods, this implementation instead derives a aligned position float value and uses Gaussian distribution to adjust the attention weights of all source hidden states instead of slicing the actual window. We also propose an experimental alignment type, **c) completely predictive alignment:** set ```p_t``` as in ii), but apply it to all source hidden states (```h_s```) instead of the target hidden state (```h_t```). Then, choose top ```@window_width``` positions to build the context vector and zero out the rest. Currently, this option is only avaiable for many-to-one scenarios. <img src="assets/local_attention.png"> ### Hierarchical Attention First introduced in *Hierarchical Attention Networks for Document Classification* by Zichao Yang et al. The idea is to reflect the hierarchical structure that exists within documents. The original paper proposes a **bottom-up** approach by applying attention mechanisms sequentially at word- and sentence-levels, but a **top-down** approach (ex. word- and character-levels) is also applicable. Hence, this type of mechanisms is said to attend differentially to more and less important content when constructing the document representation. ## Alignment Functions <img src="assets/alignment_functions.png"> Each function is trying to compute an alignment score given a target hidden state (```h_t```) and source hidden states (```h_s```). | Name | Formula for <img src="https://latex.codecogs.com/png.latex?\Large&space;score(h_t,&space;h_s)"> | Defined by | | ------- | --- | --- | | Dot Product | <img src="https://latex.codecogs.com/png.latex?\Large&space;h_t^\intercal&space;\cdot&space;h_s"> | Luong et al. (2015) | | Scaled Dot Product | <img src="https://latex.codecogs.com/png.latex?\Large&space;\frac{h_t^\intercal&space;\cdot&space;h_s}{\sqrt{H}}"> | Vaswani et al. (2017) | | General | <img src="https://latex.codecogs.com/png.latex?\Large&space;h_t^\intercal&space;\cdot&space;W_a&space;\cdot&space;h_s"> | Luong et al. (2015) | | Concat | <img src="https://latex.codecogs.com/png.latex?\Large&space;v_a^\intercal&space;\cdot&space;\tanh(W_a[h_t:h_s])"> | Bahdanau et al. (2015) | | Location | <img src="https://latex.codecogs.com/png.latex?\Large&space;W_a&space;\cdot&space;h_t"> | Luong et al. (2015) | where ```H``` is the number of hidden states given by the encoder RNN, and where ```W_a``` and ```v_a``` are trainable weight matrices. ## Implementation Details * As of now, all attention mechanisms in this repository are successfully tested with applications in both many-to-one and many-to-many sequence tasks. Check the *Examples* subsection for example applications. * It should be noted that there is no claim that the attention mechanisms represented in this repository (and their accompanying hyperparameters represented in *Examples* subsection) is optimized in anyway; there is still a lot of room for improvement from both a software development and research perspective. * Every layer is a subclass of ```tf.keras.layers.Layer()```. * The ```__init__()``` method of each custom class calls the the initialization method of its parent and defines additional attributes specific to each layer. * The ```get_config()``` method calls the configuration method of its parent and defines custom attributes introduced with the layer. * If a custom layer includes method ```build()```, then it contains trainable parameters. Take the ```Attention()``` layer for example, the backpropagation of the loss signals which inputs to give more care to and hence indicates a change in weights of the layer. * The ```call()``` method is the actual operation that is performed on the input tensors. * ```compute_output_shape()``` methods are avoided for spacing. ## Examples These layers can be plugged-in to your projects (whether language models or other types of RNNs) within seconds, just like any other TensorFlow layer with Keras integration. See the below general-purpose example for instance: ``` from tensorflow.keras.layers import Input, Embedding, LSTM, Dense from layers import Attention X = Input(shape=(sequence_length,), batch_size=batch_size) # define input layer for summary ## Token Embedding (Pretrained or Not) #

评论收藏

内容反馈