自动编码器作为助理主管：改进中文社交媒体文本摘要的文本表示资源-CSDN文库

170 浏览量 2021-03-13 05:30:00 上传评论收藏 681KB PDF 举报

在当前的自然语言处理领域中，文本摘要是一项重要的任务，它的目标是从原始文本中提取关键信息并生成简短的总结。文本摘要分为两类：抽取式文本摘要（Extractive Text Summarization）和抽象式文本摘要（Abstract Text Summarization）。抽取式文本摘要是通过选择原文本中的关键词或短语来组成总结；而抽象式文本摘要则通过学习语义表征来生成更接近人类写作方式的摘要。目前，大多数抽象式文本摘要模型都是基于序列到序列模型（Seq2Seq）的，该模型通过编码器将源文本编码为语义表征，并通过解码器从该表征生成摘要。然而，在社交媒体上，原始内容通常很长且含有噪声，如拼写错误、非正式表达和语法错误等，这些错误内容导致文本摘要变得相当困难。尤其是基于循环神经网络（RNN）的Seq2Seq模型，在将长序列压缩为准确表征时，存在梯度消失和梯度爆炸问题，使得学习准确的语义表征变得非常困难。在本研究中，研究者们提出了一个创新的方法，即使用自动编码器作为辅助监督者（Assistant Supervisor），以改善中文社交媒体文本摘要的文本表示。在这个框架中，研究者将摘要的自动编码器视作Seq2Seq模型的辅助监督者。由于摘要内容相对简短且更规范，与源内容共享相同的含义，可以作为辅助信息来指导源内容的表征学习。具体来说，研究者通过将源内容和摘要的表示学习进行联合优化，即监督源内容的表示学习，使之与摘要的表示相匹配。实验采用了流行的中文社交媒体数据集来评估模型的性能。实验结果表明，这种方法能够达到基准数据集上的最佳性能。这表明利用辅助监督者来引导抽象式文本摘要模型的学习是一种有效的方法。在更深入的讨论中，我们可以看到这项工作在以下几个方面对改善中文社交媒体文本摘要做出了贡献： 1. 数据集与问题定义：这项研究在中文社交媒体的上下文中考虑了文本摘要问题，并针对社交媒体内容的特点，如噪声多和信息量大，提出了新的解决方案。 2. 模型架构创新：自动编码器作为辅助监督者的想法是创新的，它将文本摘要任务中的源内容和摘要之间建立了一种新的监督关系。 3. 性能提升：通过实验验证，该模型在基准数据集上达到了最先进的性能，证明了所提出方法的有效性。 4. 应用前景：这项工作不仅对文本摘要领域有贡献，而且为处理中文社交媒体文本提供了一种新的思路，有望推广至其他中文自然语言处理任务中。 5. 启发其他研究者：这项研究可能激励其他研究者探索更多在类似困难情况下，利用辅助监督信号来改善深度学习模型性能的途径。需要注意的是，这项研究在实际应用中还需要进一步的优化和改进，如处理更多的噪声类型、适应不同长度的文本摘要以及提高模型的泛化能力等。此外，对于自动编码器的超参数调整和损失函数的设计等技术细节也是影响最终摘要质量的关键因素。未来的研究方向可能包括探索更好的编码器和解码器结构、集成更多的监督信号以及将这种模型扩展到其他语言的社交媒体文本摘要任务上。

资源推荐

资源详情

资源评论

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 725–731

Melbourne, Australia, July 15 - 20, 2018.

2018 Association for Computational Linguistics

725

Autoencoder as Assistant Supervisor: Improving Text Representation for

Chinese Social Media Text Summarization

Shuming Ma

, Xu Sun

1,2

, Junyang Lin

, Houfeng Wang

MOE Key Lab of Computational Linguistics, School of EECS, Peking University

Deep Learning Lab, Beijing Institute of Big Data Research, Peking University

School of Foreign Languages, Peking University

{shumingma, xusun, linjunyang, wanghf}@pku.edu.cn

Abstract

Most of the current abstractive text sum-

marization models are based on the

sequence-to-sequence model (Seq2Seq).

The source content of social media is long

and noisy, so it is difﬁcult for Seq2Seq to

learn an accurate semantic representation.

Compared with the source content, the an-

notated summary is short and well writ-

ten. Moreover, it shares the same mean-

ing as the source content. In this work,

we supervise the learning of the represen-

tation of the source content with that of the

summary. In implementation, we regard a

summary autoencoder as an assistant su-

pervisor of Seq2Seq. Following previous

work, we evaluate our model on a popular

Chinese social media dataset. Experimen-

tal results show that our model achieves

the state-of-the-art performances on the

benchmark dataset.

1 Introduction

Text summarization is to produce a brief summary

of the main ideas of the text. Unlike extractive text

summarization (Radev et al., 2004; Woodsend and

Lapata, 2010; Cheng and Lapata, 2016), which se-

lects words or word phrases from the source texts

as the summary, abstractive text summarization

learns a semantic representation to generate more

human-like summaries. Recently, most models for

abstractive text summarization are based on the

sequence-to-sequence model, which encodes the

source texts into the semantic representation with

an encoder, and generates the summaries from the

representation with a decoder.

The code is available at https://github.com/

lancopku/superAE

The contents on the social media are long, and

contain many errors, which come from spelling

mistakes, informal expressions, and grammatical

mistakes (Baldwin et al., 2013). Large amount of

errors in the contents cause great difﬁculties for

text summarization. As for RNN-based Seq2Seq,

it is difﬁcult to compress a long sequence into an

accurate representation (Li et al., 2015), because

of the gradient vanishing and exploding problem.

Compared with the source content, it is easier

to encode the representations of the summaries,

which are short and manually selected. Since the

source content and the summary share the same

points, it is possible to supervise the learning of

the semantic representation of the source content

with that of the summary.

In this paper, we regard a summary autoen-

coder as an assistant supervisor of Seq2Seq. First,

we train an autoencoder, which inputs and recon-

structs the summaries, to obtain a better repre-

sentation to generate the summaries. Then, we

supervise the internal representation of Seq2Seq

with that of autoencoder by minimizing the dis-

tance between two representations. Finally, we

use adversarial learning to enhance the supervi-

sion. Following the previous work (Ma et al.,

2017), We evaluate our proposed model on a Chi-

nese social media dataset. Experimental results

show that our model outperforms the state-of-the-

art baseline models. More speciﬁcally, our model

outperforms the Seq2Seq baseline by the score of

7.1 ROUGE-1, 6.1 ROUGE-2, and 7.0 ROUGE-L.

2 Proposed Model

We introduce our proposed model in detail in this

section.

2.1 Notation

Given a summarization dataset that consists of N

data samples, the i

data sample (x

, y

) con-

726

Source content encoder

Summary encoder

Summary decoder

Supervise

𝒛

𝒕

𝒛

𝒔

𝒙

𝟏

𝒙

𝟐

𝒙

𝟑

𝒚

𝟏

𝒚

𝟐

𝒚

𝟑

𝒚

𝟏

𝒚

𝟐

𝒚

𝟑

(a) Training Stage

Source content encoder

Summary decoder

𝒛

𝒕

𝒙

𝟏

𝒙

𝟐

𝒙

𝟑

𝒚

𝟏

𝒚

𝟐

𝒚

𝟑

(b) Test Stage

Figure 1: The overview of our model. The model

consists of a sequence-to-sequence model and an

autoencoder model. At the training stage, we

use the autoencoder to supervise the sequence-to-

sequence model. At the test stage, we use the

sequence-to-sequence model to generate the sum-

maries.

tains a source content x

= {x

, x

, ..., x

}, and

a summary y

= {y

, y

, ..., y

}, while M is the

number of the source words, and L is the num-

ber of the summary words. At the training stage,

we train the model to generate the summary y

given the source content x. At the test stage, the

model decodes the predicted summary y

given

the source content x.

2.2 Supervision with Autoencoder

Figure 1 shows the architecture of our model. At

the training stage, the source content encoder com-

presses the input contents x into the internal repre-

sentation z

with a Bi-LSTM encoder. At the same

time, the summary encoder compresses the refer-

ence summary y into the representation z

. Then

both z

and z

are fed into a LSTM decoder to gen-

erate the summary. Finally, the semantic represen-

tation of the source content is supervised by the

summary.

We implement the supervision by minimizing

the distance between the semantic representations

and z

, and this term in the loss function can be

written as:

d(z

, z

) (1)

where d(z

, z

) is a function which measures the

distance between z

and z

. λ is a tunable hyper-

parameter to balance the loss of the supervision

and the other parts of the loss, and N

is the num-

ber of the hidden unit to limit the magnitude of the

distance function. We set λ = 0.3 based on the

performance on the validation set. The distance

between two representations can be written as:

d(z

, z

) = kz

− z

(2)

2.3 Adversarial Learning

We further enhance the supervision with the ad-

versarial learning approach. As shown in Eq. 1,

we use a ﬁxed hyper-parameter as a weight to

measure the strength of the supervision of the au-

toencoder. However, in the case when the source

content and summary have high relevance, the

strength of the supervision should be higher, and

when the source content and summary has low

relevance, the strength should be lower. In order

to determine the strength of supervision more dy-

namically, we introduce the adversarial learning.

More speciﬁcally, we regard the representation of

the autoencoder as the “gold” representation, and

that of the sequence-to-sequence as the “fake” rep-

resentation. A model is trained to discriminate

between the gold and fake representations, which

is called a discriminator. The discriminator tries

to identify the two representations. On the con-

trary, the supervision, which minimizes the dis-

tance of the representations and makes them sim-

ilar, tries to prevent the discriminator from mak-

ing correct predictions. In this way, when the dis-

criminator can distinguish the two representations

(which means the source content and the summary

has low relevance), the strength of supervision will

be decreased, and when the discriminator fails to

distinguish, the strength of supervision will be im-

proved.

In implementation of the adversarial learning,

the discriminator objective function can be written

as:

(θ

) = − log P

(y = 1|z

)

− log P

(y = 0|z

)

(3)

where P

(y = 1|z) is the probability that the dis-

criminator identiﬁes the vector z as the “gold” rep-

resentation, while P

(y = 0|z) is the probability

that the vector z is identiﬁed as the “fake” repre-

sentation, and θ

is the parameters of the discrim-

剩余6页未读，继续阅读

评论收藏

内容反馈

weixin_38707192

粉丝: 3
资源: 921

自动编码器作为助理主管：改进中文社交媒体文本摘要的文本表示

社交媒体文本中的情感分析

基于改进自编码器的文本分类算法.pdf

基于深度学习的社交媒体文本立场分析研究.pptx

综述：文本摘要.pdf

PLC与文本屏、编码器、变频器程序实例-彩钢瓦裁切控制程序.pdf

基于BERT的社交电商文本分类算法.pdf

自动识别不同编码的文本文件

基于深度学习的文本摘要自动生成（自然语言处理）-本科毕业设计

text-summarization:提取与抽象文本摘要方法

汉字编码器(exe格式)

基于深度学习的自动文本摘要.pdf

利用ChatGPT进行自动文本摘要生成.docx

采用深度稀疏自动编码器实现高维矩阵降维，提取特征

一个好用的文本文件编码转换器

文本文件编码识别和编码转换

NLP：使用s2s+指针网络完成中文文本摘要.zip

Abap基础学习文档5_处理文本摘要.doc

易语言字节集文本生成器源码,易语言随机文本自动生成

Microsoft Word PLC与文本屏 编码器 变频器程序实例之一 pdf

中文-长文本-摘要-数据集

定长切割控制资料PLC与文本屏、编码器、变频器程序实例之一.doc

结合层级注意力的抽取式新闻文本自动摘要.docx

Python-使用最新版本的tensorflow实现seq2seq模型生成文本数据摘要

基于python的GPT2中文文本生成模型项目实现

点睛文本编码查询(文本的字符串转换工具)

基于CORDIC的反正弦和反余弦计算的FPGA实现

最新资源

Microsoft Word PLC与文本屏编码器变频器程序实例之一 pdf