AdvancesinNaturalLanguageProcessing_清华NLP刘知远团队大模型笔记资源-CSDN文库

108 浏览量 2024-01-11 14:06:38 上传评论收藏 763KB PDF 举报

"自然语言处理的最新进展" 自然语言处理（Natural Language Processing，简称NLP）是计算机科学的一个子领域，关心使用计算技术来学习、理解和生成人类语言内容。早期的计算语言研究集中于自动化语言结构分析和基本技术的开发，例如机器翻译、语音识别和语音合成等。今天的研究人员通过这些工具在实际应用中不断改进和应用，例如创建spoken dialogue systems和语音翻译引擎、通过社交媒体挖掘健康或财经信息、识别消费者对产品和服务的情感等。在过去的20年中，计算语言学已经成长为一个激动人心的科学研究领域和实用的技术，越来越广泛地应用于消费者产品中，例如苹果的Siri和Skype翻译器等。四个关键因素促进了这些发展：（i）计算能力的巨大增加，（ii）大量语言数据的可用性，（iii）机器学习（ML）方法的发展，以及（iv）对人类语言结构和社交上下文的更好理解。在本篇Review中，我们将描述当前语言研究的几个应用领域，这些努力展示了基于当前最先进的方法结合统计分析和机器学习的计算语言处理技术。计算语言学，亦称自然语言处理（NLP），是计算机科学的一个子领域，关心使用计算技术来学习、理解和生成人类语言内容。计算语言系统可以有多种目的：例如帮助人际交流、自动翻译、文本分类等。当前，自然语言处理的应用领域包括： 1. Spoken Dialogue Systems：通过语音识别和自然语言处理技术，创建交互式对话系统，例如虚拟助手、客服机器人等。 2. Sentiment Analysis：通过自然语言处理技术，分析消费者对产品和服务的情感和评价。 3. Machine Translation：使用机器学习和统计模型，实现高质量的机器翻译，例如Google翻译、Microsoft翻译等。 4. Text Classification：使用自然语言处理技术，自动分类和分析文本信息，例如 spam邮件过滤、文本情感分析等。 5. Information Extraction：使用自然语言处理技术，自动提取和分析文本信息，例如新闻文章中的命名实体识别、事件提取等。这些应用领域的发展对计算语言学和自然语言处理技术的需求日益增长。为满足这些需求，研究人员正在开发新的方法和技术，例如深度学习、 attention mechanism、 transformers 等。自然语言处理技术正在改变我们的生活，帮助我们更好地交流、理解和生成人类语言内容。随着技术的不断发展和改进，自然语言处理技术将继续推动语言研究和应用的发展。

资源推荐

资源详情

资源评论

REVIEW

Advances in natural

language processing

Julia Hirschberg

* and Christopher D. Manning

2,3

Natural language processing emplo y s computational techniques for the purpose of learning,

understanding, and producing human language content. Early computational approaches to

language resear ch focused on automating the analysis of the linguistic structur e of language

and developing basic technologies such as machine translation, speech recognition, and speech

synthesis. Today’s researcher s refine and make use of such tools in real-world applications,

creating spok en dialogue systems and speech-to-speech translation engines, mining social

media for information about health or finance, and identifying sentiment and emotion towar d

products and services. We describe successes and challenges in this rapidly advancing ar ea.

ver the past 20 years, computational lin-

guistics has grown into both an exciting

area of scientific research and a practical

technology that is increasingly being in-

corporated into consumer products (for

example, in applications such as Apple’sSiriand

Skype Translator). Four key factors enabled these

developments: (i) a vast increase in computing

power , (ii) the availability of very large amounts

of linguistic data, (iii) the development of highly

successful machine learning (ML) methods, and

(iv) a much richer understanding of the structure

of human language and its deployment in social

contexts. In this Review, we describe some cur-

rent application areas of interest in language

research. These efforts illustrate computational

approaches to big data, based on current cut ting-

edge methodologies that combine statistical anal-

ysis and ML with knowledge of language.

Computational linguistics, also known as nat-

ural language processing (NLP), is the subfield

of computer science concerned with using com-

putational techniques to learn, understand, and

produce human language content. C omputation-

al linguistic systems can have multiple purposes:

The goal can be aiding human-human commu-

nication, such as in machine translation (MT);

aiding human-machine communication, such as

with conversational agents; or benefiting both

humans and machines by analyzing and learn-

ing from the enormous quantity of human lan-

guage content that is now available online.

During the first several decades of work in

computational linguistics, scientists attempted

to write down for computers the vocabularies

and rules of human languages. This proved a

difficult task, owing to the variability, ambiguity,

and context-dependent interpretation of human

languages. For instance, a star can be either an

astronomical object or a person, and “star” can

be a noun or a verb. In another example, two in-

terpretations are possible for the headline “Teacher

strikes idle kids,” depending on the noun, verb, and

adjective assignments of the words in the sentence,

as well as grammatical structure. Beginning in the

1980s, but more widely in the 1990s, NLP was

transformed by researchers starting to build mod-

els over large quantities of empirical language

data. Statistical or corpus (“body of words”)–based

NLP was one of the first notable successes of

the use of big data, long before the power of

ML was more generally recognized or the term

“big data” even introduced.

A central finding of this statistical approach to

NLP has been that simple methods using words,

part-of-speech (POS) sequences (such as whether

a word is a noun, verb, or preposition), or simple

templates can often achieve notable results when

trained on large quantities of data. Many text

and sentiment classifiers are still based solely on

the different sets of words (“bag of words”)that

documents contain, without regard to sentence

and discourse structure or meaning. Achieving

improvements over these simple baseli ne s can be

quite difficult. Nevertheless, the best-performing

systems now use sophisticated ML approaches

and a rich understanding of linguistic structure.

High-performance tools that identify syntactic

and semantic information as well as information

about discourse context are now available. One

example is Stanford CoreNLP (1), which provides

a standard NLP preprocessing pipeline that in-

cludes POS tagging (with tags such as noun, verb,

and preposition); identification of named entities,

such as people , places, and organizations; parsing

of sentences into their grammatical structures ;

and identifying co-references between noun

phrase mentions (Fig. 1).

Historically, two developments enabled the

initial transformation of NLP into a big data field.

The first was the early availability to researchers

of linguistic data in digital form, particularly

through the Linguistic Data Consortium (LDC)

(2), established in 1992. Today, large amounts

of digital text can easily be downloaded from

the Web. Available as linguistically annotated

data are large speech and text corpora anno-

tated with POS tags, syntactic parses, semantic

labels, annotations of named entities (persons,

places, organizations), dialogue acts (statement,

question, request), emotions and positive or neg-

ative sentiment, and discourse structure (topic

or rhetorical structure). Second, performance im-

provements in NLP were spurred on by shared

task competitions. Originally, these competitions

were largely funded and organized by the U.S.

Department of Defense, but they were later or-

ganized by the research community itself, such

as the CoNLL Shared Tasks (3). These tasks were

a precursor of modern ML predictive modeling

and analytics competitions, such as on Kaggle (4),

in which companies and researchers post their

data and statisticians and data miners from all over

theworldcompetetoproducethebestmodels.

A major limitation of NLP today is the fact that

most NLP resources and systems are available

only for high-resource languages (HRLs), such as

English, French, Spanish, German, and Chinese.

In contrast, many low-resource languages (LRLs)—

such as Bengali, Indonesian, Punjabi, Cebuano,

and Swahili—s pok en and wri t ten by millions of

people have no such resources or systems avail-

able.Afuturechallengeforthelanguagecommu-

nity is how to develop resources and tools for

hundreds or thousands of languages, not just a few.

Machine translation

Proficiency in languages was traditionally a hall-

mark of a learned person. Although the social

standing of this human skill has declined in the

modern age of science and machines, translation

between human languages remains crucially im-

portant, and MT is perhaps the most substantial

way in which computers could aid human-human

communication. Moreover, the ability of com-

puters to translate between human languages

remains a consummate test of machine intel-

ligence: Correct translation requires not only

the ability to analyze and generate sentences in

human languages but also a humanlike under-

standing of world knowledge and context, de-

spite the ambiguities of languages. For example,

the French word “bordel” st raightforwardly means

“brothel”; but if someone says “My room is un

bordel,” then a translating machine has to know

enough to suspect that this person is probably not

running a brothel in his or her room but rather is

saying “My room is a complete mess.”

Machine translation was one of the first non-

numeric applications of computers and was studied

intensively starting in the late 1950s. However , the

hand-built grammar-basedsystemsofearlydec-

ades achieved very limited success. The field was

transformed in the early 1990s when researchers

at IBM acquired a large quantity of English and

French sentences that weretranslationsofeach

other (known as parallel text), produced as the

proceedings of the bilingual Canadian Parliament.

These data allowed them to collect statistics of

word translations and word sequences and to

build a probabilistic model of MT (5).

Following a quiet period in the late 1990s,

the new millennium brought the potent combina-

tion of ample online text, including considerable

quantities of parallel text, much more abundant

and inexpensive computing, and a new idea

for building statistical phrase-based MT systems

SCIENCE sciencemag.org 17 J ULY 2015 • VOL 349 ISSUE 6245 261

Department of Computer Science, Columbia University, New York,

NY 10027, USA.

Department of Linguistics, Stanford University,

Stanford, CA 94305-2150, USA.

Department of Computer

Science, Stanford University, Stanford, CA 94305-9020, USA.

*Corresponding author. E-mail: julia@cs.columbia.edu

on July 16, 2015www.sciencemag.orgDownloaded from on July 16, 2015www.sciencemag.orgDownloaded from on July 16, 2015www.sciencemag.orgDownloaded from on July 16, 2015www.sciencemag.orgDownloaded from on July 16, 2015www.sciencemag.orgDownloaded from on July 16, 2015www.sciencemag.orgDownloaded from

(6). Rather than translating word by word, the

key advance is to notice that small word groups

often have distinctive translations. The Japa-

nese

“mizu iro” is literally the sequence

of two words (“water color”), but this is not the

correct meaning (nor does it mean a type of

painting); rather, it indicates a light, sky-blue color.

Such phrase-based MT was used by Franz Och in

the development of Google Translate.

This technology enabled the services we have

today, which allow free and instant translation

between many language pairs, but it still pro-

duces translations that are only just serviceable

for determining the gist of a passage. However,

very promising work continues to push MT for-

ward. Much subsequent research has aimed to

better exploit the structure of human language

sentences (i.e., their syntax) in translation sys-

tems (7, 8), and researchers are actively building

deeper meaning representations of language (9)

to enable a new level of semantic MT.

Finally, just in the past year , we have seen the

development of an extremely promising approach

to MT through the use of deep-learning–based

sequence models. The central idea of deep learn-

ing is that if we can train a model with several

representational levels to optimize a final objec-

tive, such as translation quality, then the model

can itself learn intermediate representations

that are useful for the task at hand. This idea

has been explored particularly for neural net-

work models in which information is stored in

real-valued vectors, with the mapping between

vectors consisting of a matrix multiplication fol-

lo we d by a nonlinearity, such as a sigmoid func-

tion that maps the output values of the matrix

multiplication onto [−1, 1]. Building large models

of this form is much more practical with the

massive parallel computation that is now econo-

mically available via graphics processing units. For

translation, research has focused on a particular

version of recur rent neural networks, with enhanced

“long short-term memory” computational units

that can better maintain contextual information

from early until late in a sentence (10, 11)(Fig.2).

The distributed representations of neural networks

are often very effective for capturing subtle seman-

ti c similarities, and neural MT systems have al-

ready produced some state-of-the-art results (12, 13).

A still-underexplored area in MT is getting ma-

chines to have more of a sense of discourse, so

that a sequence of sentences translates naturally—

although work in the area has begun (14). Finally,

MT is not necessarily a task for machines to do

alone. Rather it can be reconceptualized as an op-

po rtunity for computer-supported cooperative

work that also exploits human skills (15). In such

a system, machine intelligence is aimed at human-

co m p ute r int e r f ace cap a b i lities of giving effective

suggestions and reacting productively to human

input, rather than wholly replacing the skills and

knowledge of a human translator.

Spoken dialogue systems and

conversational agents

Dialogue has been a popular topic in NLP re-

search since the 1980s. However, early work on

text-based dialogue has now expanded to include

spoken dialogue on mobile devices (e.g., Apple’s

Siri, Amtrak’s Julie, Google Now, and Microsoft’s

Cortana) for information access and task-based

apps. Spoken dialogue systems (SDSs) also allow

robots to help people with simple manual tasks

[e.g., Manuela Veloso’s CoBots (16)] or provide

therapy for less-abled persons [e.g., Maja Mataric’s

socially assistive robots (17)]. They also enable ava-

tars to tutor people in interview or negotiation stra-

tegies or to help with health care decisions (18, 19).

The creation of SDSs, whether between hu-

mans or between humans and artificial agents,

requires tools for automatic speech recognition

(ASR), to identify what a human says; dialogue

management (DM), to determine what that hu-

man wants; actions to obtain the information or

perform the activity requested; and text-to-speech

(TTS) synthesis, to convey that information back

to the human in spoken form. (Fig. 3). In addition,

SDSs need to be ready to interact with users when

an error in speech recognition occurs; to decide

what words might be incorrectly recognized; and

to determine what the user actually said, either

automatically or via dialogue with the user. In

speech-to-speech translation systems, MT com-

ponents are also needed to facilitate dialogue

between speakers of different languages and the

system, to identify potential mistranslations before

they occur , and to clarify these with the speaker .

Practical SDSs have been enabled by break-

throughs in speech recognition accuracy, mainly

coming from replacing traditional acoustic feature–

modeling pipelines with deep-learning models that

map sound signals to sequences of human language

sounds and words (20). Although SDSs now work

fairly well in limited domains, where the topics of

the interaction are known in advance and where

thewordspeoplearelikelytousecanbepredeter-

mined, they are not yet very success f ul in open-

domain interaction, where users may talk about

anything at all. Chatbots following in the tradition

of ELIZA (21) handle open-domain interaction by

cleverly repeating variations of the human input;

262 17 JULY 2015 • VOL 349 ISSUE 624 5 sciencemag.org SCIE NCE

case

aux

nsubj

Mrs. Clinton previously worked for Mr. Obama , but she is now distancing herself from him .

Part of speech:

NNP NNP PRP PRPPRPNNP VBD VBZ VBGNNPRB RBCCIN IN,

Mrs. Clinton previously worked for Mr. Obama, but she is now distancing herself from him .

Basic dependencies:

NNP NNP PRP PRPPRPNNP VBD VBZNNPRB RBCCIN IN, .

Mrs. Clinton previously worked for Mr. Obama, but she is now distancing herself from him.

Named entity recognition:

Person Date DatePerson

Mrs. Clinton previously worked for Mr. Obama, but she is now distancing herself from him.

Co-reference:

Mention MMent

Coref

compound compound case

dobjadvmod

nmod

conj

advmod

nsubj

Coref

Mention

VBG

Fig. 1. Many language technology tools start by doing linguistic structure analysis. Here we show output from Stanford CoreNLP. As shown from top to

bottom, this tool determines the parts of speech of each word, tags various words or phrases as semantic named entities of various sorts, determines which entity

mentions co-ref er to the same person or organization, and then works out the syntactic structur e of each sentence, using a dependency gr ammar analysis.

ARTIFICIAL INTELLIGENCE

剩余6页未读，继续阅读

评论收藏

内容反馈