没有合适的资源?快使用搜索试试~ 我知道了~
DPWord2Vec_Better_Representation_of_Design_Patterns_in_Semantics...
需积分: 0 0 下载量 65 浏览量
2024-04-01
16:03:53
上传
评论
收藏 1.75MB PDF 举报
温馨提示
试读
21页
DPWord2Vec_Better_Representation_of_Design_Patterns_in_Semantics.pdf
资源推荐
资源详情
资源评论
DPWord2Vec: Better Representation
of Design Patterns in Semantics
Dong Liu, He Jiang , Member, IEEE, Xiaochen Li , Zhilei Ren,
Lei Qiao
, and Zuohua Ding , Member, IEEE
Abstract—With the plain text descriptions of design patterns, developers could better learn and understand the definitions and usage
scenarios of design patterns. To facilitate the automatic usage of these descriptions, e.g., recommending design patterns by free-text
queries, design patterns and natural languages should be adequately associated. Existing studies usually use texts in design pattern
books as the representations of design patterns to calculate similarities with the queries. However, this way is problematic. Lots of
information of design patterns may be absent from design pattern books and many words would be out of vocabulary due to the content
limitation of these books. To overcome these issues, a more comprehensive method should be constructed to estimate the relatedness
between design patterns and natural language words. Motivated by Word2Vec, in this study, we propose DPWord2Vec that embeds
design patterns and natural language words into vectors simultaneously. We first build a corpus containing more than 400 thousand
documents extracted from design pattern books, Wikipedia, and Stack Overflow. Next, we redefine the concept of context window to
associate design patterns with words. Then, the design pattern and word vector representations are learnt by leveraging an advanced
word embedding method. The learnt design pattern and word vectors can be universally used in textual description based design
pattern tasks. An evaluation shows that DPWord2Vec outperforms the baseline algorithms by 24.2-120.9 percent in measuring the
similarities between design patterns and words in terms of Spearman’s rank correlation coefficient. Moreover, we adopt DPWord2Vec
on two typical design pattern tasks. In the design pattern tag recommendation task, the DPWord2Vec-based method outperforms two
state-of-the-art algorithms by 6.6 and 32.7 percent respectively when considering Recall@10. In the design pattern selection task,
DPWord2Vec improves the existing methods by 6.5-70.7 percent in terms of MRR.
Index Terms—Design pattern, word embedding, Word2Vec, semantic similarity, tag recommendation, design pattern selection
Ç
1INTRODUCTION
S
OFTWARE design patterns derive from the notion of design
pattern in the area of architecture [1], aiming to docu-
ment reusable experience for recurring software design
problems [2]. In recent years, many studies about design
patterns have been conducted [3], [4], [5]. As to the litera-
ture, there are roughly two ways to describe design pat-
terns: the formal way and the informal way.
The formal way specifies design patterns with formally
defined pattern languages. For example, the Gang-of-Four
(GoF) book respectively uses Unified Modeling Language
(UML) class diagram and sequence diagram to illustrate the
structure and collaborations of each design pattern [2]. A
number of studies are based on the formal descriptions of
design patterns [4], [6], as formal specifications enhance the
capabilities of machine processing [7]. However, there are
some weaknesses of the formal way. First, it is inconvenient
to precisely specify the intent and applicability of design
patterns. Second, building the meta-model of each design
pattern is usually costly [8]. Third, the formal way may lose
human readability, which is critically important to the util-
ity of design patterns [7].
Conversely, the informal way depicts design patterns
with free text. Comparing with the formal way, it is more
understandable and convenient to describe design pattern
relevant artifacts in words. Thus, the informal way is a
profitable supplement to the formal way. To provide tool
supports for design pattern relevant tasks based on infor-
mal descripti ons, the key point is to establish the semantic
relationship s between design patterns and natural lan-
guages, so that the retrieval or iden tification of design pat-
terns can be practically realized. However, to associate
design patterns wit h nat ural l angua ges is no easy job. A
design pattern name is usually a phrase, such as “factory
met hod”. An experienced developer may capture the
semantics of the design pattern via the name as he/she
understands the relevant background. But for the automatic
tools, it is difficult to comprehend the connotations from
only these several words. More information about design
patterns should be provided for them to “learn” the back-
ground knowledge.
Dong Liu is with the School of Software, Dalian University of Technology,
Dalian 116620, China. E-mail: dongliu@mail.dlut.edu.cn.
He Jiang and Zhilei Ren are with the Key Laboratory for Ubiquitous Network
and Service Software of Liaoning Province, School of Software, Dalian
University of Technology, Dalian 116620, China.
E-mail: {jianghe, zren}@dlut.edu.cn.
Xiaochen Li is with the School of Software, Dalian University of Technology,
Dalian 116620, China, and also with the Software Verification and Valida-
tion Lab, University of Luxembourg, L-1855 Luxembourg, Luxembourg.
E-mail: xiaochen.li@uni.lu.
Lei Qiao is with the Beijing Institute of Control Engineering, Beijing 100190,
China. E-mail: fly2moon@aliyun.com.
Zuohua Ding is with the School of Information Sciences, Zhejiang Sci-Tech
University, Hangzhou 310018, China. E-mail: zouhuading@hotmail.com.
Manuscript received 28 Nov. 2019; revised 17 July 2020; accepted 29 July 2020.
Date of publication 18 Aug. 2020; date of current version 18 Apr. 2022.
(Corresponding author: He Jiang.)
Recommended for acceptance by N. Bencomo.
Digital Object Identifier no. 10.1109/TSE.2020.3017336
1228 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 48, NO. 4, APRIL 2022
0098-5589 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanchang Hangkong University. Downloaded on March 17,2024 at 02:19:07 UTC from IEEE Xplore. Restrictions apply.
To obtain exact semantic information of design patterns,
the existing studies usually take the descriptions in design
pattern books as standard definitions of design patterns [8],
[9], [10]. If a snippet of text is similar to the standard defini-
tion of a design pattern, then it is likely to be related to the
design pattern. Hence, the relatedness between design pat-
terns and natural languages can be estimated. However,
this kind of methods is still problematic. On one hand,
much information about design patterns is absent from
these books. Design pattern books usually depict the mecha-
nisms, scenarios, and specifications of design patterns [2].
As time goes by, many applications beyond the original
design pattern books have been developed. For example,
the Active Record design pattern is related to the Ruby on
Rails framework as Active Record provides the data model
of the framework.
1
The AngularJS framework implements
the Dependency Injection design pattern itself and usually
accompanies by this design pattern.
2
These relationships
cannot be mined from design pattern books. On the other
hand, the vocabulary extracted from design pattern books is
usually too small. The lengths of descriptions in design pat-
tern books are limited and many natural language words
may be out of the scope. It is difficult to handle the texts
containing many out-of-vocabulary words. Therefore, the
wide usage of this kind of methods is restricted.
In this study, we aim to overcome these issues by con-
structing a general method to estimate the relatedness
between design patterns and natural language words, in
order that it can be universally used in the tasks based on
informal descriptions of design patterns. The “words” here
refer to as both plain natural language words, such as
“factory” and “method”, and software specific terms, such
as “angularjs”. Inspired by the word embedding method [11],
we propose DPWord2Vec that maps both design patterns
and natural language words into one vector space. With the
design pattern and word vectors, the similarity between a
design pattern and a word or a document can be calculated.
In this way, the relationship between natural languages and
design patterns can be built. However, there are two chal-
lenges to be addressed. First, how to find a relatively large
corpus about design patterns? Second, how to associate a
design pattern with its relevant natural language words for
vectors training?
To handle the first challenge, we build a general corpus
containing 491,555 documents. The general corpus consists
of two parts: the description corpus and the crowdsourced
corpus. The description corpus contains relatively formal
design pattern descriptions that are extracted from design
pattern books and Wikipedia. The crowdsourced corpus is
constructed based on a set of design pattern relevant Stack
Overflow posts obtained from our previous work [12]. Then
we extend the concept of context window in Word2Vec to
our general corpus and define the context windows for each
design pattern and each word respectively. In this way, the
linkages between design patterns and words are established,
that is, the design pattern context windows contain words
and design patterns appear in word context windows. Hence,
the second challenge can be properly addressed. Finally, the
design pattern and word vector representations are learnt by
leveraging an advanced word embedding method, namely
GloVe [13], based on these context windows.
To clarify the quality of the learnt design pattern and
word vectors, we deploy an evaluation with a dp-word
(design pattern - word) similarity task. Experimental results
on 2,000 manually labelled dp-word pairs show that the
learnt vectors by DPWord2Vec are more effective than
some widely used semantic relatedness estimation algo-
rithms, i.e., outperform these algorithms by 24.2-120.9 per-
cent in terms of Spearman’s rank correlation coefficient. To
show the practicability, we depict two applications of
DPWord2Vec to solve two typical design pattern tasks, i.e.,
design pattern tag recommendation and design pattern
selection. In the first application, when recommending the
top 10 design pattern tags for the posts in a software infor-
mation site, the DPWord2Vec-based method outperforms
two state-of-the-art tag recommendation methods by 6.6
and 32.7 percent respectively in terms of Recall@10. In the
second application, the method refined by DPWord2Vec
outperforms the two existing design pattern selection meth-
ods by 6.5 and 70.7 percent respectively when considering
the mean values of Mean Reciprocal Rank (MRR) over three
design pattern collections.
In this paper, we make the following contributions:
1) We propose DPWord2Vec that maps both design
patterns and natural language words into vectors to
support design pattern relevant tasks. To the best of
our knowledge, this is the first work that establishes
the universal relationship between design patterns
and natural languages.
2) We evaluate DPWord2Vec on a manually labelled
dp-word pair dataset to show its effectiveness in
semantic relatedness estimation.
3) DPWord2Vec is applied to two design pattern rele-
vant tasks, namely design pattern tag recommenda-
tion and design pattern selection. DPWord2Vec
outperforms the state-of-the-art methods.
The rest of this paper is organized as follows. Section 2
shows the background of the study. Section 3 presents the
framework of DPWord2Vec. The settings and results for
evaluating DPWord2Vec are depicted in Sections 4 and 5,
respectively. Sections 6 and 7 introduce two applications of
DPWord2Vec. Section 8 discusses potential threats to valid-
ity. Some studies related to our work are outlined in
Section 9. We conclude the paper in Section 10.
2PRELIMINARIES
Before the depiction of DPWord2Vec, we demonstrate the
concept of design pattern in this study and briefly introduce
the word embedding technique.
2.1 Concept of Design Pattern
Generally speaking, design patterns are proven solutions to
recurring software design problems [2]. However, to the
best of our knowledge, there are no formal definitions nor
standard lists of design patterns. There exist numbers of
design pattern collections that are published with multiple
channels, such as design pattern books, academic papers, or
1. https://guides.rubyonrails.org/active_record_basics.html
2. https://angular.io/guide/dependency-injection
LIU ET AL.: DPWORD2VEC: BETTER REPRESENTATION OF DESIGN PATTERNS IN SEMANTICS 1229
Authorized licensed use limited to: Nanchang Hangkong University. Downloaded on March 17,2024 at 02:19:07 UTC from IEEE Xplore. Restrictions apply.
online libraries [7]. Design patterns in different collections
may be depicted in different ways, e.g., in flat text format or
using UML. In this paper, we focus on the design patterns
with rich textual descriptions and collect design patterns
from various sources.
Similar to “design pattern”, “architecture pattern” is also
a means for software design. Strictly, they are not a same
concept, but the boundary between them may not be unified
for different design pattern collections. For example, Model
View Controller is an example of architectural pattern in
Wikipedia
3
but marked as a design pattern in MSDN.
4
Therefore, instead of creating a standard subjectively, we
choose not to distinguish them in our study. Once an entity
is identified as a design pattern in some design pattern col-
lections, we regard it as a design pattern.
2.2 Word Embedding
Word embedding is a set of techniques that maps words or
phrases in the vocabulary to vectors of real numbers. The
core part of DPWord2Vec is also word embedding, but it
handles both words and design patterns. Word embedding
methods focus on mapping words into a continuous vector
space with a much lower dimension than the size of vocabu-
lary and the vector representation of each word is deter-
mined by supervised learning based on the corpus [11].
To facilitate the demonstration, we explain how word
embedding works with an example. Assuming there is a
corpus that contains a sentence: “software design patterns
encapsulate proven solutions that address recurring prob-
lems”. To mine the relationships between words, the sliding
context window strategy is usually used [11]. A context
window contains a central word and several surrounding
words which are at a distance of no more than c positions
from the central word. For example, the context window
with centre “patterns” and c = 2 contains the surrounding
words “software”, “design”, “encapsulate”, and “proven”.
Multiple local context windows are constructed as the cen-
tral word slides from the beginning (“software”) to the end
(“problems”) of the corpus.
Then the word vectors are learnt based on these local
context windows. The intuition is that if two words appear
frequently in the same context window then their vector
representations are highly associated. For example, the
objective of the Skip-gram model is to learn word vector
representations that are good at predicting each surround-
ing word by the vector of the central word [11]. Conversely,
the Continuous Bag-of-Words (CBOW) model aims to pre-
dict the central word by the concatenation or average of the
vectors of the surrounding words [11]. Different from them,
the GloVe model counts the number of the total co-occur-
rences of each pair of words through all the local context
windows and predicts the co-occurrence number by the
vectors of the words in the pair [13].
3THE DPWORD2VEC FRAMEWORK
DPWord2Vec aims to embed natural language words and
design patterns into one vector space. This process can be
divided into four phases (as shown in Fig. 1). At first, the
corpus related to design patterns are acquired from multi-
ple sources. Next, the documents in the corpus are prepro-
cessed. Then, we propose a context window-based strategy
to strengthen the tie between words and design patterns. At
last, the word and design pattern vectors are trained based
on the corpus and the context windows.
3.1 Corpus Building
To train the vectors of words and design patterns, a corpus
relevant to design patterns should be built at first. Formally,
we construct a general corpus C, which contains multiple
documents. For each document doc in C, doc has two compo-
nents: the token component doc:Tokens, a sequence of natural
language words that describes some design patterns, and the
design pattern component doc:DPs,asetofdesignpatterns
described by doc:Tokens. The general corpus C can be further
categorized into two groups according to their sources.
Description Corpus. Documents in this corpus are extracted
from design pattern books and Wikipedia. Some design pat-
tern books catalog their own lists of design patterns. For
example, GoF presents 23 design patterns with the problem
definitions and design specifications [2]. A design pattern is
usually described by a chapter or a section in a design pattern
book. Similarly, a number of design patterns are specified by
Wikipedia as entries with one page for each design pattern.
5
A chapter or section of a design pattern book, or a Wikipedia
page of a design pattern forms a document doc.Inthiscorpus,
doc:Tokens denotes the whole text in the chapter, section, or
page, but excluding the code snippets. Meanwhile, doc:DPs
contains only one element, i.e., the described design pattern.
Totally, the description corpus contains 431 documents,
which are associated with 13 design pattern books and 125
Wikipedia pages. Amongst the design pattern components,
349 unique design patterns are involved.
Fig. 1. The framework of DPWord2Vec.
3. https://en.wikipedia.org/wiki/Architectural_pattern
4. https://msdn.microsoft.com/en-us/library/ms978748.aspx 5. https://en.wikipedia.org/wiki/Category:Software design patterns
1230 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 48, NO. 4, APRIL 2022
Authorized licensed use limited to: Nanchang Hangkong University. Downloaded on March 17,2024 at 02:19:07 UTC from IEEE Xplore. Restrictions apply.
Crowdsourced Corpus. Documents in this corpus are con-
structed by referring to the programming forum, i.e., Stack
Overflow.
6
In the previous study [12], 187,493 design pat-
tern relevant question posts spanning from August 2008 to
December 2017 are detected in Stack Overflow.
A design pattern relevant post indicates the design pattern
name(s) appears at least one time in the post. However, it is
not a trivial string matching task to detect the design pattern
occurrences in Stack Overflow posts, as the discussions on
Stack Overflow are usually informal [14], [15] and the name of
a design pattern may not be mentioned in a unique form. It is
also referred to as the morphological form issue [14]. The pre-
vious study has attempted to address this issue in two
aspects. On the one hand, the standard design pattern names
as well as other common names are collected simultaneously
from the existing design pattern collections, e.g., design pat-
tern books, in which the other well-known names of each con-
taining design pattern are usually presented explicitly, e.g.,
marked as “also known as”. These names include aliases, e.g.,
“open implementation” is an alias for “reflection”, and acro-
nyms, e.g., “mvc” is an acronym for “model view controller”.
On the other hand, regular expressions are leveraged to allow
some variants when searching a design pattern name in the
text of the Stack Overflow posts. For example, the regular
expression for “model view controller” is “model[ba-z]?view
[ba-z]?controller”, where “[ba-z]?” denotes a non-alphabetic
character that matches zero or one time, so that the variants
such as “model-view-controller”, “model_view_controller”,
and “modelviewcontroller” can be involved. A manual vali-
dation on the sampled posts shows that the detection
is acceptably accurate, i.e., achieves Precision value of
97.3 percent and Recall value of 87.8 percent. More details can
be obtained by referring to [12].
We use these question posts to construct the crowd-
sourced corpus. Moreover, it is enriched by all the answer
posts to these design pattern relevant question posts. A
question post and each of its answer post are assigned to
different documents. The relevant design pattern(s) to an
answer post is as same as its question post. For a document
doc in this corpus, doc:Tokens denotes a content merging the
title and body part of a question or answer post with code
snippets discarded, and doc:DPs is the set of the relevant
design pattern(s) to the post.
Finally, there are 491,124 documents in this corpus and
210 unique design patterns are involved.
By merging the two corpora, we obtain a general corpus
C, which contains 491,555 documents.
7
The involved design
patterns are indexed and form a design pattern vocabulary,
namely V
DP
, with 372 design patterns. Although the docu-
ments in the description corpus are far less than those in the
crowdsourced corpus, the description corpus is indispens-
able for building the design pattern vectors. On one hand,
the description corpus makes it possible to build vectors for
the design patterns that are rarely discussed in Stack Over-
flow. On the other hand, this corpus tends to provide more
formal and precise depictions of design patterns than the
crowdsourced corpus. We will show its significance in
Section 5.1.
3.2 Corpus Preprocessing
Comparing to the general natural language documents, the
amount of design pattern relevant documents tends to be
quite small. Therefore, our built corpus is relatively smaller
than those for training the common word vectors [11], [13].
Based on this actuality, we perform preprocessing on the
token component of each document aiming to filter out the
insignificant and redundant information and build a com-
pact vocabulary.
At first, code-like tokens (e.g., function names) in a natural
language sentence are split according to its camel style to
ensure the semantic integrity of the sentence. With this step,
on the one hand, these code-like tokens can be converted into
more understandable identifiers [16] to better reflect the
semantic meanings. On the other hand, the volume of the
vocabulary can be reduced. Next, we tokenize and lowercase
the token component of each document. Then, the less infor-
mative tokens, including English-language stop words, spe-
cial tags (HTML tags in Stack Overflow posts, and reference
markers in design pattern books and Wikipedia pages), and
non-alphabetic characters (e.g., numbers) are removed from
the text, as they are not very useful to reflect the semantic rela-
tionship between the natural language and design patterns.
Moreover, each token is stemmed to its root form, e.g.,
“developer”, “developed”, and “developing” to “develop”.
As the words with a same root usually have similar mean-
ing [17] and the vector representations of them are also similar
in some word embedding methods [18], [19], we can simply
regard them as a same word without losing much semantic
information. At last, we discard the words that occur no more
than five times in the corpus when constructing the vocabu-
lary but retain them in the corpus. These words are likely to
be noisy terms [20] and it is not significant to train the vectors
of them.
Some of the above steps, such as camel case splitting,
stop words removing, and word stemming, may be not
common in word embedding methods. With abundant
training corpora, vector representation of each distinct iden-
tifier in the text can be learnt. However, due to the scale of
the design pattern corpus, it is reasonable to conduct these
preprocessing steps to reduce the vocabulary size, i.e., the
number of vectors to be learnt, to adapt to the corpus. Fur-
thermore, the focus of this study is to build the semantic
relationship between natural languages and design pat-
terns, it is not a core concern to represent all the identifiers
precisely. As a common concept in the word embedding
methods, the word context will not be significantly affected
by the preprocessing, since the eliminated tokens contain
little semantic information and the meanings of the changed
tokens are mainly retained. It is adequate to apply the word
embedding methods to the preprocessed corpus.
After the preprocessing, we obtain a word vocabulary
V
Word
that contains 27,770 words.
3.3 Context Window Construction
As to the corpus we build, each document contains two
parts: the natural language words and the design patterns.
To train the vectors of words and design patterns together,
6. https://stackoverflow.com/
7. The detailed description corpus and crowdsourced corpus, as
well as the number of relevant documents to each design pattern are
available via https://github.com/WoodenHeadoo/dpword2vec.
LIU ET AL.: DPWORD2VEC: BETTER REPRESENTATION OF DESIGN PATTERNS IN SEMANTICS 1231
Authorized licensed use limited to: Nanchang Hangkong University. Downloaded on March 17,2024 at 02:19:07 UTC from IEEE Xplore. Restrictions apply.
剩余20页未读,继续阅读
资源评论
m0_74388450
- 粉丝: 0
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功