【免费】DPWord2Vec_Better_Representation_of_Design_Patterns_in

需积分: 0 65 浏览量 2024-04-01 16:03:53 上传评论收藏 1.75MB PDF 举报

资源推荐

资源详情

资源评论

DPWord2Vec: Better Representation

of Design Patterns in Semantics

Dong Liu, He Jiang , Member, IEEE, Xiaochen Li , Zhilei Ren,

Lei Qiao

, and Zuohua Ding , Member, IEEE

Abstract—With the plain text descriptions of design patterns, developers could better learn and understand the deﬁnitions and usage

scenarios of design patterns. To facilitate the automatic usage of these descriptions, e.g., recommending design patterns by free-text

queries, design patterns and natural languages should be adequately associated. Existing studies usually use texts in design pattern

books as the representations of design patterns to calculate similarities with the queries. However, this way is problematic. Lots of

information of design patterns may be absent from design pattern books and many words would be out of vocabulary due to the content

limitation of these books. To overcome these issues, a more comprehensive method should be constructed to estimate the relatedness

between design patterns and natural language words. Motivated by Word2Vec, in this study, we propose DPWord2Vec that embeds

design patterns and natural language words into vectors simultaneously. We ﬁrst build a corpus containing more than 400 thousand

documents extracted from design pattern books, Wikipedia, and Stack Overﬂow. Next, we redeﬁne the concept of context window to

associate design patterns with words. Then, the design pattern and word vector representations are learnt by leveraging an advanced

word embedding method. The learnt design pattern and word vectors can be universally used in textual description based design

pattern tasks. An evaluation shows that DPWord2Vec outperforms the baseline algorithms by 24.2-120.9 percent in measuring the

similarities between design patterns and words in terms of Spearman’s rank correlation coefﬁcient. Moreover, we adopt DPWord2Vec

on two typical design pattern tasks. In the design pattern tag recommendation task, the DPWord2Vec-based method outperforms two

state-of-the-art algorithms by 6.6 and 32.7 percent respectively when considering Recall@10. In the design pattern selection task,

DPWord2Vec improves the existing methods by 6.5-70.7 percent in terms of MRR.

Index Terms—Design pattern, word embedding, Word2Vec, semantic similarity, tag recommendation, design pattern selection

1INTRODUCTION

OFTWARE design patterns derive from the notion of design

pattern in the area of architecture [1], aiming to docu-

ment reusable experience for recurring software design

problems [2]. In recent years, many studies about design

patterns have been conducted [3], [4], [5]. As to the litera-

ture, there are roughly two ways to describe design pat-

terns: the formal way and the informal way.

The formal way speciﬁes design patterns with formally

deﬁned pattern languages. For example, the Gang-of-Four

(GoF) book respectively uses Uniﬁed Modeling Language

(UML) class diagram and sequence diagram to illustrate the

structure and collaborations of each design pattern [2]. A

number of studies are based on the formal descriptions of

design patterns [4], [6], as formal speciﬁcations enhance the

capabilities of machine processing [7]. However, there are

some weaknesses of the formal way. First, it is inconvenient

to precisely specify the intent and applicability of design

patterns. Second, building the meta-model of each design

pattern is usually costly [8]. Third, the formal way may lose

human readability, which is critically important to the util-

ity of design patterns [7].

Conversely, the informal way depicts design patterns

with free text. Comparing with the formal way, it is more

understandable and convenient to describe design pattern

relevant artifacts in words. Thus, the informal way is a

proﬁtable supplement to the formal way. To provide tool

supports for design pattern relevant tasks based on infor-

mal descripti ons, the key point is to establish the semantic

relationship s between design patterns and natural lan-

guages, so that the retrieval or iden tiﬁcation of design pat-

terns can be practically realized. However, to associate

design patterns wit h nat ural l angua ges is no easy job. A

design pattern name is usually a phrase, such as “factory

met hod”. An experienced developer may capture the

semantics of the design pattern via the name as he/she

understands the relevant background. But for the automatic

tools, it is difﬁcult to comprehend the connotations from

only these several words. More information about design

patterns should be provided for them to “learn” the back-

ground knowledge.

 Dong Liu is with the School of Software, Dalian University of Technology,

Dalian 116620, China. E-mail: dongliu@mail.dlut.edu.cn.

 He Jiang and Zhilei Ren are with the Key Laboratory for Ubiquitous Network

and Service Software of Liaoning Province, School of Software, Dalian

University of Technology, Dalian 116620, China.

E-mail: {jianghe, zren}@dlut.edu.cn.

 Xiaochen Li is with the School of Software, Dalian University of Technology,

Dalian 116620, China, and also with the Software Veriﬁcation and Valida-

tion Lab, University of Luxembourg, L-1855 Luxembourg, Luxembourg.

E-mail: xiaochen.li@uni.lu.

 Lei Qiao is with the Beijing Institute of Control Engineering, Beijing 100190,

China. E-mail: ﬂy2moon@aliyun.com.

 Zuohua Ding is with the School of Information Sciences, Zhejiang Sci-Tech

University, Hangzhou 310018, China. E-mail: zouhuading@hotmail.com.

Manuscript received 28 Nov. 2019; revised 17 July 2020; accepted 29 July 2020.

Date of publication 18 Aug. 2020; date of current version 18 Apr. 2022.

(Corresponding author: He Jiang.)

Recommended for acceptance by N. Bencomo.

Digital Object Identiﬁer no. 10.1109/TSE.2020.3017336

1228 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 48, NO. 4, APRIL 2022

See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Nanchang Hangkong University. Downloaded on March 17,2024 at 02:19:07 UTC from IEEE Xplore. Restrictions apply.

To obtain exact semantic information of design patterns,

the existing studies usually take the descriptions in design

pattern books as standard deﬁnitions of design patterns [8],

[9], [10]. If a snippet of text is similar to the standard deﬁni-

tion of a design pattern, then it is likely to be related to the

design pattern. Hence, the relatedness between design pat-

terns and natural languages can be estimated. However,

this kind of methods is still problematic. On one hand,

much information about design patterns is absent from

these books. Design pattern books usually depict the mecha-

nisms, scenarios, and speciﬁcations of design patterns [2].

As time goes by, many applications beyond the original

design pattern books have been developed. For example,

the Active Record design pattern is related to the Ruby on

Rails framework as Active Record provides the data model

of the framework.

The AngularJS framework implements

the Dependency Injection design pattern itself and usually

accompanies by this design pattern.

These relationships

cannot be mined from design pattern books. On the other

hand, the vocabulary extracted from design pattern books is

usually too small. The lengths of descriptions in design pat-

tern books are limited and many natural language words

may be out of the scope. It is difﬁcult to handle the texts

containing many out-of-vocabulary words. Therefore, the

wide usage of this kind of methods is restricted.

In this study, we aim to overcome these issues by con-

structing a general method to estimate the relatedness

between design patterns and natural language words, in

order that it can be universally used in the tasks based on

informal descriptions of design patterns. The “words” here

refer to as both plain natural language words, such as

“factory” and “method”, and software speciﬁc terms, such

as “angularjs”. Inspired by the word embedding method [11],

we propose DPWord2Vec that maps both design patterns

and natural language words into one vector space. With the

design pattern and word vectors, the similarity between a

design pattern and a word or a document can be calculated.

In this way, the relationship between natural languages and

design patterns can be built. However, there are two chal-

lenges to be addressed. First, how to ﬁnd a relatively large

corpus about design patterns? Second, how to associate a

design pattern with its relevant natural language words for

vectors training?

To handle the ﬁrst challenge, we build a general corpus

containing 491,555 documents. The general corpus consists

of two parts: the description corpus and the crowdsourced

corpus. The description corpus contains relatively formal

design pattern descriptions that are extracted from design

pattern books and Wikipedia. The crowdsourced corpus is

constructed based on a set of design pattern relevant Stack

Overﬂow posts obtained from our previous work [12]. Then

we extend the concept of context window in Word2Vec to

our general corpus and deﬁne the context windows for each

design pattern and each word respectively. In this way, the

linkages between design patterns and words are established,

that is, the design pattern context windows contain words

and design patterns appear in word context windows. Hence,

the second challenge can be properly addressed. Finally, the

design pattern and word vector representations are learnt by

leveraging an advanced word embedding method, namely

GloVe [13], based on these context windows.

To clarify the quality of the learnt design pattern and

word vectors, we deploy an evaluation with a dp-word

(design pattern - word) similarity task. Experimental results

on 2,000 manually labelled dp-word pairs show that the

learnt vectors by DPWord2Vec are more effective than

some widely used semantic relatedness estimation algo-

rithms, i.e., outperform these algorithms by 24.2-120.9 per-

cent in terms of Spearman’s rank correlation coefﬁcient. To

show the practicability, we depict two applications of

DPWord2Vec to solve two typical design pattern tasks, i.e.,

design pattern tag recommendation and design pattern

selection. In the ﬁrst application, when recommending the

top 10 design pattern tags for the posts in a software infor-

mation site, the DPWord2Vec-based method outperforms

two state-of-the-art tag recommendation methods by 6.6

and 32.7 percent respectively in terms of Recall@10. In the

second application, the method reﬁned by DPWord2Vec

outperforms the two existing design pattern selection meth-

ods by 6.5 and 70.7 percent respectively when considering

the mean values of Mean Reciprocal Rank (MRR) over three

design pattern collections.

In this paper, we make the following contributions:

1) We propose DPWord2Vec that maps both design

patterns and natural language words into vectors to

support design pattern relevant tasks. To the best of

our knowledge, this is the ﬁrst work that establishes

the universal relationship between design patterns

and natural languages.

2) We evaluate DPWord2Vec on a manually labelled

dp-word pair dataset to show its effectiveness in

semantic relatedness estimation.

3) DPWord2Vec is applied to two design pattern rele-

vant tasks, namely design pattern tag recommenda-

tion and design pattern selection. DPWord2Vec

outperforms the state-of-the-art methods.

The rest of this paper is organized as follows. Section 2

shows the background of the study. Section 3 presents the

framework of DPWord2Vec. The settings and results for

evaluating DPWord2Vec are depicted in Sections 4 and 5,

respectively. Sections 6 and 7 introduce two applications of

DPWord2Vec. Section 8 discusses potential threats to valid-

ity. Some studies related to our work are outlined in

Section 9. We conclude the paper in Section 10.

2PRELIMINARIES

Before the depiction of DPWord2Vec, we demonstrate the

concept of design pattern in this study and brieﬂy introduce

the word embedding technique.

2.1 Concept of Design Pattern

Generally speaking, design patterns are proven solutions to

recurring software design problems [2]. However, to the

best of our knowledge, there are no formal deﬁnitions nor

standard lists of design patterns. There exist numbers of

design pattern collections that are published with multiple

channels, such as design pattern books, academic papers, or

1. https://guides.rubyonrails.org/active_record_basics.html

2. https://angular.io/guide/dependency-injection

LIU ET AL.: DPWORD2VEC: BETTER REPRESENTATION OF DESIGN PATTERNS IN SEMANTICS 1229

Authorized licensed use limited to: Nanchang Hangkong University. Downloaded on March 17,2024 at 02:19:07 UTC from IEEE Xplore. Restrictions apply.

online libraries [7]. Design patterns in different collections

may be depicted in different ways, e.g., in ﬂat text format or

using UML. In this paper, we focus on the design patterns

with rich textual descriptions and collect design patterns

from various sources.

Similar to “design pattern”, “architecture pattern” is also

a means for software design. Strictly, they are not a same

concept, but the boundary between them may not be uniﬁed

for different design pattern collections. For example, Model

View Controller is an example of architectural pattern in

Wikipedia

but marked as a design pattern in MSDN.

Therefore, instead of creating a standard subjectively, we

choose not to distinguish them in our study. Once an entity

is identiﬁed as a design pattern in some design pattern col-

lections, we regard it as a design pattern.

2.2 Word Embedding

Word embedding is a set of techniques that maps words or

phrases in the vocabulary to vectors of real numbers. The

core part of DPWord2Vec is also word embedding, but it

handles both words and design patterns. Word embedding

methods focus on mapping words into a continuous vector

space with a much lower dimension than the size of vocabu-

lary and the vector representation of each word is deter-

mined by supervised learning based on the corpus [11].

To facilitate the demonstration, we explain how word

embedding works with an example. Assuming there is a

corpus that contains a sentence: “software design patterns

encapsulate proven solutions that address recurring prob-

lems”. To mine the relationships between words, the sliding

context window strategy is usually used [11]. A context

window contains a central word and several surrounding

words which are at a distance of no more than c positions

from the central word. For example, the context window

with centre “patterns” and c = 2 contains the surrounding

words “software”, “design”, “encapsulate”, and “proven”.

Multiple local context windows are constructed as the cen-

tral word slides from the beginning (“software”) to the end

(“problems”) of the corpus.

Then the word vectors are learnt based on these local

context windows. The intuition is that if two words appear

frequently in the same context window then their vector

representations are highly associated. For example, the

objective of the Skip-gram model is to learn word vector

representations that are good at predicting each surround-

ing word by the vector of the central word [11]. Conversely,

the Continuous Bag-of-Words (CBOW) model aims to pre-

dict the central word by the concatenation or average of the

vectors of the surrounding words [11]. Different from them,

the GloVe model counts the number of the total co-occur-

rences of each pair of words through all the local context

windows and predicts the co-occurrence number by the

vectors of the words in the pair [13].

3THE DPWORD2VEC FRAMEWORK

DPWord2Vec aims to embed natural language words and

design patterns into one vector space. This process can be

divided into four phases (as shown in Fig. 1). At ﬁrst, the

corpus related to design patterns are acquired from multi-

ple sources. Next, the documents in the corpus are prepro-

cessed. Then, we propose a context window-based strategy

to strengthen the tie between words and design patterns. At

last, the word and design pattern vectors are trained based

on the corpus and the context windows.

3.1 Corpus Building

To train the vectors of words and design patterns, a corpus

relevant to design patterns should be built at ﬁrst. Formally,

we construct a general corpus C, which contains multiple

documents. For each document doc in C, doc has two compo-

nents: the token component doc:Tokens, a sequence of natural

language words that describes some design patterns, and the

design pattern component doc:DPs,asetofdesignpatterns

described by doc:Tokens. The general corpus C can be further

categorized into two groups according to their sources.

Description Corpus. Documents in this corpus are extracted

from design pattern books and Wikipedia. Some design pat-

tern books catalog their own lists of design patterns. For

example, GoF presents 23 design patterns with the problem

deﬁnitions and design speciﬁcations [2]. A design pattern is

usually described by a chapter or a section in a design pattern

book. Similarly, a number of design patterns are speciﬁed by

Wikipedia as entries with one page for each design pattern.

A chapter or section of a design pattern book, or a Wikipedia

page of a design pattern forms a document doc.Inthiscorpus,

doc:Tokens denotes the whole text in the chapter, section, or

page, but excluding the code snippets. Meanwhile, doc:DPs

contains only one element, i.e., the described design pattern.

Totally, the description corpus contains 431 documents,

which are associated with 13 design pattern books and 125

Wikipedia pages. Amongst the design pattern components,

349 unique design patterns are involved.

Fig. 1. The framework of DPWord2Vec.

3. https://en.wikipedia.org/wiki/Architectural_pattern

4. https://msdn.microsoft.com/en-us/library/ms978748.aspx 5. https://en.wikipedia.org/wiki/Category:Software design patterns

1230 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 48, NO. 4, APRIL 2022

Authorized licensed use limited to: Nanchang Hangkong University. Downloaded on March 17,2024 at 02:19:07 UTC from IEEE Xplore. Restrictions apply.

Crowdsourced Corpus. Documents in this corpus are con-

structed by referring to the programming forum, i.e., Stack

Overﬂow.

In the previous study [12], 187,493 design pat-

tern relevant question posts spanning from August 2008 to

December 2017 are detected in Stack Overﬂow.

A design pattern relevant post indicates the design pattern

name(s) appears at least one time in the post. However, it is

not a trivial string matching task to detect the design pattern

occurrences in Stack Overﬂow posts, as the discussions on

Stack Overﬂow are usually informal [14], [15] and the name of

a design pattern may not be mentioned in a unique form. It is

also referred to as the morphological form issue [14]. The pre-

vious study has attempted to address this issue in two

aspects. On the one hand, the standard design pattern names

as well as other common names are collected simultaneously

from the existing design pattern collections, e.g., design pat-

tern books, in which the other well-known names of each con-

taining design pattern are usually presented explicitly, e.g.,

marked as “also known as”. These names include aliases, e.g.,

“open implementation” is an alias for “reﬂection”, and acro-

nyms, e.g., “mvc” is an acronym for “model view controller”.

On the other hand, regular expressions are leveraged to allow

some variants when searching a design pattern name in the

text of the Stack Overﬂow posts. For example, the regular

expression for “model view controller” is “model[ba-z]?view

[ba-z]?controller”, where “[ba-z]?” denotes a non-alphabetic

character that matches zero or one time, so that the variants

such as “model-view-controller”, “model_view_controller”,

and “modelviewcontroller” can be involved. A manual vali-

dation on the sampled posts shows that the detection

is acceptably accurate, i.e., achieves Precision value of

97.3 percent and Recall value of 87.8 percent. More details can

be obtained by referring to [12].

We use these question posts to construct the crowd-

sourced corpus. Moreover, it is enriched by all the answer

posts to these design pattern relevant question posts. A

question post and each of its answer post are assigned to

different documents. The relevant design pattern(s) to an

answer post is as same as its question post. For a document

doc in this corpus, doc:Tokens denotes a content merging the

title and body part of a question or answer post with code

snippets discarded, and doc:DPs is the set of the relevant

design pattern(s) to the post.

Finally, there are 491,124 documents in this corpus and

210 unique design patterns are involved.

By merging the two corpora, we obtain a general corpus

C, which contains 491,555 documents.

The involved design

patterns are indexed and form a design pattern vocabulary,

namely V

, with 372 design patterns. Although the docu-

ments in the description corpus are far less than those in the

crowdsourced corpus, the description corpus is indispens-

able for building the design pattern vectors. On one hand,

the description corpus makes it possible to build vectors for

the design patterns that are rarely discussed in Stack Over-

ﬂow. On the other hand, this corpus tends to provide more

formal and precise depictions of design patterns than the

crowdsourced corpus. We will show its signiﬁcance in

Section 5.1.

3.2 Corpus Preprocessing

Comparing to the general natural language documents, the

amount of design pattern relevant documents tends to be

quite small. Therefore, our built corpus is relatively smaller

than those for training the common word vectors [11], [13].

Based on this actuality, we perform preprocessing on the

token component of each document aiming to ﬁlter out the

insigniﬁcant and redundant information and build a com-

pact vocabulary.

At ﬁrst, code-like tokens (e.g., function names) in a natural

language sentence are split according to its camel style to

ensure the semantic integrity of the sentence. With this step,

on the one hand, these code-like tokens can be converted into

more understandable identiﬁers [16] to better reﬂect the

semantic meanings. On the other hand, the volume of the

vocabulary can be reduced. Next, we tokenize and lowercase

the token component of each document. Then, the less infor-

mative tokens, including English-language stop words, spe-

cial tags (HTML tags in Stack Overﬂow posts, and reference

markers in design pattern books and Wikipedia pages), and

non-alphabetic characters (e.g., numbers) are removed from

the text, as they are not very useful to reﬂect the semantic rela-

tionship between the natural language and design patterns.

Moreover, each token is stemmed to its root form, e.g.,

“developer”, “developed”, and “developing” to “develop”.

As the words with a same root usually have similar mean-

ing [17] and the vector representations of them are also similar

in some word embedding methods [18], [19], we can simply

regard them as a same word without losing much semantic

information. At last, we discard the words that occur no more

than ﬁve times in the corpus when constructing the vocabu-

lary but retain them in the corpus. These words are likely to

be noisy terms [20] and it is not signiﬁcant to train the vectors

of them.

Some of the above steps, such as camel case splitting,

stop words removing, and word stemming, may be not

common in word embedding methods. With abundant

training corpora, vector representation of each distinct iden-

tiﬁer in the text can be learnt. However, due to the scale of

the design pattern corpus, it is reasonable to conduct these

preprocessing steps to reduce the vocabulary size, i.e., the

number of vectors to be learnt, to adapt to the corpus. Fur-

thermore, the focus of this study is to build the semantic

relationship between natural languages and design pat-

terns, it is not a core concern to represent all the identiﬁers

precisely. As a common concept in the word embedding

methods, the word context will not be signiﬁcantly affected

by the preprocessing, since the eliminated tokens contain

little semantic information and the meanings of the changed

tokens are mainly retained. It is adequate to apply the word

embedding methods to the preprocessed corpus.

After the preprocessing, we obtain a word vocabulary

Word

that contains 27,770 words.

3.3 Context Window Construction

As to the corpus we build, each document contains two

parts: the natural language words and the design patterns.

To train the vectors of words and design patterns together,

6. https://stackoverﬂow.com/

7. The detailed description corpus and crowdsourced corpus, as

well as the number of relevant documents to each design pattern are

available via https://github.com/WoodenHeadoo/dpword2vec.

LIU ET AL.: DPWORD2VEC: BETTER REPRESENTATION OF DESIGN PATTERNS IN SEMANTICS 1231

Authorized licensed use limited to: Nanchang Hangkong University. Downloaded on March 17,2024 at 02:19:07 UTC from IEEE Xplore. Restrictions apply.

剩余20页未读，继续阅读

评论收藏

内容反馈

m0_74388450

粉丝: 0
资源: 1

DPWord2Vec_Better_Representation_of_Design_Patterns_in_Semantics...

最新资源

DPWord2Vec_Better_Representation_of_Design_Patterns_in_Semantics...

wiki_word2vec_50.bin.zip

word2vec_twitter word2vec_twitter_model.bin

word2vec-twitter：Word2Vec 400M Tweets word2vec_twitter_model.bin

word2vec.rar_VEC-361_layers5cb_vec361_word2vec_word2vec 中文

word2vec_lstm_talk.pdf

word2vec_basic.py源码下载

word2vec_wiki.model.rar

word2vec_basic.py

vec2mat and mat2vec_produceafw_vector_matlabfunction_mat2vec_mat

pmsm_vec_SVPWM111.zip_PMSM矢量控制_VEC_111_site:www.pudn.com_vec111_

pmsm_vec_SVPWM.mdl.zip_PMSM模型_pmsm_pmsm_vec_svpwm.mdl_svpwm电机_电机

Word2VEC_java-master.zip_java word2vec_word2vec_word2vec java

node2vec-master-python3_node2vec_blanketk2r_源码.rar

node2vec-master-python3_node2vec_blanketk2r_源码.zip

vec2mat and mat2vec_produceafw

QAM_mat_vec.rar_dispersion_linear dispersion _vec_mat_vec函数 mat

train_word2vec_model

word2vec_java_util

【Graph Embedding】node2vec_算法原理，实现和应用_浅梦的学习笔记-CSDN博客_node2vec.mhtml

1_sixyin-music-source-v1.0.7.js

植物大战僵尸杂交版v2.0安装程序.exe

洛雪音乐助手自定义音源v1.2.0下载.zip

植物大战僵尸杂交版v2.0.zip

植物大战僵尸杂交版v2.0.88安装程序.zip

misaka-v3.3.8.zip

TiggerRamDiskV4.2Beta1-Win.zip

大麦抢票_BP全自动抢购教程+注意事项.rar

Flyme10图标包_1.0.0_1.apk

C语言程序设计第四版何钦铭课后习题及答案.pdf

最新资源