3620
1. Tools built on open source annotated corpus works badly on domain specific CWS.
2. Annotated domain data is scarce due to high cost.
3. How to leverage open source annotated data despite their generality is an open question.
Recently, efforts have been made to exploit open source (high resource) data to improve the perfor-
mance of domain specific (low resource) tasks and decrease the amount of domain annotated data (Yang
et al., 2017; Peng and Dredze, 2016; Mou et al., 2016). In this paper, we further this line of work by
developing a multi-task learning (Caruana, 1997; Peng and Dredze, 2016) framework, named Adaptive
Multi-Task Transfer Learning. Inspired by the success of Domain Adaptation (Saenko et al., 2010; Tzeng
et al., 2014; Long and Wang, 2015b), we propose to minimize distribution distance of hidden represen-
tation between the source and target domain, thus make the hidden representations adapt to each other
and obtain domain-invariant features. Finally, we annotated 3 medical datasets from different medical
departments and medical forum, together with 3 open source datasets
??
. The contribution of this paper
can be summarized as follows:
• We propose a novel framework for Chinese word segmentation in the medical domain.
• To the best of our knowledge, we are the first to analyze the performance of transfer learning meth-
ods against the amount of disparity between target/source domains.
• Our framework outperforms strong baselines especially when there is substantial disparity.
• We open source 3 medical CWS datasets from different sources, which can be used for further study.
2 Related Work
2.1 Chinese word segmentation
Statistical Chinese word segmentation has been studied for decades. Xue and others (2003) was the first
to treat it as a sequence tagging problem, using a maximum entropy model. Peng et al. (2004) achieved
better results by using a conditional random field model (Lafferty et al., 2001). This method has been
followed by many other works (Zhao et al., 2006; Sun et al., 2012).
Recently, neural network models have been applied on CWS. These methods use automatically derived
features from neural network instead of hand-crafted discrete features. Zheng et al. (2013) first adopted
neural network architecture to CWS. Chen et al. (2015b) used Long short-term memory(LSTM) to
capture long term dependency. Chen et al. (2015a) proposed a gated recursive neural network (GRNN)
to incorporate context information. In this paper, we adopt Bidirectional LSTM-CRF Models (Huang et
al., 2015) as our base model.
2.2 Transfer Learning
Transfer learning distills knowledge from source domain and helps target domain to achieve a higher per-
formance (Pan and Yang, 2010). In feature-based models, many transfer approached have been studied,
including instance transfer (Jiang and Zhai, 2007; Liao et al., 2005), feature representation transfer (Ar-
gyriou et al., 2006; Argyriou et al., 2007), parameter transfer(Lawrence and Platt, 2004; Bonilla et al.,
2007) and relation knowledge transfer(Mihalkova et al., 2007; Mihalkova and et al., 2009).
Recently, the transferability of neural networks is also studied. For example, (Mou et al., 2016)
studied two methods (INIT, MULT) on NLP applications. Peng and Dredze (2016) proposed to use
domain mask and linear projection upon multi-task learning (MTL) (Long and Wang, 2015a). In this
paper, we follow MTL and extend the framework with a novel loss function.
3 Single-Task Chinese word segmentation
In this section, we briefly formulate the Chinese word segmentation task and introduce our base model,
Bi-LSTM-CRF (Huang et al., 2015).