# connlp
A bunch of python codes to analyze text data in the construction industry.
Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP).
## _Project Information_
- Supported by C!LAB (@Seoul Nat'l Univ.)
## _Contributors_
- Seonghyeon Boris Moon (blank54@snu.ac.kr, https://github.com/blank54/)
- Sehwan Chung (hwani751@snu.ac.kr)
- Jungyeon Kim (janykjy@snu.ac.kr)
# Initialize
## _Setup_
Install _**connlp**_ with _pip_.
```shell
pip install connlp
```
Install _requirements.txt_.
```shell
cd WORKSPACE
wget -O requirements_connlp.txt https://raw.githubusercontent.com/blank54/connlp/master/requirements.txt
pip install -r requirements_connlp.txt
```
## _Test_
If the code below runs with no error, _**connlp**_ is installed successfully.
```python
from connlp.test import hello
hello()
# 'Helloworld'
```
# Preprocess
Preprocessing module supports English and Korean.
NOTE: No plan for other languages (by 2021.04.02.).
## _Normalizer_
_**Normalizer**_ normalizes the input text by eliminating trash characters and remaining numbers, alphabets, and punctuation marks.
```python
from connlp.preprocess import Normalizer
normalizer = Normalizer()
normalizer.normalize(text='I am a boy!')
# 'i am a boy'
```
## _EnglishTokenizer_
_**EnglishTokenizer**_ tokenizes the input text in English based on word spacing.
The ngram-based tokenization is in preparation.
```python
from connlp.preprocess import EnglishTokenizer
tokenizer = EnglishTokenizer()
tokenizer.tokenize(text='I am a boy!')
# ['I', 'am', 'a', 'boy!']
```
## _KoreanTokenizer_
_**KoreanTokenizer**_ tokenizes the input text in Korean, and is based on either pre-trained or unsupervised approaches.
You are recommended to use pre-trained method unless you have a large size of corpus. This is the default setting.
If you want to use a pre-trained tokenizer, you have to select which analyzer you want to use. Available analyzers are based on KoNLPy (https://konlpy.org/ko/latest/api/konlpy.tag/), a python package for Korean language processing. The default analyzer is _**Hannanum**_
```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=True, analyzer='Hannanum')
```
If your corpus is big, you may use an unsupervised method, which is based on _**soynlp**_ (https://github.com/lovit/soynlp), an unsupervised text analyzer in Korean.
```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=False)
```
### _train_
If your _**KoreanTokenizer**_ are pre-trained, you can neglect this step.
Otherwhise (i.e. you are using an unsupervised approach), the _**KoreanTokenizer**_ object first needs to be trained on (unlabeled) corpus. 'Word score' is calculated for every subword in the corpus.
```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=False)
docs = ['코퍼스의 첫 번째 문서입니다.', '두 번째 문서입니다.', '마지막 문서']
tokenizer.train(text=docs)
print(tokenizer.word_score)
# {'서': 0.0, '코': 0.0, '째': 0.0, '.': 0.0, '의': 0.0, '마': 0.0, '막': 0.0, '번': 0.0, '문': 0.0, '코퍼': 1.0, '번째': 1.0, '마지': 1.0, '문서': 1.0, '코퍼스': 1.0, '문서입': 0.816496580927726, '마지막': 1.0, '코퍼스의': 1.0, '문서입니': 0.8735804647362989, '문서입니다': 0.9036020036098448, '문서입니다.': 0.9221079114817278}
```
### _tokenize_
If you are using a pre-trained _**KoreanTokenizer**_, the selected KoNLPy analyzer will tokenize the input sentence based on morphological analysis.
```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=True, analyzer='Hannanum')
doc = docs[0] # '코퍼스의 첫 번째 문서입니다.'
tokenizer.tokenize(doc)
# ['코퍼스', '의', '첫', '번째', '문서', '입니다', '.']
```
If you are using an unsupervised _**KoreanTokenizer**_, tokenization is based on the 'word score' calculated from _**KoreanTokenizer.train**_ method.
For each blank-separated token, a subword that has the maximum 'word score' is selectd as an individual 'word' and separated with the remaining part.
```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=False)
doc = docs[0] # '코퍼스의 첫 번째 문서입니다.'
tokenizer.tokenize(doc)
# ['코퍼스의', '첫', '번째', '문서', '입니다.']
```
## _StopwordRemover_
_**StopwordRemover**_ removes stopwords from a given sentence based on the user-customized stopword list.
Before utilizing _**StopwordRemover**_ the user should normalize and tokenize the docs.
```python
from connlp.preprocess import Normalizer, EnglishTokenizer, StopwordRemover
normalizer = Normalizer()
eng_tokenizer = EnglishTokenizer()
stopword_remover = StopwordRemover()
docs = ['I am a boy!', 'He is a boy..', 'She is a girl?']
tokenized_docs = []
for doc in eng_docs:
normalized_doc = normalizer.normalize(text=doc)
tokenized_doc = eng_tokenizer.tokenize(text=normalized_doc)
tokenized_docs.append(tokenized_doc)
print(docs)
print(tokenized_docs)
# ['I am a boy!', 'He is a boy..', 'She is a girl?']
# [['i', 'am', 'a', 'boy'], ['he', 'is', 'a', 'boy'], ['she', 'is', 'a', 'girl']]
```
The user should prepare a customized stopword list (i.e., _stoplist_).
The _stoplist_ should include user-customized stopwords divided by '\n' and the file should be in ".txt" format.
```text
a
is
am
```
Initiate the _**StopwordRemover**_ with appropriate filepath of user-customized stopword list.
If the stoplist is absent at the filepath, the stoplist would be ramain as a blank list.
```python
fpath_stoplist = 'test/thesaurus/stoplist.txt'
stopword_remover.initiate(fpath_stoplist=fpath_stoplist)
print(stopword_remover)
# <connlp.preprocess.StopwordRemover object at 0x7f163e70c050>
```
The user can count the word frequencies and figure out additional stopwords based on the results.
```python
stopword_remover.count_freq_words(docs=tokenized_docs)
# ========================================
# Word counts
# | [1] a: 3
# | [2] boy: 2
# | [3] is: 2
# | [4] i: 1
# | [5] am: 1
# | [6] he: 1
# | [7] she: 1
# | [8] girl: 1
```
After finally updating the _stoplist_, use _**remove**_ method to remove the stopwords from text.
```python
stopword_removed_docs = []
for doc in tokenized_docs:
stopword_removed_docs.append(stopword_remover.remove(sent=doc))
print(stopword_removed_docs)
# [['i', 'boy'], ['he', 'boy'], ['she', 'girl']]
```
The user can check which stopword was removed with _**check_removed_words**_ methods.
```python
stopword_remover.check_removed_words(docs=tokenized_docs, stopword_removed_docs=stopword_removed_docs)
# ========================================
# Check stopwords removed
# | [1] BEFORE: a(3) ->
# | [2] BEFORE: boy -> AFTER: boy(2)
# | [3] BEFORE: is(2) ->
# | [4] BEFORE: i -> AFTER: i(1)
# | [5] BEFORE: am(1) ->
# | [6] BEFORE: he -> AFTER: he(1)
# | [7] BEFORE: she -> AFTER: she(1)
# | [8] BEFORE: girl -> AFTER: girl(1)
```
# Embedding
## _Vectorizer_
_**Vectorizer**_ includes several text embedding methods that have been commonly used for decades.
### _tfidf_
TF-IDF is the most commonly used technique for word embedding.
The TF-IDF model counts the term frequency(TF) and inverse document frequency(IDF) from the given documents.
The results included the followings.
- TF-IDF Vectorizer (a class of sklearn.feature_extraction.text.TfidfVectorizer')
- TF-IDF Matrix
- TF-IDF Vocabulary
```python
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()