# Romanian Diacritic Restoration With Neural Nets
## Live Demo: [pagini.ro](http://pagini.ro/)
Author [Horia Cristescu](mailto:horia.cristescu@gmail.com). I can work remote and am available for hiring in Bucharest.
## Why?
Writing in Romanian with diacritics on an English keyboard can be hard. Every day millions of people write comments, articles and emails without diacritics. A corpus study based on OpenCrawl revealed that only 81% of the online text in Romanian has diacritics.
That is why I was motivated in picking this problem. On the other hand it's a "nice problem" because there is a lot of training data available. I used the Romanian Wikipedia and OpenCrawl.
## How?
I took a large text corpus and removed the diacritics, then trained a neural network to predict the diacritics back. I used recurrent and convolutional layers (LSTMs and CNNs) to build the model. After the neural net makes a prediction, I run a check for obvious mistakes with a large dictionary.
## Features:
To train the network I used both character and word level features. The obvious problem is how to align them inside a neural net. I chose to replicate the word embeddings for each letter, thus obtaining an more complex embedding for characters that takes into account the whole word.
I lowercased the text and removed all characters except letters, digits and a few punctuation marks. Later, when the model makes predictions, I lowercase the input text and then recover the case on the prediction, including the out-of-set characters.
To compute word embeddings I chose to hash words into the range 0..1,000,000 and then run the word ids through a similarly sized Embedding layer of width 50. The char embeddings are based on an Embedding table as well. The char and word embeddings are learned jointly (end-to-end).
The output should be the correctly 'diacritised' word, but instead the model predicts only the diacritic sign itself. I mapped:
- a..z letters -> 0
- "ț", "ș", "î" -> 1
- "ă" -> 2
- out of set chars -> 3
This way I limited the size of the softmax layer and sped up training.
## Architecture
<img src="app/word_char_map.png?raw=true" width="508">
<img src="app/model.png?raw=true#" width="508">
The model is based on CNNs and LSTMs. We have two paths - character level and word level. My intuition for using separate word and char level paths is to learn both long range structure and morphology. For the character path, we use embeddings and three layers of CNN. The word path goes through embedding and biLSTM. We merge the two paths by projecting words to characters, based on a projection matrix which is received as an additional input. Then we have three more CNN layers and output predictions.
```python
def model_def():
# version 17
# char level input (char ids)
input_char = Input(shape=(None, ))
# word level input (word ids)
input_word = Input(shape=(None, ))
# word x char translation map (array)
input_map = Input(shape=(None, None))
# embed chars
char_embed = Embedding(char_vocab_size, 50)(input_char)
# run through 3 layers of CNN
char_pipe = Conv1D(128, 31, name="conv_size_31", activation='relu', padding='same')(char_embed)
char_pipe = Conv1D(128, 21, name="conv_size_21", activation='relu', padding='same')(char_pipe)
char_pipe = Conv1D(128, 15, name="conv_size_15", activation='relu', padding='same')(char_pipe)
# pass words through LSTM
word_pipe = Embedding(max_word_hash, 50, name='embed_word')(input_word)
word_pipe = Bidirectional(LSTM(50, return_sequences=True))(word_pipe) # (None, 27, 100)
# map word space to char space
input_map_p = Lambda(lambda x: K.permute_dimensions(x, (0,2,1)), name='transpose_map')(input_map)
word_pipe = Lambda(lambda x: K.batch_dot(x[0], x[1], axes=[2,1]), name='project_words_chars')([input_map_p, word_pipe])
# concatenate word, char level and char input
pipe = keras.layers.concatenate([word_pipe, char_pipe, char_embed], axis=-1)
# three more layers of CNN
pipe = Conv1D(128, 11, name="conv_size_11", activation='relu', padding='same')(pipe)
pipe = Conv1D(128, 7, name="conv_size_7", activation='relu', padding='same')(pipe)
pipe = Conv1D(128, 3, name="conv_size_3", activation='relu', padding='same')(pipe)
# reduce output to 4 channels per char
output = TimeDistributed(Dense(4, activation='softmax'))(pipe)
model = Model([input_char, input_word, input_map], output)
optimizer = Adam(lr=0.001)
model.compile(
loss='categorical_crossentropy',
optimizer=optimizer,
metrics=['accuracy'])
return model
```
## Training
The final accuracy before dictionary word check is **99.86%**.
I used batches of 64, examples of 150 chars and 100 training epochs with Adam (initial lr = 0.001). The model reaches 99.30% accuracy in the first epoch. But then it takes a long time to reach 99.86% after which it can't improve anymore. No matter how I changed the architecture, this limit stands - actually it was pretty interesting to see that in action. It only changes if I train on different data. At this point the model makes about 1 error in 500 characters. Some of those errors would have been hard to predict even for humans given only the flattened text.
For validation I set apart 500k of text. The training and text scores converged remarcably well at the end of training.
The trained model is available upon request, being too large to host on Github.
## What didn't work so well:
I tried char based LSTM without word level information, but got 0.5% lower accuracy. I believe that the word embedding path is responsible for the model being able to learn from 99.70% upwards, because it has a lot of capacity in the word embeddings. Char embeddings are few and the Conv1D layers don't have that many weights, so there's less capacity there. On the other hand, using just word is very slow because of the word sparsity problem.
I tried predicting only the diacritic of the center character in the example, but this gives similar accuracy with predicting the whole example at once.
## Other methods:
Other approaches are usually based on ngram-models. I tried to count word ngrams up to size 3 in a corpus of 1Gb of cleaned up text, using the count-min-sketch library [madoka](https://github.com/ikegami-yukino/madoka-python). The ngram model solves a large portion of the diacritics well but nowhere near the neural model, it was too brittle. Counting larger ngrams would have been hard and the tables very sparse. In reality it is too hard to find ngrams in the wild for all possible word combinations.
## Website
I used Klein as backend and jQuery with plain HTML/CSS for the front end. The theme is based on Bootstrap.
## Other Romanian diacritic restoration services:
- http://diacritice.ai/
- http://plagiarisma.net/ro/spellcheck.php
- http://diacritice.opa.ro/
- http://www.diacritice.com/
ABCD
没有合适的资源?快使用搜索试试~ 我知道了~
使用神经网络从平面文本中自动恢复罗马尼亚语变音符号_CSS_Python_下载.zip
共19个文件
py:5个
css:4个
js:3个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 76 浏览量
2023-04-28
13:39:40
上传
评论
收藏 7.66MB ZIP 举报
温馨提示
使用神经网络从平面文本中自动恢复罗马尼亚语变音符号_CSS_Python_下载.zip
资源推荐
资源详情
资源评论
收起资源包目录
使用神经网络从平面文本中自动恢复罗马尼亚语变音符号_CSS_Python_下载.zip (19个子文件)
romanian-diacritic-restoration-master
serve.py 5KB
app
word_char_map.png 36KB
model2.png 59KB
js
bootstrap.min.js 48KB
app.js 598B
jquery.min.js 90KB
css
bootstrap.min.css 141KB
app.css 27KB
bootstrap-grid.min.css 33KB
bootstrap-reboot.min.css 4KB
favicon.ico 5KB
index.html 3KB
LICENSE 1KB
lol.py 4KB
preprocessing.py 8KB
debug.py 3KB
dictionary.txt 14.88MB
train.py 4KB
README.md 7KB
共 19 条
- 1
资源评论
快撑死的鱼
- 粉丝: 1w+
- 资源: 9154
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功