# Serbian Language Tools - PHP library for Transliteration & Diacritic Restoration
Serbian Language Tools is a PHP library for dealing with text written in Serbian language. It features:
- Tokenizer
- **Diacritic restoration tool**
- Transliterator between Serbian Cyrillic and Latin alphabets
- Alphabet detection
## Requirements
This library requires PHP 7.4 or greater with [sqlite3](https://www.php.net/manual/en/book.sqlite3.php), [intl](https://www.php.net/manual/en/book.intl.php) and [mbstring](https://www.php.net/manual/en/book.mbstring.php) extensions.
## Installation
You can install the package via composer:
```bash
composer require turanjanin/serbian-language-tools
```
## Usage
In order to use the library, you need to tokenize the string. Tokenization is a process of splitting the string into a series of related characters. This library can recognize the following tokens: Word, Whitespace, URI (which includes URLs, hashtags and at-mentions), Interpunction, HTML and Emoticon.
Tokenizing can be achieved by creating a new instance of `Text` class using the named constructor:
```php
use Turanjanin\SerbianLanguageTools\Text;
$text = Text::fromString('Zdravo svete, ovo je primer teksta!');
```
Text object will now contain an array of various tokens that can be processed. You can use this object as any other PHP array since it implements `ArrayAccess` interface.
```php
echo count($text) . "\n"; // 13
echo get_class($text[1]). "\n"; // Turanjanin\SerbianLanguageTools\Tokens\Whitespace
echo $text[9] . "\n"; // primer
```
### Diacritic Restoration / Diacritization
Serbian Latin alphabet includes a couple of specific characters that are not found in ASCII encoding table. These characters feature diacritics - č, ć, š, ž, dž, đ - which are often omitted in everyday communication (social media, emails and SMS), mainly due to the widespread usage of English keyboard layouts.
This degraded Latin alphabet can be easily understood by human readers but it poses significant challenge for search engines and natural language processing. Therefore, this library features an algorithm that allows automated restoration of ASCII text by using a [dictionary of Serbian words](dictionary/README.md) and phrases for context disambiguation.
The algorithm inspects all `Word` tokens and looks for restoration candidates - the words with s, c, z or dj characters. After that, the following two steps are applied:
1. The most common phrases are searched for inside the text and, if found, words are replaced with their diacritical equivalents. This step takes word context into consideration which allows us to give advantage to some less used variations. For example, `sto hiljada` won't be replaced with `što hiljada`, even though the form `što` *(why)* has much greater frequency compared to word `sto` *(hundred)*.
2. Every restoration candidate is looked up in the dictionary and, if there are known variations, token is replaced with `RestoredWord` (if there is only one possible variation) or `MultipleRestoredWord` (if there are more possible variations). In case of more than one variation, the one with the highest frequency will be marked as preferred.
Diacritic restoration can be performed by calling the invokable class:
```php
use Turanjanin\SerbianLanguageTools\Text;
use Turanjanin\SerbianLanguageTools\Transformers\DiacriticRestorer;
$text = Text::fromString('Cetiri cavke cuceci dzangrizavo cijucu u zeleznickoj skoli.');
echo (new DiacriticRestorer)($text); // Četiri čavke čučeći džangrizavo cijuču u železničkoj školi.
```
Dictionary needed for this algorithm is stored in custom-made SQLite database that is included with this library. You can extend this database or use different storage solution by providing custom implementation of `Turanjanin\SerbianLanguageTools\Dictionary\Dictionary` interface.
### Transliteration
Library supports transliteration of text between Cyrillic, Latin and ASCII alphabets. Transliteration can be performed by calling appropriate invokable class:
```php
use Turanjanin\SerbianLanguageTools\Text;
use Turanjanin\SerbianLanguageTools\Transformers\ToAsciiLatin;
use Turanjanin\SerbianLanguageTools\Transformers\ToCyrillic;
use Turanjanin\SerbianLanguageTools\Transformers\ToLatin;
$cyrillic = Text::fromString('Ово је ћирилични текст');
$latin = Text::fromString('Primer latiničnog teksta');
echo (new ToLatin)($cyrillic); // Ovo je ćirilični tekst
echo (new ToCyrillic)($latin); // Пример латиничног текста
echo (new ToAsciiLatin)($cyrillic); // Ovo je cirilicni tekst
```
If you need only transliteration between Latin and Cyrillic alphabets, take a look at the simpler library - [turanjanin/serbian-transliterator](https://github.com/turanjanin/serbian-transliterator).
### Alphabet Detection
Library can be used to detect if text is written in Serbian Cyrillic or Latin alphabet:
```php
use Turanjanin\SerbianLanguageTools\Text;
Text::fromString('Ovo je latinica')->isLatin(); // true
Text::fromString('Ovo je latinica')->isCyrillic(); // false
```
## Author
- [Jovan Turanjanin](https://github.com/turanjanin)
## License
The MIT License (MIT). Please see [License File](LICENSE.md) for more information.
没有合适的资源?快使用搜索试试~ 我知道了~
一套用于对用塞尔维亚语编写的文本进行标记化、音译和变音符号恢复的工具_PHP_下载.zip
共26个文件
php:22个
md:2个
sqlite:1个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 93 浏览量
2023-04-28
13:53:56
上传
评论
收藏 17.76MB ZIP 举报
温馨提示
一套用于对用塞尔维亚语编写的文本进行标记化、音译和变音符号恢复的工具_PHP_下载.zip
资源推荐
资源详情
资源评论
收起资源包目录
一套用于对用塞尔维亚语编写的文本进行标记化、音译和变音符号恢复的工具_PHP_下载.zip (26个子文件)
serbian-language-tools-master
LICENSE.md 1KB
resources
dictionary.sqlite 66.06MB
src
Dictionary
Variant.php 301B
Dictionary.php 431B
SqliteDictionary.php 2KB
Transformers
DiacriticRestorer.php 8KB
ToCyrillic.php 4KB
WordTransformer.php 3KB
ToLatin.php 2KB
ToAsciiLatin.php 790B
Text.php 2KB
IntegrationTest.php 3KB
IsSerbianCyrillic.php 656B
Tokens
Word.php 3KB
Html.php 113B
Whitespace.php 119B
Emoticon.php 117B
Uri.php 112B
Interpunction.php 122B
Token.php 573B
RestoredWord.php 2KB
MultipleRestoredWord.php 598B
Tokenizer.php 9KB
Exceptions
InvalidDatabaseException.php 384B
composer.json 1KB
README.md 5KB
共 26 条
- 1
资源评论
快撑死的鱼
- 粉丝: 1w+
- 资源: 9154
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 1961ee27df03bd4595d28e24b00dde4e_744c805f7e4fb4d40fa3f695bfbab035_8(1).c
- mediapipe-0.9.0.1-cp37-cp37m-win-amd64.whl.zip
- windows注册表编辑工具
- mediapipe-0.9.0.1-cp37-cp37m-win-amd64.whl.zip
- 校园通行码预约管理系统20240522075502
- 车类型数据集6250张VOC+YOLO格式.zip
- The PyTorch implementation of STGCN.STGCN-main.zip
- 092300108.cpp
- 车类型数据集6000张VOC+YOLO格式.zip
- for daily read
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功