# TextDistance
![TextDistance logo](logo.png)
[![Build Status](https://travis-ci.org/orsinium/textdistance.svg?branch=master)](https://travis-ci.org/orsinium/textdistance) [![PyPI version](https://img.shields.io/pypi/v/textdistance.svg)](https://pypi.python.org/pypi/textdistance) [![Status](https://img.shields.io/pypi/status/textdistance.svg)](https://pypi.python.org/pypi/textdistance) [![Code size](https://img.shields.io/github/languages/code-size/orsinium/textdistance.svg)](https://github.com/orsinium/textdistance) [![License](https://img.shields.io/pypi/l/textdistance.svg)](LICENSE)
**TextDistance** -- python library for comparing distance between two or more sequences by many algorithms.
Features:
* 30+ algorithms
* Pure python implementation
* Simple usage
* More than two sequences comparing
* Some algorithms have more than one implementation in one class.
* Optional numpy usage for maximum speed.
## Algorithms
### Edit based
| Algorithm | Class | Functions |
|-------------------------------------------------------------------------------------------|----------------------|------------------------|
| [Hamming](https://en.wikipedia.org/wiki/Hamming_distance) | `Hamming` | `hamming` |
| [MLIPNS](http://www.sial.iias.spb.su/files/386-386-1-PB.pdf) | `Mlipns` | `mlipns` |
| [Levenshtein](https://en.wikipedia.org/wiki/Levenshtein_distance) | `Levenshtein` | `levenshtein` |
| [Damerau-Levenshtein](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) | `DamerauLevenshtein` | `damerau_levenshtein` |
| [Jaro-Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) | `JaroWinkler` | `jaro_winkler`, `jaro` |
| [Strcmp95](http://cpansearch.perl.org/src/SCW/Text-JaroWinkler-0.1/strcmp95.c) | `StrCmp95` | `strcmp95` |
| [Needleman-Wunsch](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) | `NeedlemanWunsch` | `needleman_wunsch` |
| [Gotoh](https://www.cs.umd.edu/class/spring2003/cmsc838t/papers/gotoh1982.pdf) | `Gotoh` | `gotoh` |
| [Smith-Waterman](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) | `SmithWaterman` | `smith_waterman` |
### Token based
| Algorithm | Class | Functions |
|-------------------------------------------------------------------------------------------|----------------------|---------------|
| [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index) | `Jaccard` | `jaccard` |
| [Sørensen–Dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) | `Sorensen` | `sorensen`, `sorensen_dice`, `dice` |
| [Tversky index](https://en.wikipedia.org/wiki/Tversky_index) | `Tversky` | `tversky` |
| [Overlap coefficient](https://en.wikipedia.org/wiki/Overlap_coefficient) | `Overlap` | `overlap` |
| [Tanimoto distance](https://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_similarity_and_distance) | `Tanimoto` | `tanimoto` |
| [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) | `Cosine` | `cosine` |
| [Monge-Elkan](https://www.academia.edu/200314/Generalized_Monge-Elkan_Method_for_Approximate_Text_String_Comparison) | `MongeElkan` | `monge_elkan` |
| [Bag distance](https://github.com/Yomguithereal/talisman/blob/master/src/metrics/distance/bag.js) | `Bag` | `bag` |
### Sequence based
| Algorithm | Class | Functions |
|-----------|-------|-----------|
| [longest common subsequence similarity](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem) | `LCSSeq` | `lcsseq` |
| [longest common substring similarity](https://docs.python.org/2/library/difflib.html#difflib.SequenceMatcher) | `LCSStr` | `lcsstr` |
| [Ratcliff-Obershelp similarity](http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/DDJ/1988/8807/8807c/8807c.htm) | `RatcliffObershelp` | `ratcliff_obershelp` |
### Compression based
[Normalized compression distance](https://en.wikipedia.org/wiki/Normalized_compression_distance#Normalized_compression_distance) with different compression algorithms.
Classic compression algorithms:
| Algorithm | Class | Function |
|----------------------------------------------------------------------------|-------------|--------------|
| [Arithmetic coding](https://en.wikipedia.org/wiki/Arithmetic_coding) | `ArithNCD` | `arith_ncd` |
| [RLE](https://en.wikipedia.org/wiki/Run-length_encoding) | `RLENCD` | `rle_ncd` |
| [BWT RLE](https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform) | `BWTRLENCD` | `bwtrle_ncd` |
Normal compression algorithms:
| Algorithm | Class | Function |
|----------------------------------------------------------------------------|--------------|---------------|
| Square Root | `SqrtNCD` | `sqrt_ncd` |
| [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) | `EntropyNCD` | `entropy_ncd` |
Work in progress algorithms that compare two strings as array of bits:
| Algorithm | Class | Function |
|--------------------------------------------|-----------|------------|
| [BZ2](https://en.wikipedia.org/wiki/Bzip2) | `BZ2NCD` | `bz2_ncd` |
| [LZMA](https://en.wikipedia.org/wiki/LZMA) | `LZMANCD` | `lzma_ncd` |
| [ZLib](https://en.wikipedia.org/wiki/Zlib) | `ZLIBNCD` | `zlib_ncd` |
See [blog post](https://articles.life4web.ru/eng/ncd/) for more details about NCD.
### Phonetic
| Algorithm | Class | Functions |
|------------------------------------------------------------------------------|----------|-----------|
| [MRA](https://en.wikipedia.org/wiki/Match_rating_approach) | `MRA` | `mra` |
| [Editex](https://anhaidgroup.github.io/py_stringmatching/v0.3.x/Editex.html) | `Editex` | `editex` |
### Simple
| Algorithm | Class | Functions |
|---------------------|------------|------------|
| Prefix similarity | `Prefix` | `prefix` |
| Postfix similarity | `Postfix` | `postfix` |
| Length distance | `Length` | `length` |
| Identity similarity | `Identity` | `identity` |
| Matrix similarity | `Matrix` | `matrix` |
## Installation
### Stable
Only pure python implementation:
```bash
pip install textdistance
```
With extra libraries for maximum speed:
```bash
pip install textdistance[extras]
```
With all libraries (required for [benchmarking](#benchmarks) and [testing](#test)):
```bash
pip install textdistance[benchmark]
```
With algorithm specific extras:
```bash
pip install textdistance[Hamming]
```
Algorithms with available extras: `DamerauLevenshtein`, `Hamming`, `Jaro`, `JaroWinkler`, `Levenshtein`.
### Dev
Via pip:
```bash
pip install -e git+https://github.com/orsinium/textdistance.git#egg=textdistance
```
Or clone repo and install with some extras:
```bash
git clone https://github.com/orsinium/textdistance.git
pip install -e .[benchmark]
```
## Usage
All algorithms have 2 interfaces:
1. Class with algorithm-specific params for customizing.
2. Class instance with default params for quick and simple usage.
All algorithms have some common methods:
1. `.distance(*s