SimHashPHP
==========
> This is the second version of SimHashPHP. If you are using the version 1 and don't want to
> update your code, please refer to the `1.0-security` branch (https://github.com/tgalopin/SimHashPhp/tree/1.0-security).
> The 1.0 branch will be maintained until the release of a v3 but only the v2 will have lastest features.
What is SimHashPHP ?
--------------------
SimHashPHP is a PHP library that port the SimHash algorithm in PHP.
This algorithm, created by Moses Charikar, provides an efficient way to compute a similarity index between two texts.
It is used by Google internally to detect dupplicate content.
See ["SimHash or the way to compare quickly two datasets"](http://titouangalopin.com/blog/articles/2014/05/simhash-or-the-way-to-compare-quickly-two-datasets)
for more informations.
[![Build Status](https://secure.travis-ci.org/tgalopin/SimHashPhp.png?branch=master)](http://travis-ci.org/tgalopin/SimHashPhp)
How to use it ?
---------------
Install it with [Composer](https://getcomposer.org):
``` sh
composer require tga/simhash-php
```
Once installed, include `vendor/autoload.php` to load the library.
The concept of SimHash is described in [this article](http://titouangalopin.com/blog/articles/2014/05/simhash-or-the-way-to-compare-quickly-two-datasets).
Here are few examples:
``` php
<?php
require 'vendor/autoload.php';
$text1 = <<<EOT
George Headley (1909–1983) was a West Indian cricketer who played 22 Test matches, mostly before the Second World War.
Considered one of the best batsmen to play for West Indies and one of the greatest cricketers of all time, he also
represented Jamaica and played professionally in England. Headley was born in Panama but raised in Jamaica where he
quickly established a cricketing reputation as a batsman. West Indies had a weak cricket team through most of Headley's
career; as their one world-class player, he carried a heavy responsibility, and they depended on his batting. He batted
at number three, scoring 2,190 runs in Tests at an average of 60.83, and 9,921 runs in all first-class matches at an
average of 69.86. He was chosen as one of the Wisden Cricketers of the Year in 1934.
EOT;
$text2 = <<<EOT
George Headley was a West Indian cricketer who played 22 Test matches, mostly before the Second World War.
Considered one of the best batsmen to play for West Indies and one of the greatest cricketers of all time, he also
represented Jamaica and played professionally in England. Headley was born in Panama but raised in Jamaica where he
quickly established a cricketing reputation as a batsman. West Indies had a weak cricket team through most of Headley's
career; as their one world-class player, he carried a heavy responsibility, and they depended on his batting. He batted
at number three, scoring 2,190 runs in tests at an average of 60.83, and 9,921 runs in all first-class matches at an
average of 69.86. He was chosen as one of the Wisden Cricketers of the Year.
EOT;
$simhash = new \Tga\SimHash\SimHash();
$extractor = new \Tga\SimHash\Extractor\SimpleTextExtractor();
$comparator = new Tga\SimHash\Comparator\GaussianComparator(3);
$fp1 = $simhash->hash($extractor->extract($text1), \Tga\SimHash\SimHash::SIMHASH_64);
$fp2 = $simhash->hash($extractor->extract($text2), \Tga\SimHash\SimHash::SIMHASH_64);
var_dump($fp1->getBinary());
var_dump($fp2->getBinary());
// Index between 0 and 1 : 0.80073740291681
var_dump($comparator->compare($fp1, $fp2));
```
License
-------
This library is under the MIT license (see LICENSE.md)
About
-----
SimHashPHP is mainly developed by Titouan Galopin.
Reporting an issue or a feature request
---------------------------------------
Issues and feature requests are tracked in the [Github issue tracker](https://github.com/tgalopin/SimHashPhp/issues).
没有合适的资源?快使用搜索试试~ 我知道了~
SimHashPHP 算法来实现海量文本的相似度计算与快速去重
共32个文件
php:20个
txt:3个
html:3个
需积分: 5 2 下载量 162 浏览量
2023-05-22
17:29:01
上传
评论
收藏 25KB ZIP 举报
温馨提示
SimHash 算法来实现海量文本的相似度计算与快速去重。SimHashPHP是一个PHP库,它在PHP中移植了SimHash算法。该算法由Moses Charikar创建,提供了一种有效的方法来计算两个文本之间的相似性指数。
资源推荐
资源详情
资源评论
收起资源包目录
SimHashPhP.zip (32个子文件)
SimHashPhp
lib
Tga
SimHash
SimHash.php 4KB
Fingerprint.php 2KB
Comparator
GaussianComparator.php 2KB
ComparatorInterface.php 760B
Tokenizer
String32Tokenizer.php 1KB
String128Tokenizer.php 1KB
TokenizerInterface.php 1KB
String64Tokenizer.php 2KB
Vectorizer
VectorizerInterface.php 644B
DefaultVectorizer.php 2KB
Extractor
ExtractorInterface.php 733B
HtmlExtractor.php 2KB
SimpleTextExtractor.php 891B
LICENSE.md 1KB
.travis.yml 106B
doc
examples
simple_text.php 2KB
basic.php 503B
simple_html.php 2KB
phpunit.xml.dist 286B
composer.json 645B
tests
resources
text
file2.txt 2KB
file1.txt 1KB
file3.txt 1KB
html
file1.html 0B
file2.html 0B
file3.html 0B
src
Tga
SimHash
SimHashTest.php 2KB
Extractor
HtmlExtractorTest.php 817B
SimpleTextExtractorTest.php 782B
autoload.php 53B
.gitignore 28B
README.md 4KB
共 32 条
- 1
资源评论
天涯行走
- 粉丝: 1
- 资源: 3
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功