Robots.txt php parser class
=====================
[![Build Status](https://travis-ci.org/t1gor/Robots.txt-Parser-Class.svg?branch=master)](https://travis-ci.org/t1gor/Robots.txt-Parser-Class) [![Code Climate](https://codeclimate.com/github/t1gor/Robots.txt-Parser-Class/badges/gpa.svg)](https://codeclimate.com/github/t1gor/Robots.txt-Parser-Class) [![Test Coverage](https://codeclimate.com/github/t1gor/Robots.txt-Parser-Class/badges/coverage.svg)](https://codeclimate.com/github/t1gor/Robots.txt-Parser-Class) [![License](https://poser.pugx.org/t1gor/robots-txt-parser/license.svg)](https://packagist.org/packages/t1gor/robots-txt-parser) [![Total Downloads](https://poser.pugx.org/t1gor/robots-txt-parser/downloads.svg)](https://packagist.org/packages/t1gor/robots-txt-parser)
PHP class to parse robots.txt rules according to Google, Yandex, W3C and The Web Robots Pages specifications.
Full list of supported specifications (and what's not supported, yet) are available in our [Wiki](https://github.com/t1gor/Robots.txt-Parser-Class/wiki/Specifications).
### Supported directives:
- User-agent
- Allow
- Disallow
- Sitemap
- Host
- Cache-delay
- Clean-param
- Crawl-delay
- Request-rate (in progress)
- Visit-time (in progress)
### Installation
The library is available for install via Composer package. To install via Composer, please add the requirement to your `composer.json` file, like this:
```sh
composer require t1gor/robots-txt-parser
```
You can find out more about Composer here: https://getcomposer.org/
### Usage example
###### Creating parser instance
```php
use t1gor\RobotsTxtParser\RobotsTxtParser;
# from string
$parser = new RobotsTxtParser("User-agent: * \nDisallow: /");
# from local file
$parser = new RobotsTxtParser(fopen('some/robots.txt'));
# or a remote one (make sure it's allowed in your php.ini)
# even FTP should work (but this is not confirmed)
$parser = new RobotsTxtParser(fopen('http://example.com/robots.txt'));
```
###### Logging parsing process
We are implementing `LoggerAwareInterface` from `PSR`, so it should work out of the box with any logger supporting that standard. Please see below for Monolog example with Telegram bot:
```php
use Monolog\Handler\TelegramBotHandler;
use Monolog\Logger;
use PHPUnit\Framework\TestCase;
use Psr\Log\LogLevel;
use t1gor\RobotsTxtParser\RobotsTxtParser;
$monologLogger = new Logger('robot.txt-parser');
$monologLogger->setHandler(new TelegramBotHandler('api-key', 'channel'));
$parser = new RobotsTxtParser(fopen('some/robots.txt'));
$parser->setLogger($monologLogger);
```
Most log entries we have are of `LogLevel::DEBUG`, but there might also be some `LogLevel::WARNINGS` where it is appropriate.
###### Parsing non UTF-8 encoded files
```php
use t1gor\RobotsTxtParser\RobotsTxtParser;
/** @see EncodingTest for more details */
$parser = new RobotsTxtParser(fopen('market-yandex-Windows-1251.txt', 'r'), 'Windows-1251');
```
### Public API
| Method | Params | Returns | Description |
| ------ | ------ | ------ | ----------- |
| `setLogger` | `Psr\Log\LoggerInterface $logger` | `void` | |
| `getLogger` | `-` | `Psr\Log\LoggerInterface` | |
| `setHttpStatusCode` | `int $code` | `void` | Set HTTP response code for allowance checks |
| `isAllowed` | `string $url, ?string $userAgent` | `bool` | If no `$userAgent` is passed, will return for `*` |
| `isDisallowed` | `string $url, ?string $userAgent` | `bool` | If no `$userAgent` is passed, will return for `*` |
| `getDelay` | `string $userAgent, string $type = 'crawl-delay'` | `float` | Get any of the delays, e.g. `Crawl-delay`, `Cache-delay`, etc. |
| `getCleanParam` | `-` | `[ string => string[] ]` | Where key is the path, and values are params |
| `getRules` | `?string $userAgent` | `array` | Get the rules the parser read in a tree-line structure |
| `getHost` | `?string $userAgent` | `string[]` or `string` or `null` | If no `$userAgent` is passed, will return all |
| `getSitemaps` | `?string $userAgent` | `string[]` | If no `$userAgent` is passed, will return all |
| `getContent` | `-` | `string` | The content that was parsed. |
| `getLog` | `-` | `[]` | **Deprecated.** Please use PSR logger as described above. |
| `render` | `-` | `string` | **Deprecated.** Please `getContent` |
Even more code samples could be found in the [tests folder](https://github.com/t1gor/Robots.txt-Parser-Class/tree/master/test).
**Some useful links and materials:**
* [Google: Robots.txt Specifications](https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt)
* [Yandex: Using robots.txt](http://help.yandex.com/webmaster/?id=1113851)
* [The Web Robots Pages](http://www.robotstxt.org/)
* [W3C Recommendation](https://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.2)
* [Some inspirational code](http://socoder.net/index.php?snippet=23824), and [some more](http://www.the-art-of-web.com/php/parse-robots/)
* [Google Webmaster tools Robots.txt testing tool](https://www.google.com/webmasters/tools/robots-testing-tool)
### Contributing
First of all - thank you for your interest and a desire to help! If you found an issue and know how to fix it, please submit a pull request to the dev branch. Please do not forget the following:
- Your fixed issue should be covered with tests (we are using phpUnit)
- Please mind the [code climate](https://codeclimate.com/github/t1gor/Robots.txt-Parser-Class) recommendations. It some-how helps to keep things simpler, or at least seems to :)
- Following the coding standard would also be much appreciated (4 tabs as an indent, camelCase, etc.)
I would really appreciate if you could share the link to your project that is utilizing the lib.
License
-------
The MIT License
Copyright (c) 2013 Igor Timoshenkov
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
收起资源包目录
用于robots.txt解析的Php类_PHP_Dockerfile_下载.zip (98个子文件)
Robots.txt-Parser-Class-master
.github
dependabot.yml 112B
workflows
main.yml 1KB
assets
components-graph.png 184KB
schema.png 39KB
LICENSE 1KB
composer.json 986B
source
LogsIfAvailableTrait.php 472B
RobotsTxtParser.php 15KB
Stream
Filters
SkipDirectivesWithInvalidValuesFilter.php 2KB
SkipCommentedLinesFilter.php 897B
SkipEmptyLinesFilter.php 937B
SkipEndOfCommentedLineFilter.php 911B
SkipUnsupportedDirectivesFilter.php 957B
TrimSpacesLeftFilter.php 603B
GeneratorBasedReader.php 4KB
ReaderInterface.php 307B
CustomFilterInterface.php 2KB
Parser
TreeBuilder.php 2KB
HostName.php 491B
DirectiveProcessorsFactory.php 1KB
TreeBuilderInterface.php 301B
Url.php 2KB
UserAgent
UserAgentMatcher.php 899B
UserAgentMatcherInterface.php 361B
DirectiveProcessors
AllowProcessor.php 307B
UserAgentProcessor.php 1KB
SitemapProcessor.php 1KB
AbstractAllowanceProcessor.php 1KB
DirectiveProcessorInterface.php 435B
DisallowProcessor.php 313B
CrawlDelayProcessor.php 960B
CleanParamProcessor.php 707B
AbstractDirectiveProcessor.php 629B
CacheDelayProcessor.php 869B
HostProcessor.php 1023B
WarmingMessages.php 640B
Directive.php 2KB
build
.gitkeep 0B
logs
.gitkeep 0B
Dockerfile 324B
composer.lock 128KB
test
HttpStatusCodeTest.php 1KB
UserAgentTest.php 3KB
EmptyRulesShouldAllowEverythingTest.php 573B
RenderTest.php 2KB
AtSymbolTest.php 1017B
Directives
SitemapsTest.php 2KB
CrawlDelayTest.php 2KB
CacheDelayTest.php 2KB
HostTest.php 1KB
CleanParamTest.php 2KB
EndAnchorTest.php 2KB
Fixtures
with-empty-rules.txt 31B
with-hosts.txt 627B
disallow-all.txt 26B
with-empty-and-whitespace.txt 240B
with-clean-param.txt 176B
market-yandex-ru.txt 10KB
with-commented-lines.txt 107B
crawl-delay-spec.txt 79B
with-empty-lines.txt 116B
cache-delay-spec.txt 146B
market-yandex-Windows-1251.txt 10KB
wikipedia-org.txt 27KB
with-sitemaps.txt 437B
with-invalid-request-rate.txt 346B
allow-spec.txt 483B
large-commented-lines.txt 599KB
with-faulty-directives.txt 2KB
with-commented-line-endings.txt 213B
allow-all.txt 23B
expected-skipped-lines-log.php 3KB
WhitespacesTest.php 959B
EncodingTest.php 2KB
Stream
Filter
SkipEmptyLinesFilterTest.php 2KB
SkipUnsupportedDirectivesTest.php 3KB
TrimSpacesLeftAndRightFilterTest.php 1009B
SkipCommentedLinesFilterTest.php 2KB
SkipDirectivesWithInvalidValuesFilterTest.php 2KB
SkipEndOfCommentedLineFilterTest.php 2KB
ReaderTest.php 1KB
Parser
UserAgent
UserAgentMatcherTest.php 1KB
DirectivesProcessors
CleanParamProcessorTest.php 2KB
UserAgentProcessorTest.php 1KB
SitemapProcessorTest.php 2KB
CrawlDelayProcessorTest.php 2KB
HostProcessorTest.php 2KB
bootstrap.php 362B
DisallowUppercasePathTest.php 974B
UnlistedPathTest.php 538B
DisallowAllTest.php 829B
AllowTest.php 3KB
InvalidPathTest.php 681B
CommentsTest.php 2KB
RobotsTxtParserTest.php 3KB
.gitignore 82B
phpunit.xml 630B
README.md 7KB
共 98 条
- 1
资源评论
快撑死的鱼
- 粉丝: 1w+
- 资源: 9150
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功