支持通配符()匹配的NodeJSrobots.txt解析器。_JavaScript

共14个文件

md：4个

js：3个

yml：2个

版权申诉

96 浏览量 2023-04-25 11:51:56 上传评论收藏 56KB ZIP 举报

在IT行业中，尤其是在Web开发领域，`robots.txt`文件是一个重要的组成部分，它用于指示搜索引擎爬虫哪些页面可以抓取，哪些不能。这个压缩包“支持通配符()匹配的NodeJSrobots.txt解析器”是一个基于JavaScript的解决方案，专门用于解析包含通配符规则的`robots.txt`文件。在本文中，我们将深入探讨`robots.txt`、通配符匹配以及如何在Node.js环境中使用JavaScript实现这一功能。让我们了解什么是`robots.txt`。`robots.txt`是一个纯文本文件，位于网站根目录下，其主要目的是为网络爬虫提供指南，告诉它们哪些URL应该被访问，哪些应该被禁止。`robots.txt`遵循一系列简单的规则，包括`User-Agent`（定义爬虫名称）和`Disallow`（指定不应抓取的URL）等指令。接下来，我们讨论通配符在`robots.txt`中的应用。在标准的`robots.txt`规范中，通配符并不被正式支持。然而，许多搜索引擎如Google和Bing，为了提高灵活性，实现了对星号(*)和问号(?)的非官方支持。星号(*)代表零个或多个任意字符，问号(?)代表一个任意字符。例如，`Disallow: /path/*`会阻止所有以`/path/`开头的URL被爬取，而`Disallow: /file?.html`则会阻止所有以`file?`开头且后跟任意字符.html的URL。本压缩包提供的Node.js解析器特别之处在于它能够处理这种通配符匹配。Node.js是服务器端的JavaScript运行环境，使得开发者可以使用JavaScript编写服务器端代码。这个解析器可能通过正则表达式或其他字符串处理技术实现了通配符到具体URL的映射，从而准确地识别出哪些URL应被尊重。在Node.js环境中，你可以按照以下步骤使用这个解析器： 1. 将`robots-parser-master`解压并安装依赖。通常，项目会有`package.json`文件，你可以使用`npm install`命令来安装所有必要的依赖。 2. 导入解析器模块。如果你遵循了Node.js的模块化标准，你可能会找到一个名为`robots-parser`或者类似名字的入口文件。 3. 使用解析器读取并解析`robots.txt`文件。这通常涉及读取文件内容，然后调用解析器提供的方法来处理内容。 4. 检查特定的URL是否被允许或禁止。解析器应该提供一个方法，接受一个URL作为参数，返回一个布尔值表示该URL是否可被爬取。理解并正确使用这样的解析器对于Web开发人员来说至关重要，因为它可以帮助他们确保网站的隐私和性能。同时，对于搜索引擎优化(SEO)来说，正确配置`robots.txt`可以防止不必要的资源被频繁抓取，从而降低服务器负载。总结来说，这个Node.js解析器是一个工具，它扩展了标准`robots.txt`协议，支持了通配符匹配，使得开发者能更精确地控制搜索引擎对网站的爬取行为。通过学习和使用这样的工具，你可以更好地管理你的网站，确保它在搜索引擎面前呈现最佳状态。

资源推荐

资源详情

资源评论

收起资源包目录

支持通配符()匹配的NodeJSrobots.txt解析器。_JavaScript_下载.zip （14个子文件）

robots-parser-master

SECURITY.md 327B

LICENSE.md 1KB

.github

workflows

test.yml 950B

codeql-analysis.yml 1KB

CHANGELOG.md 3KB

index.d.ts 421B

.prettierignore 4B

package.json 1KB

package-lock.json 154KB

Robots.js 10KB

test

Robots.js 22KB

index.js 116B

.gitignore 41B

README.md 7KB

# Robots Parser [![NPM downloads](https://img.shields.io/npm/dm/robots-parser)](https://www.npmjs.com/package/robots-parser) [![DeepScan grade](https://deepscan.io/api/teams/457/projects/16277/branches/344939/badge/grade.svg)](https://deepscan.io/dashboard#view=project&tid=457&pid=16277&bid=344939) [![GitHub license](https://img.shields.io/github/license/samclarke/robots-parser.svg)](https://github.com/samclarke/robots-parser/blob/master/license.md) [![Coverage Status](https://coveralls.io/repos/github/samclarke/robots-parser/badge.svg?branch=master)](https://coveralls.io/github/samclarke/robots-parser?branch=master) A robots.txt parser which aims to be complaint with the [draft specification](https://datatracker.ietf.org/doc/html/draft-koster-rep). The parser currently supports: - User-agent: - Allow: - Disallow: - Sitemap: - Crawl-delay: - Host: - Paths with wildcards (\*) and EOL matching ($) ## Installation Via NPM: npm install robots-parser or via Yarn: yarn add robots-parser ## Usage ```js var robotsParser = require('robots-parser'); var robots = robotsParser('http://www.example.com/robots.txt', [ 'User-agent: *', 'Disallow: /dir/', 'Disallow: /test.html', 'Allow: /dir/test.html', 'Allow: /test.html', 'Crawl-delay: 1', 'Sitemap: http://example.com/sitemap.xml', 'Host: example.com' ].join('\n')); robots.isAllowed('http://www.example.com/test.html', 'Sams-Bot/1.0'); // true robots.isAllowed('http://www.example.com/dir/test.html', 'Sams-Bot/1.0'); // true robots.isDisallowed('http://www.example.com/dir/test2.html', 'Sams-Bot/1.0'); // true robots.getCrawlDelay('Sams-Bot/1.0'); // 1 robots.getSitemaps(); // ['http://example.com/sitemap.xml'] robots.getPreferredHost(); // example.com ``` ### isAllowed(url, [ua]) **boolean or undefined** Returns true if crawling the specified URL is allowed for the specified user-agent. This will return `undefined` if the URL isn't valid for this robots.txt. ### isDisallowed(url, [ua]) **boolean or undefined** Returns true if crawling the specified URL is not allowed for the specified user-agent. This will return `undefined` if the URL isn't valid for this robots.txt. ### getMatchingLineNumber(url, [ua]) **number or undefined** Returns the line number of the matching directive for the specified URL and user-agent if any. Line numbers start at 1 and go up (1-based indexing). Returns -1 if there is no matching directive. If a rule is manually added without a lineNumber then this will return undefined for that rule. ### getCrawlDelay([ua]) **number or undefined** Returns the number of seconds the specified user-agent should wait between requests. Returns undefined if no crawl delay has been specified for this user-agent. ### getSitemaps() **array** Returns an array of sitemap URLs specified by the `sitemap:` directive. ### getPreferredHost() **string or null** Returns the preferred host name specified by the `host:` directive or null if there isn't one. # Changes ### Version 3.0.1 - Fixed bug with `https:` URLs defaulting to port `80` instead of `443` if no port is specified. Thanks to @dskvr for reporting This affects comparing URLs with the default HTTPs port to URLs without it. For example, comparing `https://example.com/` to `https://example.com:443/` or vice versa. They should be treated as equivalent but weren't due to the incorrect port being used for `https:`. ### Version 3.0.0 - Changed to using global URL object instead of importing. – Thanks to @brendankenny ### Version 2.4.0: - Added Typescript definitions – Thanks to @danhab99 for creating - Added SECURITY.md policy and CodeQL scanning ### Version 2.3.0: - Fixed bug where if the user-agent passed to `isAllowed()` / `isDisallowed()` is called "constructor" it would throw an error. - Added support for relative URLs. This does not affect the default behavior so can safely be upgraded. Relative matching is only allowed if both the robots.txt URL and the URLs being checked are relative. For example: ```js var robots = robotsParser('/robots.txt', [ 'User-agent: *', 'Disallow: /dir/', 'Disallow: /test.html', 'Allow: /dir/test.html', 'Allow: /test.html' ].join('\n')); robots.isAllowed('/test.html', 'Sams-Bot/1.0'); // false robots.isAllowed('/dir/test.html', 'Sams-Bot/1.0'); // true robots.isDisallowed('/dir/test2.html', 'Sams-Bot/1.0'); // true ``` ### Version 2.2.0: - Fixed bug that with matching wildcard patterns with some URLs – Thanks to @ckylape for reporting and fixing - Changed matching algorithm to match Google's implementation in google/robotstxt - Changed order of precedence to match current spec ### Version 2.1.1: - Fix bug that could be used to causing rule checking to take a long time – Thanks to @andeanfog ### Version 2.1.0: - Removed use of punycode module API's as new URL API handles it - Improved test coverage - Added tests for percent encoded paths and improved support - Added `getMatchingLineNumber()` method - Fixed bug with comments on same line as directive ### Version 2.0.0: This release is not 100% backwards compatible as it now uses the new URL APIs which are not supported in Node < 7. - Update code to not use deprecated URL module API's. – Thanks to @kdzwinel ### Version 1.0.2: - Fixed error caused by invalid URLs missing the protocol. ### Version 1.0.1: - Fixed bug with the "user-agent" rule being treated as case sensitive. – Thanks to @brendonboshell - Improved test coverage. – Thanks to @schornio ### Version 1.0.0: - Initial release. # License The MIT License (MIT) Copyright (c) 2014 Sam Clarke Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

评论收藏

内容反馈

版权申诉