用于nodejs的网络爬虫抓取器蜘蛛_JavaScript_HTML

共44个文件

js：27个

html：10个

txt：2个

版权申诉

108 浏览量 2023-04-25 11:27:28 上传评论收藏 34KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

用于nodejs的网络爬虫抓取器蜘蛛_JavaScript_HTML_下载.zip （44个子文件）

roboto-master

lib

stats.js 393B

pipelines

roboto-solr.js 1KB

item-logger.js 227B

crawler.js 10KB

roboto.js 358B

urlFrontier.js 5KB

linkExtractor.js 5KB

robotsParser.js 6KB

logger.js 397B

downloaders

httpAuth.js 309B

default.js 294B

.travis.yml 89B

examples

hackerNews.js 976B

basic.js 1KB

deadLinkCrawler.js 720B

top20news.js 2KB

package.json 871B

test

urls.js 6KB

depth.js 854B

robots.js 2KB

delay.js 968B

events.js 877B

basic.js 2KB

urlFrontier.js 942B

parsing.js 1KB

mockserver.js 591B

static

robots.txt 387B

delay

page.html 160B

index.html 285B

badRobots.txt 100B

stories

page1.html 988B

index.html 2KB

page2.html 1KB

depth

page1.html 221B

page.html 221B

page3.html 160B

index.html 285B

page2.html 221B

feed.rss 662B

pipelines.js 1KB

fixtures.js 330B

index.js 75B

.gitignore 18B

README.md 13KB

Roboto ======= Roboto is a node.js crawler framework that you can use to do things like: - Crawl documents in an intranet for search indexing. - Crawl an app to check for broken links [(example)](https://github.com/jculvey/roboto/blob/master/examples/deadLinkCrawler.js). - General purpose crawling of the web [(news crawler example)](https://github.com/jculvey/roboto/blob/master/examples/top20news.js). - Scrape a website for data aggregation [(hackernews example)](https://github.com/jculvey/roboto/blob/master/examples/hackerNews.js). - Much more! ## Installation ```bash $ npm install roboto ``` ## Usage Here's an example of roboto being used to crawl a fictitious news site: ```js var roboto = require('roboto'); var html_strip = require('htmlstrip-native').html_strip; var fooCrawler = new roboto.Crawler({ startUrls: [ "http://www.foonews.com/latest", ], allowedDomains: [ // optional "foonews.com", ] }); // Add parsers to the crawler. // Each will parse a data item from the response fooCrawler.parseField('title', function(response, $){ return $('head title').text(); }); // $ is a cherio selector loaded with the response body. // Use it like you would jquery. // See https://github.com/cheeriojs/cheerio for more info. fooCrawler.parseField('body', function(response, $){ var html = $('body').html(); return html_strip(html); }); // response has a few attributes from // http://nodejs.org/api/http.html#http_http_incomingmessage fooCrawler.parseField('url', function(response, $){ return response.url; }); // Do something with the items you parse fooCrawler.on('item', function(item) { // item = { // title: 'Foo happened today!', // body: 'It was amazing', // url: http://www.foonews.com/latest // } database.save(item, function(err) { if (err) crawler.log(err); }); }); fooCrawler.crawl(); ``` For more options, see the [Options Reference](#options-reference). ## Basic Options The only required option is `startUrls`. ```js var crawler = new roboto.Crawler({ startUrls: [ "http://www.example.com", ], allowedDomains: [ "example.com" ], blacklist: [ /accounts/, ], whitelist: [ /stories/, /sports/ ] }); ``` `allowedDomains` can be used to limit page crawls to certain domains. Any urls matching a pattern specified in `blacklist` wont be crawled. If a `whitelist` has been specified, only urls matching a pattern in `whitelist` will be crawled. In the example crawler above, the following urls would get crawled: - `http://www.example.com/stories/foo-bar.html` - `http://www.example.com/stories/1900-01-01/old-stories.html` - `http://www.example.com/sports/football/people-kicking-balls.html` And the following urls would not: - `http://www.example.com/accounts/passwords.html` (match in `blacklist`) - `http://www.example.com/foo/bar/page.html` (no match `in whitelist`) - `http://www.badnews.com/foo/bar/page.html` (not in `allowedDomains`) ## Items For each document roboto crawls, it creates an item. This item will be populated with fields parsed from the document with parser functions added via `crawler.parseField`. After a document has been parsed, the crawler will emit an `item` event. You can subscribe to this event like so: ```js crawler.on('item', function(item) { database.save(item, function(err) { if(err) crawler.log(err); }); }); // Probably not a wise idea, but for the purpose of illustration: crawler.on('item', function(item) { fs.writeFile(item.filename, item.body, function (err) { if (err) crawler.log(err); }); }); ``` ## Pipelines Pipelines are item processing plugins which you can add to a crawler like so: ```js crawler.pipeline(somePipeline); ``` ### roboto-solr This pipeline can be used to write extracted items to a solr index. A `fieldMap` can be specified in the options of the constructor to change the key of an item as it is stored in solr. In the following example, the crawler is parsing a `'url'` field which will be stored in the solr index as `'id'` ```js var robotoSolr = roboto.pipelines.robotoSolr({ host: '127.0.0.1', port: '8983', core: '/collection1', // if defined, should begin with a slash path: '/solr', // should also begin with a slash fieldMap: { 'url': 'id', 'body': 'content_t' } }); myCrawler.pipeline(robotoSolr); ``` ### Create your own. Creating your own pipeline plugin can make it easeir to share the same item processing logic across projects. It also lets you share it with others! The signature of a pipeline function is `function(item, callback)`. The `callback` function takes a single argument `err` which should be supplied if an error was encountered. Otherwise, it should be invoked with no arguments `callback()`. ```js var myPipeline = function(item, done) { log.info(JSON.stringify(item, null, ' ')); done(); }; myCrawler.pipeline(myPipeline); ``` By default, roboto adds the [`itemLogger` pipeline](https://github.com/jculvey/roboto/blob/master/lib/pipelines/item-logger.js) to each crawler. This pipeline simply logs the contents of an item using roboto's built-in logger. This can serve as a good reference point when developing your own pipeline. ## Downloaders The default downloader can be over-ridden with a downloader plugin. ### HTTP Authentication Roboto comes with a downloader plugin that using http authentication in your crawl requests. ```js var roboto = require('roboto'); var robotoHttpAuth = roboto.downloaders.httpAuth; // The options should be the auth hash mentioned here: // https://github.com/mikeal/request#http-authentication httpAuthOptions = { rejectUnauthorized: true, // defaults to false auth: { user: 'bob', pass: 'secret' } } myCrawler.downloader(robotoHttpAuth(httpAuthOptions)); ``` ### Create your own You can create a custom downloader and add it to your crawler `downloader` function. ```js var _ = require('underscore'); var myDownloader = function(href, requestHandler) { var requestOptions = _.extend(this.defaultRequestOptions, { url: href headers: { 'X-foo': 'bar' } }); request(requestOptions, requestHandler); }; myCrawler.downloader(myDownloader) { ``` The default request options are available in the `defaultRequestOptions` property of the crawler. More information about the structure of `requestOptions` and the signature of `requestHandler` can be [found here](https://github.com/mikeal/request#requestoptions-callback). ## Url Normalization Also known as [URL canonicalization](http://en.wikipedia.org/wiki/URL_normalization) This is the process of reducing syntactically different urls to a common simplified form. This is useful while crawling to ensure that multiple urls that point to the same page don't get crawled more than once. By default roboto normalizes urls with the following procedure: - Unescaping url encoding `/foo/%7Eexample => /foo/~example` - Converting relative urls to absolute `/foo.html => http://example.com/bar.html` - Fully resolving paths `/foo/../bar/baz.html => /bar/baz.html` - Discarding fragments `/foo.html#bar => /foo.html` - Discarding query params `/foo.html#bar => /foo.html` - Discarding directory indexes `/foo/index.html => /foo` - index.html, index.php, default.asp, default.aspx are all discarded. - Removing mutliple occurrences of '/' `/foo//bar///baz => /foo/bar/baz` - Removing trailing '/' `/foo/bar/ => /foo/bar` Discarding query params all together isn't optimal. A planned enhancement is to sort query params, and possibly detect safe params to remove (sort, rows, etc.). ## Link Extraction By default, roboto will extract all links from a page and add them onto the queue of pages to be crawled unless they: - Don't contain an `href` attribute. - Have `rel="nofollow"` or `rel="noindex"`. - Don't belong to a domain listed in the crawler's `allowedDomains` l

评论收藏

内容反馈

版权申诉