Roboto
=======
Roboto is a node.js crawler framework that you can use to do things like:
- Crawl documents in an intranet for search indexing.
- Crawl an app to check for broken links [(example)](https://github.com/jculvey/roboto/blob/master/examples/deadLinkCrawler.js).
- General purpose crawling of the web [(news crawler example)](https://github.com/jculvey/roboto/blob/master/examples/top20news.js).
- Scrape a website for data aggregation [(hackernews example)](https://github.com/jculvey/roboto/blob/master/examples/hackerNews.js).
- Much more!
## Installation
```bash
$ npm install roboto
```
## Usage
Here's an example of roboto being used to crawl a fictitious news site:
```js
var roboto = require('roboto');
var html_strip = require('htmlstrip-native').html_strip;
var fooCrawler = new roboto.Crawler({
startUrls: [
"http://www.foonews.com/latest",
],
allowedDomains: [ // optional
"foonews.com",
]
});
// Add parsers to the crawler.
// Each will parse a data item from the response
fooCrawler.parseField('title', function(response, $){
return $('head title').text();
});
// $ is a cherio selector loaded with the response body.
// Use it like you would jquery.
// See https://github.com/cheeriojs/cheerio for more info.
fooCrawler.parseField('body', function(response, $){
var html = $('body').html();
return html_strip(html);
});
// response has a few attributes from
// http://nodejs.org/api/http.html#http_http_incomingmessage
fooCrawler.parseField('url', function(response, $){
return response.url;
});
// Do something with the items you parse
fooCrawler.on('item', function(item) {
// item = {
// title: 'Foo happened today!',
// body: 'It was amazing',
// url: http://www.foonews.com/latest
// }
database.save(item, function(err) {
if (err) crawler.log(err);
});
});
fooCrawler.crawl();
```
For more options, see the [Options Reference](#options-reference).
## Basic Options
The only required option is `startUrls`.
```js
var crawler = new roboto.Crawler({
startUrls: [
"http://www.example.com",
],
allowedDomains: [ "example.com" ],
blacklist: [
/accounts/,
],
whitelist: [
/stories/,
/sports/
]
});
```
`allowedDomains` can be used to limit page crawls to certain domains.
Any urls matching a pattern specified in `blacklist` wont be crawled.
If a `whitelist` has been specified, only urls matching a pattern in `whitelist`
will be crawled.
In the example crawler above, the following urls would get crawled:
- `http://www.example.com/stories/foo-bar.html`
- `http://www.example.com/stories/1900-01-01/old-stories.html`
- `http://www.example.com/sports/football/people-kicking-balls.html`
And the following urls would not:
- `http://www.example.com/accounts/passwords.html` (match in `blacklist`)
- `http://www.example.com/foo/bar/page.html` (no match `in whitelist`)
- `http://www.badnews.com/foo/bar/page.html` (not in `allowedDomains`)
## Items
For each document roboto crawls, it creates an item. This item will be populated
with fields parsed from the document with parser functions added via `crawler.parseField`.
After a document has been parsed, the crawler will emit an `item` event. You can subscribe to this event like
so:
```js
crawler.on('item', function(item) {
database.save(item, function(err) {
if(err) crawler.log(err);
});
});
// Probably not a wise idea, but for the purpose of illustration:
crawler.on('item', function(item) {
fs.writeFile(item.filename, item.body, function (err) {
if (err) crawler.log(err);
});
});
```
## Pipelines
Pipelines are item processing plugins which you can add to a crawler like so:
```js
crawler.pipeline(somePipeline);
```
### roboto-solr
This pipeline can be used to write extracted items to a solr index.
A `fieldMap` can be specified in the options of the constructor to
change the key of an item as it is stored in solr.
In the following example, the crawler is parsing a `'url'` field
which will be stored in the solr index as `'id'`
```js
var robotoSolr = roboto.pipelines.robotoSolr({
host: '127.0.0.1',
port: '8983',
core: '/collection1', // if defined, should begin with a slash
path: '/solr', // should also begin with a slash
fieldMap: {
'url': 'id',
'body': 'content_t'
}
});
myCrawler.pipeline(robotoSolr);
```
### Create your own.
Creating your own pipeline plugin can make it easeir to share the same item processing
logic across projects. It also lets you share it with others!
The signature of a pipeline function is `function(item, callback)`. The `callback`
function takes a single argument `err` which should be supplied if an error was encountered. Otherwise,
it should be invoked with no arguments `callback()`.
```js
var myPipeline = function(item, done) {
log.info(JSON.stringify(item, null, ' '));
done();
};
myCrawler.pipeline(myPipeline);
```
By default, roboto adds the [`itemLogger` pipeline](https://github.com/jculvey/roboto/blob/master/lib/pipelines/item-logger.js) to
each crawler. This pipeline simply logs the contents of an item using roboto's built-in logger. This can
serve as a good reference point when developing your own pipeline.
## Downloaders
The default downloader can be over-ridden with a downloader plugin.
### HTTP Authentication
Roboto comes with a downloader plugin that using http authentication in
your crawl requests.
```js
var roboto = require('roboto');
var robotoHttpAuth = roboto.downloaders.httpAuth;
// The options should be the auth hash mentioned here:
// https://github.com/mikeal/request#http-authentication
httpAuthOptions = {
rejectUnauthorized: true, // defaults to false
auth: {
user: 'bob',
pass: 'secret'
}
}
myCrawler.downloader(robotoHttpAuth(httpAuthOptions));
```
### Create your own
You can create a custom downloader and add it to your crawler `downloader` function.
```js
var _ = require('underscore');
var myDownloader = function(href, requestHandler) {
var requestOptions = _.extend(this.defaultRequestOptions, {
url: href
headers: {
'X-foo': 'bar'
}
});
request(requestOptions, requestHandler);
};
myCrawler.downloader(myDownloader) {
```
The default request options are available in the `defaultRequestOptions` property of the crawler.
More information about the structure of `requestOptions` and the signature of `requestHandler` can
be [found here](https://github.com/mikeal/request#requestoptions-callback).
## Url Normalization
Also known as [URL canonicalization](http://en.wikipedia.org/wiki/URL_normalization)
This is the process of reducing syntactically different urls to a common simplified
form. This is useful while crawling to ensure that multiple urls that point to the
same page don't get crawled more than once.
By default roboto normalizes urls with the following procedure:
- Unescaping url encoding `/foo/%7Eexample => /foo/~example`
- Converting relative urls to absolute `/foo.html => http://example.com/bar.html`
- Fully resolving paths `/foo/../bar/baz.html => /bar/baz.html`
- Discarding fragments `/foo.html#bar => /foo.html`
- Discarding query params `/foo.html#bar => /foo.html`
- Discarding directory indexes `/foo/index.html => /foo`
- index.html, index.php, default.asp, default.aspx are all discarded.
- Removing mutliple occurrences of '/' `/foo//bar///baz => /foo/bar/baz`
- Removing trailing '/' `/foo/bar/ => /foo/bar`
Discarding query params all together isn't optimal. A planned enhancement is to
sort query params, and possibly detect safe params to remove (sort, rows, etc.).
## Link Extraction
By default, roboto will extract all links from a page and add them
onto the queue of pages to be crawled unless they:
- Don't contain an `href` attribute.
- Have `rel="nofollow"` or `rel="noindex"`.
- Don't belong to a domain listed in the crawler's `allowedDomains` l
没有合适的资源?快使用搜索试试~ 我知道了~
用于nodejs的网络爬虫抓取器蜘蛛_JavaScript_HTML_下载.zip
共44个文件
js:27个
html:10个
txt:2个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 108 浏览量
2023-04-25
11:27:28
上传
评论
收藏 34KB ZIP 举报
温馨提示
用于nodejs的网络爬虫抓取器蜘蛛_JavaScript_HTML_下载.zip
资源推荐
资源详情
资源评论
收起资源包目录
用于nodejs的网络爬虫抓取器蜘蛛_JavaScript_HTML_下载.zip (44个子文件)
roboto-master
lib
stats.js 393B
pipelines
roboto-solr.js 1KB
item-logger.js 227B
crawler.js 10KB
roboto.js 358B
urlFrontier.js 5KB
linkExtractor.js 5KB
robotsParser.js 6KB
logger.js 397B
downloaders
httpAuth.js 309B
default.js 294B
.travis.yml 89B
examples
hackerNews.js 976B
basic.js 1KB
deadLinkCrawler.js 720B
top20news.js 2KB
package.json 871B
test
urls.js 6KB
depth.js 854B
robots.js 2KB
delay.js 968B
events.js 877B
basic.js 2KB
urlFrontier.js 942B
parsing.js 1KB
mockserver.js 591B
static
robots.txt 387B
delay
page.html 160B
index.html 285B
badRobots.txt 100B
stories
page1.html 988B
index.html 2KB
page2.html 1KB
depth
page1.html 221B
page.html 221B
page3.html 160B
index.html 285B
page2.html 221B
feed.rss 662B
pipelines.js 1KB
fixtures.js 330B
index.js 75B
.gitignore 18B
README.md 13KB
共 44 条
- 1
资源评论
快撑死的鱼
- 粉丝: 1w+
- 资源: 9156
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 高性能量化工具 hikyuu 2.0.3 python3.9 ubuntu 安装包
- Cyclone Version 9.51
- 高性能量化回测工具 hikyuu 2.0.3 python 3.12 windows 安装包
- 省级城乡居民基本养老保险情况数据集(2010-2022年).xlsx
- 舞队填写版.cpp
- 基于BP神经网络的多输入单输出回归预测.zip
- 高性能量化回测工具 hikyuu 2.0.3 python 3.9 windows 安装包
- 省级城镇职工基本养老保险情况2000-2022年.xlsx
- 高性能量化回测工具 hikyuu 2.0.3 python 3.10 windows 安装包
- 算法部署-使用OpenVINO+C#部署PaddleOCR字符识别算法-项目源码-优质项目实战.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功