# WebCollector
WebCollector is an open source web crawler framework based on Java.It provides
some simple interfaces for crawling the Web,you can setup a
multi-threaded web crawler in less than 5 minutes.
## HomePage
[https://github.com/CrawlScript/WebCollector](https://github.com/CrawlScript/WebCollector)
<!--
## Document
[WebCollector-GitDoc](https://github.com/CrawlScript/WebCollector-GitDoc)
-->
## Installation
### Using Maven
```xml
<dependency>
<groupId>cn.edu.hfut.dmic.webcollector</groupId>
<artifactId>WebCollector</artifactId>
<version>2.71</version>
</dependency>
```
### Without Maven
WebCollector jars are available on the [HomePage](https://github.com/CrawlScript/WebCollector).
+ __webcollector-version-bin.zip__ contains core jars.
## Quickstart
Lets crawl some news from hfut news.This demo prints out the titles and contents extracted from news of hfut news.
### Automatically Detecting URLs
[AutoNewsCrawler.java](https://github.com/CrawlScript/WebCollector/blob/master/AutoNewsCrawler.java):
```java
import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
import org.jsoup.nodes.Document;
/**
* Crawling news from hfut news
*
* @author hu
*/
public class AutoNewsCrawler extends BreadthCrawler {
/**
* @param crawlPath crawlPath is the path of the directory which maintains
* information of this crawler
* @param autoParse if autoParse is true,BreadthCrawler will auto extract
* links which match regex rules from pag
*/
public AutoNewsCrawler(String crawlPath, boolean autoParse) {
super(crawlPath, autoParse);
/*start page*/
this.addSeed("http://news.hfut.edu.cn/list-1-1.html");
/*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/
this.addRegex("http://news.hfut.edu.cn/show-.*html");
/*do not fetch jpg|png|gif*/
this.addRegex("-.*\\.(jpg|png|gif).*");
/*do not fetch url contains #*/
this.addRegex("-.*#.*");
setThreads(50);
getConf().setTopN(100);
// setResumable(true);
}
@Override
public void visit(Page page, CrawlDatums next) {
String url = page.url();
/*if page is news page*/
if (page.matchUrl("http://news.hfut.edu.cn/show-.*html")) {
/*extract title and content of news by css selector*/
String title = page.select("div[id=Article]>h2").first().text();
String content = page.selectText("div#artibody");
System.out.println("URL:\n" + url);
System.out.println("title:\n" + title);
System.out.println("content:\n" + content);
/*If you want to add urls to crawl,add them to nextLink*/
/*WebCollector automatically filters links that have been fetched before*/
/*If autoParse is true and the link you add to nextLinks does not match the
regex rules,the link will also been filtered.*/
//next.add("http://xxxxxx.com");
}
}
public static void main(String[] args) throws Exception {
AutoNewsCrawler crawler = new AutoNewsCrawler("crawl", true);
/*start crawl with depth of 4*/
crawler.start(4);
}
}
```
### Manually Detecting URLs
[ManualNewsCrawler.java](https://github.com/CrawlScript/WebCollector/blob/master/ManualNewsCrawler.java):
```java
import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
import org.jsoup.nodes.Document;
/**
* Crawling news from hfut news
*
* @author hu
*/
public class ManualNewsCrawler extends BreadthCrawler {
/**
* @param crawlPath crawlPath is the path of the directory which maintains
* information of this crawler
* @param autoParse if autoParse is true,BreadthCrawler will auto extract
* links which match regex rules from pag
*/
public ManualNewsCrawler(String crawlPath, boolean autoParse) {
super(crawlPath, autoParse);
/*add 10 start pages and set their type to "list"
"list" is not a reserved word, you can use other string instead
*/
for(int i = 1; i <= 10; i++) {
this.addSeed("http://news.hfut.edu.cn/list-1-" + i + ".html", "list");
}
setThreads(50);
getConf().setTopN(100);
// setResumable(true);
}
@Override
public void visit(Page page, CrawlDatums next) {
String url = page.url();
if (page.matchType("list")) {
/*if type is "list"*/
/*detect content page by css selector and mark their types as "content"*/
next.add(page.links("div[class=' col-lg-8 '] li>a")).type("content");
}else if(page.matchType("content")) {
/*if type is "content"*/
/*extract title and content of news by css selector*/
String title = page.select("div[id=Article]>h2").first().text();
String content = page.selectText("div#artibody", 0);
//read title_prefix and content_length_limit from configuration
title = getConf().getString("title_prefix") + title;
content = content.substring(0, getConf().getInteger("content_length_limit"));
System.out.println("URL:\n" + url);
System.out.println("title:\n" + title);
System.out.println("content:\n" + content);
}
}
public static void main(String[] args) throws Exception {
ManualNewsCrawler crawler = new ManualNewsCrawler("crawl", false);
crawler.getConf().setExecuteInterval(5000);
crawler.getConf().set("title_prefix","PREFIX_");
crawler.getConf().set("content_length_limit", 20);
/*start crawl with depth of 4*/
crawler.start(4);
}
}
```
## CrawlDatum
CrawlDatum is an important data structure in WebCollector, which corresponds to url of webpages. Both crawled urls and detected urls are maintained as CrawlDatums.
There are some differences between CrawlDatum and url:
+ A CrawlDatum contains a key and a url. The key is the url by default. You can set the key manually by CrawlDatum.key("xxxxx") so that CrawlDatums with the same url may have different keys. This is very useful in some tasks like crawling data by api, which often request different data by the same url with different post parameters.
+ A CrawlDatum may contain metadata, which could maintain some information besides the url.
## Manually Detecting URLs
In both `void visit(Page page, CrawlDatums next)` and `void execute(Page page, CrawlDatums next)`, the second parameter `CrawlDatum next` is a container which you should put the detected URLs in:
```java
//add one detected URL
next.add("detected URL");
//add one detected URL and set its type
next.add("detected URL", "type");
//add one detected URL
next.add(new CrawlDatum("detected URL"));
//add detected URLs
next.add("detected URL list");
//add detected URLs
next.add(("detected URL list","type");
//add detected URLs
next.add(new CrawlDatums("detected URL list"));
//add one detected URL and return the added URL(CrawlDatum)
//and set its key and type
next.addAndReturn("detected URL").key("key").type("type");
//add detected URLs and return the added URLs(CrawlDatums)
//and set their type and meta info
next.addAndReturn("detected URL list").type("type").meta("page_num",10);
//add detected URL and return next
//and modify the type and meta info of all the CrawlDatums in next,
//including the added URL
next.add("det
没有合适的资源?快使用搜索试试~ 我知道了~
WebCollector最新稳定版(含jar包、源码、JavaAPI)
共93个文件
java:75个
xml:6个
md:3个
1星 需积分: 50 56 下载量 95 浏览量
2017-09-27
21:55:43
上传
评论
收藏 18.97MB ZIP 举报
温馨提示
WebCollector最新稳定版,内含所需jar包、源码、JavaAPI、测试程序等!
资源推荐
资源详情
资源评论
收起资源包目录
WebCollector-master.zip (93个子文件)
WebCollector-master
.gitignore 73B
ManualNewsCrawler.java 2KB
README.md 14KB
WebCollector
WebCollector.iml 8KB
pom.xml 13KB
src
test
java
cn
edu
hfut
dmic
webcollector
util
AvroTest.java 2KB
CharsetDetectorTest.java 2KB
CrawlDatumsTest.java 2KB
MetaTest.java 1KB
OkHttpRequesterTest.java 688B
BerkeleyDBManagerTest.java 1KB
CrawlDatumTest.java 1KB
main
resources
log4j.properties 240B
java
cn
edu
hfut
dmic
contentextractor
News.java 2KB
ContentExtractor.java 18KB
webcollector
crawler
Crawler.java 13KB
AutoParseCrawler.java 5KB
net
Proxys.java 3KB
HttpResponse.java 6KB
HttpRequest.java 13KB
Requester.java 1KB
conf
Configured.java 958B
Configuration.java 7KB
CommonConfigured.java 1KB
DefaultConfigured.java 987B
model
MetaSetter.java 348B
CrawlDatums.java 5KB
CrawlDatum.java 7KB
Page.java 14KB
Links.java 5KB
MetaGetter.java 364B
crawldb
StatusGeneratorFilter.java 480B
SegmentWriter.java 1KB
DBManager.java 3KB
GeneratorFilter.java 350B
Generator.java 3KB
Injector.java 997B
example
DemoTypeCrawler.java 4KB
DemoPostCrawler.java 3KB
TutorialCrawler.java 4KB
DemoHashSetNextFilter.java 4KB
DemoMetaCrawler.java 6KB
DemoBingCrawler.java 6KB
DemoDepthCrawler.java 3KB
DemoSelenium.java 3KB
DemoNextFilter.java 4KB
fetcher
NextFilter.java 1KB
Executor.java 1KB
Fetcher.java 14KB
Visitor.java 1KB
util
Config.java 2KB
ReflectAvroFileWriter.java 1KB
CrawlDatumFormater.java 3KB
FileIdGenerator.java 2KB
ReflectAvroFileReader.java 768B
RegexRule.java 4KB
MD5Utils.java 2KB
MysqlHelper.java 2KB
ListUtils.java 354B
ConfigurationUtils.java 767B
Counter.java 1KB
FileSystemOutput.java 3KB
ExceptionUtils.java 1KB
JsoupUtils.java 2KB
FileUtils.java 5KB
CharsetDetector.java 4KB
GsonUtils.java 312B
plugin
net
OkHttpRequester.java 4KB
berkeley
BerkeleyDBManager.java 8KB
BerkeleyCrawler.java 1KB
BerkeleyDBUtils.java 2KB
BerkeleyDBReader.java 3KB
BerkeleyGenerator.java 4KB
BreadthCrawler.java 2KB
nextfilter
HashSetNextFilter.java 1KB
ram
RamDBManager.java 4KB
RamCrawler.java 1KB
RamGenerator.java 2KB
RamDB.java 1KB
.idea
workspace.xml 75KB
encodings.xml 221B
misc.xml 2KB
.name 12B
copyright
profiles_settings.xml 76B
compiler.xml 1KB
CODE_COVERAGE.md 309B
WebCollector-JRuby
webcollector-0.1.0.gem 6.88MB
webcollector.gemspec 409B
lib
webcollector.rb 790B
LICENSE.txt 34KB
README.zh-cn.md 2KB
webcollector-2.71-bin.zip 11.95MB
AutoNewsCrawler.java 2KB
共 93 条
- 1
资源评论
- 大螳螂q2018-07-25说好的API文档呢?
始终向前的蜗牛
- 粉丝: 2
- 资源: 18
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功