#WebCollector
WebCollector is an open source web crawler framework based on Java.It provides
some simple interfaces for crawling the Web,you can setup a
multi-threaded web crawler in less than 5 minutes.
##HomePage
[https://github.com/CrawlScript/WebCollector](https://github.com/CrawlScript/WebCollector)
##Document
[WebCollector-GitDoc](https://github.com/CrawlScript/WebCollector-GitDoc)
##Installation
### Using Maven
To use the latest release of WebCollector, please use the following snippet in your pom.xml
```xml
<dependency>
<groupId>cn.edu.hfut.dmic.webcollector</groupId>
<artifactId>WebCollector</artifactId>
<version>2.09</version>
</dependency>
```
### Without Maven
WebCollector jars are available on the [HomePage](https://github.com/CrawlScript/WebCollector).
+ __webcollector-version-bin.zip__ contains core jars.
##Quickstart
Lets crawl some news from hfut news.This demo prints out the titles and contents extracted from news of hfut news.
[NewsCrawler.java](https://github.com/CrawlScript/WebCollector/blob/master/NewsCrawler.java):
import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
import org.jsoup.nodes.Document;
/**
* Crawling news from hfut news
*
* @author hu
*/
public class NewsCrawler extends BreadthCrawler {
/**
* @param crawlPath crawlPath is the path of the directory which maintains
* information of this crawler
* @param autoParse if autoParse is true,BreadthCrawler will auto extract
* links which match regex rules from pag
*/
public NewsCrawler(String crawlPath, boolean autoParse) {
super(crawlPath, autoParse);
/*start page*/
this.addSeed("http://news.hfut.edu.cn/list-1-1.html");
/*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/
this.addRegex("http://news.hfut.edu.cn/show-.*html");
/*do not fetch jpg|png|gif*/
this.addRegex("-.*\\.(jpg|png|gif).*");
/*do not fetch url contains #*/
this.addRegex("-.*#.*");
}
@Override
public void visit(Page page, CrawlDatums next) {
String url = page.getUrl();
/*if page is news page*/
if (page.matchUrl("http://news.hfut.edu.cn/show-.*html")) {
/*we use jsoup to parse page*/
Document doc = page.getDoc();
/*extract title and content of news by css selector*/
String title = page.select("div[id=Article]>h2").first().text();
String content = page.select("div#artibody", 0).text();
System.out.println("URL:\n" + url);
System.out.println("title:\n" + title);
System.out.println("content:\n" + content);
/*If you want to add urls to crawl,add them to nextLink*/
/*WebCollector automatically filters links that have been fetched before*/
/*If autoParse is true and the link you add to nextLinks does not match the regex rules,the link will also been filtered.*/
//next.add("http://xxxxxx.com");
}
}
public static void main(String[] args) throws Exception {
NewsCrawler crawler = new NewsCrawler("crawl", true);
crawler.setThreads(50);
crawler.setTopN(100);
//crawler.setResumable(true);
/*start crawl with depth of 4*/
crawler.start(4);
}
}
##Content Extraction
WebCollector could automatically extract content from news web-pages:
News news = ContentExtractor.getNewsByHtml(html, url);
News news = ContentExtractor.getNewsByHtml(html);
News news = ContentExtractor.getNewsByUrl(url);
String content = ContentExtractor.getContentByHtml(html, url);
String content = ContentExtractor.getContentByHtml(html);
String content = ContentExtractor.getContentByUrl(url);
Element contentElement = ContentExtractor.getContentElementByHtml(html, url);
Element contentElement = ContentExtractor.getContentElementByHtml(html);
Element contentElement = ContentExtractor.getContentElementByUrl(url);
###Other Documentation
+ [中文文档](https://github.com/CrawlScript/WebCollector/blob/master/README.zh-cn.md)
没有合适的资源?快使用搜索试试~ 我知道了~
webcollector源码2.26
共106个文件
java:83个
xml:7个
md:5个
4星 · 超过85%的资源 需积分: 10 24 下载量 180 浏览量
2016-02-05
23:20:18
上传
评论
收藏 9.13MB ZIP 举报
温馨提示
WebCollector是一个无须配置、便于二次开发的JAVA爬虫框架(内核),它提供精简的的API,只需少量代码即可实现一个功能强大的爬虫。WebCollector-Hadoop是WebCollector的Hadoop版本,支持分布式爬取。 本资源是2.26,源自http://git.oschina.net/
资源推荐
资源详情
资源评论
收起资源包目录
webcollector源码2.26 (106个子文件)
.gitignore 148B
Lazy.iml 2KB
WebCollector.iml 2KB
Fetcher.java 18KB
ContentExtractor.java 17KB
FetcherReducer.java 15KB
HttpRequest.java 12KB
HttpRequest.java 12KB
LazyConfig.java 9KB
Page.java 8KB
Page.java 8KB
BerkeleyDBManager.java 8KB
Crawler.java 7KB
DemoBingCrawler.java 6KB
FileUtils.java 6KB
FileUtils.java 6KB
CrawlDatum.java 6KB
KMeans.java 5KB
Links.java 5KB
HttpResponse.java 5KB
HttpResponse.java 5KB
Merge.java 5KB
Crawler.java 5KB
CharsetDetector.java 4KB
CharsetDetector.java 4KB
Links.java 4KB
BerkeleyGenerator.java 4KB
CrawlDatum.java 4KB
TutorialCrawler.java 4KB
LazyCrawler.java 4KB
CrawlDatumFormater.java 4KB
FetcherOutputFormat.java 4KB
Tutorial2Crawler.java 4KB
DemoPostCrawler.java 4KB
RamDBManager.java 4KB
BasicCrawler.java 4KB
CrawlDatumFormater.java 3KB
CrawlDatums.java 3KB
Generator.java 3KB
DemoDepthCrawler.java 3KB
RegexRule.java 3KB
RegexRule.java 3KB
CrawlDatums.java 3KB
WebpageKmeans.java 3KB
DemoSeleniumHttpRequest.java 3KB
RamCrawler.java 3KB
FileSystemOutput.java 3KB
Proxys.java 2KB
Proxys.java 2KB
RamGenerator.java 2KB
BerkeleyDBUtils.java 2KB
DBReader.java 2KB
NewsCrawler.java 2KB
DemoSeleniumCrawler.java 2KB
CharsetDetectorTest.java 2KB
DBManager.java 2KB
MongoHelper.java 2KB
News.java 2KB
BreadthCrawler.java 2KB
JsoupUtils.java 2KB
JsoupUtils.java 2KB
Injector.java 2KB
SegmentUtil.java 2KB
RegexVisitor.java 2KB
Fetcher.java 1KB
WordsBag.java 1KB
Main.java 1KB
SegmentWriter.java 1KB
Config.java 1KB
Config.java 1KB
RamDB.java 1KB
Visitor.java 1KB
Visitor.java 1KB
Generator.java 1KB
Content.java 1KB
ParseData.java 1022B
DBLock.java 986B
Requester.java 985B
Requester.java 984B
Redirect.java 974B
Injector.java 969B
DBUpdater.java 924B
CrawlerConfiguration.java 817B
StopWords.java 781B
CommonRequester.java 734B
Plugin.java 445B
demo_task.json 811B
README.md 4KB
README.zh-cn.md 2KB
README.md 755B
CODE_COVERAGE.md 300B
README.md 97B
log4j.properties 234B
regex 85B
submit.sh 441B
build.sh 168B
stopwords.txt 3KB
pom.xml 11KB
pom.xml 8KB
pom.xml 5KB
共 106 条
- 1
- 2
资源评论
- lustar_20082019-03-15没用上,不过还是感谢分享
damon2001
- 粉丝: 0
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 最入门的爬虫代码 python.docx
- 爬虫零基础入门-爬取天气预报.pdf
- 最通俗易懂的 MongoDB 非结构化文档存储数据库教程.zip
- 以mongodb为数据库的订单物流小项目.zip
- 腾讯云-mongodb数据库, 项目部署.zip
- 腾讯 APIJSON 的 MongoDB 数据库插件.zip
- 理解非关系型数据库和关系型数据库的区别.zip
- 操作简单的Mongodb网页web管理工具,基于Spring Boot2.0支持mongodb集群.zip
- tms-mongodb-web,提供访问mongodb数据的REST API和可灵活扩展的mongodb web 客户端.zip
- SpringBoot整合mongodb学习MongoTemplate和MongoRepository两种方式CRUD使用.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功