WebCollector最新稳定版（含jar包、源码、JavaAPI）

共93个文件

java：75个

xml：6个

md：3个

WebCollector

1星需积分: 50 95 浏览量 2017-09-27 21:55:43 上传评论收藏 18.97MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

WebCollector-master.zip （93个子文件）

WebCollector-master

.gitignore 73B

ManualNewsCrawler.java 2KB

README.md 14KB

WebCollector

WebCollector.iml 8KB

pom.xml 13KB

src

test

java

edu

hfut

dmic

webcollector

util

AvroTest.java 2KB

CharsetDetectorTest.java 2KB

CrawlDatumsTest.java 2KB

MetaTest.java 1KB

OkHttpRequesterTest.java 688B

BerkeleyDBManagerTest.java 1KB

CrawlDatumTest.java 1KB

main

resources

log4j.properties 240B

java

edu

hfut

dmic

contentextractor

News.java 2KB

ContentExtractor.java 18KB

webcollector

crawler

Crawler.java 13KB

AutoParseCrawler.java 5KB

net

Proxys.java 3KB

HttpResponse.java 6KB

HttpRequest.java 13KB

Requester.java 1KB

conf

Configured.java 958B

Configuration.java 7KB

CommonConfigured.java 1KB

DefaultConfigured.java 987B

model

MetaSetter.java 348B

CrawlDatums.java 5KB

CrawlDatum.java 7KB

Page.java 14KB

Links.java 5KB

MetaGetter.java 364B

crawldb

StatusGeneratorFilter.java 480B

SegmentWriter.java 1KB

DBManager.java 3KB

GeneratorFilter.java 350B

Generator.java 3KB

Injector.java 997B

example

DemoTypeCrawler.java 4KB

DemoPostCrawler.java 3KB

TutorialCrawler.java 4KB

DemoHashSetNextFilter.java 4KB

DemoMetaCrawler.java 6KB

DemoBingCrawler.java 6KB

DemoDepthCrawler.java 3KB

DemoSelenium.java 3KB

DemoNextFilter.java 4KB

fetcher

NextFilter.java 1KB

Executor.java 1KB

Fetcher.java 14KB

Visitor.java 1KB

util

Config.java 2KB

ReflectAvroFileWriter.java 1KB

CrawlDatumFormater.java 3KB

FileIdGenerator.java 2KB

ReflectAvroFileReader.java 768B

RegexRule.java 4KB

MD5Utils.java 2KB

MysqlHelper.java 2KB

ListUtils.java 354B

ConfigurationUtils.java 767B

Counter.java 1KB

FileSystemOutput.java 3KB

ExceptionUtils.java 1KB

JsoupUtils.java 2KB

FileUtils.java 5KB

CharsetDetector.java 4KB

GsonUtils.java 312B

plugin

net

OkHttpRequester.java 4KB

berkeley

BerkeleyDBManager.java 8KB

BerkeleyCrawler.java 1KB

BerkeleyDBUtils.java 2KB

BerkeleyDBReader.java 3KB

BerkeleyGenerator.java 4KB

BreadthCrawler.java 2KB

nextfilter

HashSetNextFilter.java 1KB

ram

RamDBManager.java 4KB

RamCrawler.java 1KB

RamGenerator.java 2KB

RamDB.java 1KB

.idea

workspace.xml 75KB

encodings.xml 221B

misc.xml 2KB

.name 12B

profiles_settings.xml 76B

compiler.xml 1KB

CODE_COVERAGE.md 309B

WebCollector-JRuby

webcollector-0.1.0.gem 6.88MB

webcollector.gemspec 409B

lib

webcollector.rb 790B

LICENSE.txt 34KB

README.zh-cn.md 2KB

webcollector-2.71-bin.zip 11.95MB

AutoNewsCrawler.java 2KB

# WebCollector WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. ## HomePage [https://github.com/CrawlScript/WebCollector](https://github.com/CrawlScript/WebCollector)  ## Installation ### Using Maven ```xml <dependency> <groupId>cn.edu.hfut.dmic.webcollector</groupId> <artifactId>WebCollector</artifactId> <version>2.71</version> </dependency> ``` ### Without Maven WebCollector jars are available on the [HomePage](https://github.com/CrawlScript/WebCollector). + __webcollector-version-bin.zip__ contains core jars. ## Quickstart Lets crawl some news from hfut news.This demo prints out the titles and contents extracted from news of hfut news. ### Automatically Detecting URLs [AutoNewsCrawler.java](https://github.com/CrawlScript/WebCollector/blob/master/AutoNewsCrawler.java): ```java import cn.edu.hfut.dmic.webcollector.model.CrawlDatums; import cn.edu.hfut.dmic.webcollector.model.Page; import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler; import org.jsoup.nodes.Document; /** * Crawling news from hfut news * * @author hu */ public class AutoNewsCrawler extends BreadthCrawler { /** * @param crawlPath crawlPath is the path of the directory which maintains * information of this crawler * @param autoParse if autoParse is true,BreadthCrawler will auto extract * links which match regex rules from pag */ public AutoNewsCrawler(String crawlPath, boolean autoParse) { super(crawlPath, autoParse); /*start page*/ this.addSeed("http://news.hfut.edu.cn/list-1-1.html"); /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/ this.addRegex("http://news.hfut.edu.cn/show-.*html"); /*do not fetch jpg|png|gif*/ this.addRegex("-.*\\.(jpg|png|gif).*"); /*do not fetch url contains #*/ this.addRegex("-.*#.*"); setThreads(50); getConf().setTopN(100); // setResumable(true); } @Override public void visit(Page page, CrawlDatums next) { String url = page.url(); /*if page is news page*/ if (page.matchUrl("http://news.hfut.edu.cn/show-.*html")) { /*extract title and content of news by css selector*/ String title = page.select("div[id=Article]>h2").first().text(); String content = page.selectText("div#artibody"); System.out.println("URL:\n" + url); System.out.println("title:\n" + title); System.out.println("content:\n" + content); /*If you want to add urls to crawl,add them to nextLink*/ /*WebCollector automatically filters links that have been fetched before*/ /*If autoParse is true and the link you add to nextLinks does not match the regex rules,the link will also been filtered.*/ //next.add("http://xxxxxx.com"); } } public static void main(String[] args) throws Exception { AutoNewsCrawler crawler = new AutoNewsCrawler("crawl", true); /*start crawl with depth of 4*/ crawler.start(4); } } ``` ### Manually Detecting URLs [ManualNewsCrawler.java](https://github.com/CrawlScript/WebCollector/blob/master/ManualNewsCrawler.java): ```java import cn.edu.hfut.dmic.webcollector.model.CrawlDatums; import cn.edu.hfut.dmic.webcollector.model.Page; import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler; import org.jsoup.nodes.Document; /** * Crawling news from hfut news * * @author hu */ public class ManualNewsCrawler extends BreadthCrawler { /** * @param crawlPath crawlPath is the path of the directory which maintains * information of this crawler * @param autoParse if autoParse is true,BreadthCrawler will auto extract * links which match regex rules from pag */ public ManualNewsCrawler(String crawlPath, boolean autoParse) { super(crawlPath, autoParse); /*add 10 start pages and set their type to "list" "list" is not a reserved word, you can use other string instead */ for(int i = 1; i <= 10; i++) { this.addSeed("http://news.hfut.edu.cn/list-1-" + i + ".html", "list"); } setThreads(50); getConf().setTopN(100); // setResumable(true); } @Override public void visit(Page page, CrawlDatums next) { String url = page.url(); if (page.matchType("list")) { /*if type is "list"*/ /*detect content page by css selector and mark their types as "content"*/ next.add(page.links("div[class=' col-lg-8 '] li>a")).type("content"); }else if(page.matchType("content")) { /*if type is "content"*/ /*extract title and content of news by css selector*/ String title = page.select("div[id=Article]>h2").first().text(); String content = page.selectText("div#artibody", 0); //read title_prefix and content_length_limit from configuration title = getConf().getString("title_prefix") + title; content = content.substring(0, getConf().getInteger("content_length_limit")); System.out.println("URL:\n" + url); System.out.println("title:\n" + title); System.out.println("content:\n" + content); } } public static void main(String[] args) throws Exception { ManualNewsCrawler crawler = new ManualNewsCrawler("crawl", false); crawler.getConf().setExecuteInterval(5000); crawler.getConf().set("title_prefix","PREFIX_"); crawler.getConf().set("content_length_limit", 20); /*start crawl with depth of 4*/ crawler.start(4); } } ``` ## CrawlDatum CrawlDatum is an important data structure in WebCollector, which corresponds to url of webpages. Both crawled urls and detected urls are maintained as CrawlDatums. There are some differences between CrawlDatum and url: + A CrawlDatum contains a key and a url. The key is the url by default. You can set the key manually by CrawlDatum.key("xxxxx") so that CrawlDatums with the same url may have different keys. This is very useful in some tasks like crawling data by api, which often request different data by the same url with different post parameters. + A CrawlDatum may contain metadata, which could maintain some information besides the url. ## Manually Detecting URLs In both `void visit(Page page, CrawlDatums next)` and `void execute(Page page, CrawlDatums next)`, the second parameter `CrawlDatum next` is a container which you should put the detected URLs in: ```java //add one detected URL next.add("detected URL"); //add one detected URL and set its type next.add("detected URL", "type"); //add one detected URL next.add(new CrawlDatum("detected URL")); //add detected URLs next.add("detected URL list"); //add detected URLs next.add(("detected URL list","type"); //add detected URLs next.add(new CrawlDatums("detected URL list")); //add one detected URL and return the added URL(CrawlDatum) //and set its key and type next.addAndReturn("detected URL").key("key").type("type"); //add detected URLs and return the added URLs(CrawlDatums) //and set their type and meta info next.addAndReturn("detected URL list").type("type").meta("page_num",10); //add detected URL and return next //and modify the type and meta info of all the CrawlDatums in next, //including the added URL next.add("det

评论收藏

内容反馈