webcollector源码2.26_webcollector下载资源-CSDN文库

共106个文件

java：83个

xml：7个

md：5个

源码

4星 · 超过85%的资源需积分: 10 180 浏览量 2016-02-05 23:20:18 上传评论收藏 9.13MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

webcollector源码2.26 （106个子文件）

.gitignore 148B

Lazy.iml 2KB

WebCollector.iml 2KB

Fetcher.java 18KB

ContentExtractor.java 17KB

FetcherReducer.java 15KB

HttpRequest.java 12KB

LazyConfig.java 9KB

Page.java 8KB

BerkeleyDBManager.java 8KB

Crawler.java 7KB

DemoBingCrawler.java 6KB

FileUtils.java 6KB

CrawlDatum.java 6KB

KMeans.java 5KB

Links.java 5KB

HttpResponse.java 5KB

Merge.java 5KB

Crawler.java 5KB

CharsetDetector.java 4KB

Links.java 4KB

BerkeleyGenerator.java 4KB

CrawlDatum.java 4KB

TutorialCrawler.java 4KB

LazyCrawler.java 4KB

CrawlDatumFormater.java 4KB

FetcherOutputFormat.java 4KB

Tutorial2Crawler.java 4KB

DemoPostCrawler.java 4KB

RamDBManager.java 4KB

BasicCrawler.java 4KB

CrawlDatumFormater.java 3KB

CrawlDatums.java 3KB

Generator.java 3KB

DemoDepthCrawler.java 3KB

RegexRule.java 3KB

CrawlDatums.java 3KB

WebpageKmeans.java 3KB

DemoSeleniumHttpRequest.java 3KB

RamCrawler.java 3KB

FileSystemOutput.java 3KB

Proxys.java 2KB

RamGenerator.java 2KB

BerkeleyDBUtils.java 2KB

DBReader.java 2KB

NewsCrawler.java 2KB

DemoSeleniumCrawler.java 2KB

CharsetDetectorTest.java 2KB

DBManager.java 2KB

MongoHelper.java 2KB

News.java 2KB

BreadthCrawler.java 2KB

JsoupUtils.java 2KB

Injector.java 2KB

SegmentUtil.java 2KB

RegexVisitor.java 2KB

Fetcher.java 1KB

WordsBag.java 1KB

Main.java 1KB

SegmentWriter.java 1KB

Config.java 1KB

RamDB.java 1KB

Visitor.java 1KB

Generator.java 1KB

Content.java 1KB

ParseData.java 1022B

DBLock.java 986B

Requester.java 985B

Requester.java 984B

Redirect.java 974B

Injector.java 969B

DBUpdater.java 924B

CrawlerConfiguration.java 817B

StopWords.java 781B

CommonRequester.java 734B

Plugin.java 445B

demo_task.json 811B

README.md 4KB

README.zh-cn.md 2KB

README.md 755B

CODE_COVERAGE.md 300B

README.md 97B

log4j.properties 234B

regex 85B

submit.sh 441B

build.sh 168B

stopwords.txt 3KB

pom.xml 11KB

pom.xml 8KB

pom.xml 5KB

共 106 条

#WebCollector WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. ##HomePage [https://github.com/CrawlScript/WebCollector](https://github.com/CrawlScript/WebCollector) ##Document [WebCollector-GitDoc](https://github.com/CrawlScript/WebCollector-GitDoc) ##Installation ### Using Maven To use the latest release of WebCollector, please use the following snippet in your pom.xml ```xml <dependency> <groupId>cn.edu.hfut.dmic.webcollector</groupId> <artifactId>WebCollector</artifactId> <version>2.09</version> </dependency> ``` ### Without Maven WebCollector jars are available on the [HomePage](https://github.com/CrawlScript/WebCollector). + __webcollector-version-bin.zip__ contains core jars. ##Quickstart Lets crawl some news from hfut news.This demo prints out the titles and contents extracted from news of hfut news. [NewsCrawler.java](https://github.com/CrawlScript/WebCollector/blob/master/NewsCrawler.java): import cn.edu.hfut.dmic.webcollector.model.CrawlDatums; import cn.edu.hfut.dmic.webcollector.model.Page; import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler; import org.jsoup.nodes.Document; /** * Crawling news from hfut news * * @author hu */ public class NewsCrawler extends BreadthCrawler { /** * @param crawlPath crawlPath is the path of the directory which maintains * information of this crawler * @param autoParse if autoParse is true,BreadthCrawler will auto extract * links which match regex rules from pag */ public NewsCrawler(String crawlPath, boolean autoParse) { super(crawlPath, autoParse); /*start page*/ this.addSeed("http://news.hfut.edu.cn/list-1-1.html"); /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/ this.addRegex("http://news.hfut.edu.cn/show-.*html"); /*do not fetch jpg|png|gif*/ this.addRegex("-.*\\.(jpg|png|gif).*"); /*do not fetch url contains #*/ this.addRegex("-.*#.*"); } @Override public void visit(Page page, CrawlDatums next) { String url = page.getUrl(); /*if page is news page*/ if (page.matchUrl("http://news.hfut.edu.cn/show-.*html")) { /*we use jsoup to parse page*/ Document doc = page.getDoc(); /*extract title and content of news by css selector*/ String title = page.select("div[id=Article]>h2").first().text(); String content = page.select("div#artibody", 0).text(); System.out.println("URL:\n" + url); System.out.println("title:\n" + title); System.out.println("content:\n" + content); /*If you want to add urls to crawl,add them to nextLink*/ /*WebCollector automatically filters links that have been fetched before*/ /*If autoParse is true and the link you add to nextLinks does not match the regex rules,the link will also been filtered.*/ //next.add("http://xxxxxx.com"); } } public static void main(String[] args) throws Exception { NewsCrawler crawler = new NewsCrawler("crawl", true); crawler.setThreads(50); crawler.setTopN(100); //crawler.setResumable(true); /*start crawl with depth of 4*/ crawler.start(4); } } ##Content Extraction WebCollector could automatically extract content from news web-pages: News news = ContentExtractor.getNewsByHtml(html, url); News news = ContentExtractor.getNewsByHtml(html); News news = ContentExtractor.getNewsByUrl(url); String content = ContentExtractor.getContentByHtml(html, url); String content = ContentExtractor.getContentByHtml(html); String content = ContentExtractor.getContentByUrl(url); Element contentElement = ContentExtractor.getContentElementByHtml(html, url); Element contentElement = ContentExtractor.getContentElementByHtml(html); Element contentElement = ContentExtractor.getContentElementByUrl(url); ###Other Documentation + [中文文档](https://github.com/CrawlScript/WebCollector/blob/master/README.zh-cn.md)

评论收藏

内容反馈