![logo](http://webmagic.io/images/logo.jpeg)
[Readme in Chinese](https://github.com/code4craft/webmagic/tree/master/README-zh.md)
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/us.codecraft/webmagic-parent/badge.svg?subject=Maven%20Central)](https://maven-badges.herokuapp.com/maven-central/us.codecraft/webmagic-parent/)
[![License](https://img.shields.io/badge/License-Apache%20License%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0.html)
[![Build Status](https://travis-ci.org/code4craft/webmagic.png?branch=master)](https://travis-ci.org/code4craft/webmagic)
>A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.
## Features:
* Simple core with high flexibility.
* Simple API for html extracting.
* Annotation with POJO to customize a crawler, no configuration.
* Multi-thread and Distribution support.
* Easy to be integrated.
## Install:
Add dependencies to your pom.xml:
```xml
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.5</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.5</version>
</dependency>
```
WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12.
```xml
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
```
## Get Started:
### First crawler:
Write a class implements PageProcessor. For example, I wrote a crawler of github repository infomation.
```java
public class GithubRepoPageProcessor implements PageProcessor {
private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
@Override
public void process(Page page) {
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
page.putField("name", page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());
if (page.getResultItems().get("name")==null){
//skip this page
page.setSkip(true);
}
page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
}
}
```
* `page.addTargetRequests(links)`
Add urls for crawling.
You can also use annotation way:
```java
@TargetUrl("https://github.com/\\w+/\\w+")
@HelpUrl("https://github.com/\\w+")
public class GithubRepo {
@ExtractBy(value = "//h1[@class='public']/strong/a/text()", notNull = true)
private String name;
@ExtractByUrl("https://github\\.com/(\\w+)/.*")
private String author;
@ExtractBy("//div[@id='readme']/tidyText()")
private String readme;
public static void main(String[] args) {
OOSpider.create(Site.me().setSleepTime(1000)
, new ConsolePageModelPipeline(), GithubRepo.class)
.addUrl("https://github.com/code4craft").thread(5).run();
}
}
```
### Docs and samples:
Documents: [http://webmagic.io/docs/](http://webmagic.io/docs/)
The architecture of webmagic (refered to [Scrapy](http://scrapy.org/))
![image](http://code4craft.github.io/images/posts/webmagic.png)
There are more examples in `webmagic-samples` package.
### Lisence:
Lisenced under [Apache 2.0 lisence](http://opensource.org/licenses/Apache-2.0)
### Thanks:
To write webmagic, I refered to the projects below :
* **Scrapy**
A crawler framework in Python.
[http://scrapy.org/](http://scrapy.org/)
* **Spiderman**
Another crawler framework in Java.
[http://git.oschina.net/l-weiwei/spiderman](http://git.oschina.net/l-weiwei/spiderman)
### Mail-list:
[https://groups.google.com/forum/#!forum/webmagic-java](https://groups.google.com/forum/#!forum/webmagic-java)
[http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988](http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988)
QQ Group: 373225642 542327088
### Related Project
* <a href="https://github.com/gsh199449/spider" target="_blank">Gather Platform</a>
A web console based on WebMagic for Spider configuration and management.
没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
收起资源包目录
webmagic垂直爬虫 v0.7.5.zip (292个子文件)
.gitignore 71B
Github.groovy 545B
说明.htm 4KB
mock-github.html 112KB
mock-github.html 112KB
mock-webmagic.html 2KB
package.html 149B
package.html 106B
package.html 104B
package.html 96B
package.html 90B
package.html 88B
package.html 73B
package.html 70B
package.html 58B
package.html 56B
config.ini 480B
config.ini 362B
XpathSelectorTest.java 103KB
MockGithubDownloader.java 72KB
ProcessorBenchmark.java 59KB
Spider.java 21KB
HttpClientDownloaderTest.java 15KB
PageModelExtractor.java 15KB
Site.java 10KB
HttpClientGenerator.java 7KB
Xpath2Selector.java 7KB
WebDriverPool.java 7KB
Page.java 7KB
FileCacheQueueScheduler.java 6KB
ScriptConsole.java 6KB
HttpUriRequestConverter.java 5KB
HttpClientDownloader.java 5KB
Request.java 5KB
PageModelExtractorTest.java 5KB
PhantomJSDownloader.java 5KB
MultiPagePipeline.java 4KB
BasicTypeFormatter.java 4KB
HtmlNode.java 4KB
RedisScheduler.java 4KB
RedisPriorityScheduler.java 4KB
SpiderTest.java 4KB
SpiderMonitor.java 4KB
Proxy.java 4KB
UrlUtils.java 4KB
ScriptProcessor.java 4KB
ProxyTest.java 3KB
SeleniumDownloader.java 3KB
ZipCodePageProcessor.java 3KB
OOSpider.java 3KB
AlexanderMcqueenGoodsProcessor.java 3KB
ModelPageProcessor.java 3KB
Selectable.java 3KB
SmartContentSelector.java 3KB
BloomFilterDuplicateRemoverTest.java 3KB
HttpRequestBody.java 3KB
PatternProcessorExample.java 3KB
AbstractSelectable.java 3KB
RegexSelector.java 3KB
ExtractRule.java 3KB
CssSelector.java 3KB
CountableThreadPool.java 3KB
SpiderTest.java 3KB
SpiderStatus.java 3KB
SimpleHttpClientTest.java 3KB
GithubRepo.java 3KB
ModelPageProcessorTest.java 2KB
News163.java 2KB
DoubleKeyMap.java 2KB
MmapQueueScheduler.java 2KB
ZhihuPageProcessor.java 2KB
QuickStarter.java 2KB
Html.java 2KB
DuplicateStorageRemover.java 2KB
GithubRepo.java 2KB
HtmlTest.java 2KB
ComboExtract.java 2KB
CharsetUtils.java 2KB
PriorityScheduler.java 2KB
JsonPathSelectorTest.java 2KB
DelayQueueScheduler.java 2KB
PlainText.java 2KB
BloomFilterDuplicateRemover.java 2KB
PrioritySchedulerTest.java 2KB
ObjectFormatterBuilder.java 2KB
ExtractBy.java 2KB
FilePipeline.java 2KB
RedisPrioritySchedulerTest.java 2KB
DiaoyuwengProcessor.java 2KB
ScriptProcessorBuilder.java 2KB
CustomRedirectStrategy.java 2KB
GithubRepoApi.java 2KB
DuplicateRemovedSchedulerTest.java 2KB
AmanzonPageProcessor.java 2KB
BaiduBaikePageProcessor.java 2KB
PhantomJSPageProcessor.java 2KB
MamacnPageProcessor.java 2KB
JsonFilePageModelPipeline.java 2KB
SinaBlogProcessor.java 2KB
BaiduBaike.java 2KB
共 292 条
- 1
- 2
- 3
资源评论
芝麻粒儿
- 粉丝: 6w+
- 资源: 2万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功