一个简约灵活强大的Java爬虫框架_jvppeteer做爬虫资源-CSDN文库

共71个文件

java：48个

xml：6个

js：4个

Java开发-Web爬虫

需积分: 35 158 浏览量 2019-08-08 02:00:37 上传评论 1 收藏 238KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

一个简约灵活强大的Java爬虫框架.zip （71个子文件）

xbynet-crawler-9871604

.gitignore 402B

README.md 6KB

pom.xml 5KB

crawler-core

pom.xml 585B

src

test

java

net

xby1993

crawler

NeihanshequCrawler.java 3KB

AppTest.java 647B

GithubCrawler.java 5KB

ZhihuRecommendCrawler.java 3KB

QiushibaikeCrawler.java 3KB

OSChinaTweetsCrawler.java 3KB

StartAllJoke.java 215B

main

java

net

xby1993

crawler

utils

CrawlerUtils.java 1KB

BeanUtil.java 752B

CountableThreadPool.java 2KB

SpiderListener.java 167B

Request.java 4KB

Site.java 1KB

http

FileDownloader.java 2KB

CustomRedirectStrategy.java 2KB

AbsDownloader.java 8KB

HttpClientFactory.java 4KB

DefaultDownloader.java 2KB

Downloader.java 262B

IpProxyProvider.java 361B

annotation

Nullable.java 301B

Spider.java 9KB

parser

JsoupParser.java 3KB

XpathParser.java 1KB

Parser.java 66B

JsonPathParser.java 1KB

RequestAction.java 417B

Const.java 188B

Processor.java 983B

Response.java 3KB

scheduler

DefaultScheduler.java 1KB

DuplicateRemover.java 323B

Scheduler.java 397B

RedisScheduler.java 4KB

ISpider.java 79B

LICENSE 1KB

crawler-server

.gitignore 9B

pom.xml 3KB

src

main

java

net

xby1993

crawler

server

monitor

MonitorServlet.java 3KB

SpiderManager.java 2KB

HelloServlet.java 999B

Main.java 769B

demo

GithubCrawler.java 5KB

webapp

css

bootstrap.min.css 115KB

WEB-INF

web.xml 379B

index.jsp 83B

META-INF

context.xml 67B

fonts

glyphicons-halflings-regular.svg 106KB

glyphicons-halflings-regular.woff 23KB

glyphicons-halflings-regular.ttf 44KB

glyphicons-halflings-regular.woff2 18KB

glyphicons-halflings-regular.eot 20KB

bootstrap.min.js 35KB

jquery.min.js 85KB

spider-list.js 393B

jsp

new-employee.jsp 3KB

spider-list.jsp 3KB

crawler-selenium

pom.xml 1000B

src

main

java

net

xby1993

crawler

selenium

ImageRegion.java 432B

WindowUtil.java 6KB

ImageUtil.java 759B

PhantomjsWebDriverPool.java 4KB

SeleniumAction.java 175B

getCssAttr.js 348B

SeleniumDownloader.java 3KB

WebDriverPool.java 252B

WebDriverManager.java 2KB

# crawler A simple and flexible web crawler framework for java. ## Features: 1、Code is easy to understand and customized (代码简单易懂，可定制性强) 2、Api is simple and easy to use 3、Support File download、Content part fetch.(支持文件下载、分块抓取) 4、Request And Response support much options、strong customizable.(请求和响应支持的内容和选项比较丰富、每个请求可定制性强) 5、Support do your own operation before or after network request in downloader(支持网络请求前后执行自定义操作) 6、Selenium+PhantomJS support 7、Redis support ## Future: 1、Complete the code comment and test(完善代码注释和完善测试代码) ## Install: This is the plain maven javase project. So you can download the code and package jar file for your own. ## demo: ```java import net.xby1993.crawler.http.DefaultDownloader; import net.xby1993.crawler.http.FileDownloader; import net.xby1993.crawler.http.HttpClientFactory; import net.xby1993.crawler.parser.JsoupParser; import net.xby1993.crawler.scheduler.DefaultScheduler; public class GithubCrawler extends Processor { @Override public void process(Response resp) { String currentUrl = resp.getRequest().getUrl(); System.out.println("CurrentUrl:" + currentUrl); int respCode = resp.getCode(); System.out.println("ResponseCode:" + respCode); System.out.println("type:" + resp.getRespType().name()); String contentType = resp.getContentType(); System.out.println("ContentType:" + contentType); Map<String, List<String>> headers = resp.getHeaders(); System.out.println("ResonseHeaders:"); for (String key : headers.keySet()) { List<String> values=headers.get(key); for(String str:values){ System.out.println(key + ":" +str); } } JsoupParser parser = resp.html(); // suppport parted ,分块抓取是会有个parent response来关联所有分块response // System.out.println("isParted:"+resp.isPartResponse()); // Response parent=resp.getParentResponse(); // resp.addPartRequest(null); //Map<String,Object> extras=resp.getRequest().getExtras(); if (currentUrl.equals("https://github.com/xbynet")) { String avatar = parser.single("img.avatar", "src"); String dir = System.getProperty("java.io.tmpdir"); String savePath = Paths.get(dir, UUID.randomUUID().toString()) .toString(); boolean avatarDownloaded = download(avatar, savePath); System.out.println("avatar:" + avatar + ", saved:" + savePath); // System.out.println("avtar downloaded status:"+avatarDownloaded); String name = parser.single(".vcard-names > .vcard-fullname", "text"); System.out.println("name:" + name); List<String> reponames = parser.list( ".pinned-repos-list .repo.js-repo", "text"); List<String> repoUrls = parser.list( ".pinned-repo-item .d-block >a", "href"); System.out.println("reponame:url"); if (reponames != null) { for (int i = 0; i < reponames.size(); i++) { String tmpUrl="https://github.com"+repoUrls.get(i); System.out.println(reponames.get(i) + ":"+tmpUrl); Request req=new Request(tmpUrl).putExtra("name", reponames.get(i)); resp.addRequest(req); } } }else{ Map<String,Object> extras=resp.getRequest().getExtras(); String name=extras.get("name").toString(); System.out.println("repoName:"+name); String shortDesc=parser.single(".repository-meta-content","allText"); System.out.println("shortDesc:"+shortDesc); } } public void start() { Site site = new Site(); Spider spider = Spider.builder(this).threadNum(5).site(site) .urls("https://github.com/xbynet").build(); spider.run(); } public static void main(String[] args) { new GithubCrawler().start(); } public void startCompleteConfig() { String pcUA = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"; String androidUA = "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"; Site site = new Site(); site.setEncoding("UTF-8").setHeader("Referer", "https://github.com/") .setRetry(3).setRetrySleep(3000).setSleep(50).setTimeout(30000) .setUa(pcUA); Request request = new Request("https://github.com/xbynet"); HttpClientContext ctx = new HttpClientContext(); BasicCookieStore cookieStore = new BasicCookieStore(); ctx.setCookieStore(cookieStore); request.setAction(new RequestAction() { @Override public void before(CloseableHttpClient client, HttpUriRequest req) { System.out.println("before-haha"); } @Override public void after(CloseableHttpClient client, CloseableHttpResponse resp) { System.out.println("after-haha"); } }).setCtx(ctx).setEncoding("UTF-8") .putExtra("somekey", "I can use in the response by your own") .setHeader("User-Agent", pcUA).setMethod(Const.HttpMethod.GET) .setPartRequest(null).setEntity(null) .setParams("appkeyqqqqqq", "1213131232141").setRetryCount(5) .setRetrySleepTime(10000); Spider spider = Spider.builder(this).threadNum(5) .name("Spider-github-xbynet") .defaultDownloader(new DefaultDownloader()) .fileDownloader(new FileDownloader()) .httpClientFactory(new HttpClientFactory()).ipProvider(null) .listener(null).pool(null).scheduler(new DefaultScheduler()) .shutdownOnComplete(true).site(site).build(); spider.run(); } } ``` ## Examples: - Github(github个人项目信息) - OSChinaTweets(开源中国动弹) - Qiushibaike(醜事百科) - Neihanshequ(内涵段子) - ZihuRecommend(知乎推荐) **More Examples:** Please see [here](https://github.com/xbynet/crawler/tree/master/crawler-core/src/test/java/net/xby1993/crawler) ## Thinks: [webmagic](https://github.com/code4craft/webmagic):本项目借鉴了webmagic多处代码，设计上也作了较多参考，非常感谢。 [xsoup](https://github.com/code4craft/xsoup)：本项目使用xsoup作为底层xpath处理器 [JsonPath](https://github.com/json-path/JsonPath)：本项目使用JsonPath作为底层jsonpath处理器 [Jsoup](https://jsoup.org/) 本项目使用Jsoup作为底层HTML/XML处理器 [HttpClient](http://hc.apache.org/) 本项目使用HttpClient作为底层网络请求工具

评论收藏

内容反馈