# crawler4j
[![Build Status](https://travis-ci.org/yasserg/crawler4j.svg?branch=master)](https://travis-ci.org/yasserg/crawler4j)
[![Maven Central](https://img.shields.io/maven-central/v/edu.uci.ics/crawler4j.svg?style=flat-square)](https://search.maven.org/search?q=g:edu.uci.ics%20a:crawler4j)
[![Gitter Chat](http://img.shields.io/badge/chat-online-brightgreen.svg)](https://gitter.im/crawler4j/Lobby)
crawler4j is an open source web crawler for Java which provides a simple interface for
crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.
## Table of content
- [Installation](#installation)
- [Quickstart](#quickstart)
- [More Examples](#more-examples)
- [Configuration Details](#configuration-details)
- [License](#license)
## Installation
### Using Maven
Add the following dependency to your pom.xml:
```xml
<dependency>
<groupId>edu.uci.ics</groupId>
<artifactId>crawler4j</artifactId>
<version>4.4.0</version>
</dependency>
```
### Using Gradle
Add the following dependency to your build.gradle file:
compile group: 'edu.uci.ics', name: 'crawler4j', version: '4.4.0'
## Quickstart
You need to create a crawler class that extends WebCrawler. This class decides which URLs
should be crawled and handles the downloaded page. The following is a sample
implementation:
```java
public class MyCrawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
+ "|png|mp3|mp4|zip|gz))$");
/**
* This method receives two parameters. The first parameter is the page
* in which we have discovered this new url and the second parameter is
* the new url. You should implement this function to specify whether
* the given url should be crawled or not (based on your crawling logic).
* In this example, we are instructing the crawler to ignore urls that
* have css, js, git, ... extensions and to only accept urls that start
* with "https://www.ics.uci.edu/". In this case, we didn't need the
* referringPage parameter to make the decision.
*/
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("https://www.ics.uci.edu/");
}
/**
* This function is called when a page is fetched and ready
* to be processed by your program.
*/
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
Set<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
}
}
```
As can be seen in the above code, there are two main functions that should be overridden:
- shouldVisit: This function decides whether the given URL should be crawled or not. In
the above example, this example is not allowing .css, .js and media files and only allows
pages within 'www.ics.uci.edu' domain.
- visit: This function is called after the content of a URL is downloaded successfully.
You can easily get the url, text, links, html, and unique id of the downloaded page.
You should also implement a controller class which specifies the seeds of the crawl,
the folder in which intermediate crawl data should be stored and the number of concurrent
threads:
```java
public class Controller {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "/data/crawl/root";
int numberOfCrawlers = 7;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
// Instantiate the controller for this crawl.
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
// For each crawl, you need to add some seed urls. These are the first
// URLs that are fetched and then the crawler starts following links
// which are found in these pages
controller.addSeed("https://www.ics.uci.edu/~lopes/");
controller.addSeed("https://www.ics.uci.edu/~welling/");
controller.addSeed("https://www.ics.uci.edu/");
// The factory which creates instances of crawlers.
CrawlController.WebCrawlerFactory<BasicCrawler> factory = MyCrawler::new;
// Start the crawl. This is a blocking operation, meaning that your code
// will reach the line after this only when crawling is finished.
controller.start(factory, numberOfCrawlers);
}
}
```
## More Examples
- [Basic crawler](crawler4j-examples/crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/basic/): the full source code of the above example with more details.
- [Image crawler](crawler4j-examples/crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/imagecrawler/): a simple image crawler that downloads image content from the crawling domain and stores them in a folder. This example demonstrates how binary content can be fetched using crawler4j.
- [Collecting data from threads](crawler4j-examples/crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/localdata/): this example demonstrates how the controller can collect data/statistics from crawling threads.
- [Multiple crawlers](crawler4j-examples/crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/multiple/): this is a sample that shows how two distinct crawlers can run concurrently. For example, you might want to split your crawling into different domains and then take different crawling policies for each group. Each crawling controller can have its own configurations.
- [Shutdown crawling](crawler4j-examples/crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/shutdown/): this example shows how crawling can be terminated gracefully by sending the 'shutdown' command to the controller.
- [Postgres/JDBC integration](crawler4j-examples/crawler4j-examples-postgres/): this shows how to save the crawled content into a Postgres database (or any other JDBC repository), thanks [rzo1](https://github.com/rzo1/).
## Configuration Details
The controller class has a mandatory parameter of type [CrawlConfig](crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java).
Instances of this class can be used for configuring crawler4j. The following sections
describe some details of configurations.
### Crawl depth
By default there is no limit on the depth of crawling. But you can limit the depth of crawling. For example, assume that you have a seed page "A", which links to "B", which links to "C", which links to "D". So, we have the following link structure:
A -> B -> C -> D
Since, "A" is a seed page, it will have a depth of 0. "B" will have depth of 1 and so on. You can set a limit on the depth of pages that crawler4j crawls. For example, if you set this limit to 2, it won't crawl page "D". To set the maximum depth you can use:
```java
crawlConfig.setMaxDepthOfCrawling(maxDepthOfCrawling);
```
### Enable SSL
To enable SSL simply:
```java
CrawlConfig config = new CrawlConfig();
config.setIncludeHttpsPages(true);
```
### Maximum number of pages to crawl
Although by default there
没有合适的资源?快使用搜索试试~ 我知道了~
Open Source Web Crawler for Java.zip
共127个文件
java:71个
groovy:17个
xml:11个
需积分: 1 0 下载量 165 浏览量
2023-12-29
20:40:05
上传
评论
收藏 292KB ZIP 举报
温馨提示
Open Source Web Crawler for Java.zip
资源推荐
资源详情
资源评论
收起资源包目录
Open Source Web Crawler for Java.zip (127个子文件)
gradlew.bat 2KB
absolute.css 363B
quotes.css 196B
relative.css 150B
public_suffix_list.dat 203KB
.editorconfig 111B
.gitignore 123B
.gitignore 105B
.gitignore 11B
build.gradle 6KB
build.gradle 2KB
build.gradle 656B
settings.gradle 429B
build.gradle 51B
gradlew 5KB
BasicAuthTest.groovy 6KB
FormAuthInfoTest.groovy 6KB
CrawlerWithJSTest.groovy 5KB
TimeoutTest.groovy 5KB
WebCrawlerTest.groovy 4KB
NoIndexTest.groovy 4KB
RedirectHandlerTest.groovy 4KB
NoFollowTest.groovy 4KB
OnRedirectedToInvalidTest.groovy 3KB
CustomDnsResolverTest.groovy 2KB
CssParseDataTest.groovy 2KB
PublicSuffixTest.groovy 2KB
NetTest.groovy 2KB
TLDListOnlineTest.groovy 1KB
RobotstxtParserTest.groovy 1KB
PageTest.groovy 953B
HtmlParserTest.groovy 790B
wiki.c2.com.html 9KB
gradle-wrapper.jar 53KB
CrawlController.java 25KB
WebCrawler.java 24KB
CrawlConfig.java 23KB
UrlResolver.java 19KB
PageFetcher.java 16KB
RobotstxtServer.java 9KB
Page.java 8KB
WebURL.java 8KB
UserAgentDirectives.java 7KB
URLCanonicalizer.java 7KB
HtmlContentHandler.java 7KB
Frontier.java 7KB
Parser.java 5KB
HostDirectives.java 5KB
DocIDServer.java 5KB
TikaHtmlParser.java 5KB
CssParseData.java 5KB
WorkQueues.java 5KB
Counters.java 5KB
BinaryParseData.java 5KB
PathRule.java 5KB
URLCanonicalizerTest.java 5KB
AuthInfo.java 5KB
StatusHandlerCrawlController.java 4KB
RobotstxtParser.java 4KB
BasicCrawlController.java 4KB
PageFetchResult.java 4KB
LocalDataCollectorCrawler.java 3KB
ControllerWithShutdown.java 3KB
BasicCrawler.java 3KB
MultipleCrawlerController.java 3KB
PageFetcherHtmlTest.java 3KB
LocalDataCollectorController.java 3KB
Util.java 3KB
PgsqlTest.java 3KB
ImageCrawler.java 3KB
SniPoolingHttpClientConnectionManager.java 3KB
SampleLauncher.java 3KB
BasicCrawler.java 3KB
StatusHandlerCrawler.java 3KB
RobotstxtConfig.java 3KB
InProcessPagesDB.java 2KB
PageFetcherHtmlOnly.java 2KB
HtmlParseData.java 2KB
PostgresWebCrawler.java 2KB
BasicCrawler.java 2KB
FormAuthInfo.java 2KB
ImageCrawlController.java 2KB
TLDList.java 2KB
WebURLTupleBinding.java 2KB
PostgresDBServiceImpl.java 2KB
IdleConnectionMonitorThread.java 2KB
SniSSLConnectionSocketFactory.java 2KB
CrawlStat.java 2KB
IO.java 2KB
TextParseData.java 2KB
Net.java 1KB
AllTagMapper.java 1KB
BasicAuthHttpRequestInterceptor.java 1KB
Configurable.java 1KB
BasicAuthInfo.java 1KB
ParseData.java 1KB
ExtractedUrlAnchorPair.java 1KB
NtAuthInfo.java 698B
PostgresCrawlerFactory.java 694B
PageBiggerThanMaxSizeException.java 537B
共 127 条
- 1
- 2
资源评论
zero2100
- 粉丝: 160
- 资源: 2464
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功