OpenSourceWebCrawlerforJava.zip资源-CSDN文库

共127个文件

java：71个

groovy：17个

xml：11个

java

需积分: 1 165 浏览量 2023-12-29 20:40:05 上传评论收藏 292KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Open Source Web Crawler for Java.zip （127个子文件）

gradlew.bat 2KB

absolute.css 363B

quotes.css 196B

relative.css 150B

public_suffix_list.dat 203KB

.editorconfig 111B

.gitignore 123B

.gitignore 105B

.gitignore 11B

build.gradle 6KB

build.gradle 2KB

build.gradle 656B

settings.gradle 429B

build.gradle 51B

gradlew 5KB

BasicAuthTest.groovy 6KB

FormAuthInfoTest.groovy 6KB

CrawlerWithJSTest.groovy 5KB

TimeoutTest.groovy 5KB

WebCrawlerTest.groovy 4KB

NoIndexTest.groovy 4KB

RedirectHandlerTest.groovy 4KB

NoFollowTest.groovy 4KB

OnRedirectedToInvalidTest.groovy 3KB

CustomDnsResolverTest.groovy 2KB

CssParseDataTest.groovy 2KB

PublicSuffixTest.groovy 2KB

NetTest.groovy 2KB

TLDListOnlineTest.groovy 1KB

RobotstxtParserTest.groovy 1KB

PageTest.groovy 953B

HtmlParserTest.groovy 790B

wiki.c2.com.html 9KB

gradle-wrapper.jar 53KB

CrawlController.java 25KB

WebCrawler.java 24KB

CrawlConfig.java 23KB

UrlResolver.java 19KB

PageFetcher.java 16KB

RobotstxtServer.java 9KB

Page.java 8KB

WebURL.java 8KB

UserAgentDirectives.java 7KB

URLCanonicalizer.java 7KB

HtmlContentHandler.java 7KB

Frontier.java 7KB

Parser.java 5KB

HostDirectives.java 5KB

DocIDServer.java 5KB

TikaHtmlParser.java 5KB

CssParseData.java 5KB

WorkQueues.java 5KB

Counters.java 5KB

BinaryParseData.java 5KB

PathRule.java 5KB

URLCanonicalizerTest.java 5KB

AuthInfo.java 5KB

StatusHandlerCrawlController.java 4KB

RobotstxtParser.java 4KB

BasicCrawlController.java 4KB

PageFetchResult.java 4KB

LocalDataCollectorCrawler.java 3KB

ControllerWithShutdown.java 3KB

BasicCrawler.java 3KB

MultipleCrawlerController.java 3KB

PageFetcherHtmlTest.java 3KB

LocalDataCollectorController.java 3KB

Util.java 3KB

PgsqlTest.java 3KB

ImageCrawler.java 3KB

SniPoolingHttpClientConnectionManager.java 3KB

SampleLauncher.java 3KB

BasicCrawler.java 3KB

StatusHandlerCrawler.java 3KB

RobotstxtConfig.java 3KB

InProcessPagesDB.java 2KB

PageFetcherHtmlOnly.java 2KB

HtmlParseData.java 2KB

PostgresWebCrawler.java 2KB

BasicCrawler.java 2KB

FormAuthInfo.java 2KB

ImageCrawlController.java 2KB

TLDList.java 2KB

WebURLTupleBinding.java 2KB

PostgresDBServiceImpl.java 2KB

IdleConnectionMonitorThread.java 2KB

SniSSLConnectionSocketFactory.java 2KB

CrawlStat.java 2KB

IO.java 2KB

TextParseData.java 2KB

Net.java 1KB

AllTagMapper.java 1KB

BasicAuthHttpRequestInterceptor.java 1KB

Configurable.java 1KB

BasicAuthInfo.java 1KB

ParseData.java 1KB

ExtractedUrlAnchorPair.java 1KB

NtAuthInfo.java 698B

PostgresCrawlerFactory.java 694B

PageBiggerThanMaxSizeException.java 537B

共 127 条

# crawler4j [![Build Status](https://travis-ci.org/yasserg/crawler4j.svg?branch=master)](https://travis-ci.org/yasserg/crawler4j) [![Maven Central](https://img.shields.io/maven-central/v/edu.uci.ics/crawler4j.svg?style=flat-square)](https://search.maven.org/search?q=g:edu.uci.ics%20a:crawler4j) [![Gitter Chat](http://img.shields.io/badge/chat-online-brightgreen.svg)](https://gitter.im/crawler4j/Lobby) crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes. ## Table of content - [Installation](#installation) - [Quickstart](#quickstart) - [More Examples](#more-examples) - [Configuration Details](#configuration-details) - [License](#license) ## Installation ### Using Maven Add the following dependency to your pom.xml: ```xml <dependency> <groupId>edu.uci.ics</groupId> <artifactId>crawler4j</artifactId> <version>4.4.0</version> </dependency> ``` ### Using Gradle Add the following dependency to your build.gradle file: compile group: 'edu.uci.ics', name: 'crawler4j', version: '4.4.0' ## Quickstart You need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and handles the downloaded page. The following is a sample implementation: ```java public class MyCrawler extends WebCrawler { private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg" + "|png|mp3|mp4|zip|gz))$"); /** * This method receives two parameters. The first parameter is the page * in which we have discovered this new url and the second parameter is * the new url. You should implement this function to specify whether * the given url should be crawled or not (based on your crawling logic). * In this example, we are instructing the crawler to ignore urls that * have css, js, git, ... extensions and to only accept urls that start * with "https://www.ics.uci.edu/". In this case, we didn't need the * referringPage parameter to make the decision. */ @Override public boolean shouldVisit(Page referringPage, WebURL url) { String href = url.getURL().toLowerCase(); return !FILTERS.matcher(href).matches() && href.startsWith("https://www.ics.uci.edu/"); } /** * This function is called when a page is fetched and ready * to be processed by your program. */ @Override public void visit(Page page) { String url = page.getWebURL().getURL(); System.out.println("URL: " + url); if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); String text = htmlParseData.getText(); String html = htmlParseData.getHtml(); Set<WebURL> links = htmlParseData.getOutgoingUrls(); System.out.println("Text length: " + text.length()); System.out.println("Html length: " + html.length()); System.out.println("Number of outgoing links: " + links.size()); } } } ``` As can be seen in the above code, there are two main functions that should be overridden: - shouldVisit: This function decides whether the given URL should be crawled or not. In the above example, this example is not allowing .css, .js and media files and only allows pages within 'www.ics.uci.edu' domain. - visit: This function is called after the content of a URL is downloaded successfully. You can easily get the url, text, links, html, and unique id of the downloaded page. You should also implement a controller class which specifies the seeds of the crawl, the folder in which intermediate crawl data should be stored and the number of concurrent threads: ```java public class Controller { public static void main(String[] args) throws Exception { String crawlStorageFolder = "/data/crawl/root"; int numberOfCrawlers = 7; CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); // Instantiate the controller for this crawl. PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); // For each crawl, you need to add some seed urls. These are the first // URLs that are fetched and then the crawler starts following links // which are found in these pages controller.addSeed("https://www.ics.uci.edu/~lopes/"); controller.addSeed("https://www.ics.uci.edu/~welling/"); controller.addSeed("https://www.ics.uci.edu/"); // The factory which creates instances of crawlers. CrawlController.WebCrawlerFactory<BasicCrawler> factory = MyCrawler::new; // Start the crawl. This is a blocking operation, meaning that your code // will reach the line after this only when crawling is finished. controller.start(factory, numberOfCrawlers); } } ``` ## More Examples - [Basic crawler](crawler4j-examples/crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/basic/): the full source code of the above example with more details. - [Image crawler](crawler4j-examples/crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/imagecrawler/): a simple image crawler that downloads image content from the crawling domain and stores them in a folder. This example demonstrates how binary content can be fetched using crawler4j. - [Collecting data from threads](crawler4j-examples/crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/localdata/): this example demonstrates how the controller can collect data/statistics from crawling threads. - [Multiple crawlers](crawler4j-examples/crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/multiple/): this is a sample that shows how two distinct crawlers can run concurrently. For example, you might want to split your crawling into different domains and then take different crawling policies for each group. Each crawling controller can have its own configurations. - [Shutdown crawling](crawler4j-examples/crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/shutdown/): this example shows how crawling can be terminated gracefully by sending the 'shutdown' command to the controller. - [Postgres/JDBC integration](crawler4j-examples/crawler4j-examples-postgres/): this shows how to save the crawled content into a Postgres database (or any other JDBC repository), thanks [rzo1](https://github.com/rzo1/). ## Configuration Details The controller class has a mandatory parameter of type [CrawlConfig](crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java). Instances of this class can be used for configuring crawler4j. The following sections describe some details of configurations. ### Crawl depth By default there is no limit on the depth of crawling. But you can limit the depth of crawling. For example, assume that you have a seed page "A", which links to "B", which links to "C", which links to "D". So, we have the following link structure: A -> B -> C -> D Since, "A" is a seed page, it will have a depth of 0. "B" will have depth of 1 and so on. You can set a limit on the depth of pages that crawler4j crawls. For example, if you set this limit to 2, it won't crawl page "D". To set the maximum depth you can use: ```java crawlConfig.setMaxDepthOfCrawling(maxDepthOfCrawling); ``` ### Enable SSL To enable SSL simply: ```java CrawlConfig config = new CrawlConfig(); config.setIncludeHttpsPages(true); ``` ### Maximum number of pages to crawl Although by default there

评论收藏

内容反馈