#Abot [![Build Status](https://ci.appveyor.com/api/projects/status/b1ukruawvu6uujn0?svg=true)](https://ci.appveyor.com/project/sjdirect/abot)
*Please star this project!!* Contact me with exciting opportunities!!
######C# web crawler built for speed and flexibility.
Abot is an open source C# web crawler built for speed and flexibility. It takes care of the low level plumbing (multithreading, http requests, scheduling, link parsing, etc..). You just register for events to process the page data. You can also plugin your own implementations of core interfaces to take complete control over the crawl process. Abot targets .NET version 4.0.
######What's So Great About It?
* Open Source (Free for commercial and personal use)
* It's fast!!
* Easily customizable (Pluggable architecture allows you to decide what gets crawled and how)
* Heavily unit tested (High code coverage)
* Very lightweight (not over engineered)
* No out of process dependencies (database, installed services, etc...)
* Runs on Mono
######Links of Interest
* [Ask a question](http://groups.google.com/group/abot-web-crawler)
* [Report a bug or suggest a feature](https://github.com/sjdirect/abot/issues)
* [Learn how you can contribute](https://github.com/sjdirect/abot/wiki/Contribute)
* [Need expert Abot customization?](https://github.com/sjdirect/abot/wiki/Custom-Development)
* [Take the usage survey](https://www.surveymonkey.com/s/JS5826F) to help prioritize features/improvements
* [Consider making a donation](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=G6ZY6BZNBFVQJ)
* [Unofficial Chinese Documentation](https://github.com/zixiliuyue/abot)
######Use [AbotX](http://abotx.org) for powerful extensions/wrappers
* [Crawl multiple sites concurrently](http://abotx.org/Learn/ParallelCrawlerEngine)
* [Execute/Render Javascript](http://abotx.org/Learn/JavascriptRendering)
* [Avoid getting blocked by sites](http://abotx.org/Learn/AutoThrottling)
* [Auto Tuning](http://abotx.org/Learn/AutoTuning)
* [Auto Throttling](http://abotx.org/Learn/AutoThrottling)
* [Pause/Resume live crawls](http://abotx.org/Learn/CrawlerX#crawlerx-pause-resume)
* [Simplified pluggability/extensibility](https://abotx.org/Learn/CrawlerX#easy-override)
<br /><br />
<hr />
##Quick Start
######Installing Abot
* Install Abot using [Nuget](https://www.nuget.org/packages/Abot/)
* If you prefer to build from source yourself see the [Working With The Source Code section](#working-with-the-source-code) below
######Using Abot
1: Add the following using statements to the host class...
```c#
using Abot.Crawler;
using Abot.Poco;
```
2: Configure Abot using any of the options below. You can see what effect each config value has on the crawl by looking at the [code comments ](https://github.com/sjdirect/abot/blob/master/Abot/Poco/CrawlConfiguration.cs).
**Option 1:** Add the following to the app.config or web.config file of the assembly using the library. Nuget will NOT add this for you. *NOTE: The gcServer or gcConcurrent entry may help memory usage in your specific use of abot.*
```xml
<configuration>
<configSections>
<section name="abot" type="Abot.Core.AbotConfigurationSectionHandler, Abot"/>
</configSections>
<runtime>
<!-- Experiment with these to see if it helps your memory usage, USE ONLY ONE OF THE FOLLOWING -->
<!--<gcServer enabled="true"/>-->
<!--<gcConcurrent enabled="true"/>-->
</runtime>
<abot>
<crawlBehavior
maxConcurrentThreads="10"
maxPagesToCrawl="1000"
maxPagesToCrawlPerDomain="0"
maxPageSizeInBytes="0"
userAgentString="Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko"
crawlTimeoutSeconds="0"
downloadableContentTypes="text/html, text/plain"
isUriRecrawlingEnabled="false"
isExternalPageCrawlingEnabled="false"
isExternalPageLinksCrawlingEnabled="false"
httpServicePointConnectionLimit="200"
httpRequestTimeoutInSeconds="15"
httpRequestMaxAutoRedirects="7"
isHttpRequestAutoRedirectsEnabled="true"
isHttpRequestAutomaticDecompressionEnabled="false"
isSendingCookiesEnabled="false"
isSslCertificateValidationEnabled="false"
isRespectUrlNamedAnchorOrHashbangEnabled="false"
minAvailableMemoryRequiredInMb="0"
maxMemoryUsageInMb="0"
maxMemoryUsageCacheTimeInSeconds="0"
maxCrawlDepth="1000"
maxLinksPerPage="1000"
isForcedLinkParsingEnabled="false"
maxRetryCount="0"
minRetryDelayInMilliseconds="0"
/>
<authorization
isAlwaysLogin="false"
loginUser=""
loginPassword="" />
<politeness
isRespectRobotsDotTextEnabled="false"
isRespectMetaRobotsNoFollowEnabled="false"
isRespectHttpXRobotsTagHeaderNoFollowEnabled="false"
isRespectAnchorRelNoFollowEnabled="false"
isIgnoreRobotsDotTextIfRootDisallowedEnabled="false"
robotsDotTextUserAgentString="abot"
maxRobotsDotTextCrawlDelayInSeconds="5"
minCrawlDelayPerDomainMilliSeconds="0"/>
<extensionValues>
<add key="key1" value="value1"/>
<add key="key2" value="value2"/>
</extensionValues>
</abot>
</configuration>
```
**Option 2:** Create an instance of the Abot.Poco.CrawlConfiguration class manually. This approach ignores the app.config values completely.
```c#
CrawlConfiguration crawlConfig = new CrawlConfiguration();
crawlConfig.CrawlTimeoutSeconds = 100;
crawlConfig.MaxConcurrentThreads = 10;
crawlConfig.MaxPagesToCrawl = 1000;
crawlConfig.UserAgentString = "abot v1.0 http://code.google.com/p/abot";
crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue1", "1111");
crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue2", "2222");
etc...
```
**Option 3:** Both!! Load from app.config then tweek
```c#
CrawlConfiguration crawlConfig = AbotConfigurationSectionHandler.LoadFromXml().Convert();
crawlConfig.MaxConcurrentThreads = 5;//this overrides the config value
etc...
```
3: Create an instance of Abot.Crawler.PoliteWebCrawler
```c#
//Will use app.config for configuration
PoliteWebCrawler crawler = new PoliteWebCrawler();
```
```c#
//Will use the manually created crawlConfig object created above
PoliteWebCrawler crawler = new PoliteWebCrawler(crawlConfig, null, null, null, null, null, null, null, null);
```
4: Register for events and create processing methods (both synchronous and asynchronous versions available)
```c#
crawler.PageCrawlStartingAsync += crawler_ProcessPageCrawlStarting;
crawler.PageCrawlCompletedAsync += crawler_ProcessPageCrawlCompleted;
crawler.PageCrawlDisallowedAsync += crawler_PageCrawlDisallowed;
crawler.PageLinksCrawlDisallowedAsync += crawler_PageLinksCrawlDisallowed;
```
```c#
void crawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e)
{
PageToCrawl pageToCrawl = e.PageToCrawl;
Console.WriteLine("About to crawl link {0} which was found on page {1}", pageToCrawl.Uri.AbsoluteUri, pageToCrawl.ParentUri.AbsoluteUri);
}
void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
CrawledPage crawledPage = e.CrawledPage;
if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK)
Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri);
else
Console.WriteLine("Crawl of page succeeded {0}", crawledPage.Uri.AbsoluteUri);
if (string.IsNullOrEmpty(crawledPage.Content.Text))
Console.WriteLine("Page had no content {0}", crawledPage.Uri.AbsoluteUri);
var htmlAgilityPackDocument = crawledPage.HtmlDocument; //Html Agility Pack parser
var angleSharpHtmlDocument = crawledPage.AngleSharpHtmlDocument; //AngleSharp parser
}
void crawler_PageLinksCrawlDisallowed(object sender, PageLinksCrawlDisallowedArgs e)
{
CrawledPage crawledPage = e.CrawledPage;
Console.WriteLine("Did not crawl the links on page {0} due to {1}", crawledPage.Uri.AbsoluteUri, e.DisallowedRea