.Net开源网络爬虫Abot源码资源-CSDN文库

共1498个文件

cs：478个

dll：263个

xml：218个

1星需积分: 10 168 浏览量 2017-11-17 17:32:39 上传评论 2 收藏 59.76MB ZIP 举报

《深入解析.Net开源网络爬虫Abot源码》网络爬虫是互联网信息挖掘的重要工具，它能够自动化地抓取网页信息，为数据分析、搜索引擎优化等应用场景提供支持。在众多的网络爬虫框架中，.Net平台上的Abot以其易用性和高效性脱颖而出。本文将对Abot这一开源网络爬虫的源码进行深度剖析，帮助开发者理解其核心原理，提升网络爬虫开发能力。 Abot的核心设计理念是模块化和可扩展性。它的架构设计使得开发者可以根据需求选择不同的组件，如网页下载器、HTML解析器等，方便定制和优化。Abot提供了丰富的配置选项，允许用户调整爬虫的行为，如并发请求的数量、重试机制、延迟设置等，以适应不同的网络环境和目标网站。在网页下载层面，Abot利用HttpClient或者WebClient类进行HTTP请求，这些类提供了异步和同步的API，支持GET和POST等多种HTTP方法，能处理cookies、代理服务器等复杂场景。源码中，Abot通过UrlRequester类封装了请求过程，实现请求的统一管理和错误处理。 HTML解析是Abot的重要组成部分，它使用HtmlAgilityPack库解析网页HTML，这是一个轻量级且强大的库，可以解析不规则的HTML结构。Abot通过HtmlPage类来处理解析后的HTML内容，提取出链接、文本等信息。此外，它还支持XPath和CSS选择器两种方式来定位网页元素，增强了灵活性。 Abot的爬虫调度系统是其独特之处，它采用广度优先搜索（BFS）策略，确保网页的遍历顺序，避免陷入无限循环或重复抓取。同时，它维护了一个待爬取URL队列和已爬取URL集合，以防止重复抓取和保证爬取的顺序。源码中的CrawlQueue和CrawlDecisionMaker类扮演了关键角色，它们负责URL的添加、删除以及是否继续爬取的决策。在异常处理和日志记录方面，Abot使用了Common.Logging库，这允许开发者选择不同的日志记录实现，如NLog、Log4Net等。通过日志，开发者可以跟踪爬虫运行状态，排查问题。另外，Abot还支持多线程和多进程爬取，以提高抓取效率。它通过ThreadPool或Task类进行异步处理，合理分配资源，避免因并发过高导致的IP封锁或服务器压力过大。 Abot的源码展示了如何构建一个功能完备、可扩展的网络爬虫系统。通过学习Abot，开发者不仅能掌握网络爬虫的基本原理，还能了解到HTTP请求、HTML解析、爬虫调度等关键技术，对提升.NET开发能力大有裨益。希望这篇分析能帮助你深入了解Abot，更好地运用到实际项目中去。

资源推荐

资源详情

资源评论

收起资源包目录

.Net开源网络爬虫Abot源码（1498个子文件）

_._ 0B

Global.asax 109B

nuget.bat 65B

nuget.bat 63B

DesignTimeResolveAssemblyReferences.cache 133KB

DesignTimeResolveAssemblyReferences.cache 110KB

InformationCollector.csprojResolveAssemblyReference.cache 94KB

Abot.csprojResolveAssemblyReference.cache 65KB

Abot.csprojResolveAssemblyReference.cache 64KB

Abot.csprojResolveAssemblyReference.cache 61KB

Abot.Demo.csprojResolveAssemblyReference.cache 52KB

Text.csprojResolveAssemblyReference.cache 51KB

Demos.csprojResolveAssemblyReference.cache 17KB

MySqlSugar.csprojResolveAssemblyReference.cache 11KB

WinPager.csprojResolveAssemblyReference.cache 10KB

MySqlSugar.csprojResolveAssemblyReference.cache 10KB

DesignTimeResolveAssemblyReferencesInput.cache 9KB

DesignTimeResolveAssemblyReferencesInput.cache 8KB

DesignTimeResolveAssemblyReferencesInput.cache 7KB

DesignTimeResolveAssemblyReferencesInput.cache 6KB

InformationCollector.csproj.GenerateResource.Cache 1KB

SyncDataBase.csproj.GenerateResource.Cache 1KB

WinPager.csproj.GenerateResource.Cache 992B

InformationCollector.csproj.GenerateResource.Cache 983B

WinPager.csproj.GenerateResource.Cache 924B

DesignTimeResolveAssemblyReferences.cache 791B

App.config 4KB

Web.config 4KB

App.config 3KB

Abot.Demo.exe.config 3KB

Abot.Demo.vshost.exe.config 3KB

共 1498 条

#Abot [![Build Status](https://ci.appveyor.com/api/projects/status/b1ukruawvu6uujn0?svg=true)](https://ci.appveyor.com/project/sjdirect/abot) *Please star this project!!* Contact me with exciting opportunities!! ######C# web crawler built for speed and flexibility. Abot is an open source C# web crawler built for speed and flexibility. It takes care of the low level plumbing (multithreading, http requests, scheduling, link parsing, etc..). You just register for events to process the page data. You can also plugin your own implementations of core interfaces to take complete control over the crawl process. Abot targets .NET version 4.0. ######What's So Great About It? * Open Source (Free for commercial and personal use) * It's fast!! * Easily customizable (Pluggable architecture allows you to decide what gets crawled and how) * Heavily unit tested (High code coverage) * Very lightweight (not over engineered) * No out of process dependencies (database, installed services, etc...) * Runs on Mono ######Links of Interest * [Ask a question](http://groups.google.com/group/abot-web-crawler) * [Report a bug or suggest a feature](https://github.com/sjdirect/abot/issues) * [Learn how you can contribute](https://github.com/sjdirect/abot/wiki/Contribute) * [Need expert Abot customization?](https://github.com/sjdirect/abot/wiki/Custom-Development) * [Take the usage survey](https://www.surveymonkey.com/s/JS5826F) to help prioritize features/improvements * [Consider making a donation](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=G6ZY6BZNBFVQJ) * [Unofficial Chinese Documentation](https://github.com/zixiliuyue/abot) ######Use [AbotX](http://abotx.org) for powerful extensions/wrappers * [Crawl multiple sites concurrently](http://abotx.org/Learn/ParallelCrawlerEngine) * [Execute/Render Javascript](http://abotx.org/Learn/JavascriptRendering) * [Avoid getting blocked by sites](http://abotx.org/Learn/AutoThrottling) * [Auto Tuning](http://abotx.org/Learn/AutoTuning) * [Auto Throttling](http://abotx.org/Learn/AutoThrottling) * [Pause/Resume live crawls](http://abotx.org/Learn/CrawlerX#crawlerx-pause-resume) * [Simplified pluggability/extensibility](https://abotx.org/Learn/CrawlerX#easy-override) <br /><br /> <hr /> ##Quick Start ######Installing Abot * Install Abot using [Nuget](https://www.nuget.org/packages/Abot/) * If you prefer to build from source yourself see the [Working With The Source Code section](#working-with-the-source-code) below ######Using Abot 1: Add the following using statements to the host class... ```c# using Abot.Crawler; using Abot.Poco; ``` 2: Configure Abot using any of the options below. You can see what effect each config value has on the crawl by looking at the [code comments ](https://github.com/sjdirect/abot/blob/master/Abot/Poco/CrawlConfiguration.cs). **Option 1:** Add the following to the app.config or web.config file of the assembly using the library. Nuget will NOT add this for you. *NOTE: The gcServer or gcConcurrent entry may help memory usage in your specific use of abot.* ```xml <configuration> <configSections> <section name="abot" type="Abot.Core.AbotConfigurationSectionHandler, Abot"/> </configSections> <runtime>    </runtime> <abot> <crawlBehavior maxConcurrentThreads="10" maxPagesToCrawl="1000" maxPagesToCrawlPerDomain="0" maxPageSizeInBytes="0" userAgentString="Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko" crawlTimeoutSeconds="0" downloadableContentTypes="text/html, text/plain" isUriRecrawlingEnabled="false" isExternalPageCrawlingEnabled="false" isExternalPageLinksCrawlingEnabled="false" httpServicePointConnectionLimit="200" httpRequestTimeoutInSeconds="15" httpRequestMaxAutoRedirects="7" isHttpRequestAutoRedirectsEnabled="true" isHttpRequestAutomaticDecompressionEnabled="false" isSendingCookiesEnabled="false" isSslCertificateValidationEnabled="false" isRespectUrlNamedAnchorOrHashbangEnabled="false" minAvailableMemoryRequiredInMb="0" maxMemoryUsageInMb="0" maxMemoryUsageCacheTimeInSeconds="0" maxCrawlDepth="1000" maxLinksPerPage="1000" isForcedLinkParsingEnabled="false" maxRetryCount="0" minRetryDelayInMilliseconds="0" /> <authorization isAlwaysLogin="false" loginUser="" loginPassword="" /> <politeness isRespectRobotsDotTextEnabled="false" isRespectMetaRobotsNoFollowEnabled="false" isRespectHttpXRobotsTagHeaderNoFollowEnabled="false" isRespectAnchorRelNoFollowEnabled="false" isIgnoreRobotsDotTextIfRootDisallowedEnabled="false" robotsDotTextUserAgentString="abot" maxRobotsDotTextCrawlDelayInSeconds="5" minCrawlDelayPerDomainMilliSeconds="0"/> <extensionValues> <add key="key1" value="value1"/> <add key="key2" value="value2"/> </extensionValues> </abot> </configuration> ``` **Option 2:** Create an instance of the Abot.Poco.CrawlConfiguration class manually. This approach ignores the app.config values completely. ```c# CrawlConfiguration crawlConfig = new CrawlConfiguration(); crawlConfig.CrawlTimeoutSeconds = 100; crawlConfig.MaxConcurrentThreads = 10; crawlConfig.MaxPagesToCrawl = 1000; crawlConfig.UserAgentString = "abot v1.0 http://code.google.com/p/abot"; crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue1", "1111"); crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue2", "2222"); etc... ``` **Option 3:** Both!! Load from app.config then tweek ```c# CrawlConfiguration crawlConfig = AbotConfigurationSectionHandler.LoadFromXml().Convert(); crawlConfig.MaxConcurrentThreads = 5;//this overrides the config value etc... ``` 3: Create an instance of Abot.Crawler.PoliteWebCrawler ```c# //Will use app.config for configuration PoliteWebCrawler crawler = new PoliteWebCrawler(); ``` ```c# //Will use the manually created crawlConfig object created above PoliteWebCrawler crawler = new PoliteWebCrawler(crawlConfig, null, null, null, null, null, null, null, null); ``` 4: Register for events and create processing methods (both synchronous and asynchronous versions available) ```c# crawler.PageCrawlStartingAsync += crawler_ProcessPageCrawlStarting; crawler.PageCrawlCompletedAsync += crawler_ProcessPageCrawlCompleted; crawler.PageCrawlDisallowedAsync += crawler_PageCrawlDisallowed; crawler.PageLinksCrawlDisallowedAsync += crawler_PageLinksCrawlDisallowed; ``` ```c# void crawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e) { PageToCrawl pageToCrawl = e.PageToCrawl; Console.WriteLine("About to crawl link {0} which was found on page {1}", pageToCrawl.Uri.AbsoluteUri, pageToCrawl.ParentUri.AbsoluteUri); } void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e) { CrawledPage crawledPage = e.CrawledPage; if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK) Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri); else Console.WriteLine("Crawl of page succeeded {0}", crawledPage.Uri.AbsoluteUri); if (string.IsNullOrEmpty(crawledPage.Content.Text)) Console.WriteLine("Page had no content {0}", crawledPage.Uri.AbsoluteUri); var htmlAgilityPackDocument = crawledPage.HtmlDocument; //Html Agility Pack parser var angleSharpHtmlDocument = crawledPage.AngleSharpHtmlDocument; //AngleSharp parser } void crawler_PageLinksCrawlDisallowed(object sender, PageLinksCrawlDisallowedArgs e) { CrawledPage crawledPage = e.CrawledPage; Console.WriteLine("Did not crawl the links on page {0} due to {1}", crawledPage.Uri.AbsoluteUri, e.DisallowedRea

评论收藏

内容反馈