PyPI官网下载|AdvancedHTMLParser-7.3.2.tar.gz资源-CSDN文库

版权申诉

127 浏览量 2022-01-08 21:24:03 上传评论收藏 221KB GZ 举报

共56个文件

py：31个

html：11个

txt：4个

**PyPI 官网下载 | AdvancedHTMLParser-7.3.2.tar.gz** AdvancedHTMLParser 是一个基于 Python 的高效、灵活且易于使用的 HTML 解析库。这个库的主要目的是帮助开发者处理和解析 HTML 文档，从而提取所需的数据。在 Python 中，有许多 HTML 解析库可供选择，如 BeautifulSoup 和 lxml，但 AdvancedHTMLParser 提供了一些独特的特性，使得它在某些场景下具有优势。 **一、AdvancedHTMLParser 概述** AdvancedHTMLParser 是由 Michael Klishin 开发的，它的设计目标是提供一种快速、低级别的解析方法，同时保持易于使用。这个库支持事件驱动的解析模式，允许用户通过注册回调函数来处理特定的 HTML 元素或属性，这在处理大量数据时可以提高性能。 **二、核心功能** 1. **事件驱动的解析**：AdvancedHTMLParser 使用事件驱动模型，当遇到 HTML 元素的开始、结束或者属性时，会触发相应的回调函数。这种方式允许程序只关注关心的部分，忽略其余内容，降低了内存占用。 2. **自定义解析行为**：用户可以定义自己的解析规则，通过注册处理器（handler）来处理特定的 HTML 结构，这样可以灵活地应对各种 HTML 文档结构。 3. **高效性能**：由于其底层实现，AdvancedHTMLParser 在处理大量 HTML 数据时，相比其他解析库能提供更好的性能。 4. **错误处理**：在处理不规范的 HTML 时，AdvancedHTMLParser 可以很好地处理标签不匹配和其他常见错误，保持解析过程的稳定性。 5. **轻量级**：相比于 BeautifulSoup 等库，AdvancedHTMLParser 的体积较小，对于需要快速部署或轻量级应用来说是个不错的选择。 **三、安装与使用** 在 Python 环境中，可以通过 PyPI 官方仓库使用 pip 来安装 AdvancedHTMLParser： ```bash pip install AdvancedHTMLParser ``` 然后，在 Python 代码中导入并使用该库： ```python from AdvancedHTMLParser import AdvancedHTMLParser def handle_start_element(name, attrs): print(f"Start tag: {name}, Attributes: {attrs}") def handle_end_element(name): print(f"End tag: {name}") parser = AdvancedHTMLParser() parser.onStartElement = handle_start_element parser.onEndElement = handle_end_element html_string = """ <html> <head> <title>Example Page</title> </head> <body> <p>Hello, World!</p> </body> </html> """ parser.feed(html_string) ``` **四、高级用法** AdvancedHTMLParser 还提供了更高级的功能，如递归解析、跟踪元素路径和自定义标签处理。例如，你可以使用 `parser.setElementHandler` 方法来处理特定的 HTML 标签，并获取标签内的文本、属性等信息。 **五、总结** AdvancedHTMLParser 是一个强大的 HTML 解析工具，尤其适合需要高性能和低级别控制的场景。通过事件驱动和自定义处理器，它可以轻松处理复杂的 HTML 解析任务。尽管在某些方面可能不如 BeautifulSoup 等库功能全面，但对于特定需求，AdvancedHTMLParser 可能是一个更好的选择。在处理大量 HTML 数据时，它的性能优势尤为明显。如果你的项目中需要高效的 HTML 解析，不妨考虑使用 AdvancedHTMLParser。

资源推荐

资源详情

资源评论

收起资源包目录

AdvancedHTMLParser-7.3.2.tar.gz （56个子文件）

AdvancedHTMLParser-7.3.2

setup.cfg 59B

README.md 20KB

README.rst 21KB

indexed_example.py 2KB

tests

AdvancedHTMLParserTests

test_Compare.py 7KB

test_Pickle.py 12KB

test_ParseMethods.py 2KB

test_CustomFilter.py 4KB

test_Insertions.py 13KB

test_InvalidHtml.py 2KB

test_Style.py 15KB

test_ValidateInvalidHtml.py 3KB

test_Blocks.py 6KB

test_RefTag.py 2KB

test_untaggedText.py 5KB

test_TagClass.py 16KB

test_Building.py 11KB

__init__.py 0B

test_General.py 9KB

test_ParserGetters.py 2KB

test_Attributes.py 19KB

test_Children.py 3KB

runTests.py 12KB

example.py 2KB

LICENSE 7KB

PKG-INFO 26KB

ChangeLog 21KB

AdvancedHTMLParser.egg-info

dependency_links.txt 1B

PKG-INFO 26KB

SOURCES.txt 2KB

top_level.txt 19B

requires.txt 14B

MANIFEST.in 340B

setup.py 5KB

doc

AdvancedHTMLParser.constants.html 3KB

AdvancedHTMLParser.Parser.html 76KB

AdvancedHTMLParser.Validator.html 46KB

AdvancedHTMLParser.utils.html 5KB

AdvancedHTMLParser.Tags.html 116KB

AdvancedHTMLParser.html 253KB

AdvancedHTMLParser.SpecialAttributes.html 29KB

index.html 253KB

exceptions.html 191KB

AdvancedHTMLParser.exceptions.html 20KB

AdvancedHTMLParser.Formatter.html 16KB

AdvancedHTMLParser

Tags.py 87KB

Formatter.py 10KB

.vimrc 32B

exceptions.py 2KB

constants.py 4KB

Validator.py 2KB

utils.py 3KB

SpecialAttributes.py 24KB

__init__.py 1KB

Parser.py 47KB

formatHTML 3KB

AdvancedHTMLParser ================== AdvancedHTMLParser is an Advanced HTML Parser, with support for adding, removing, modifying, and formatting HTML. It aims to provide the same interface as you would find in a compliant browser through javascript ( i.e. all the getElement methods, appendChild, etc), as well as many more complex and sophisticated features not available through a browser. And most importantly, it's in python! There are many potential applications, not limited to: * Webpage Scraping / Data Extraction * Testing and Validation * HTML Modification/Insertion * Outputting your website * Debugging * HTML Document generation * Web Crawling * Formatting HTML documents or web pages It is especially good for servlets/webpages. It is quick to take an expertly crafted page in raw HTML / css, and have your servlet's ingest with AdvancedHTMLParser and create/insert data elements into the existing view using a simple and well-known interface ( javascript-like + HTML DOM ). Another useful scenario is creating automated testing suites which can operate much more quickly and reliably (and at a deeper function-level), unlike in-browser testing suites. Full API -------- Can be found http://htmlpreview.github.io/?https://github.com/kata198/AdvancedHTMLParser/blob/master/doc/AdvancedHTMLParser.html . Examples -------- Various examples can be found in the "tests" directory. A very old, simple example can also be found as "example.py" in the root directory. Short Doc --------- **AdvancedHTMLParser** Think of this like "document" in a browser. The AdvancedHTMLParser can read in a file (or string) of HTML, and will create a modifiable DOM tree from it. It can also be constructed manually from AdvancedHTMLParser.AdvancedTag objects. To populate an AdvancedHTMLParser from existing HTML: parser = AdvancedHTMLParser.AdvancedHTMLParser() # Parse an HTML string into the document parser.parseStr(htmlStr) # Parse an HTML file into the document parser.parseFile(filename) The parser then exposes many "standard" functions as you'd find on the web for accessing the data, and some others: getElementsByTagName - Returns a list of all elements matching a tag name getElementsByName - Returns a list of all elements with a given name attribute getElementById - Returns a single AdvancedTag (or None) if found an element matching the provided ID getElementsByClassName - Returns a list of all elements containing a class name getElementsByAttr - Returns a list of all elements matching a paticular attribute/value pair. getElementsWithAttrValues - Returns a list of all elements with a specific attribute name containing one of a list of values getElementsCustomFilter - Provide a function/lambda that takes a tag argument, and returns True to "match" it. Returns all matched objects getHTML - Returns string of HTML representing this DOM getRootNodes - Get a list of nodes at root level (0) getAllNodes - Get all the nodes contained within this document getFormattedHTML - Returns a formatted string (using AdvancedHTMLFormatter; see below) of the HTML. Takes as argument an indent (defaults to two spaces) The results of all of these getElement\* functions are TagCollection objects. These objects can be modified, and will be reflected in the parent DOM. The parser also contains some expected properties, like head - The "head" tag associated with this document, or None body - The "body" tag associated with this document, or None forms - All "forms" on this document as a TagCollection **General Attributes** In general, attributes can be accessed with dot-syntax, i.e. tagEm.id = "Hello" will set the "id" attribute. If it works in HTML javascript on a tag element, it should work on an AdvancedTag element with python. setAttribute, getAttribute, and removeAttribute are more explicit and recommended ways of getting/setting/deleting attributes on elements. The same names are used in python as in the javascript/DOM, such as 'className' corrosponding to a space-separated string of the 'class' attribute, 'classList' corrosponding to a list of classes, etc. **Style Attribute** Style attributes can be manipulated just like in javascript, so element.style.position = 'relative' for setting, or element.style.position for access. You can also assign the tag.style as a string, like: myTag.style = "display: block; float: right; font-weight: bold" in addition to individual properties: myTag.style.display = 'block' myTag.style.float = 'right' myTag.style.fontWeight = 'bold' You can remove style properties by setting its value to an empty string. For example, to clear "display" property: myTag.style.display = '' A standard method *setProperty* can also obe used to set or remove individual properties For example: myTag.style.setProperty("display", "block") # Set display: block myTag.style.setProperty("display", '') # Clear display: property The naming conventions are the same as in javascript, like "element.style.paddingTop" for "padding-top" attribute. **TagCollection** A TagCollection can be used like a list. It also exposes the various getElement\* functions which operate on the elements within the list (and their children). To operate just on items in the list, you can use filterCollection which takes a lambda/function and returns True to retain that tag in the return. **AdvancedTag** The AdvancedTag represents a single tag and its inner text. It exposes many of the functions and properties you would expect to be present if using javascript. each AdvancedTag also supports the same getElementsBy\* functions as the parser. It adds several additional that are not found in javascript, such as peers and arbitrary attribute searching. some of these include: appendText - Append text to this element appendChild - Append a child to this element removeChild - Removes a child removeText - Removes first occurance of some text from any text nodes removeTextAll - Removes ALL occurances of some text from any text nodes insertBefore - Inserts a child before an existing child insertAfter - Inserts a child after an existing child getChildren - Returns the children as a list getStartTag - Start Tag, with attributes getEndTag - End Tag getPeersByName - Gets "peers" (elements with same parent, at same level in tree) with a given name getPeersByAttr - Gets peers by an arbitrary attribute/value combination getPeersWithAttrValues - Gets peers by an arbitrary attribute/values combination. getPeersByClassName - Gets peers that contain a given class name getElement\* - Same as above, but act on the children of this element. getHTML / toHTML / asHTML - Get the HTML representation using this node as a root (so start tag and attributes, innerText (text and child nodes), and end tag) firstChild - Get the first child of this node, be it text or an element (AdvancedTag) firstElementChild - Get the first child of this node that is an element lastChild - Get the last child of this node, be it text or an element (AdvancedTag) lastElementChild - Get the last child of this node that is an element nextSibling - Get next sibling, be it text or an element nextElementSibling - Get next sibling, that is an element previousSibling - Get previous sibling, be it text or an element previousElementSibling - Get previous sibling, that is an element {get,set,has,remove}Attribute - get/set/test/remove an attribute {add,remove}Class - Add/remove a class from the list of classes setStyle - Set a specific style property [like: s

评论收藏

内容反馈

版权申诉