AdvancedHTMLParser
==================
AdvancedHTMLParser is an Advanced HTML Parser, with support for adding, removing, modifying, and formatting HTML.
It aims to provide the same interface as you would find in a compliant browser through javascript ( i.e. all the getElement methods, appendChild, etc), as well as many more complex and sophisticated features not available through a browser. And most importantly, it's in python!
There are many potential applications, not limited to:
* Webpage Scraping / Data Extraction
* Testing and Validation
* HTML Modification/Insertion
* Outputting your website
* Debugging
* HTML Document generation
* Web Crawling
* Formatting HTML documents or web pages
It is especially good for servlets/webpages. It is quick to take an expertly crafted page in raw HTML / css, and have your servlet's ingest with AdvancedHTMLParser and create/insert data elements into the existing view using a simple and well-known interface ( javascript-like + HTML DOM ).
Another useful scenario is creating automated testing suites which can operate much more quickly and reliably (and at a deeper function-level), unlike in-browser testing suites.
Full API
--------
Can be found http://htmlpreview.github.io/?https://github.com/kata198/AdvancedHTMLParser/blob/master/doc/AdvancedHTMLParser.html .
Examples
--------
Various examples can be found in the "tests" directory. A very old, simple example can also be found as "example.py" in the root directory.
Short Doc
---------
**AdvancedHTMLParser**
Think of this like "document" in a browser.
The AdvancedHTMLParser can read in a file (or string) of HTML, and will create a modifiable DOM tree from it. It can also be constructed manually from AdvancedHTMLParser.AdvancedTag objects.
To populate an AdvancedHTMLParser from existing HTML:
parser = AdvancedHTMLParser.AdvancedHTMLParser()
# Parse an HTML string into the document
parser.parseStr(htmlStr)
# Parse an HTML file into the document
parser.parseFile(filename)
The parser then exposes many "standard" functions as you'd find on the web for accessing the data, and some others:
getElementsByTagName - Returns a list of all elements matching a tag name
getElementsByName - Returns a list of all elements with a given name attribute
getElementById - Returns a single AdvancedTag (or None) if found an element matching the provided ID
getElementsByClassName - Returns a list of all elements containing a class name
getElementsByAttr - Returns a list of all elements matching a paticular attribute/value pair.
getElementsWithAttrValues - Returns a list of all elements with a specific attribute name containing one of a list of values
getElementsCustomFilter - Provide a function/lambda that takes a tag argument, and returns True to "match" it. Returns all matched objects
getHTML - Returns string of HTML representing this DOM
getRootNodes - Get a list of nodes at root level (0)
getAllNodes - Get all the nodes contained within this document
getFormattedHTML - Returns a formatted string (using AdvancedHTMLFormatter; see below) of the HTML. Takes as argument an indent (defaults to two spaces)
The results of all of these getElement\* functions are TagCollection objects. These objects can be modified, and will be reflected in the parent DOM.
The parser also contains some expected properties, like
head - The "head" tag associated with this document, or None
body - The "body" tag associated with this document, or None
forms - All "forms" on this document as a TagCollection
**General Attributes**
In general, attributes can be accessed with dot-syntax, i.e.
tagEm.id = "Hello"
will set the "id" attribute. If it works in HTML javascript on a tag element, it should work on an AdvancedTag element with python.
setAttribute, getAttribute, and removeAttribute are more explicit and recommended ways of getting/setting/deleting attributes on elements.
The same names are used in python as in the javascript/DOM, such as 'className' corrosponding to a space-separated string of the 'class' attribute, 'classList' corrosponding to a list of classes, etc.
**Style Attribute**
Style attributes can be manipulated just like in javascript, so element.style.position = 'relative' for setting, or element.style.position for access.
You can also assign the tag.style as a string, like:
myTag.style = "display: block; float: right; font-weight: bold"
in addition to individual properties:
myTag.style.display = 'block'
myTag.style.float = 'right'
myTag.style.fontWeight = 'bold'
You can remove style properties by setting its value to an empty string.
For example, to clear "display" property:
myTag.style.display = ''
A standard method *setProperty* can also obe used to set or remove individual properties
For example:
myTag.style.setProperty("display", "block") # Set display: block
myTag.style.setProperty("display", '') # Clear display: property
The naming conventions are the same as in javascript, like "element.style.paddingTop" for "padding-top" attribute.
**TagCollection**
A TagCollection can be used like a list.
It also exposes the various getElement\* functions which operate on the elements within the list (and their children).
To operate just on items in the list, you can use filterCollection which takes a lambda/function and returns True to retain that tag in the return.
**AdvancedTag**
The AdvancedTag represents a single tag and its inner text. It exposes many of the functions and properties you would expect to be present if using javascript.
each AdvancedTag also supports the same getElementsBy\* functions as the parser.
It adds several additional that are not found in javascript, such as peers and arbitrary attribute searching.
some of these include:
appendText - Append text to this element
appendChild - Append a child to this element
removeChild - Removes a child
removeText - Removes first occurance of some text from any text nodes
removeTextAll - Removes ALL occurances of some text from any text nodes
insertBefore - Inserts a child before an existing child
insertAfter - Inserts a child after an existing child
getChildren - Returns the children as a list
getStartTag - Start Tag, with attributes
getEndTag - End Tag
getPeersByName - Gets "peers" (elements with same parent, at same level in tree) with a given name
getPeersByAttr - Gets peers by an arbitrary attribute/value combination
getPeersWithAttrValues - Gets peers by an arbitrary attribute/values combination.
getPeersByClassName - Gets peers that contain a given class name
getElement\* - Same as above, but act on the children of this element.
getHTML / toHTML / asHTML - Get the HTML representation using this node as a root (so start tag and attributes, innerText (text and child nodes), and end tag)
firstChild - Get the first child of this node, be it text or an element (AdvancedTag)
firstElementChild - Get the first child of this node that is an element
lastChild - Get the last child of this node, be it text or an element (AdvancedTag)
lastElementChild - Get the last child of this node that is an element
nextSibling - Get next sibling, be it text or an element
nextElementSibling - Get next sibling, that is an element
previousSibling - Get previous sibling, be it text or an element
previousElementSibling - Get previous sibling, that is an element
{get,set,has,remove}Attribute - get/set/test/remove an attribute
{add,remove}Class - Add/remove a class from the list of classes
setStyle - Set a specific style property [like: s