XML
ensible Markup Language (XML) is a set of rules for encoding documents in machine-readable form.
It is defined in the XML 1.0 Specification[4] produced by the W3C, and several other related
specifications, all gratis open standards.[5]
XML's design goals emphasize simplicity, generality, and usability over the Internet.[6] It is a textual
data format with strong support via Unicode for the languages of the world. Although the design of
XML focuses on documents, it is widely used for the representation of arbitrary data structures, for
example in web services.
Many application programming interfaces (APIs) have been developed that software developers use to
process XML data, and several schema systems exist to aid in the definition of XML-based languages.
As of 2009[update], hundreds of XML-based languages have been developed,[7] including RSS, Atom,
SOAP, and XHTML. XML-based formats have become the default for most office-productivity tools,
including Microsoft Office (Office Open XML), OpenOffice.org (OpenDocument), and Apple's
iWork.[8]
Key terminology
The material in this section is based on the XML Specification. This is not an exhaustive list of all the
constructs which appear in XML; it provides an introduction to the key constructs most often
encountered in day-to-day use.
(Unicode) Character
By definition, an XML document is a string of characters. Almost every legal Unicode character
may appear in an XML document.
Processor and Application
The processor analyzes the markup and passes structured information to an application. The
specification places requirements on what an XML processor must do and not do, but the
application is outside its scope. The processor (as the specification calls it) is often referred to
colloquially as an XML parser.
Markup and Content
The characters which make up an XML document are divided into markup and content. Markup
and content may be distinguished by the application of simple syntactic rules. All strings which
constitute markup either begin with the character "<" and end with a ">", or begin with the
character "&" and end with a ";". Strings of characters which are not markup are content.
Tag
A markup construct that begins with "<" and ends with ">". Tags come in three flavors: start-
tags, for example <section>, end-tags, for example </section>, and empty-element tags, for
example <line-break/>.
Element
A logical component of a document which either begins with a start-tag and ends with a matching
end-tag, or consists only of an empty-element tag. The characters between the start- and end-tags,