Skip to content

Instantly share code, notes, and snippets.

@aravindkumarsvg
Last active August 19, 2025 08:14
Show Gist options
  • Select an option

  • Save aravindkumarsvg/fe0cea12bc16a21443e6dd489e52a01a to your computer and use it in GitHub Desktop.

Select an option

Save aravindkumarsvg/fe0cea12bc16a21443e6dd489e52a01a to your computer and use it in GitHub Desktop.
XML Cheatsheet

XML Ecosystem: Cheatsheet & Quirks

This document provides a high-level overview of major XML technologies. It's designed for developers who need a quick refresher on core concepts, common quirks, and practical examples.

1. XML (eXtensible Markup Language)

The foundation. A markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.

Core Concepts

  • Well-Formed: Must follow all XML syntax rules (e.g., one root element, all tags closed, attributes quoted).
  • Valid: (Optional) Must conform to the rules of a DTD or XSD.
  • Elements: Building blocks of XML, defined by start and end tags (<tag>...</tag>).
  • Attributes: Key-value pairs inside an element's start tag (<tag attribute="value">).
  • Prologue: Optional start of the document, including the XML declaration (<?xml version="1.0"?>) and DTD/Schema reference.
  • Entities: Shortcuts for special characters (e.g., < for <, > for >) or custom reusable text.
  • CDATA: "Character Data" section (<![CDATA[...]]>). The parser ignores markup within it, treating it as raw text. Useful for including code snippets.

Example

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
  <book category="WEB">
    <title lang="en">Learning XML</title>
    <author>Erik T. Ray</author>
    <price>39.95</price>
    <!-- CDATA for code that includes reserved characters -->
    <example><![CDATA[
      if (x < 10 && y > 5) { console.log("ok"); }
    ]]></example>
  </book>
</bookstore>

Quirks & Gotchas

  • Case-Sensitive: <Tag> is different from <tag>.
  • Whitespace is Preserved: Unlike HTML, the XML parser does not collapse whitespace.
  • Self-Closing Tags: Tags without content must be self-closed (e.g., <br/>).
  • Attribute Quoting: All attribute values must be enclosed in quotes.

2. Core Concepts & APIs

XML Namespaces

  • Concept: A mechanism to avoid element name conflicts when mixing XML vocabularies. Declared with an xmlns attribute, which associates a prefix with a URI.
  • Example:
<root xmlns:h="http://www.w3.org/TR/html4/"
      xmlns:f="https://www.w3schools.com/furniture">
  <h:table>
    <h:tr><h:td>Apples</h:td></h:tr>
  </h:table>
  <f:table>
    <f:name>African Coffee Table</f:name>
    <f:width>80</f:width>
  </f:table>
</root>
  • Quirk: The URI is just a unique identifier, not a web address. The "default namespace" (xmlns="URI") is the single most common cause of failed selections in XPath and XSLT.

Parsing: DOM vs. SAX

  • DOM (Document Object Model): Parses the entire XML into a tree in memory. Convenient for random access but memory-intensive.
  • SAX (Simple API for XML): An event-based parser that reads the XML sequentially. Fast and memory-efficient but you can't "go back".
  • Note: Code examples are language-specific (e.g., Java, Python, JS), but the concept is universal.

Design Choice: Attributes vs. Elements

  • Concept: A common design dilemma is whether to store data as an attribute or a child element.
  • Guideline/Quirk: Use attributes for metadata (data about the element) and elements for content data.
  • Example (Attribute): <car id="123" color="red" /> (color is metadata).
  • Example (Element): <car><id>123</id><color>red</color></car> (color is content).

3. DTD (Document Type Definition)

The original way to define the legal building blocks of an XML document.

Example (Internal DTD)

<?xml version="1.0"?>
<!DOCTYPE note [
  <!ELEMENT note (to,from,heading,body)>
  <!ELEMENT to      (#PCDATA)>
  <!ELEMENT from    (#PCDATA)>
  <!ELEMENT heading (#PCDATA)>
  <!ELEMENT body    (#PCDATA)>
]>
<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>

Quirks & Gotchas

  • No Data Types: Can't specify "this must be a number."
  • Not Namespace-Aware: A major limitation.
  • Largely Obsolete: Replaced by XML Schema (XSD).

4. XML Schema (XSD)

The modern, more powerful successor to DTDs for XML validation.

Example

XML File (order.xml):

<order xmlns="http://example.com/order" orderid="123">
  <item>
    <name>Laptop</name>
    <quantity>1</quantity>
  </item>
</order>

XSD File (order.xsd):

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           targetNamespace="http://example.com/order"
           xmlns:ord="http://example.com/order"
           elementFormDefault="qualified">

  <xs:element name="order">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="item">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="name" type="xs:string"/>
              <xs:element name="quantity" type="xs:positiveInteger"/>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name="orderid" type="xs:string" use="required"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

Quirks & Gotchas

  • Verbosity & Complexity: XSDs can be extremely long and difficult to read.
  • Target Namespace: Managing the targetNamespace and default namespaces is a common source of validation errors.

5. XPath (XML Path Language)

A query language for selecting nodes from an XML document.

Example

XML Source:

<bookstore>
  <book category="COOKING">
    <title>Everyday Italian</title>
    <price>30.00</price>
  </book>
  <book category="WEB">
    <title>Learning XML</title>
    <price>39.95</price>
  </book>
</bookstore>

XPath Expressions:

  • /bookstore/book: Selects all <book> elements.
  • //title: Selects all <title> elements anywhere in the document.
  • bookstore/book[1]: Selects the first <book> element.
  • //book[@category='WEB']/title: Selects the title of the book in the "WEB" category (Learning XML).

Quirks & Gotchas

  • 1-Based Indexing: [1] selects the first item.
  • Default Namespaces: XPath 1.0 is not namespace-aware by default. This is the most common problem.

6. XSLT (eXtensible Stylesheet Language Transformations)

A language for transforming XML documents into other formats.

Example

XML Source (catalog.xml):

<catalog>
  <cd>
    <title>Empire Burlesque</title>
    <artist>Bob Dylan</artist>
  </cd>
</catalog>

XSLT Stylesheet (catalog.xsl):

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">
    <html><body>
      <h2>My CD Collection</h2>
      <xsl:apply-templates/>
    </body></html>
  </xsl:template>

  <xsl:template match="cd">
    <p>
      <xsl:value-of select="title"/> by <xsl:value-of select="artist"/>
    </p>
  </xsl:template>
</xsl:stylesheet>

Result (HTML):

<html><body>
  <h2>My CD Collection</h2>
  <p>Empire Burlesque by Bob Dylan</p>
</body></html>

What "Transformation" Really Means

Transformation is the core purpose of XSLT, and it's much more than just "styling". It means converting an XML document's structure, content, and format into something new. This output can be another XML file, an HTML page, a plain text file (like a CSV), or other structured formats.

Common Use Cases Beyond Styling

  • Data Integration (ETL): The most common enterprise use. XSLT acts as a bridge between systems that use different XML formats.
  • Filtering and Sorting: Creating a new, refined XML document by filtering out unwanted data or sorting it.
  • Format Conversion: Generating non-XML formats, like converting XML to CSV.
  • Digital Signatures (e.g., in SAML): Applying a canonicalization (C14N) transform to an XML document to create a standard, repeatable text representation before it is digitally signed.

Quirks & Gotchas

  • Thinking Declaratively: The biggest hurdle is shifting from a procedural mindset.
  • Namespace Hell: You must declare any namespaces from the source XML in your XSLT stylesheet to match elements.
  • No Variables (in the traditional sense): An <xsl:variable> in XSLT 1.0 is a constant.

7. XQuery (XML Query Language)

A language for querying data from XML documents. Think "SQL for XML".

Example

XML Source: Same as XPath example.
XQuery (FLWOR Expression):

for $book in doc("bookstore.xml")/bookstore/book
where $book/price > 35
order by $book/title
return $book/title

Result:

<title>Learning XML</title>

Quirks & Gotchas

  • Syntax: A mix of ideas from SQL, XSLT, and functional programming.
  • Implementations Vary: Different processors might have slightly different function libraries.

8. Related Specifications

XLink

  • Concept: An attribute-based syntax for advanced, multi-directional links.
  • Example:
<myLink xlink:type="simple" xlink:href="http://example.com">Link</myLink>
  • Quirk: Never achieved widespread browser support.

XPointer

  • Concept: Extends XPath to link to specific points within an XML document.
  • Example:
http://example.com/doc.xml#xpointer(//book[1]/title)
  • Quirk: Very little native support in browsers.

XHTML

  • Concept: A reformulation of HTML 4 as a well-formed XML application.
  • Example:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title>My Page</title></head>
  <body><p>Hello world!</p><br/></body>
</html>
  • Quirk: The strictness lost out to the pragmatic flexibility of HTML5.

XForms

  • Concept: An XML-based specification for web forms that separates data, logic, and presentation.
  • Quirk: Never gained native browser support and requires JavaScript libraries.

9. Web Services & Syndication

XML-RPC

  • Concept: A simple Remote Procedure Call (RPC) protocol using XML.
  • Example Request:
<methodCall>
  <methodName>examples.getStateName</methodName>
  <params><param><value><i4>41</i4></value></param></params>
</methodCall>

SOAP

  • Concept: A protocol for exchanging structured information in web services.
  • Example Envelope:
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope">
  <soap:Header></soap:Header>
  <soap:Body>
    <m:GetStockPrice xmlns:m="http://www.example.org/stock">
      <m:StockName>IBM</m:StockName>
    </m:GetStockPrice>
  </soap:Body>
</soap:Envelope>
  • Quirk: The "S" for "Simple" is a famous misnomer; it's notoriously verbose.

WSDL

  • Concept: An XML-based language used to describe the capabilities of a SOAP web service.
  • Quirk: Extremely verbose and meant to be machine-readable for generating client code.

SAML (Security Assertion Markup Language)

  • Concept: An XML-based standard for exchanging authentication and authorization data, primarily for Single Sign-On (SSO).
  • Example (Simplified Assertion):
<saml:Assertion ...>
  <saml:Subject>
    <saml:NameID>[email protected]</saml:NameID>
  </saml:Subject>
  <saml:AttributeStatement>
    <saml:Attribute Name="Role">
      <saml:AttributeValue>Admin</saml:AttributeValue>
    </saml:Attribute>
  </saml:AttributeStatement>
</saml:Assertion>
  • XSLT's Role in SAML: XSLT is not used for "styling" here, but for critical data manipulation to ensure security and interoperability.
    1. Digital Signatures (Canonicalization): Before a SAML assertion is digitally signed, it must be converted into a standard, canonical form. This ensures that logically identical but textually different XML (e.g., different attribute order) produces the same hash for signature verification. XSLT is a standard mechanism used to perform this canonicalization transform.
    2. Attribute Mapping (Claim Transformation): An Identity Provider might send attributes in one format (e.g., <Attribute Name="mail">), but a Service Provider application might expect another (<Attribute Name="emailAddress">). XSLT is used on the server-side as a "translator" to rename, filter, and restructure these incoming attributes to match what the application needs.
  • Quirk: Incredibly powerful but complex to implement. Debugging SAML flows is difficult.

UDDI

  • Concept: An XML-based registry for businesses to publish and discover web services.
  • Quirk: A classic case of over-engineering that never took off. Now obsolete.

RSS

  • Concept: A family of XML formats for web syndication (news feeds, blogs).
  • Example:
<rss version="2.0">
  <channel>
    <title>My Blog</title>
    <link>http://example.com/</link>
    <item>
      <title>Post 1</title>
      <link>http://example.com/post1</link>
    </item>
  </channel>
</rss>

Atom

  • Concept: A more modern, well-specified XML format for web feeds, designed as a successor to RSS.
  • Example:
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>My Blog</title>
  <link href="http://example.com/"/>
  <entry>
    <title>Post 1</title>
    <link href="http://example.com/post1"/>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
  </entry>
</feed>
  • Quirk: While technically superior, RSS remains more widely known due to its historical prevalence.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment