1. Introduction

As we know, today’s web technology advances are fast in good and bad ways. With almost every technology, if not used properly, its results might be devastating. Many programmers are not introduced to the vulnerabilities that might occur when working and parsing XML files, so that was the reason for me to write this article. I hope you like it.

2. What is XML?

XML stands for Extensible Markup Language, mostly used for representing structured information. XML is widely employed in today’s web technology like web services (SOAP, REST, WSDL), RSS feed, Atom, configuration files (Microsoft Office and many other Desktop applications). XML has been standardized by the World Wide Web Consortium (W3C) and is part of SGML (ISO 8879). XML was created in 1996 by Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, François Yergeau, and John Cowan. The first standardization and specification for XML was made on 10 Feb 1998.

W3schools.com has a nice short description of what XML represents (http://www.w3schools.com/xml/xml_whatis.asp):

  • XML stands for Extensible Markup Language
  • XML is a markup language much like HTML
  • XML was designed to carry data, not to display data
  • XML tags are not predefined. You must define your own tags
  • XML is designed to be self-descriptive
  • XML is a W3C Recommendation

3. Designing an XML structure

  • XML Header (Document Type Definition – DTD)

Designing an XML structure is pretty straightforward. Each XML document begins with a header that defines the XML declaration:

<?xml version=”1.0″ encoding=”UTF-8″ ?>

Code 1: Sample of header declaration

For the current example, the header defines the type of the encoding and the version. Also, in the header some additional entities could be included such as !DOCTYPE or other material. This is known as DTD (Document Type Definition) where a set of declarations are added to the XML file (for the tags used in DTD visit http://www.w3schools.com/dtd/).

  • XML Elements

Each XML file contains elements that could be defined with any character you want, except for special characters. The start of the tag is “” for example and end of a tag is “”.

<title>Looking for a job!</title>
<description>Recent graduate student looking for..</description>

Code 2: Sample of element declaration

The declaration of tags is quite easy; you just need to stick to the following rules (http://www.w3schools.com/xml/xml_elements.asp):

  • Names can contain letters, numbers, and other characters
  • Names cannot start with a number or punctuation character
  • Names cannot start with the letters xml (or XML, or Xml, etc)
  • Names cannot contain spaces
  • XML Attributes

Instead of making an element within an element, you can make the child element be an attribute to its parent element. Kind of confusing to explain, but in practice it’s very easy.

<item title="Looking for a job!">
<description>Recent graduate student looking for..</description>

Code 3: Sample of element declaration

Look in the previous sample code (Code 2); there, as you can see, I have made the “title” an element, and in this sample code, I made it an attribute. Not much of a difference, since there are not many rules about how you will make your XML file

  • XML Validation

There are also web sites where you can validate you an XML file, to see if it is properly designed or not: http://www.xmlvalidation.com/, http://www.w3schools.com/xml/xml_validator.asp, http://www.validome.org/xml/ and many more.

4. Making an XML file (RSS feed)

Today many web application and desktop application use XML as part of its structure and the RSS feed is one of them. It stands for Rich Site Summary, or more colloquially Really Simple Syndication, and its main function is to display summarized text of recent published blogs, posts, news and etc. Today many news aggregators including Google News works by using the RSS feed. Here is a sample script in PHP for making an RSS feed. This is just a sample for you to see how it works. I definitely wouldn’t recommend this for using in real life project!

header("Content-Type: application/rss+xml; charset=UTF-8");
$sql="SELECT advert_id, advert_title, advert_text, advert_date FROM adverts ";
$adverts = advert::find_by_sql($sql);
$rss.="<?xml version=\"1.0\" encoding=\"UTF-8\"?>";
$rss.="<rss version=\"2.0\">";
$rss.="<title>Feeds RSS Localhost</title>";
$rss.="<description>Sample RSS feed</description>";
$rss.="<copyright>Copyright (C) localhost.com</copyright>";
        foreach($adverts as $advert){
echo $rss;

Code 4. Sample XML generator in PHP

As you can see from the code, this script is used for generating a XML file with the tags that i have defined. This script will generate the following XML content (this is just a sample script from an old web page that i have been working on so it will not work on your server).

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<title>Feeds RSS Localhost</title>
<description>Sample RSS feed</description>
<copyright>Copyright (C) localhost.com</copyright>
        <title>Selling car for 30k$</title>
            <description>I am interested in selling my car...</description>
        <title>Looking for a job!</title>
            <description>Recent graduate student looking for.. </description>
        <title>Selling house for 430k$</title>
            <description>Want to live in great house…</description>

Code 5: Sample of generated XML

If you open it in Chrome, it will look like:

Figure 1: XML file in browser

5. Parsing XML files

In Python, you can easily parse XML files. There are many modules that can be used for this purpose, for this sample will be used BeautifulSoap (http://www.crummy.com/software/BeautifulSoup/).

def parse_score(link):
    xml = urllib2.urlopen(link)
    xml_content = xml.read()
    soup = BeautifulSoup(xml_content)
    results = soup.find_all("item")

    for result in results:
        print result.contents

Code 6: Sample of XML parsing

The code is straightforward; three steps are involved: loading the XML link (or file), parsing the content by using BeautifulSoap and the last step is extracting the XML content.

6. Common XML vulnerabilities (sample of vulnerable code https://gist.github.com/hakre/2416846)

Every application has vulnerabilities, so XML parsers have some too. This is a list of well-known XML vulnerabilities that might occur in your application:

  • Billion laughs

This vulnerability is a DoS (Denial Of Service) aimed for the parsers of the XML. This vulnerability is also known as XML bomb or Entity Expansion XML bomb. It also might happen that this vulnerability pass the validation of the XML schema. Consider the following tag:

<!ENTITY entityName “Some Value”>

Code 7: DTD tag

Now consider the following vulnerable code (the code is taken from http://cytinus.wordpress.com/2011/07/26/37/):

Figure 2: Billion laughs vulnerable code

As you can see, we have 10 “lols”. So what is happening here? At the end, we have instance of “lol9”. When the &lol9; is parsed the entity lol9 will be called which has 10 “lol8” instances. The lol8 has 10 “lol7” instances and so on. At the end you may assume that there will be a lot of “lol” (100,000,000 instances = billion). The billion lol’s might cause DoS (Denial of Service). That’s why it is called the Billion Laughs Vulnerability. For more information about the vulnerability, check the link http://cytinus.wordpress.com/2011/07/26/37/.

  • Quadratic blowup

Another Entity Expansion XML bomb is the quadratic blowup vulnerability discovered by Amin Klein of Trusteer. The “kaboom” entity has 50,000 “a” represented as “&a;” When parsed, the size of it changes, from 200KB to 2.5gb, causing DoS. Still the billion laughs create much bigger size when parsing compared to quadratic blowup.

<?xml version="1.0"?>
<!DOCTYPE kaboom [
  <!ENTITY a "aaaaaaaaaaaaaa.....

Code 7. Quadratic blowup

  • DTD retrieval

Also with entity declaration, you can have an URL link for replacement (for definition of replacement see previous vulnerability). When using the System identifiers you can download the content from external location and embed it in you XML file.

<!DOCTYPE external [
<!ENTITY ee SYSTEM "http://www.python.org/some.xml">
</span></p><p><span style="font-size:14pt"><root>&ee;</root>
</span></p><p><span style="font-size:14pt">

Code 8. Remote entity expansion retrieval example

The same vulnerability could be used for local file also:

<!DOCTYPE external [
<!ENTITY ee SYSTEM "file:///PATH/TO/simple.xml">

Code 9. Local entity expansion retrieval example

According to an article from Python’s blog about XML vulnerabilities, here are the possible “bad things” that might happen because of this vulnerability (http://blog.python.org/2013/02/announcing-defusedxml-fixes-for-xml.html):

  • An attacker can circumvent firewalls and gain access to restricted resources as all the requests are made from an internal and trustworthy IP address, not from the outside.
  • An attacker can abuse a service to attack, spy on or DoS your servers but also third party services. The attack is disguised with the IP address of the server and the attacker is able to utilize the high bandwidth of a big machine.
  • An attacker can exhaust additional resources on the machine, e.g. with requests to a service that doesn’t respond or responds with very large files.
  • An attacker may gain knowledge, when, how often and from which IP address a XML document is accessed.
  • An attacker could send mail from inside your network if the URL handler supports smtp:// URIs.

7. How to defend

Figure 3: Modules that lack protection from XML exploits (http://blog.python.org/2013/02/announcing-defusedxml-fixes-for-xml.html)


8. Conclusion

I think that this topic was interesting because it is something that many programmers are not aware of. We should care more about the security of web applications, because XML is more and more part of them, and that increases the risks of being exploited. We saw that the results of exploiting these vulnerabilities are devastating, and that is why we should be more concerned about using safe modules and functions.

9. References