RationalizeNamespacePrefixes

Today I’m going to talk about parsing documents that use XML Namespaces with XML::Easy. While XML::Easy doesn’t (by design) ship with its own XML Namespace aware parser, one of my modules XML::Easy::Transform::RationalizeNamespacePrefixes makes parsing documents that use namespaces a doddle with just one simple function call.

The problems namespaces solve and cause

XML Namespaces is an extension of the XML 1.0 specification that allows multiple standards to cooperate so they don’t use the same names for their nodes, meaning it’s possible to use more than one specification in the same document at the same time without conflict. For example here’s an XML document that uses two different made up specs at the same time to describe a pub lunch that uses the tag “chips” to mean two different things:

<order
  xmlns:prepackaged="http://twoshortplanks.com/ns/example/behindthebar"
  xmlns:grub="http://twoshortplanks.com/ns/example/food">
  <grub:meal >
    <grub:beefburger/>
    <grub:chips/>
  </grub:meal>
  <prepackaged:chips type="Pringles" />
</order>

So the way the XML Namespace specification works is by using a convention of naming nodes starting with an extra prefix. This allow you to use what otherwise would be the same named in the same document to have a different schematic meaning. For example the “chips” nodes are written as “prepackaged:chips” when they’re referring to crisps, and “grub:chips” when they’re referring to a fries. The clever bit of XML Namespaces is that doesn’t matter what prefix you use to differentiate the two from each other, but what namespace URLs they map to. For example, this document here is considered to be essentially identical to the previous example as far as a namespace aware XML parser is concerned:

<order>
  <meal xmlns="http://twoshortplanks.com/ns/example/food">
    <beefburger/>
    <chips/>
  </meal>
  <barsnack:chips xmlns:barsnack="http://twoshortplanks.com/ns/example/behindthebar" type="Pringles" />
</order>

The meaning of the prefix is entirely derived from the presence of the xmlns prefixed attributes on the node or on the parent node mapping the prefix to a URL¹. This both is great and a complete nightmare: Great since you're mapping an arbitrary prefix to a the unique namespace URL you're not going to get conflicts with other specifications (the way you would if each specification defined its own prefix.) And a complete nightmare because you don't know what the thing you're looking for is actually called - either your code, or the parser, has to keep track of what namespaces are declared in the current scope and what prefixes map to what namespaces.

Using XML::Easy::Transform::RationalizeNamespacePrefixes

What would be great is if there was some way you could force everyone who gives you a document to use the prefixes you'd like, and then you'd know what they'd be called and instead of having to worry about all these xmlns:whatever attributes in the document (and what nodes were where in the tree in relation to them.) Then you could just look for all the "beverage:larger" nodes.

Well, we can't force other people to do what we want, but what we can do is make use of the fact that the prefixes are arbitrary and the same document with any prefix means the same thing. We can therefore just rewrite whatever document we're given into a form we'd like to deal with before we process it. This is the task XML::Easy::Transform::RationalizeNamespacesPrefixes was designed for - it rationalises the prefixes of the namespaces to whatever you want. For example, forcing using "drink" and "modifier" prefixes for the namespaces:

my $old_doc = xml10_read_document($string_of_xml);
my $new_doc = rationalize_namespace_prefixes($old_doc, {
  namespaces => {
    "http://twoshortplanks.com/ns/example/food" => "kitchen",
    "http://twoshortplanks.com/ns/example/behindthebar" => "barstaff",
  },
  force_attribute_prefix => 1,
})

Now if you feed either of the above documents to the code, you'll have an in memory representation of the following document:

<order
  xmlns:barstaff="http://twoshortplanks.com/ns/example/behindthebar"
  xmlns:kitchen="http://twoshortplanks.com/ns/example/food">
  <kitchen:meal >
    <kitchen:beefburger/>
    <kitchen:chips/>
  </kitchen:meal>
  <barstaff:chips barstaff:type="Pringles" />
</order>

Several important transformations have happened:

  • It used the namespace/prefixe mapping that we passed into it with namespaces to rename all the corresponding nodes in the document to have the whatever prefixes we want. This means we now know without looking at the xmlns attributes what our nodes will be called.

  • All the namespaces have been moved to the top element of the document. In this example the module didn't need to introduce any further prefixes to do this (which can happen if the same prefix is used to refer to different URLs in different parts of the tree) nor condense prefixes to a single prefix per namespace (which happens if multiple prefixes refer to the same URL) but if it had to do that, it would have. This means it's really easy to find other namespaces that are defined in our document - you just look for xmlns attributes at the top element.
  • The force_attribute_prefix option forces prefixes to be attached to attribute names too

Now we can parse the document without worrying about the namespaces at all. If we want to look for all the packets of preprepared food in the document:

use XML::Easy::Text qw(xml10_read_document);
use XML::Easy::Classify qw(is_xml_element);
use XML::Easy::NodeBasics qw(xe_twine);
use XML::Easy::Transform::RationalizeNamespacePrefixes qw(rationalize_namespace_prefixes);

sub packets {
  my $element = shift;
  return unless is_xml_element($element);
  my @return;
  push @return, $element->attribute("barstaff:type") if $element->type_name eq "barstaff:chips";
  push @return, map { packets($_) } @{ xe_twine($element) };
  return @return;
}

say "We need the following packets:";
say " * $_" for packets(
  rationalize_namespace_prefixes(
    xml10_read_document($string_of_xml), {
      namespaces => {
        "http://twoshortplanks.com/ns/example/behindthebar" => "barstaff",
      },
      force_attribute_prefix => 1,
    }
  )
);

There's more information on 'XML::Easy::Transform::RationalizeNamespacePrefix's search.cpan.org page².

And that concludes my mini-series into looking into XML::Easy. I'm sure to write more about it in the future as more interesting uses and extensions are written for it, but in my next entry I'll be taking a break from the pointy brackets!


[1] I've used the term URL mutliple times in this document when I should have really used URI. We're using the http:// thingy wosit to Identify a Unique Reference, so it should be a URI, rather an a Universal Resorce Location because there's no resource to locate at that address. It's just a unique name.

[2]Please note that this blog was originally posted close in time to when the new version of XML::Easy::Transform::RationalizeNamespacePrefixes was uploaded to the CPAN, so not all features described in this post may have reached your local CPAN mirror if you're reading it "hot off the presses".

Posted in 1

Permalink 1 Comment

XML::Easy::ProceduralWriter

In this post I want to take a further look at writing code that outputs XML with XML::Easy, and how the example from my previous blog entry can be improved upon by using XML::Easy::ProceduralWriter.

What’s wrong with the previous example?

When we left things in my previous post we were looking at this snippet of code that outputs a simple XML webpage:

tie my %hash, "Tie::IxHash",
  "http://search.cpan.org/" => "Search CPAN",
  "http://blog.twoshortplanks.com" => "Blog",
  "http://www.schlockmercenary.com" => "Schlock",
;

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul",
      map { xe("li", xe("a", { href => $_ }, $hash{$_}) ) } keys %hash
    ),
  ),
);

print {$fh} xml10_write_document($root_element);

The above code produces exactly the output we want, but it doesn’t necessarily go about the best way of producing it.

The first problem is that it’s using Tie::IxHash to ensure that the keys of the hash (and thus the nodes in the resulting XML) come out in the right order rather than in a random order like traditional hashes. Tied data structures are much slower than normal data structures and using this structure in this way is a big performance hit. However in this case we have to tie because it’s hard to write, in a readable way, the logic inline in the map statement to process a normal array two elements at a time.

Which brings us onto the second problem, also related to the map statement – it’s somewhat unwieldy to write and hard to read (you have to scan to the end of the line to work out that it’s using the %hash for its keys.) This only gets worse as you have to produce more complex XML and you try and use further (possibly nested) map statements and tertiary logic expressions to build up even more complex data structures – which is every bit as messy to do as it is to explain.

Both issues stem from trying to build the XML::Easy::Element tree all in one go, essentially in one statement as a single assignment. If we choose not to restrict ourselves in this way we can easily re-order the code to use a temporary variable and do away with both the tie and the map:

my @data = (
  "http://search.cpan.org/" => "Search CPAN",
  "http://blog.twoshortplanks.com" => "Blog",
  "http://www.schlockmercenary.com" => "Schlock",
);

my @links;
while (@data) {
  my $url = shift @data;
  my $text = shift @data;
  push @links, xe("li", xe("a", { href => $url }, $text) );
}

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul", @links),
  ),
);

print {$fh} xml10_write_document($root_element);

The problem with this solution is now we’ve ended up with code that’s backwards. We’re creating the list elements and then creating the node that encloses them. Now we have to read the bottom of the code to work out that we’re creating a HTML document at all!

Introducing XML::Easy::ProceduralWriter

To solve this problem I wrote XML::Easy::ProceduralWriter, a module that allows you to write your code in a procedural fashion but without having to “code in reverse”.

Here’s the above example re-written again, this time using XML::Easy::ProceduralWriter:

use XML::Easy::ProceduralWriter;

print {$fh} xml_bytes {

  element "html", contains {

    element "head", contains {
      element "title", contains {
        text "My Links";
      };
    };

    element "body", contains {
      element "ul", contains {
         my @data = (
           "http://search.cpan.org/" => "Search CPAN",
           "http://blog.twoshortplanks.com" => "Blog",
           "http://www.schlockmercenary.com" => "Schlock",
         );

         while (@data) {
           element "li", contains {
             element "a", href => shift @data, contains {
               text shift @data;
             };
           };
         }
      };
    };
  };

};

Using the module is straight forward. You start by calling either xml_element (which returns an XML::Easy::Element) or xml_bytes (which returns a set of bytes you can print out) and inside these call you pass some code that generates XML elements and text. Each element can ‘contain’ further code that produces sub-elements and text that element contains and so on.

The key thing to notice is that unlike the previous examples where you were passing data structures into the functions here you’re passing code to be executed. This means you can place arbitrary logic in what you pass in and you’re not limited to single statements. For example, in the above code we declare variables in the middle of generating the XML. The conceptual jump here is realising that neither what the blocks of code nor what element and text return isn’t important, but the side effects of calling these two functions are. The simplest way to think about it is to imagine the string being built up as the element and text statements are encountered in much the same way output is straight away printed to the filehandle when you use print (even though technically this isn’t the case here – a full XML::Easy::Element object tree is always actually built in the background.)

The documentation for XML::Easy::ProceduralWriter contains a reasonable tutorial that explains its usage in more detail, but it should be pretty straight forward from just reading the above code to jump straight in.

And that’s pretty much all I have to say about outputting XML with XML::Easy. In my next post we’ll look instead at advanced parsing and how to cope with documents with XML Namespace declarations.

XML::Easy by Example

Last week I posted about why you should be interested in the new XML parsing library on the block, XML::Easy. In this post I’m going to actually dive into some code so you can see what it’s like to work with.

Parsing XML by Example

The basics of parsing is pretty straight forward:

use XML::Easy::Text qw(xml10_read_document);

# read in the document
open my $fh, "<:utf8", "somexml.xml";
  or die "Can't read filehandle: $!";
my $string = do { local $/; <> };

# parse it
my $root_element = xml10_read_document($string);

Now $root_element contains an XML::Easy::Element. Getting basic facts out of this element such as its name or attribute values is easy too:

say "the name of the root element is ".$root_element->type_name;
say "the content of the href attribute is ".$root_element->attribute("href")
  if defeined $root_element->attribute("href");

Getting at the child elements involves dealing with a twine. What’s a twine you say? Why it’s nothing more than an alternating list of strings and elements. Let’s look at an example to help explain this:

my $input = '<p>Hello my <i>friend</i>, here is my picture: <img src="http://farm1.static.flickr.com/116/262065452_6017d39626_t.jpg" /></p>'

We can then call this:

my $p = xml10_read_document($string);
my $twine = $p->content_twine;

The $twine variable now contains a an array reference holding alternating strings and XML::Easy::Elements

  • $twine->[0] contains the string “Hello my”
  • $twine->[1] contains an XML::Easy::Element representing the <i> tag (which in turn will contain the text “friend”)
  • $twine->[2] contains the string “, here is my picture “
  • $twine->[3] contains an XML::Easy::Element representing the <img> tag
  • $twine->[4] contains the empty string “” between the <img> tag and the closing </p> tag

The important thing to remember about twines is that they always alternate element-string-element-string. When two elements are next to each other in the source document then they’re separated by the empty string. You’ll note that the twine first and last elements are always strings, even if they have to be empty, and an “empty” tag has a twine that contains just one element – the empty string.

Now we know the basics, let’s look at a practical example. Imagine we want to get all the possible anchors (internal links) in an XHTML document. This simply involves looking for all the <a> tags that have a name attribute:

sub get_links {
  my $element = shift;
  my @results;

  # check this element
  push @results, $element->attribute("name")
    if $element->type_name eq "a" && defined $element->attribute("name");

  # check any child elements
  my $swizzle = 0;
  foreach (@{ $element->content_twine() }) {

    # skip every other array element because it's a string
    next if $swizzle = !$swizzle;

    # recurse into the child nodes
    push @results, get_links($_);
  }

  return @results;
}

If we want to make this even easier on ourselves there's a bunch of helper functions in the XML::Easy::Classify module that can be used to help process parts of XML documents. For example, we could have written the above code in a more terse (but less efficient) way by using is_xml_element:

use XML::Easy::Classify qw(is_xml_element);

sub get_links {
  my $element = shift;
  my @results;

  # check this element
  push @results, $element->attribute("name")
    if $element->type_name eq "a" && defined $element->attribute("name");

  # check any child elements
  push @results, get_links($_)
    foreach grep { is_xml_element $_ } @{ $element->content_twine() };

  return @results;
}

Generating XML by Example

If you've got an XML::Easy::Element instance, writing it out as an XML document is just the opposite of reading it in:

use XML::Easy::Text qw(xml10_write_document);

# turn it into a string
my $string = xml10_write_document($root_element);

# write out the document
open my $fh, ">:utf8", "somexml.xml";
  or die "Can't write to filehandle: $!";
print {$fh} $string;

So One of the first things you have to know about XML::Easy::Elements and their contents is that they are immutable, or put another way you can't change them once they're created. This means they have no methods for setting the name of an element, altering the attributes, or setting the children. All of these must be passed in in the constructor.

Let's just jump in with an example. We're going to create a little code that outputs the following XML document:

<html>
   <head><title>My Links</title></head>
   <body>
     <h1>Links</h1>
     <ul>
       <li><a href="http://search.cpan.org/">Search CPAN</a></li>
       <li><a href="http://blog.twoshortplanks.com/">Blog</a></li>
       <li><a href="http://www.schlockmercenary.com/">Schlock</a></li>
     </ul>
   </body>
</html>

(I've added extra whitespace in the above example for clarity - the code examples that follow won't reproduce this whitespace)

I'm going to start of showing you the most verbose and explicit objected-orientated way to create XML::Easy::Elements, and then I'm going to show you the much quicker functional interface once you know what you're doing. The verbose way of creating an object is to explicitly pass in each of the things to the constructor:

XML::Easy::Element->new($name, $attributes_hashref, $xml_easy_content_instance)

The trouble with using such code is that it often requires requires pages and pages of code that puts Java to shame in it's repetition of the obvious (you don't really need to read the following code, just gawk at its length:)

my $root_element = XML::Easy::Element->new("html",
  {},
  XML::Easy::Content->new([
    "",
    XML::Easy::Element->new("head",
      {},
      XML::Easy::Content->new([
        "",
        XML::Easy::Element->new("title",
          {},
          XML::Easy::Content->new([
            "My Links",
          ])
        ),
        "",
      ]),
    ),
    "",
    XML::Easy::Element->new("body",
      {},
      XML::Easy::Content->new([
        "",
        XML::Easy::Element->new("h1",
          {},
          XML::Easy::Content->new([
            "Links",
          ])
        ),
        "",
        XML::Easy::Element->new("ul",
          {},
          XML::Easy::Content->new([
            "",
            XML::Easy::Element->new("li",
              {},
              XML::Easy::Content->new([
                "",
                XML::Easy::Element->new("a",
                  { href => "http://search.cpan.org/" },
                  XML::Easy::Content->new([
                    "Search CPAN",
                  ]),
                ),
                "",
              ]),
            ),
            "",
            XML::Easy::Element->new("li",
              {},
              XML::Easy::Content->new([
                "",
                XML::Easy::Element->new("a",
                  { href => "http://blog.twoshortplanks.com/" },
                  XML::Easy::Content->new([
                    "Blog",
                  ]),
                ),
                "",
              ]),
            ),
            "",
            XML::Easy::Element->new("li",
              {},
              XML::Easy::Content->new([
                "",
                XML::Easy::Element->new("a",
                  { href => "http://schlockmercenrary.com/" },
                  XML::Easy::Content->new([
                    "Schlock",
                  ]),
                ),
                "",
              ]),
            ),
            "",
          ]),
        ),
        "",
      ]),
    ),
    "",
  ]),
);

So, we never ever write code like that! For starters we could use twines instead of content objects, but that's too verbose too. We use the functional interface presented by XML::Easy::NodeBasics instead:

use XML::Easy::NodeBasics qw(xe);

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul",
      xe("li",
        xe("a", { href => "http://search.cpan.org/" }, "Search CPAN"),
      ),
      xe("li",
        xe("a", { href => "http://blog.twoshortplanks.com/" }, "Blog"),
      ),
      xe("li",
        xe("a", { href => "http://www.schlockmercenary.com/" }, "Schlock"),
      ),
    ),
  ),
);

The xe function simply takes a tag name followed by a list of things that are either hashrefs (containing attributes), strings (containing text,) or XML::Easy::Elements (containing nodes.) It can also take content objects and twines, which is handy when you're re-using fragments of XML that you've extracted from other documents you may have parsed. In short, it Does The Right Thing with whatever you throw at it.

Of course, we can optomise further by knowing that this code is Perl:

tie my %hash, "Tie::IxHash",
  "http://search.cpan.org/" => "Search CPAN",
  "http://blog.twoshortplanks.com" => "Blog",
  "http://www.schlockmercenary.com" => "Schlock",
;

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul",
      map { xe("li", xe("a", { href => $_ }, $hash{$_}) ) } keys %hash
    ),
  ),
);

And that's about it for basic XML parsing and generation with XML::Easy. There's a lot more handy functions and explantions of the theory behind XML::Easy in the documentation. In my next post I'm going to look at another way of creating XML using XML::Easy, when I talk about one of my own modules: XML::Easy::ProceduralWriter.

Introducing XML::Easy

Some days, you just want to parse XML document.

However, the standard distribution of Perl doesn’t ship with a bundled XML parser, traditionally instead requiring the user to install a module from CPAN. This means there’s no standard way to do this. Instead there are several choices of parser, each with their advantages and disadvantages: There is, as we often say in Perl, more than one way to do it. This is the first post in a series where I’m going to talk about XML::Easy, a relatively new XML parsing module that deserves a little more publicising.

But why another XML parsing library? What’s wrong with the others? Well, a few things…

One of the biggest problems with the most popular XML parsing modules like XML::LibXML and XML::Parser is that they rely on external C dependancies being installed on your system (libxml2 and expat respectively) so it can be hard to rely on them being installable on any old system. Suppose you write some software that relies on these modules. What exactly are you asking of the users of your software who have to install that module as a dependency? You’re asking them firstly to have a C compiler installed – something people using ActiveState Perl, basic web-host providers, or even Mac OS X users without dev tools do not have. Even more than this you’re often asking them to download and install (either by hand or via their package management system) the external C libraries that these modules rely on, and then know how to configure the C compiler to link to these. Complicated!

To solve this XML::Easy ships with a pure Perl XML parser neither requiring external libraries or a C compiler to install: In a pinch you can simply copy the Perl modules into your library path and you’re up and ready to go. This means that this library can be relied on pretty much anywhere.

The observant will point out that there are many existing pure Perl XML parsing libraries on CPAN. They suffer from another problem: They’re slow. Perl runs not as native instructions but as interpreted bytecode executing on a virtual machine, which is a technical way of saying “in a way that makes lots of very simple operations slow.” This is why the boring details of XML parsing are normally handled in C space.

Luckily, XML::Easy doesn’t use its pure Perl parser unless it really has to. It prefers to compile and install on those systems that do have a working C compiler its own C code for parsing XML. Note that this C code, bound into the perl interpreter with fast XS, is wholly self contained and doesn’t rely on external libraries. All the user on a modern Linux system has to do to install the module is type cpan XML::Easy at the command prompt. In this mode XML::Easy is fast: In our tests it’s at least as fast as XML::LibXML (which is to say, very fast indeed.) This week I’ve been re-writing some code that used to use MkDoc::XML to use XML::Easy and the new code is 600 (yes, six hundred) times faster.

This is great news for module authors who just want to do something simple with fast performance if they can get it, but don’t want to have to worry about putting too much of a burden on their users.

Of course, this would all be for naught if XML::Easy didn’t do a good job of parsing XML – but it does. The other big failing of the many so-called XML parsers for Perl is that they screw up the little but important things. They miss part of the specification (sometimes even deliberately!) or they don’t let you do things properly like handle unicode. XML::Easy isn’t like this: It follows the specification quite carefully (with the devotion I’ve come to expect from its author, my co-worker Zefram) and doesn’t screw up unicode because it doesn’t attempt to handle character encodings itself but embraces and works with Perl’s own unicode handling.

So by now, I’ll have either sold you on the idea of XML::Easy or not, but I haven’t really shown you how to use it. In the next post in this series I’m going to start talking about how you can use XML::Easy to parse XML and extract which bits you want.

Posted in 1

Permalink 1 Comment

Follow

Get every new post delivered to your Inbox.