XML::Easy by Example

Nov 9, 2009

Last week I posted about why you should be interested in the new XML parsing library on the block, XML::Easy. In this post I'm going to actually dive into some code so you can see what it's like to work with.

Parsing XML by Example

The basics of parsing is pretty straight forward:

use XML::Easy::Text qw(xml10_read_document);

# read in the document
open my $fh, "<:utf8", "somexml.xml";
  or die "Can't read filehandle: $!";
my $string = do { local $/; <> };

# parse it
my $root_element = xml10_read_document($string);

Now $root_element contains an XML::Easy::Element. Getting basic facts out of this element such as its name or attribute values is easy too:

say "the name of the root element is ".$root_element->type_name;
say "the content of the href attribute is ".$root_element->attribute("href")
  if defeined $root_element->attribute("href");

Getting at the child elements involves dealing with a twine. What's a twine you say? Why it's nothing more than an alternating list of strings and elements. Let's look at an example to help explain this: my $input = 'Hello my friend, here is my picture: <img src="http://farm1.static.flickr.com/116/262065452_6017d39626_t.jpg" />' We can then call this: my $p = xml10_read_document($string); my $twine = $p->content_twine; The $twine variable now contains a an array reference holding alternating strings and XML::Easy::Elements

$twine->[0] contains the string "Hello my"
$twine->[1] contains an XML::Easy::Element representing the tag (which in turn will contain the text "friend")
$twine->[2] contains the string ", here is my picture "
$twine->[3] contains an XML::Easy::Element representing the <img> tag
$twine->[4] contains the empty string "" between the <img> tag and the closing tag

The important thing to remember about twines is that they always alternate element-string-element-string. When two elements are next to each other in the source document then they're separated by the empty string. You'll note that the twine first and last elements are always strings, even if they have to be empty, and an "empty" tag has a twine that contains just one element - the empty string. Now we know the basics, let's look at a practical example. Imagine we want to get all the possible anchors (internal links) in an XHTML document. This simply involves looking for all the <a> tags that have a name attribute:

sub get_links {
  my $element = shift;
  my @results;

  # check this element
  push @results, $element->attribute("name")
    if $element->type_name eq "a" && defined $element->attribute("name");

  # check any child elements
  my $swizzle = 0;
  foreach (@{ $element->content_twine() }) {

    # skip every other array element because it's a string
    next if $swizzle = !$swizzle;

    # recurse into the child nodes
    push @results, get_links($_);
  }

  return @results;
}

If we want to make this even easier on ourselves there's a bunch of helper functions in the XML::Easy::Classify module that can be used to help process parts of XML documents. For example, we could have written the above code in a more terse (but less efficient) way by using is_xml_element:

use XML::Easy::Classify qw(is_xml_element);

sub get_links {
  my $element = shift;
  my @results;

  # check this element
  push @results, $element->attribute("name")
    if $element->type_name eq "a" && defined $element->attribute("name");

  # check any child elements
  push @results, get_links($_)
    foreach grep { is_xml_element $_ } @{ $element->content_twine() };

  return @results;
}

Generating XML by Example

If you've got an XML::Easy::Element instance, writing it out as an XML document is just the opposite of reading it in:

use XML::Easy::Text qw(xml10_write_document);

# turn it into a string
my $string = xml10_write_document($root_element);

# write out the document
open my $fh, ">:utf8", "somexml.xml";
  or die "Can't write to filehandle: $!";
print {$fh} $string;

So One of the first things you have to know about XML::Easy::Elements and their contents is that they are immutable, or put another way you can't change them once they're created. This means they have no methods for setting the name of an element, altering the attributes, or setting the children. All of these must be passed in in the constructor. Let's just jump in with an example. We're going to create a little code that outputs the following XML document:

<html>
   <head><title>My Links</title></head>
   <body>
     <h1>Links</h1>
     <ul>
       <li><a href="http://search.cpan.org/">Search CPAN</a></li>
       <li><a href="http://blog.twoshortplanks.com/">Blog</a></li>
       <li><a href="http://www.schlockmercenary.com/">Schlock</a></li>
     </ul>
   </body>
</html>

(I've added extra whitespace in the above example for clarity - the code examples that follow won't reproduce this whitespace) I'm going to start of showing you the most verbose and explicit objected-orientated way to create XML::Easy::Elements, and then I'm going to show you the much quicker functional interface once you know what you're doing. The verbose way of creating an object is to explicitly pass in each of the things to the constructor: XML::Easy::Element->new($name, $attributes_hashref, $xml_easy_content_instance) The trouble with using such code is that it often requires requires pages and pages of code that puts Java to shame in it's repetition of the obvious (you don't really need to read the following code, just gawk at its length:)

my $root_element = XML::Easy::Element->new("html",
  {},
  XML::Easy::Content->new([
    "",
    XML::Easy::Element->new("head",
      {},
      XML::Easy::Content->new([
        "",
        XML::Easy::Element->new("title",
          {},
          XML::Easy::Content->new([
            "My Links",
          ])
        ),
        "",
      ]),
    ),
    "",
    XML::Easy::Element->new("body",
      {},
      XML::Easy::Content->new([
        "",
        XML::Easy::Element->new("h1",
          {},
          XML::Easy::Content->new([
            "Links",
          ])
        ),
        "",
        XML::Easy::Element->new("ul",
          {},
          XML::Easy::Content->new([
            "",
            XML::Easy::Element->new("li",
              {},
              XML::Easy::Content->new([
                "",
                XML::Easy::Element->new("a",
                  { href => "http://search.cpan.org/" },
                  XML::Easy::Content->new([
                    "Search CPAN",
                  ]),
                ),
                "",
              ]),
            ),
            "",
            XML::Easy::Element->new("li",
              {},
              XML::Easy::Content->new([
                "",
                XML::Easy::Element->new("a",
                  { href => "http://blog.twoshortplanks.com/" },
                  XML::Easy::Content->new([
                    "Blog",
                  ]),
                ),
                "",
              ]),
            ),
            "",
            XML::Easy::Element->new("li",
              {},
              XML::Easy::Content->new([
                "",
                XML::Easy::Element->new("a",
                  { href => "http://schlockmercenrary.com/" },
                  XML::Easy::Content->new([
                    "Schlock",
                  ]),
                ),
                "",
              ]),
            ),
            "",
          ]),
        ),
        "",
      ]),
    ),
    "",
  ]),
);

So, we never ever write code like that! For starters we could use twines instead of content objects, but that's too verbose too. We use the functional interface presented by XML::Easy::NodeBasics instead:

use XML::Easy::NodeBasics qw(xe);

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul",
      xe("li",
        xe("a", { href => "http://search.cpan.org/" }, "Search CPAN"),
      ),
      xe("li",
        xe("a", { href => "http://blog.twoshortplanks.com/" }, "Blog"),
      ),
      xe("li",
        xe("a", { href => "http://www.schlockmercenary.com/" }, "Schlock"),
      ),
    ),
  ),
);

The xe function simply takes a tag name followed by a list of things that are either hashrefs (containing attributes), strings (containing text,) or XML::Easy::Elements (containing nodes.) It can also take content objects and twines, which is handy when you're re-using fragments of XML that you've extracted from other documents you may have parsed. In short, it Does The Right Thing with whatever you throw at it. Of course, we can optomise further by knowing that this code is Perl:

tie my %hash, "Tie::IxHash",
  "http://search.cpan.org/" => "Search CPAN",
  "http://blog.twoshortplanks.com" => "Blog",
  "http://www.schlockmercenary.com" => "Schlock",
;

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul",
      map { xe("li", xe("a", { href => $_ }, $hash{$_}) ) } keys %hash
    ),
  ),
);

And that's about it for basic XML parsing and generation with XML::Easy. There's a lot more handy functions and explantions of the theory behind XML::Easy in the documentation. In my next post I'm going to look at another way of creating XML using XML::Easy, when I talk about one of my own modules: XML::Easy::ProceduralWriter.

As Thick As Two Short Planks

Mark Fowler's Perl Blog

XML::Easy by Example

Parsing XML by Example

Generating XML by Example

- to blog -