Tell me more, tell me more

A new version of Test::DatabaseRow has just escaped onto the CPAN, with a minor new feature verbose data to make working out what went wrong in the database test even easier when the worst happens

In a nutshell this feature allows you to output in diagnostics all the results the database returned (not just listing the first thing that didn’t match, which is the default thing row_ok does in a similar way to Test::More’s is_deeply does.)

In other words instead of writing this:

    all_row_ok(
       table => "contacts",
       where => [ cid => 123 ],
       tests => [ name => "trelane" ],
       store_rows => \@rows,
       description => "contact 123's name is trelane"
     ) or diag explain \@rows;

You can just write this:

    all_row_ok(
       table => "contacts",
       where => [ cid => 123 ],
       tests => [ name => "trelane" ],
       verbose_data => 1,
       description => "contact 123's name is trelane"
     );

Or even just write this:

    all_row_ok(
       table => "contacts",
       where => [ cid => 123 ],
       tests => [ name => "trelane" ],
       description => "contact 123's name is trelane"
     );

And turn on verbosity from the command line when you run the tests

TEST_DBROW_VERBOSE_DATA=1 perl mytest.pl

Mavericks, XCode 5 and WWW::Curl

Sometimes you go through a lot of pain trying to get something to work, and you just need to write it down and put it on the internet. You tell yourself this is useful; You’re doing it so that others can find your words and solve the problem too. In reality, it’s just cathartic. And who doesn’t like a good rant?

So What Broke?

I recently updated my Mac to Mavericks (OS X 10.9). Overall, I like it…but it broke something I care about. It broke libcurl’s handling of certificates., and in so doing broke my install of WWW::Curl also (since it’s a wrapper around the system libcurl.) This is remarkably hard to diagnose, because it just seems like the servers have suddenly stopped trusting each other (when in reality libcurl has just started effectively ignoring the options you’re passing to it.)

Now, WWW::Curl (and its working certificate handling) is a dependency of some of the code I’m using. Worse than forcing me to disable certificate security, I actually need to identify myself to the server with my own cert. Just turning them off and running insecurely won’t work.

Installing libcurl

I’m not a great fan of installing software from source on my mac after having been bitten by fink and it’s kin numerous times in the past. This said, homebrew is actually pretty darn nifty, and can be used to install libcurl:

brew install curl   

Wrrr, clunk, clunk and zap, I have a non broken curl in /usr/local/Cellar/gcc47/4.7.3/bin. Hooray! This one actually deems to listen to the command line options you pass it!

Installing WWW::Curl

Now all we have to do is install WWW::Curl and link it against this libcurl, right? Well, I did this:

wget http://cpan.metacpan.org/authors/id/S/SZ/SZBALINT/WWW-Curl-4.15.tar.gz
gunzip -c WWW-Curl-4.15.tar.gz | tar -xvf -
cd WWW-Curl-4.15
export CURL_CONFIG=/usr/local/Cellar/curl/7.33.0/bin/curl-config
perl Makefile.PL

And things started to go wrong:

The version is libcurl 7.33.0
Found curl.h in /usr/local/Cellar/curl/7.33.0/include/curl/curl.h
In file included from /usr/local/Cellar/curl/7.33.0/include/curl/curl.h:34:
In file included from /usr/local/Cellar/curl/7.33.0/include/curl/curlbuild.h:152:
In file included from /usr/include/sys/socket.h:80:
In file included from /usr/include/Availability.h:148:
/usr/include/AvailabilityInternal.h:4098:10: error: #else without #if
        #else
         ^
/usr/include/AvailabilityInternal.h:4158:10: error: unterminated conditional directive
        #if __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_10_6
         ^
/usr/include/AvailabilityInternal.h:4131:10: error: unterminated conditional directive
        #if __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_10_5
         ^
/usr/include/AvailabilityInternal.h:4108:10: error: unterminated conditional directive
        #if __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_10_4
         ^
4 errors generated.
Building curlopt-constants.c for your libcurl version
Building Easy.pm constants for your libcurl version
Building Share.pm constants for your libcurl version
Checking if your kit is complete...
Looks good
Writing Makefile for WWW::Curl
Writing MYMETA.yml and MYMETA.json

That Ladies and Gentlemen is the sound of Apple’s compiler sucking. You see, not so long ago I upgraded to XCode 5 and apparently this no longer ships with gcc. Uh oh.

(I don’t show the errors you get when you make and make test, but be assured it’s all downhill from here)

Installing gcc

I suspect I could have just re-downloaded XCode 4 and then used xcode-select. But I didn’t. Let me know if that works, okay? This is what I actually did:

brew install gcc47

Then I waited a long time (while doing something else productive instead.) Finally when I was done I had to manually edit the Makefile.PL to use the right preprocessor:

perl -pi -e 's!cpp"!/usr/local/Cellar/gcc47/4.7.3/bin/cpp"!' Makefile.PL
perl Makefile.PL

(yeah, it’d be nice if they’d been some sort of option for that.) Then I altered the resulting Makefile for good measure too:

perl -pi -e 's{CC = cc}{CC = /usr/local/Cellar/gcc47/4.7.3/bin/gcc-4.7}' Makefile

And then I could build it all.

make
make test

Of course, it still got errors. But they’re not new errors. So I pretend I didn’t see them and install anyway.

make install

I hate computers sometimes.

Under the Hood

Perl provides a high level of abstraction between you and the computer allowing you to write very expressive high level code quickly that does a lot. Sometimes however, when things don’t go to plan or you want performance improvements it’s important find out what’s really going on at the lower levels and find out what perl’s doing “under the hood.”

What Did perl Think I Said?

Sometimes when code doesn’t do what you expect it’s nice to see how the Perl interpreter understands your code incase your understanding of Perl’s syntax and perl’s understanding of that same syntax differ. One way to do this is to use the B::Deparse module from the command line to regenerate Perl code from the internal representation perl has built from your source code when it parsed it.

This is as simple as:

bash$ perl -MO=Deparse myscript.pl

One of my favourite options for B::Deparse is -p which tells it to put in an obsessive amount of brackets so you can see what precedence perl is applying:

bash$ perl -MO=Deparse,-p -le 'print $ARGV[0]+$ARGV[1]*$ARGV[2]'
BEGIN { $/ = "\n"; $\ = "\n"; }
print(($ARGV[0] + ($ARGV[1] * $ARGV[2])));
-e syntax OK

You’ll even note there’s two sets of brackets immediately after the print statement – one surrounding the addition and one enclosing the argument list to print. This means that B::Deparse can also be used to work out why the following script prints out 25 rather than 5:

bash$ perl -le 'print ($ARGV[0]**2+$ARGV[1]**2)**0.5' 3 4

The brackets we thought we were using for force precedence actually were parsed by perl as constraining what we were passing to print meaning that the **0.5 was actually ignored:

bash$ perl -MO=Deparse,-p -le 'print ($ARGV[0]**2+$ARGV[1]**2)**0.5' 3 4
BEGIN { $/ = "\n"; $\ = "\n"; }
(print((($ARGV[0] ** 2) + ($ARGV[1] ** 2))) ** 0.5);
-e syntax OK

What Does That Scalar Actually Contain?

A scalar is many things at once – it can actually hold a string, an integer, a floating point value and convert between them at will. We can see the internal structure with the Devel::Peek module:

use Devel::Peek;
my $foo = 2;
Dump($foo);

This prints

SV = IV(0x100813f78) at 0x100813f80
  REFCNT = 1
  FLAGS = (PADMY,IOK,pIOK)
  IV = 2

This tells you a lot about the object. It tells you it’s an int (an IV) and the value of that int is 2. You can see that it’s got one reference pointing to it (the $foo alias.) You can also see it’s got several flags set on it telling us which of the values stored in the object are still current (in this case, the IV, since it’s an IV)

$foo .= "";
Dump($foo);

This now prints:

SV = PVIV(0x100803c10) at 0x100813f80
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 2
  PV = 0x100208900 "2"
  CUR = 1
  LEN = 8

We gain PV flags (it’s a “pointer value” aka a string) and we also gain CUR (current string length) and LEN (total string length allocated before we need to re-alloc and copy the string.) The flags have changed to indicate that the PV value is now current too.

So we can tell a lot about the internal state of a scalar. Why would we care (assuming we’re not going to be using XS that has to deal with this kind of stuff.) Mainly I find myself reaching for Devel::Peek to print out the contents of strings whenever I have encoding issues.

Consider this:

my $acme = "L\x{e9}on";
Dump $acme;

On my system this shows that Léon was actually stored internally as a latin-1 byte sequence:

SV = PV(0x100801c78) at 0x100813f98
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x100202550 "L\351on"
  CUR = 4
  LEN = 8

But it doesn’t have to be

utf8::upgrade($acme);
Dump($acme);

Now the internal bytes of the string are stored in utf8 (and the UTF8 flag is turned on)

SV = PV(0x100801c78) at 0x100813f98
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1002010f0 "L\303\251on" [UTF8 "L\x{e9}on"]
  CUR = 5
  LEN = 6

As far as perl is concerned these are the same string:

my $acme  = "L\x{e9}on";
my $acme2 = $acme;
utf8::upgrade($acme);
say "Yep, this will be printed"
  if $acme eq $acme2;

In fact, perl may decide to switch between these two internal representations as you concatinate and manipulate your strings. This is not something you normally have to worry about until something goes wrong and you see something horrid being output:

Léon

This is usually a sign that you’ve read in some bytes that were encoded as latin-1 and forgotten to use Encode (or you’ve done that twice!), or you’ve passed a UTF-8 string though a C library, or you had duff data to begin with (garbage in, garbage out.) Of course, you can’t really start to work out which of these cases is true unless you look in the variable, and that’s hard: You can’t just print it out because that will re-encode it with the binmode of that filehandle and your terminal may do all kinds of weirdness with it. The solution, of course, is to Dump it out as above and see an ASCII representation of what’s actually stored in memory.

How Much Memory Is That Using?

In general you don’t have to worry about memory in Perl – perl handles allocating and deallocating memory for you automatically. On the other hand, perl can’t magically give your computer an infinite amount of memory so you still have to worry that you’re using too much (especially in a webserver environment where you might be caching data between requests but running multiple Perl processes at the same time.) The Devel::Size module from the CPAN can be a great help here:

bash$ perl -E 'use Devel::Size qw(size); say size("a"x1024)'
1080

So in this case a string of 1024 “a” characters takes up the 1024 bytes for all the “a” characters plus 56 bytes for the internal scalar data structure (the exact size will vary slightly between versions of perl and across architectures.)

Devel::Size can also tell you how much memory nested data structures (and objects) are taking up:

perl -E 'use Devel::Size qw(total_size); say total_size({ z => [("a"x1024)x10] })'
11251

Be aware that Devel::Size will only report how much memory perl has allocated for you – not how much memory XS modules you’ve loaded into perl are taking up.

How Does perl Execute That?

Perl’s interpreter (like those that run Python, Java, JavaScript, Ruby and many other languages) neither compiles your code to native machine instructions nor interprets the source code directly to execute it. It instead compiles the code to an bytecode representation and then ‘executes’ those bytes on a virtual machine capable of understanding much higher level instructions than the processor in your computer.

When you’re optomising your code one of the most important things to do is reduce the number of “ops” (bytecode operations) that perl has to execute. This is because there’s significant overhead in actually running the virtual machine itself, so the more you can get each Perl op to do the better, even if that op itself is more expensive to run.

For example, here’s a script that counts the number of “a” characters in the output by using the index command to repeatedly search for the next “a” and increasing a counter whenever we do’

perl -E '$c++ while $pos = index($ARGV[0], "a", $pos) + 1; say $c' aardvark
3

Let’s look at what ops that program actually creates. This can be done with the B::Concise module that ships with perl:

bash$ perl -MO=Concise -E '$c++ while $pos = index($ARGV[0], "a", $pos) + 1; say $c' aardvark
l  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter ->2
2     <;> nextstate(main 42 -e:1) v:%,{ ->3
g     <@> leave vK* ->h
3        <0> enter v ->4
-        <1> null vKP/1 ->g
c           <|> and(other->d) vK/1 ->g
b              <2> sassign sKS/2 ->c
9                 <2> add[t7] sK/2 ->a
7                    <@> index[t6] sK/3 ->8
-                       <0> ex-pushmark s ->4
-                       <1> ex-aelem sK/2 ->5
-                          <1> ex-rv2av sKR/1 ->-
4                             <#> aelemfast[*ARGV] s ->5
-                          <0> ex-const s ->-
5                       <$> const[GV "a"] s ->6
-                       <1> ex-rv2sv sK/1 ->7
6                          <#> gvsv[*pos] s ->7
8                    <$> const[IV 1] s ->9
-                 <1> ex-rv2sv sKRM*/1 ->b
a                    <#> gvsv[*pos] s ->b
-              <@> lineseq vK ->-
e                 <1> preinc[t2] vK/1 ->f
-                    <1> ex-rv2sv sKRM/1 ->e
d                       <#> gvsv[*c] s ->e
f                 <0> unstack v ->4
h     <;> nextstate(main 42 -e:1) v:%,{ ->i
k     <@> say vK ->l
i        <0> pushmark s ->j
-        <1> ex-rv2sv sK/1 ->k
j           <#> gvsv[*c] s ->k

It’s not important to really understand this in any great detail; All we need worry about is that firstly it’s very big for what we’re trying to do and secondly that it’s looping so those ops we can see are going to be executed multiple times.

Let’s try an alternative approach, using the translation operator to translate all the “a” characters to “a” characters (so, do nothing) and return how many characters it ‘changed’

bash$ perl -MO=Concise -E '$c = $ARGV[0] =~ tr/a/a/; say $c' aardvark
b  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter ->2
2     <;> nextstate(main 42 -e:1) v:%,{ ->3
6     <2> sassign vKS/2 ->7
-        <1> null sKS/2 ->5
-           <1> ex-aelem sK/2 ->4
-              <1> ex-rv2av sKR/1 ->-
3                 <#> aelemfast[*ARGV] s ->4
-              <0> ex-const s ->-
4           <"> trans sS/IDENT ->5
-        <1> ex-rv2sv sKRM*/1 ->6
5           <#> gvsv[*c] s ->6
7     <;> nextstate(main 42 -e:1) v:%,{ ->8
a     <@> say vK ->b
8        <0> pushmark s ->9
-        <1> ex-rv2sv sK/1 ->a
9           <#> gvsv[*c] s ->a

Ah! much less ops! And no loops! This is because the call to tr is a single op, meaning this whole thing is much faster. Of course, don’t take my word for it – run a benchmark

#!/usr/bin/perl

use Benchmark qw(cmpthese);

cmpthese(10_000_000, {
 'index' => sub { my $c; my $pos; $c++ while $pos = index($ARGV[0], "a", $pos) + 1 },
 'tr'    => sub { my $c; $c = $ARGV[0] =~ tr/a/a/ },
});

bash$ ./benchmark.pl aardvark
           Rate index    tr
index 2439024/s    --  -39%
tr    4016064/s   65%    --

And finally

This is just a smattering of modules that can help poke around inside the internal of Perl - practically the national sport of the Programming Republic of Perl. The CPAN contains a very large number of modules that can do all kinds of clever things - try looking on the CPAN for "B::" and "Devel::" modules.

RationalizeNamespacePrefixes

Today I’m going to talk about parsing documents that use XML Namespaces with XML::Easy. While XML::Easy doesn’t (by design) ship with its own XML Namespace aware parser, one of my modules XML::Easy::Transform::RationalizeNamespacePrefixes makes parsing documents that use namespaces a doddle with just one simple function call.

The problems namespaces solve and cause

XML Namespaces is an extension of the XML 1.0 specification that allows multiple standards to cooperate so they don’t use the same names for their nodes, meaning it’s possible to use more than one specification in the same document at the same time without conflict. For example here’s an XML document that uses two different made up specs at the same time to describe a pub lunch that uses the tag “chips” to mean two different things:

<order
  xmlns:prepackaged="http://twoshortplanks.com/ns/example/behindthebar" 
  xmlns:grub="http://twoshortplanks.com/ns/example/food">
  <grub:meal >
    <grub:beefburger/>
    <grub:chips/>
  </grub:meal>
  <prepackaged:chips type="Pringles" />
</order>

So the way the XML Namespace specification works is by using a convention of naming nodes starting with an extra prefix. This allow you to use what otherwise would be the same named in the same document to have a different schematic meaning. For example the “chips” nodes are written as “prepackaged:chips” when they’re referring to crisps, and “grub:chips” when they’re referring to a fries. The clever bit of XML Namespaces is that doesn’t matter what prefix you use to differentiate the two from each other, but what namespace URLs they map to. For example, this document here is considered to be essentially identical to the previous example as far as a namespace aware XML parser is concerned:

<order>
  <meal xmlns="http://twoshortplanks.com/ns/example/food">
    <beefburger/>
    <chips/>
  </meal>
  <barsnack:chips xmlns:barsnack="http://twoshortplanks.com/ns/example/behindthebar" type="Pringles" />
</order>

The meaning of the prefix is entirely derived from the presence of the xmlns prefixed attributes on the node or on the parent node mapping the prefix to a URL¹. This both is great and a complete nightmare: Great since you're mapping an arbitrary prefix to a the unique namespace URL you're not going to get conflicts with other specifications (the way you would if each specification defined its own prefix.) And a complete nightmare because you don't know what the thing you're looking for is actually called - either your code, or the parser, has to keep track of what namespaces are declared in the current scope and what prefixes map to what namespaces.

Using XML::Easy::Transform::RationalizeNamespacePrefixes

What would be great is if there was some way you could force everyone who gives you a document to use the prefixes you'd like, and then you'd know what they'd be called and instead of having to worry about all these xmlns:whatever attributes in the document (and what nodes were where in the tree in relation to them.) Then you could just look for all the "beverage:larger" nodes.

Well, we can't force other people to do what we want, but what we can do is make use of the fact that the prefixes are arbitrary and the same document with any prefix means the same thing. We can therefore just rewrite whatever document we're given into a form we'd like to deal with before we process it. This is the task XML::Easy::Transform::RationalizeNamespacesPrefixes was designed for - it rationalises the prefixes of the namespaces to whatever you want. For example, forcing using "drink" and "modifier" prefixes for the namespaces:

my $old_doc = xml10_read_document($string_of_xml);
my $new_doc = rationalize_namespace_prefixes($old_doc, {
  namespaces => {
    "http://twoshortplanks.com/ns/example/food" => "kitchen",
    "http://twoshortplanks.com/ns/example/behindthebar" => "barstaff",
  },
  force_attribute_prefix => 1,
})

Now if you feed either of the above documents to the code, you'll have an in memory representation of the following document:

<order
  xmlns:barstaff="http://twoshortplanks.com/ns/example/behindthebar" 
  xmlns:kitchen="http://twoshortplanks.com/ns/example/food">
  <kitchen:meal >
    <kitchen:beefburger/>
    <kitchen:chips/>
  </kitchen:meal>
  <barstaff:chips barstaff:type="Pringles" />
</order>

Several important transformations have happened:

  • It used the namespace/prefixe mapping that we passed into it with namespaces to rename all the corresponding nodes in the document to have the whatever prefixes we want. This means we now know without looking at the xmlns attributes what our nodes will be called.

  • All the namespaces have been moved to the top element of the document. In this example the module didn't need to introduce any further prefixes to do this (which can happen if the same prefix is used to refer to different URLs in different parts of the tree) nor condense prefixes to a single prefix per namespace (which happens if multiple prefixes refer to the same URL) but if it had to do that, it would have. This means it's really easy to find other namespaces that are defined in our document - you just look for xmlns attributes at the top element.
  • The force_attribute_prefix option forces prefixes to be attached to attribute names too

Now we can parse the document without worrying about the namespaces at all. If we want to look for all the packets of preprepared food in the document:

use XML::Easy::Text qw(xml10_read_document);
use XML::Easy::Classify qw(is_xml_element);
use XML::Easy::NodeBasics qw(xe_twine);
use XML::Easy::Transform::RationalizeNamespacePrefixes qw(rationalize_namespace_prefixes);

sub packets {
  my $element = shift;
  return unless is_xml_element($element);
  my @return;
  push @return, $element->attribute("barstaff:type") if $element->type_name eq "barstaff:chips";
  push @return, map { packets($_) } @{ xe_twine($element) };
  return @return;
}

say "We need the following packets:";
say " * $_" for packets(
  rationalize_namespace_prefixes(
    xml10_read_document($string_of_xml), {
      namespaces => {
        "http://twoshortplanks.com/ns/example/behindthebar" => "barstaff",
      },
      force_attribute_prefix => 1,
    }
  )
);

There's more information on 'XML::Easy::Transform::RationalizeNamespacePrefix's search.cpan.org page².

And that concludes my mini-series into looking into XML::Easy. I'm sure to write more about it in the future as more interesting uses and extensions are written for it, but in my next entry I'll be taking a break from the pointy brackets!


[1] I've used the term URL mutliple times in this document when I should have really used URI. We're using the http:// thingy wosit to Identify a Unique Reference, so it should be a URI, rather an a Universal Resorce Location because there's no resource to locate at that address. It's just a unique name.

[2]Please note that this blog was originally posted close in time to when the new version of XML::Easy::Transform::RationalizeNamespacePrefixes was uploaded to the CPAN, so not all features described in this post may have reached your local CPAN mirror if you're reading it "hot off the presses".

Posted in 1

Permalink 1 Comment

XML::Easy::ProceduralWriter

In this post I want to take a further look at writing code that outputs XML with XML::Easy, and how the example from my previous blog entry can be improved upon by using XML::Easy::ProceduralWriter.

What’s wrong with the previous example?

When we left things in my previous post we were looking at this snippet of code that outputs a simple XML webpage:

tie my %hash, "Tie::IxHash",
  "http://search.cpan.org/" => "Search CPAN",
  "http://blog.twoshortplanks.com" => "Blog",
  "http://www.schlockmercenary.com" => "Schlock",
;

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul",
      map { xe("li", xe("a", { href => $_ }, $hash{$_}) ) } keys %hash
    ),
  ),
);

print {$fh} xml10_write_document($root_element);

The above code produces exactly the output we want, but it doesn’t necessarily go about the best way of producing it.

The first problem is that it’s using Tie::IxHash to ensure that the keys of the hash (and thus the nodes in the resulting XML) come out in the right order rather than in a random order like traditional hashes. Tied data structures are much slower than normal data structures and using this structure in this way is a big performance hit. However in this case we have to tie because it’s hard to write, in a readable way, the logic inline in the map statement to process a normal array two elements at a time.

Which brings us onto the second problem, also related to the map statement – it’s somewhat unwieldy to write and hard to read (you have to scan to the end of the line to work out that it’s using the %hash for its keys.) This only gets worse as you have to produce more complex XML and you try and use further (possibly nested) map statements and tertiary logic expressions to build up even more complex data structures – which is every bit as messy to do as it is to explain.

Both issues stem from trying to build the XML::Easy::Element tree all in one go, essentially in one statement as a single assignment. If we choose not to restrict ourselves in this way we can easily re-order the code to use a temporary variable and do away with both the tie and the map:

my @data = (
  "http://search.cpan.org/" => "Search CPAN",
  "http://blog.twoshortplanks.com" => "Blog",
  "http://www.schlockmercenary.com" => "Schlock",
);

my @links;
while (@data) {
  my $url = shift @data;
  my $text = shift @data;
  push @links, xe("li", xe("a", { href => $url }, $text) );
}

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul", @links),
  ),
);

print {$fh} xml10_write_document($root_element);

The problem with this solution is now we’ve ended up with code that’s backwards. We’re creating the list elements and then creating the node that encloses them. Now we have to read the bottom of the code to work out that we’re creating a HTML document at all!

Introducing XML::Easy::ProceduralWriter

To solve this problem I wrote XML::Easy::ProceduralWriter, a module that allows you to write your code in a procedural fashion but without having to “code in reverse”.

Here’s the above example re-written again, this time using XML::Easy::ProceduralWriter:

use XML::Easy::ProceduralWriter;

print {$fh} xml_bytes {

  element "html", contains {
  
    element "head", contains {
      element "title", contains {
        text "My Links";
      };
    };
    
    element "body", contains {
      element "ul", contains {
         my @data = (
           "http://search.cpan.org/" => "Search CPAN",
           "http://blog.twoshortplanks.com" => "Blog",
           "http://www.schlockmercenary.com" => "Schlock",
         );
         
         while (@data) {
           element "li", contains {
             element "a", href => shift @data, contains {
               text shift @data;
             };
           };
         }
      };
    };
  };

};

Using the module is straight forward. You start by calling either xml_element (which returns an XML::Easy::Element) or xml_bytes (which returns a set of bytes you can print out) and inside these call you pass some code that generates XML elements and text. Each element can ‘contain’ further code that produces sub-elements and text that element contains and so on.

The key thing to notice is that unlike the previous examples where you were passing data structures into the functions here you’re passing code to be executed. This means you can place arbitrary logic in what you pass in and you’re not limited to single statements. For example, in the above code we declare variables in the middle of generating the XML. The conceptual jump here is realising that neither what the blocks of code nor what element and text return isn’t important, but the side effects of calling these two functions are. The simplest way to think about it is to imagine the string being built up as the element and text statements are encountered in much the same way output is straight away printed to the filehandle when you use print (even though technically this isn’t the case here – a full XML::Easy::Element object tree is always actually built in the background.)

The documentation for XML::Easy::ProceduralWriter contains a reasonable tutorial that explains its usage in more detail, but it should be pretty straight forward from just reading the above code to jump straight in.

And that’s pretty much all I have to say about outputting XML with XML::Easy. In my next post we’ll look instead at advanced parsing and how to cope with documents with XML Namespace declarations.

XML::Easy by Example

Last week I posted about why you should be interested in the new XML parsing library on the block, XML::Easy. In this post I’m going to actually dive into some code so you can see what it’s like to work with.

Parsing XML by Example

The basics of parsing is pretty straight forward:

use XML::Easy::Text qw(xml10_read_document);

# read in the document
open my $fh, "<:utf8", "somexml.xml";
  or die "Can't read filehandle: $!";
my $string = do { local $/; <> };

# parse it
my $root_element = xml10_read_document($string);

Now $root_element contains an XML::Easy::Element. Getting basic facts out of this element such as its name or attribute values is easy too:

say "the name of the root element is ".$root_element->type_name;
say "the content of the href attribute is ".$root_element->attribute("href")
  if defeined $root_element->attribute("href");

Getting at the child elements involves dealing with a twine. What’s a twine you say? Why it’s nothing more than an alternating list of strings and elements. Let’s look at an example to help explain this:

my $input = '<p>Hello my <i>friend</i>, here is my picture: <img src="http://farm1.static.flickr.com/116/262065452_6017d39626_t.jpg&quot; /></p>'

We can then call this:

my $p = xml10_read_document($string);
my $twine = $p->content_twine;

The $twine variable now contains a an array reference holding alternating strings and XML::Easy::Elements

  • $twine->[0] contains the string “Hello my”
  • $twine->[1] contains an XML::Easy::Element representing the <i> tag (which in turn will contain the text “friend”)
  • $twine->[2] contains the string “, here is my picture “
  • $twine->[3] contains an XML::Easy::Element representing the <img> tag
  • $twine->[4] contains the empty string “” between the <img> tag and the closing </p> tag

The important thing to remember about twines is that they always alternate element-string-element-string. When two elements are next to each other in the source document then they’re separated by the empty string. You’ll note that the twine first and last elements are always strings, even if they have to be empty, and an “empty” tag has a twine that contains just one element – the empty string.

Now we know the basics, let’s look at a practical example. Imagine we want to get all the possible anchors (internal links) in an XHTML document. This simply involves looking for all the <a> tags that have a name attribute:

sub get_links {
  my $element = shift;
  my @results;

  # check this element
  push @results, $element->attribute("name")
    if $element->type_name eq "a" && defined $element->attribute("name");

  # check any child elements
  my $swizzle = 0;
  foreach (@{ $element->content_twine() }) {

    # skip every other array element because it's a string
    next if $swizzle = !$swizzle;

    # recurse into the child nodes
    push @results, get_links($_);
  }

  return @results;
}

If we want to make this even easier on ourselves there's a bunch of helper functions in the XML::Easy::Classify module that can be used to help process parts of XML documents. For example, we could have written the above code in a more terse (but less efficient) way by using is_xml_element:

use XML::Easy::Classify qw(is_xml_element);

sub get_links {
  my $element = shift;
  my @results;

  # check this element
  push @results, $element->attribute("name")
    if $element->type_name eq "a" && defined $element->attribute("name");

  # check any child elements
  push @results, get_links($_)
    foreach grep { is_xml_element $_ } @{ $element->content_twine() };

  return @results;
}

Generating XML by Example

If you've got an XML::Easy::Element instance, writing it out as an XML document is just the opposite of reading it in:

use XML::Easy::Text qw(xml10_write_document);

# turn it into a string
my $string = xml10_write_document($root_element);

# write out the document
open my $fh, ">:utf8", "somexml.xml";
  or die "Can't write to filehandle: $!";
print {$fh} $string;

So One of the first things you have to know about XML::Easy::Elements and their contents is that they are immutable, or put another way you can't change them once they're created. This means they have no methods for setting the name of an element, altering the attributes, or setting the children. All of these must be passed in in the constructor.

Let's just jump in with an example. We're going to create a little code that outputs the following XML document:

<html>
   <head><title>My Links</title></head>
   <body>
     <h1>Links</h1>
     <ul>
       <li><a href="http://search.cpan.org/">Search CPAN</a></li>
       <li><a href="http://blog.twoshortplanks.com/">Blog</a></li>
       <li><a href="http://www.schlockmercenary.com/">Schlock</a></li>
     </ul>
   </body>
</html>

(I've added extra whitespace in the above example for clarity - the code examples that follow won't reproduce this whitespace)

I'm going to start of showing you the most verbose and explicit objected-orientated way to create XML::Easy::Elements, and then I'm going to show you the much quicker functional interface once you know what you're doing. The verbose way of creating an object is to explicitly pass in each of the things to the constructor:

XML::Easy::Element->new($name, $attributes_hashref, $xml_easy_content_instance)

The trouble with using such code is that it often requires requires pages and pages of code that puts Java to shame in it's repetition of the obvious (you don't really need to read the following code, just gawk at its length:)

my $root_element = XML::Easy::Element->new("html",
  {},
  XML::Easy::Content->new([
    "",
    XML::Easy::Element->new("head",
      {},
      XML::Easy::Content->new([
        "",
        XML::Easy::Element->new("title",
          {},
          XML::Easy::Content->new([
            "My Links",
          ])
        ),
        "",
      ]),
    ),
    "",
    XML::Easy::Element->new("body",
      {},
      XML::Easy::Content->new([
        "",
        XML::Easy::Element->new("h1",
          {},
          XML::Easy::Content->new([
            "Links",
          ])
        ),
        "",
        XML::Easy::Element->new("ul",
          {},
          XML::Easy::Content->new([
            "",
            XML::Easy::Element->new("li",
              {},
              XML::Easy::Content->new([
                "",
                XML::Easy::Element->new("a",
                  { href => "http://search.cpan.org/" },
                  XML::Easy::Content->new([
                    "Search CPAN",
                  ]),
                ),
                "",
              ]),
            ),
            "",
            XML::Easy::Element->new("li",
              {},
              XML::Easy::Content->new([
                "",
                XML::Easy::Element->new("a",
                  { href => "http://blog.twoshortplanks.com/" },
                  XML::Easy::Content->new([
                    "Blog",
                  ]),
                ),
                "",
              ]),
            ),
            "",
            XML::Easy::Element->new("li",
              {},
              XML::Easy::Content->new([
                "",
                XML::Easy::Element->new("a",
                  { href => "http://schlockmercenrary.com/" },
                  XML::Easy::Content->new([
                    "Schlock",
                  ]),
                ),
                "",
              ]),
            ),
            "",
          ]),
        ),
        "",
      ]),
    ),
    "",
  ]),
);

So, we never ever write code like that! For starters we could use twines instead of content objects, but that's too verbose too. We use the functional interface presented by XML::Easy::NodeBasics instead:

use XML::Easy::NodeBasics qw(xe);

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul",
      xe("li",
        xe("a", { href => "http://search.cpan.org/" }, "Search CPAN"),
      ),
      xe("li",
        xe("a", { href => "http://blog.twoshortplanks.com/" }, "Blog"),
      ),
      xe("li",
        xe("a", { href => "http://www.schlockmercenary.com/" }, "Schlock"),
      ),
    ),
  ),
);

The xe function simply takes a tag name followed by a list of things that are either hashrefs (containing attributes), strings (containing text,) or XML::Easy::Elements (containing nodes.) It can also take content objects and twines, which is handy when you're re-using fragments of XML that you've extracted from other documents you may have parsed. In short, it Does The Right Thing with whatever you throw at it.

Of course, we can optomise further by knowing that this code is Perl:

tie my %hash, "Tie::IxHash",
  "http://search.cpan.org/" => "Search CPAN",
  "http://blog.twoshortplanks.com" => "Blog",
  "http://www.schlockmercenary.com" => "Schlock",
;

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul",
      map { xe("li", xe("a", { href => $_ }, $hash{$_}) ) } keys %hash
    ),
  ),
);

And that's about it for basic XML parsing and generation with XML::Easy. There's a lot more handy functions and explantions of the theory behind XML::Easy in the documentation. In my next post I'm going to look at another way of creating XML using XML::Easy, when I talk about one of my own modules: XML::Easy::ProceduralWriter.

Introducing XML::Easy

Some days, you just want to parse XML document.

However, the standard distribution of Perl doesn’t ship with a bundled XML parser, traditionally instead requiring the user to install a module from CPAN. This means there’s no standard way to do this. Instead there are several choices of parser, each with their advantages and disadvantages: There is, as we often say in Perl, more than one way to do it. This is the first post in a series where I’m going to talk about XML::Easy, a relatively new XML parsing module that deserves a little more publicising.

But why another XML parsing library? What’s wrong with the others? Well, a few things…

One of the biggest problems with the most popular XML parsing modules like XML::LibXML and XML::Parser is that they rely on external C dependancies being installed on your system (libxml2 and expat respectively) so it can be hard to rely on them being installable on any old system. Suppose you write some software that relies on these modules. What exactly are you asking of the users of your software who have to install that module as a dependency? You’re asking them firstly to have a C compiler installed – something people using ActiveState Perl, basic web-host providers, or even Mac OS X users without dev tools do not have. Even more than this you’re often asking them to download and install (either by hand or via their package management system) the external C libraries that these modules rely on, and then know how to configure the C compiler to link to these. Complicated!

To solve this XML::Easy ships with a pure Perl XML parser neither requiring external libraries or a C compiler to install: In a pinch you can simply copy the Perl modules into your library path and you’re up and ready to go. This means that this library can be relied on pretty much anywhere.

The observant will point out that there are many existing pure Perl XML parsing libraries on CPAN. They suffer from another problem: They’re slow. Perl runs not as native instructions but as interpreted bytecode executing on a virtual machine, which is a technical way of saying “in a way that makes lots of very simple operations slow.” This is why the boring details of XML parsing are normally handled in C space.

Luckily, XML::Easy doesn’t use its pure Perl parser unless it really has to. It prefers to compile and install on those systems that do have a working C compiler its own C code for parsing XML. Note that this C code, bound into the perl interpreter with fast XS, is wholly self contained and doesn’t rely on external libraries. All the user on a modern Linux system has to do to install the module is type cpan XML::Easy at the command prompt. In this mode XML::Easy is fast: In our tests it’s at least as fast as XML::LibXML (which is to say, very fast indeed.) This week I’ve been re-writing some code that used to use MkDoc::XML to use XML::Easy and the new code is 600 (yes, six hundred) times faster.

This is great news for module authors who just want to do something simple with fast performance if they can get it, but don’t want to have to worry about putting too much of a burden on their users.

Of course, this would all be for naught if XML::Easy didn’t do a good job of parsing XML – but it does. The other big failing of the many so-called XML parsers for Perl is that they screw up the little but important things. They miss part of the specification (sometimes even deliberately!) or they don’t let you do things properly like handle unicode. XML::Easy isn’t like this: It follows the specification quite carefully (with the devotion I’ve come to expect from its author, my co-worker Zefram) and doesn’t screw up unicode because it doesn’t attempt to handle character encodings itself but embraces and works with Perl’s own unicode handling.

So by now, I’ll have either sold you on the idea of XML::Easy or not, but I haven’t really shown you how to use it. In the next post in this series I’m going to start talking about how you can use XML::Easy to parse XML and extract which bits you want.

Posted in 1

Permalink 1 Comment

Follow

Get every new post delivered to your Inbox.