Mavericks, XCode 5 and WWW::Curl

Sometimes you go through a lot of pain trying to get something to work, and you just need to write it down and put it on the internet. You tell yourself this is useful; You’re doing it so that others can find your words and solve the problem too. In reality, it’s just cathartic. And who doesn’t like a good rant?

So What Broke?

I recently updated my Mac to Mavericks (OS X 10.9). Overall, I like it…but it broke something I care about. It broke libcurl’s handling of certificates., and in so doing broke my install of WWW::Curl also (since it’s a wrapper around the system libcurl.) This is remarkably hard to diagnose, because it just seems like the servers have suddenly stopped trusting each other (when in reality libcurl has just started effectively ignoring the options you’re passing to it.)

Now, WWW::Curl (and its working certificate handling) is a dependency of some of the code I’m using. Worse than forcing me to disable certificate security, I actually need to identify myself to the server with my own cert. Just turning them off and running insecurely won’t work.

Installing libcurl

I’m not a great fan of installing software from source on my mac after having been bitten by fink and it’s kin numerous times in the past. This said, homebrew is actually pretty darn nifty, and can be used to install libcurl:

brew install curl   

Wrrr, clunk, clunk and zap, I have a non broken curl in /usr/local/Cellar/gcc47/4.7.3/bin. Hooray! This one actually deems to listen to the command line options you pass it!

Installing WWW::Curl

Now all we have to do is install WWW::Curl and link it against this libcurl, right? Well, I did this:

wget http://cpan.metacpan.org/authors/id/S/SZ/SZBALINT/WWW-Curl-4.15.tar.gz
gunzip -c WWW-Curl-4.15.tar.gz | tar -xvf -
cd WWW-Curl-4.15
export CURL_CONFIG=/usr/local/Cellar/curl/7.33.0/bin/curl-config
perl Makefile.PL

And things started to go wrong:

The version is libcurl 7.33.0
Found curl.h in /usr/local/Cellar/curl/7.33.0/include/curl/curl.h
In file included from /usr/local/Cellar/curl/7.33.0/include/curl/curl.h:34:
In file included from /usr/local/Cellar/curl/7.33.0/include/curl/curlbuild.h:152:
In file included from /usr/include/sys/socket.h:80:
In file included from /usr/include/Availability.h:148:
/usr/include/AvailabilityInternal.h:4098:10: error: #else without #if
        #else
         ^
/usr/include/AvailabilityInternal.h:4158:10: error: unterminated conditional directive
        #if __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_10_6
         ^
/usr/include/AvailabilityInternal.h:4131:10: error: unterminated conditional directive
        #if __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_10_5
         ^
/usr/include/AvailabilityInternal.h:4108:10: error: unterminated conditional directive
        #if __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_10_4
         ^
4 errors generated.
Building curlopt-constants.c for your libcurl version
Building Easy.pm constants for your libcurl version
Building Share.pm constants for your libcurl version
Checking if your kit is complete...
Looks good
Writing Makefile for WWW::Curl
Writing MYMETA.yml and MYMETA.json

That Ladies and Gentlemen is the sound of Apple’s compiler sucking. You see, not so long ago I upgraded to XCode 5 and apparently this no longer ships with gcc. Uh oh.

(I don’t show the errors you get when you make and make test, but be assured it’s all downhill from here)

Installing gcc

I suspect I could have just re-downloaded XCode 4 and then used xcode-select. But I didn’t. Let me know if that works, okay? This is what I actually did:

brew install gcc47

Then I waited a long time (while doing something else productive instead.) Finally when I was done I had to manually edit the Makefile.PL to use the right preprocessor:

perl -pi -e 's!cpp"!/usr/local/Cellar/gcc47/4.7.3/bin/cpp"!' Makefile.PL
perl Makefile.PL

(yeah, it’d be nice if they’d been some sort of option for that.) Then I altered the resulting Makefile for good measure too:

perl -pi -e 's{CC = cc}{CC = /usr/local/Cellar/gcc47/4.7.3/bin/gcc-4.7}' Makefile

And then I could build it all.

make
make test

Of course, it still got errors. But they’re not new errors. So I pretend I didn’t see them and install anyway.

make install

I hate computers sometimes.

Under the Hood

Perl provides a high level of abstraction between you and the computer allowing you to write very expressive high level code quickly that does a lot. Sometimes however, when things don’t go to plan or you want performance improvements it’s important find out what’s really going on at the lower levels and find out what perl’s doing “under the hood.”

What Did perl Think I Said?

Sometimes when code doesn’t do what you expect it’s nice to see how the Perl interpreter understands your code incase your understanding of Perl’s syntax and perl’s understanding of that same syntax differ. One way to do this is to use the B::Deparse module from the command line to regenerate Perl code from the internal representation perl has built from your source code when it parsed it.

This is as simple as:

bash$ perl -MO=Deparse myscript.pl

One of my favourite options for B::Deparse is -p which tells it to put in an obsessive amount of brackets so you can see what precedence perl is applying:

bash$ perl -MO=Deparse,-p -le 'print $ARGV[0]+$ARGV[1]*$ARGV[2]'
BEGIN { $/ = "\n"; $\ = "\n"; }
print(($ARGV[0] + ($ARGV[1] * $ARGV[2])));
-e syntax OK

You’ll even note there’s two sets of brackets immediately after the print statement – one surrounding the addition and one enclosing the argument list to print. This means that B::Deparse can also be used to work out why the following script prints out 25 rather than 5:

bash$ perl -le 'print ($ARGV[0]**2+$ARGV[1]**2)**0.5' 3 4

The brackets we thought we were using for force precedence actually were parsed by perl as constraining what we were passing to print meaning that the **0.5 was actually ignored:

bash$ perl -MO=Deparse,-p -le 'print ($ARGV[0]**2+$ARGV[1]**2)**0.5' 3 4
BEGIN { $/ = "\n"; $\ = "\n"; }
(print((($ARGV[0] ** 2) + ($ARGV[1] ** 2))) ** 0.5);
-e syntax OK

What Does That Scalar Actually Contain?

A scalar is many things at once – it can actually hold a string, an integer, a floating point value and convert between them at will. We can see the internal structure with the Devel::Peek module:

use Devel::Peek;
my $foo = 2;
Dump($foo);

This prints

SV = IV(0x100813f78) at 0x100813f80
  REFCNT = 1
  FLAGS = (PADMY,IOK,pIOK)
  IV = 2

This tells you a lot about the object. It tells you it’s an int (an IV) and the value of that int is 2. You can see that it’s got one reference pointing to it (the $foo alias.) You can also see it’s got several flags set on it telling us which of the values stored in the object are still current (in this case, the IV, since it’s an IV)

$foo .= "";
Dump($foo);

This now prints:

SV = PVIV(0x100803c10) at 0x100813f80
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 2
  PV = 0x100208900 "2"
  CUR = 1
  LEN = 8

We gain PV flags (it’s a “pointer value” aka a string) and we also gain CUR (current string length) and LEN (total string length allocated before we need to re-alloc and copy the string.) The flags have changed to indicate that the PV value is now current too.

So we can tell a lot about the internal state of a scalar. Why would we care (assuming we’re not going to be using XS that has to deal with this kind of stuff.) Mainly I find myself reaching for Devel::Peek to print out the contents of strings whenever I have encoding issues.

Consider this:

my $acme = "L\x{e9}on";
Dump $acme;

On my system this shows that Léon was actually stored internally as a latin-1 byte sequence:

SV = PV(0x100801c78) at 0x100813f98
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x100202550 "L\351on"
  CUR = 4
  LEN = 8

But it doesn’t have to be

utf8::upgrade($acme);
Dump($acme);

Now the internal bytes of the string are stored in utf8 (and the UTF8 flag is turned on)

SV = PV(0x100801c78) at 0x100813f98
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1002010f0 "L\303\251on" [UTF8 "L\x{e9}on"]
  CUR = 5
  LEN = 6

As far as perl is concerned these are the same string:

my $acme  = "L\x{e9}on";
my $acme2 = $acme;
utf8::upgrade($acme);
say "Yep, this will be printed"
  if $acme eq $acme2;

In fact, perl may decide to switch between these two internal representations as you concatinate and manipulate your strings. This is not something you normally have to worry about until something goes wrong and you see something horrid being output:

Léon

This is usually a sign that you’ve read in some bytes that were encoded as latin-1 and forgotten to use Encode (or you’ve done that twice!), or you’ve passed a UTF-8 string though a C library, or you had duff data to begin with (garbage in, garbage out.) Of course, you can’t really start to work out which of these cases is true unless you look in the variable, and that’s hard: You can’t just print it out because that will re-encode it with the binmode of that filehandle and your terminal may do all kinds of weirdness with it. The solution, of course, is to Dump it out as above and see an ASCII representation of what’s actually stored in memory.

How Much Memory Is That Using?

In general you don’t have to worry about memory in Perl – perl handles allocating and deallocating memory for you automatically. On the other hand, perl can’t magically give your computer an infinite amount of memory so you still have to worry that you’re using too much (especially in a webserver environment where you might be caching data between requests but running multiple Perl processes at the same time.) The Devel::Size module from the CPAN can be a great help here:

bash$ perl -E 'use Devel::Size qw(size); say size("a"x1024)'
1080

So in this case a string of 1024 “a” characters takes up the 1024 bytes for all the “a” characters plus 56 bytes for the internal scalar data structure (the exact size will vary slightly between versions of perl and across architectures.)

Devel::Size can also tell you how much memory nested data structures (and objects) are taking up:

perl -E 'use Devel::Size qw(total_size); say total_size({ z => [("a"x1024)x10] })'
11251

Be aware that Devel::Size will only report how much memory perl has allocated for you – not how much memory XS modules you’ve loaded into perl are taking up.

How Does perl Execute That?

Perl’s interpreter (like those that run Python, Java, JavaScript, Ruby and many other languages) neither compiles your code to native machine instructions nor interprets the source code directly to execute it. It instead compiles the code to an bytecode representation and then ‘executes’ those bytes on a virtual machine capable of understanding much higher level instructions than the processor in your computer.

When you’re optomising your code one of the most important things to do is reduce the number of “ops” (bytecode operations) that perl has to execute. This is because there’s significant overhead in actually running the virtual machine itself, so the more you can get each Perl op to do the better, even if that op itself is more expensive to run.

For example, here’s a script that counts the number of “a” characters in the output by using the index command to repeatedly search for the next “a” and increasing a counter whenever we do’

perl -E '$c++ while $pos = index($ARGV[0], "a", $pos) + 1; say $c' aardvark
3

Let’s look at what ops that program actually creates. This can be done with the B::Concise module that ships with perl:

bash$ perl -MO=Concise -E '$c++ while $pos = index($ARGV[0], "a", $pos) + 1; say $c' aardvark
l  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter ->2
2     <;> nextstate(main 42 -e:1) v:%,{ ->3
g     <@> leave vK* ->h
3        <0> enter v ->4
-        <1> null vKP/1 ->g
c           <|> and(other->d) vK/1 ->g
b              <2> sassign sKS/2 ->c
9                 <2> add[t7] sK/2 ->a
7                    <@> index[t6] sK/3 ->8
-                       <0> ex-pushmark s ->4
-                       <1> ex-aelem sK/2 ->5
-                          <1> ex-rv2av sKR/1 ->-
4                             <#> aelemfast[*ARGV] s ->5
-                          <0> ex-const s ->-
5                       <$> const[GV "a"] s ->6
-                       <1> ex-rv2sv sK/1 ->7
6                          <#> gvsv[*pos] s ->7
8                    <$> const[IV 1] s ->9
-                 <1> ex-rv2sv sKRM*/1 ->b
a                    <#> gvsv[*pos] s ->b
-              <@> lineseq vK ->-
e                 <1> preinc[t2] vK/1 ->f
-                    <1> ex-rv2sv sKRM/1 ->e
d                       <#> gvsv[*c] s ->e
f                 <0> unstack v ->4
h     <;> nextstate(main 42 -e:1) v:%,{ ->i
k     <@> say vK ->l
i        <0> pushmark s ->j
-        <1> ex-rv2sv sK/1 ->k
j           <#> gvsv[*c] s ->k

It’s not important to really understand this in any great detail; All we need worry about is that firstly it’s very big for what we’re trying to do and secondly that it’s looping so those ops we can see are going to be executed multiple times.

Let’s try an alternative approach, using the translation operator to translate all the “a” characters to “a” characters (so, do nothing) and return how many characters it ‘changed’

bash$ perl -MO=Concise -E '$c = $ARGV[0] =~ tr/a/a/; say $c' aardvark
b  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter ->2
2     <;> nextstate(main 42 -e:1) v:%,{ ->3
6     <2> sassign vKS/2 ->7
-        <1> null sKS/2 ->5
-           <1> ex-aelem sK/2 ->4
-              <1> ex-rv2av sKR/1 ->-
3                 <#> aelemfast[*ARGV] s ->4
-              <0> ex-const s ->-
4           <"> trans sS/IDENT ->5
-        <1> ex-rv2sv sKRM*/1 ->6
5           <#> gvsv[*c] s ->6
7     <;> nextstate(main 42 -e:1) v:%,{ ->8
a     <@> say vK ->b
8        <0> pushmark s ->9
-        <1> ex-rv2sv sK/1 ->a
9           <#> gvsv[*c] s ->a

Ah! much less ops! And no loops! This is because the call to tr is a single op, meaning this whole thing is much faster. Of course, don’t take my word for it – run a benchmark

#!/usr/bin/perl

use Benchmark qw(cmpthese);

cmpthese(10_000_000, {
 'index' => sub { my $c; my $pos; $c++ while $pos = index($ARGV[0], "a", $pos) + 1 },
 'tr'    => sub { my $c; $c = $ARGV[0] =~ tr/a/a/ },
});

bash$ ./benchmark.pl aardvark
           Rate index    tr
index 2439024/s    --  -39%
tr    4016064/s   65%    --

And finally

This is just a smattering of modules that can help poke around inside the internal of Perl - practically the national sport of the Programming Republic of Perl. The CPAN contains a very large number of modules that can do all kinds of clever things - try looking on the CPAN for "B::" and "Devel::" modules.

RationalizeNamespacePrefixes

Today I’m going to talk about parsing documents that use XML Namespaces with XML::Easy. While XML::Easy doesn’t (by design) ship with its own XML Namespace aware parser, one of my modules XML::Easy::Transform::RationalizeNamespacePrefixes makes parsing documents that use namespaces a doddle with just one simple function call.

The problems namespaces solve and cause

XML Namespaces is an extension of the XML 1.0 specification that allows multiple standards to cooperate so they don’t use the same names for their nodes, meaning it’s possible to use more than one specification in the same document at the same time without conflict. For example here’s an XML document that uses two different made up specs at the same time to describe a pub lunch that uses the tag “chips” to mean two different things:

<order
  xmlns:prepackaged="http://twoshortplanks.com/ns/example/behindthebar" 
  xmlns:grub="http://twoshortplanks.com/ns/example/food">
  <grub:meal >
    <grub:beefburger/>
    <grub:chips/>
  </grub:meal>
  <prepackaged:chips type="Pringles" />
</order>

So the way the XML Namespace specification works is by using a convention of naming nodes starting with an extra prefix. This allow you to use what otherwise would be the same named in the same document to have a different schematic meaning. For example the “chips” nodes are written as “prepackaged:chips” when they’re referring to crisps, and “grub:chips” when they’re referring to a fries. The clever bit of XML Namespaces is that doesn’t matter what prefix you use to differentiate the two from each other, but what namespace URLs they map to. For example, this document here is considered to be essentially identical to the previous example as far as a namespace aware XML parser is concerned:

<order>
  <meal xmlns="http://twoshortplanks.com/ns/example/food">
    <beefburger/>
    <chips/>
  </meal>
  <barsnack:chips xmlns:barsnack="http://twoshortplanks.com/ns/example/behindthebar" type="Pringles" />
</order>

The meaning of the prefix is entirely derived from the presence of the xmlns prefixed attributes on the node or on the parent node mapping the prefix to a URL¹. This both is great and a complete nightmare: Great since you're mapping an arbitrary prefix to a the unique namespace URL you're not going to get conflicts with other specifications (the way you would if each specification defined its own prefix.) And a complete nightmare because you don't know what the thing you're looking for is actually called - either your code, or the parser, has to keep track of what namespaces are declared in the current scope and what prefixes map to what namespaces.

Using XML::Easy::Transform::RationalizeNamespacePrefixes

What would be great is if there was some way you could force everyone who gives you a document to use the prefixes you'd like, and then you'd know what they'd be called and instead of having to worry about all these xmlns:whatever attributes in the document (and what nodes were where in the tree in relation to them.) Then you could just look for all the "beverage:larger" nodes.

Well, we can't force other people to do what we want, but what we can do is make use of the fact that the prefixes are arbitrary and the same document with any prefix means the same thing. We can therefore just rewrite whatever document we're given into a form we'd like to deal with before we process it. This is the task XML::Easy::Transform::RationalizeNamespacesPrefixes was designed for - it rationalises the prefixes of the namespaces to whatever you want. For example, forcing using "drink" and "modifier" prefixes for the namespaces:

my $old_doc = xml10_read_document($string_of_xml);
my $new_doc = rationalize_namespace_prefixes($old_doc, {
  namespaces => {
    "http://twoshortplanks.com/ns/example/food" => "kitchen",
    "http://twoshortplanks.com/ns/example/behindthebar" => "barstaff",
  },
  force_attribute_prefix => 1,
})

Now if you feed either of the above documents to the code, you'll have an in memory representation of the following document:

<order
  xmlns:barstaff="http://twoshortplanks.com/ns/example/behindthebar" 
  xmlns:kitchen="http://twoshortplanks.com/ns/example/food">
  <kitchen:meal >
    <kitchen:beefburger/>
    <kitchen:chips/>
  </kitchen:meal>
  <barstaff:chips barstaff:type="Pringles" />
</order>

Several important transformations have happened:

  • It used the namespace/prefixe mapping that we passed into it with namespaces to rename all the corresponding nodes in the document to have the whatever prefixes we want. This means we now know without looking at the xmlns attributes what our nodes will be called.

  • All the namespaces have been moved to the top element of the document. In this example the module didn't need to introduce any further prefixes to do this (which can happen if the same prefix is used to refer to different URLs in different parts of the tree) nor condense prefixes to a single prefix per namespace (which happens if multiple prefixes refer to the same URL) but if it had to do that, it would have. This means it's really easy to find other namespaces that are defined in our document - you just look for xmlns attributes at the top element.
  • The force_attribute_prefix option forces prefixes to be attached to attribute names too

Now we can parse the document without worrying about the namespaces at all. If we want to look for all the packets of preprepared food in the document:

use XML::Easy::Text qw(xml10_read_document);
use XML::Easy::Classify qw(is_xml_element);
use XML::Easy::NodeBasics qw(xe_twine);
use XML::Easy::Transform::RationalizeNamespacePrefixes qw(rationalize_namespace_prefixes);

sub packets {
  my $element = shift;
  return unless is_xml_element($element);
  my @return;
  push @return, $element->attribute("barstaff:type") if $element->type_name eq "barstaff:chips";
  push @return, map { packets($_) } @{ xe_twine($element) };
  return @return;
}

say "We need the following packets:";
say " * $_" for packets(
  rationalize_namespace_prefixes(
    xml10_read_document($string_of_xml), {
      namespaces => {
        "http://twoshortplanks.com/ns/example/behindthebar" => "barstaff",
      },
      force_attribute_prefix => 1,
    }
  )
);

There's more information on 'XML::Easy::Transform::RationalizeNamespacePrefix's search.cpan.org page².

And that concludes my mini-series into looking into XML::Easy. I'm sure to write more about it in the future as more interesting uses and extensions are written for it, but in my next entry I'll be taking a break from the pointy brackets!


[1] I've used the term URL mutliple times in this document when I should have really used URI. We're using the http:// thingy wosit to Identify a Unique Reference, so it should be a URI, rather an a Universal Resorce Location because there's no resource to locate at that address. It's just a unique name.

[2]Please note that this blog was originally posted close in time to when the new version of XML::Easy::Transform::RationalizeNamespacePrefixes was uploaded to the CPAN, so not all features described in this post may have reached your local CPAN mirror if you're reading it "hot off the presses".

Posted in 1

Permalink 1 Comment

XML::Easy::ProceduralWriter

In this post I want to take a further look at writing code that outputs XML with XML::Easy, and how the example from my previous blog entry can be improved upon by using XML::Easy::ProceduralWriter.

What’s wrong with the previous example?

When we left things in my previous post we were looking at this snippet of code that outputs a simple XML webpage:

tie my %hash, "Tie::IxHash",
  "http://search.cpan.org/" => "Search CPAN",
  "http://blog.twoshortplanks.com" => "Blog",
  "http://www.schlockmercenary.com" => "Schlock",
;

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul",
      map { xe("li", xe("a", { href => $_ }, $hash{$_}) ) } keys %hash
    ),
  ),
);

print {$fh} xml10_write_document($root_element);

The above code produces exactly the output we want, but it doesn’t necessarily go about the best way of producing it.

The first problem is that it’s using Tie::IxHash to ensure that the keys of the hash (and thus the nodes in the resulting XML) come out in the right order rather than in a random order like traditional hashes. Tied data structures are much slower than normal data structures and using this structure in this way is a big performance hit. However in this case we have to tie because it’s hard to write, in a readable way, the logic inline in the map statement to process a normal array two elements at a time.

Which brings us onto the second problem, also related to the map statement – it’s somewhat unwieldy to write and hard to read (you have to scan to the end of the line to work out that it’s using the %hash for its keys.) This only gets worse as you have to produce more complex XML and you try and use further (possibly nested) map statements and tertiary logic expressions to build up even more complex data structures – which is every bit as messy to do as it is to explain.

Both issues stem from trying to build the XML::Easy::Element tree all in one go, essentially in one statement as a single assignment. If we choose not to restrict ourselves in this way we can easily re-order the code to use a temporary variable and do away with both the tie and the map:

my @data = (
  "http://search.cpan.org/" => "Search CPAN",
  "http://blog.twoshortplanks.com" => "Blog",
  "http://www.schlockmercenary.com" => "Schlock",
);

my @links;
while (@data) {
  my $url = shift @data;
  my $text = shift @data;
  push @links, xe("li", xe("a", { href => $url }, $text) );
}

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul", @links),
  ),
);

print {$fh} xml10_write_document($root_element);

The problem with this solution is now we’ve ended up with code that’s backwards. We’re creating the list elements and then creating the node that encloses them. Now we have to read the bottom of the code to work out that we’re creating a HTML document at all!

Introducing XML::Easy::ProceduralWriter

To solve this problem I wrote XML::Easy::ProceduralWriter, a module that allows you to write your code in a procedural fashion but without having to “code in reverse”.

Here’s the above example re-written again, this time using XML::Easy::ProceduralWriter:

use XML::Easy::ProceduralWriter;

print {$fh} xml_bytes {

  element "html", contains {
  
    element "head", contains {
      element "title", contains {
        text "My Links";
      };
    };
    
    element "body", contains {
      element "ul", contains {
         my @data = (
           "http://search.cpan.org/" => "Search CPAN",
           "http://blog.twoshortplanks.com" => "Blog",
           "http://www.schlockmercenary.com" => "Schlock",
         );
         
         while (@data) {
           element "li", contains {
             element "a", href => shift @data, contains {
               text shift @data;
             };
           };
         }
      };
    };
  };

};

Using the module is straight forward. You start by calling either xml_element (which returns an XML::Easy::Element) or xml_bytes (which returns a set of bytes you can print out) and inside these call you pass some code that generates XML elements and text. Each element can ‘contain’ further code that produces sub-elements and text that element contains and so on.

The key thing to notice is that unlike the previous examples where you were passing data structures into the functions here you’re passing code to be executed. This means you can place arbitrary logic in what you pass in and you’re not limited to single statements. For example, in the above code we declare variables in the middle of generating the XML. The conceptual jump here is realising that neither what the blocks of code nor what element and text return isn’t important, but the side effects of calling these two functions are. The simplest way to think about it is to imagine the string being built up as the element and text statements are encountered in much the same way output is straight away printed to the filehandle when you use print (even though technically this isn’t the case here – a full XML::Easy::Element object tree is always actually built in the background.)

The documentation for XML::Easy::ProceduralWriter contains a reasonable tutorial that explains its usage in more detail, but it should be pretty straight forward from just reading the above code to jump straight in.

And that’s pretty much all I have to say about outputting XML with XML::Easy. In my next post we’ll look instead at advanced parsing and how to cope with documents with XML Namespace declarations.

XML::Easy by Example

Last week I posted about why you should be interested in the new XML parsing library on the block, XML::Easy. In this post I’m going to actually dive into some code so you can see what it’s like to work with.

Parsing XML by Example

The basics of parsing is pretty straight forward:

use XML::Easy::Text qw(xml10_read_document);

# read in the document
open my $fh, "<:utf8", "somexml.xml";
  or die "Can't read filehandle: $!";
my $string = do { local $/; <> };

# parse it
my $root_element = xml10_read_document($string);

Now $root_element contains an XML::Easy::Element. Getting basic facts out of this element such as its name or attribute values is easy too:

say "the name of the root element is ".$root_element->type_name;
say "the content of the href attribute is ".$root_element->attribute("href")
  if defeined $root_element->attribute("href");

Getting at the child elements involves dealing with a twine. What’s a twine you say? Why it’s nothing more than an alternating list of strings and elements. Let’s look at an example to help explain this:

my $input = '<p>Hello my <i>friend</i>, here is my picture: <img src="http://farm1.static.flickr.com/116/262065452_6017d39626_t.jpg&quot; /></p>'

We can then call this:

my $p = xml10_read_document($string);
my $twine = $p->content_twine;

The $twine variable now contains a an array reference holding alternating strings and XML::Easy::Elements

  • $twine->[0] contains the string “Hello my”
  • $twine->[1] contains an XML::Easy::Element representing the <i> tag (which in turn will contain the text “friend”)
  • $twine->[2] contains the string “, here is my picture “
  • $twine->[3] contains an XML::Easy::Element representing the <img> tag
  • $twine->[4] contains the empty string “” between the <img> tag and the closing </p> tag

The important thing to remember about twines is that they always alternate element-string-element-string. When two elements are next to each other in the source document then they’re separated by the empty string. You’ll note that the twine first and last elements are always strings, even if they have to be empty, and an “empty” tag has a twine that contains just one element – the empty string.

Now we know the basics, let’s look at a practical example. Imagine we want to get all the possible anchors (internal links) in an XHTML document. This simply involves looking for all the <a> tags that have a name attribute:

sub get_links {
  my $element = shift;
  my @results;

  # check this element
  push @results, $element->attribute("name")
    if $element->type_name eq "a" && defined $element->attribute("name");

  # check any child elements
  my $swizzle = 0;
  foreach (@{ $element->content_twine() }) {

    # skip every other array element because it's a string
    next if $swizzle = !$swizzle;

    # recurse into the child nodes
    push @results, get_links($_);
  }

  return @results;
}

If we want to make this even easier on ourselves there's a bunch of helper functions in the XML::Easy::Classify module that can be used to help process parts of XML documents. For example, we could have written the above code in a more terse (but less efficient) way by using is_xml_element:

use XML::Easy::Classify qw(is_xml_element);

sub get_links {
  my $element = shift;
  my @results;

  # check this element
  push @results, $element->attribute("name")
    if $element->type_name eq "a" && defined $element->attribute("name");

  # check any child elements
  push @results, get_links($_)
    foreach grep { is_xml_element $_ } @{ $element->content_twine() };

  return @results;
}

Generating XML by Example

If you've got an XML::Easy::Element instance, writing it out as an XML document is just the opposite of reading it in:

use XML::Easy::Text qw(xml10_write_document);

# turn it into a string
my $string = xml10_write_document($root_element);

# write out the document
open my $fh, ">:utf8", "somexml.xml";
  or die "Can't write to filehandle: $!";
print {$fh} $string;

So One of the first things you have to know about XML::Easy::Elements and their contents is that they are immutable, or put another way you can't change them once they're created. This means they have no methods for setting the name of an element, altering the attributes, or setting the children. All of these must be passed in in the constructor.

Let's just jump in with an example. We're going to create a little code that outputs the following XML document:

<html>
   <head><title>My Links</title></head>
   <body>
     <h1>Links</h1>
     <ul>
       <li><a href="http://search.cpan.org/">Search CPAN</a></li>
       <li><a href="http://blog.twoshortplanks.com/">Blog</a></li>
       <li><a href="http://www.schlockmercenary.com/">Schlock</a></li>
     </ul>
   </body>
</html>

(I've added extra whitespace in the above example for clarity - the code examples that follow won't reproduce this whitespace)

I'm going to start of showing you the most verbose and explicit objected-orientated way to create XML::Easy::Elements, and then I'm going to show you the much quicker functional interface once you know what you're doing. The verbose way of creating an object is to explicitly pass in each of the things to the constructor:

XML::Easy::Element->new($name, $attributes_hashref, $xml_easy_content_instance)

The trouble with using such code is that it often requires requires pages and pages of code that puts Java to shame in it's repetition of the obvious (you don't really need to read the following code, just gawk at its length:)

my $root_element = XML::Easy::Element->new("html",
  {},
  XML::Easy::Content->new([
    "",
    XML::Easy::Element->new("head",
      {},
      XML::Easy::Content->new([
        "",
        XML::Easy::Element->new("title",
          {},
          XML::Easy::Content->new([
            "My Links",
          ])
        ),
        "",
      ]),
    ),
    "",
    XML::Easy::Element->new("body",
      {},
      XML::Easy::Content->new([
        "",
        XML::Easy::Element->new("h1",
          {},
          XML::Easy::Content->new([
            "Links",
          ])
        ),
        "",
        XML::Easy::Element->new("ul",
          {},
          XML::Easy::Content->new([
            "",
            XML::Easy::Element->new("li",
              {},
              XML::Easy::Content->new([
                "",
                XML::Easy::Element->new("a",
                  { href => "http://search.cpan.org/" },
                  XML::Easy::Content->new([
                    "Search CPAN",
                  ]),
                ),
                "",
              ]),
            ),
            "",
            XML::Easy::Element->new("li",
              {},
              XML::Easy::Content->new([
                "",
                XML::Easy::Element->new("a",
                  { href => "http://blog.twoshortplanks.com/" },
                  XML::Easy::Content->new([
                    "Blog",
                  ]),
                ),
                "",
              ]),
            ),
            "",
            XML::Easy::Element->new("li",
              {},
              XML::Easy::Content->new([
                "",
                XML::Easy::Element->new("a",
                  { href => "http://schlockmercenrary.com/" },
                  XML::Easy::Content->new([
                    "Schlock",
                  ]),
                ),
                "",
              ]),
            ),
            "",
          ]),
        ),
        "",
      ]),
    ),
    "",
  ]),
);

So, we never ever write code like that! For starters we could use twines instead of content objects, but that's too verbose too. We use the functional interface presented by XML::Easy::NodeBasics instead:

use XML::Easy::NodeBasics qw(xe);

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul",
      xe("li",
        xe("a", { href => "http://search.cpan.org/" }, "Search CPAN"),
      ),
      xe("li",
        xe("a", { href => "http://blog.twoshortplanks.com/" }, "Blog"),
      ),
      xe("li",
        xe("a", { href => "http://www.schlockmercenary.com/" }, "Schlock"),
      ),
    ),
  ),
);

The xe function simply takes a tag name followed by a list of things that are either hashrefs (containing attributes), strings (containing text,) or XML::Easy::Elements (containing nodes.) It can also take content objects and twines, which is handy when you're re-using fragments of XML that you've extracted from other documents you may have parsed. In short, it Does The Right Thing with whatever you throw at it.

Of course, we can optomise further by knowing that this code is Perl:

tie my %hash, "Tie::IxHash",
  "http://search.cpan.org/" => "Search CPAN",
  "http://blog.twoshortplanks.com" => "Blog",
  "http://www.schlockmercenary.com" => "Schlock",
;

my $root_element = xe("html",
  xe("head",
    xe("title", "My Links"),
  ),
  xe("body",
    xe("h1", "Links"),
    xe("ul",
      map { xe("li", xe("a", { href => $_ }, $hash{$_}) ) } keys %hash
    ),
  ),
);

And that's about it for basic XML parsing and generation with XML::Easy. There's a lot more handy functions and explantions of the theory behind XML::Easy in the documentation. In my next post I'm going to look at another way of creating XML using XML::Easy, when I talk about one of my own modules: XML::Easy::ProceduralWriter.

Introducing XML::Easy

Some days, you just want to parse XML document.

However, the standard distribution of Perl doesn’t ship with a bundled XML parser, traditionally instead requiring the user to install a module from CPAN. This means there’s no standard way to do this. Instead there are several choices of parser, each with their advantages and disadvantages: There is, as we often say in Perl, more than one way to do it. This is the first post in a series where I’m going to talk about XML::Easy, a relatively new XML parsing module that deserves a little more publicising.

But why another XML parsing library? What’s wrong with the others? Well, a few things…

One of the biggest problems with the most popular XML parsing modules like XML::LibXML and XML::Parser is that they rely on external C dependancies being installed on your system (libxml2 and expat respectively) so it can be hard to rely on them being installable on any old system. Suppose you write some software that relies on these modules. What exactly are you asking of the users of your software who have to install that module as a dependency? You’re asking them firstly to have a C compiler installed – something people using ActiveState Perl, basic web-host providers, or even Mac OS X users without dev tools do not have. Even more than this you’re often asking them to download and install (either by hand or via their package management system) the external C libraries that these modules rely on, and then know how to configure the C compiler to link to these. Complicated!

To solve this XML::Easy ships with a pure Perl XML parser neither requiring external libraries or a C compiler to install: In a pinch you can simply copy the Perl modules into your library path and you’re up and ready to go. This means that this library can be relied on pretty much anywhere.

The observant will point out that there are many existing pure Perl XML parsing libraries on CPAN. They suffer from another problem: They’re slow. Perl runs not as native instructions but as interpreted bytecode executing on a virtual machine, which is a technical way of saying “in a way that makes lots of very simple operations slow.” This is why the boring details of XML parsing are normally handled in C space.

Luckily, XML::Easy doesn’t use its pure Perl parser unless it really has to. It prefers to compile and install on those systems that do have a working C compiler its own C code for parsing XML. Note that this C code, bound into the perl interpreter with fast XS, is wholly self contained and doesn’t rely on external libraries. All the user on a modern Linux system has to do to install the module is type cpan XML::Easy at the command prompt. In this mode XML::Easy is fast: In our tests it’s at least as fast as XML::LibXML (which is to say, very fast indeed.) This week I’ve been re-writing some code that used to use MkDoc::XML to use XML::Easy and the new code is 600 (yes, six hundred) times faster.

This is great news for module authors who just want to do something simple with fast performance if they can get it, but don’t want to have to worry about putting too much of a burden on their users.

Of course, this would all be for naught if XML::Easy didn’t do a good job of parsing XML – but it does. The other big failing of the many so-called XML parsers for Perl is that they screw up the little but important things. They miss part of the specification (sometimes even deliberately!) or they don’t let you do things properly like handle unicode. XML::Easy isn’t like this: It follows the specification quite carefully (with the devotion I’ve come to expect from its author, my co-worker Zefram) and doesn’t screw up unicode because it doesn’t attempt to handle character encodings itself but embraces and works with Perl’s own unicode handling.

So by now, I’ll have either sold you on the idea of XML::Easy or not, but I haven’t really shown you how to use it. In the next post in this series I’m going to start talking about how you can use XML::Easy to parse XML and extract which bits you want.

Posted in 1

Permalink 1 Comment

DropBox as a No Paste server

In this blog post I’m going to talk about my own custom “no paste” solution that I’ve developed over the years. How I started out using a web page as a service, moved to scripting this from the command line, and how I finally ended up subverting DropBox to my ends. Skip to the end to get to the code.

So, what’s “no paste” I hear you ask? “No paste” servers allow you to go to a webpage and submit a bit of text and get a unique url back where that text can be viewed. This means that you don’t have to paste a whole bunch of text into IRC or an IM conversation, you can just upload your text and and copy and paste only the link the “no paste” server sent back into your IM. If you haven’t seen one before check out textmate’s nopaste server.

This has several advantages. Firstly and foremostly, it doesn’t “SPAM” the conversation you’re having. Pasting a whole bunch of code into a IRC channel where people are having a conversation causing that conversation to scroll off the screen before they can read it isn’t polite. Secondly it makes the text easier to read and easier to copy and paste into an editor (for example, most IRC an IM clients will prepend each line someone says with a datestamp when you copy and paste from them.)

Excellent. A useful idea. Now how can we make it better?

As a Perl programmer I tend to automate a heck of a lot of what I do with my computer. Filing in a form on a webpage is easy to do, but it’s more hassle than hitting a key combination that pastes whatever text is highlighted in your editor. If we do it a lot we should make the computer do all the work!

For a long time I used the App::Nopaste module on CPAN which installs a command line utility called nopaste which integrates with a vast range of “no paste” servers. This utility can take input on the command line and automatically fill in the forms on those websites for you. This means that it’s trivial to execute from your editor – in textmate it’s just a simple “Bundle” command.

In the end I stopped using nopaste not because I had a problem with the script, but because I had a problem with the nopaste servers, in particular the lack of privacy. Now, I’m a great believer in simply not putting anything on the internet that is truly private (face it, it’s going to get out!) but there exists a bunch of “semi-private” stuff (closed source code, contact information, private correspondence) that shouldn’t be put on a totally public paste service. Often it’s just a case of editing the URL that the “no paste” server returns by increasing or decreasing the number at the end to see the thing the next or previous person pasted!

So in the end I decided it might be a good idea to run my own custom “No Paste” solution with semi-secure (non-guessable) URLs. One problem with that: I couldn’t justify the infrastructure – I’m currently trying to reduce the amount of stuff I have to maintain, not increase it. So I started looking at what infrastructure I’m already using and seeing how I can better use that.

Enter DropBox. DropBox is a service that syncs a directory on your various computers with each other and the DropBox server. And one of the things it does is publish files in a certain directory as accessible from the web. This simplifies my problem a lot: All I need to do to have my own “No Paste” solution is simply have an easy way of getting text into a file on my hard drive and let the DropBox program automatically handle the “uploading” to a hosted service.

So, below is the script I wrote to do that. Features include:

  • Using a web-safe version of the “MD5 hex” one way hash of the file contents as the filename. This means that it’s both unguessable unless you know what the text contains and reasonably guaranteed to be unique
  • Taking input from STDIN or the system clipboard
  • Printing out the URL that the text will be available at, and/or copying it to the clipboard, and/or displaying it in a Growl message

#!/usr/bin/perl

use strict;
use warnings;

use 5.010;
use autodie;
use Path::Class qw(file dir);
use Digest::MD5 qw(md5_base64);
use Net::Growl qw(register notify);
use Getopt::Std qw(getopt);

########################################################################
# config

my $DROPBOX_ID = "301667";
my $GROWL_PASSWORD = "shoutout";

########################################################################

# get the config options
my %opt;
getopt("",\%opt);

# read the entire of STDIN / the files passed on the command line
my $data = $opt{c}
  ? read_clipboard()
  : do { local $/; scalar <> };

# work out the digest for it.  Covert the non url safe characters
# to url safe characters
my $uuid = md5_base64($data);
$uuid =~ s{/}{-}g;
$uuid =~ s{\+}{_}g;

# copy the data to the new file
open my $fh, ">:bytes",
  file($ENV{HOME},"Dropbox","Public","nopaste","$uuid.txt");
print {$fh} $data;
close $fh;

# output the url that dropbox will make that file avalible at
my $url = "http://dl.getdropbox.com/u/$DROPBOX_ID/nopaste/$uuid.txt";
say $url unless $opt{q};
write_clipboard($url) if $opt{p};
if ($opt{g}) {
  my $message = "shortly at $url";
  $message .= " (copied to clipboard)" if $opt{p};
  growl("Text Dropped", $message);
}

########################################################################

# this is mac os X depenent.  I'd use the Clipboard module from CPAN
# to make this system independent, but it fails tests.

sub read_clipboard {
  open my $pfh, "-|", "pbpaste";
  local $/;
  return scalar <$pfh>;
}

sub write_clipboard {
  my $data = shift;
  
  open my $pfh, "|-", "pbcopy"; 
  print {$pfh} $data;
  close $pfh;
}

sub growl {
  my $title = shift;
  my $description = shift;
  
  register(
    application => "droptxt",
    password => $GROWL_PASSWORD,
  );

  notify(
    application => "droptxt",
    password => $GROWL_PASSWORD,
    title => $title,
    description => $description,
  );
  
}

########################################################################

__END__

=head1 NAME

droptxt - easily write text to a file in your public dropbox

=head1 SYNOPSIS

  # read from stdin
  bash$ droptxt
  this is some text
  ^D

http://dl.getdropbox.com/u/301667/nopaste/4ZwSg8klsyBmhf9SKs-j5g.txt

  # read from a file
  bash$ droptxt some_text.txt

http://dl.getdropbox.com/u/301667/nopaste/asdSDsq_asdQsasdw12s3d.txt

  # read from the clipboard
  bash$ droptxt -c

http://dl.getdropbox.com/u/301667/nopaste/cssj12-22WWdsqQfxjpDDe.txt

  # also paste the url to the clipboard
  bash droptxt -p some_text.txt

http://dl.getdropbox.com/u/301667/nopaste/asdSDsq_asdQsasdw12s3d.txt

=head1 DESCRIPTION

This is a command line utility that is designed to be used as an
alternative to "no paste" utilities.  Instead of sending the input to a
webserver it simply writes it to a location on your hard drive where the
DropBox utility will syncronize it with the Dropox webservers.

=head2 Options

=over

=item -c

Copy the input from the system clipboard rather than from the usual
location.

=item -p

Paste the url to the system clipboard.

=item -g

Announce the url via Growl

=item -q

Do not print the url to STDOUT

=back

=head1 AUTHOR

Copyright Mark Fowler E<lt>mark@twoshortplanks.comE<gt> 2009. All rights reserved.

This program is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

=head1 BUGS

Doesn't wait for DropBox to sync the file.  The URL this creates may not be
usable straight away!

=head1 SEE ALSO

L<http://www.getdropbox.com> for details on the service.

L<App::Nopaste> for a utility that uses public "no paste" servers instead.

=cut

Let’s have our cake and eat it too

One of the most difficult tradeoffs in language design is brevity verses explicitness. Having long names for methods, functions, variables and verbose patterns makes your code much clearer and less ambiguous. To my mind there’s nothing ambiguous about:

System.out.println("Hello World");

It’s obviously printing a line to standard out and, at least as long as you associate “System” with “standard” in your head, there’s no surprises lurking here. This kind of explicitness is great. It means anyone can read the code and bit by bit pick it apart to work out exactly what’s happening (Oh…so there’s a ‘System’ class….and look an ‘out’ property….and a ‘println’ method…)

Of course, the problem with this level explicitness is that the verbosity it requires takes too long to write all the time unless you’ve got a super-charged macro based IDE. And even then it faces a worse drawback: As the patterns get larger and the verbosity multiplies it gets harder to comprehend the overall picture. Your eyes tends to glaze over after a day’s protracted coding and the details start to become obscured. For example in microcosm, the “System.out” is the least interesting bit of the above code, but it’s the part of the statement my eyes are drawn to first. Worse still, programmers tend to write the same number of lines of code no matter how verbose the programming language they tend to use is. Those languages with more brevity therefore tend on the whole to get more done in those lines!

Compare and contrast the above Java statement to the following Perl statement:

say "Hello World"

Much shorter, and much easier to read; The eyes are drawn to the “Hello World” which is indeed the interesting part of the statement. With much shorter statements and less wrapper code, the Perl users should be producing many more lines of code a day and beating the pants off the Java programmers.

Well, this is sometimes true. And, to be fair, often not.

What was the tradeoff with brevity again? Oh, yes…more ambiguity. When does this strike? In maintenance and what I like to call “pre-maintenance”, the time you’re developing the code yourself and if reaches the point it’s too big to fit in your head.

Consider the two examples above. While the Java version is clearly printing to standard out, “say” is printing to the ‘current file handle’ which almost always is standard out. Of course Perl people might consider this potential action at a distance to be a worthwhile abstraction layer. Which really emphasises an important difference between the two languages.

Perl and Java are essentially operating at about the same level of abstraction. They’re both a level above C, running on a virtual machine layer that has nice safety nets built in meaning things like memory allocation, array boundary checking, sorting algorithms, etc. are all taken care by the core language and API. There’s really little to separate the two languages, and they share more in common with each other than say, C and Prolog do, so it’s more interesting to look at the small details that make the two languages different to one another. The Perl and Java programmer have a lot to learn from one another.

This difference in syntax philosophy is really interesting to me. Perl’s basic syntax typically allows you to express more in shorter space by exploiting what is known as context. There’s a lot of implicit things. There’s list or scalar context for example, or the current file handle, or there’s the topic variables ($_, @_ et al) that are often used as default arguments for calls. This either allows you to hold more in your head (because you worry about less because you can ignore the need to restate the context all the time) or hold a lot less (because the code isn’t clear and you have to worry about what’s in the hidden context all the time.)

So in theory Perl allows you to express more in a line, but you can also get yourself in a mess a heck of a lot quicker. It can be really easy for beginners to pick up Perl compared to Java because they aren’t forced to deal with all the implicit stuff directly, but at the same time they’re not aware of all the implicit things going so it can harder to deal with too; A double edged sword.

No wonder sometimes Perl is better than Java, and sometimes Java is better than Perl.

The obvious thing that Perl and Java can do, in the grand tradition of langauge design, is learn from each other’s mistakes and steal the good bits from each other without (hopefully) picking up the bad bits. Java stealing regular expressions and Perl stealing layered IO are good examples of worthwhile theft.

So what would I change about Perl to take advantage of what Java teaches us? Probably more than some people would like, and a lot less than others.

As a way of example of what I would change I wrote is this particularly confusing chunk of code earlier in the week:

sub log {
  no warnings;
  warn @_;
}

no warnings immediately followed by warn? Gah! Of course, what I’m actually doing is suppressing all warnings that perl will generate (undefined values, printing of wide characters, etc) while it prints out my warnings message. Very brief. Different semantic domains entirely.

So what do I think we should do to fix this? Nothing. This is just the pain of having a brief ambigious language: Sometimes you’re just going to end up with what I dub “talking at semantic cross purposes” in your code. I could suggest that we force people more explicit so it’s clear what warnings I’m talking about in each case, but then I’d be changing the feel of the language. I’m in no rush to recreate Java; Java’s a fine language and I know where it is when I want to use it.

So what would I change? Ambiguity where there’s genuine confusion due to overloading meaning of things.

If you’re paying a lot of attention in my example you’ll notice that I’m not declaring something that’s going to be used as a subroutine there, despite the sub keyword (because, obviously, without syntatic gymnastics you can’t call a subroutine called log without calling the log built in function instead.) In fact, the occasions where you want a method to be also be callable as a subroutine (or vice versa) are very thin. So this is where I’d be more explicit and lose the unnecessary ambiguity be able to express the difference between a function and a method. Like so:

method log {
  no warnings;
  warn @_
}

Of course, that’s exactly what some of the more radical extensions like MooseX::Declare allow you to do, and I’ll talk more about that in a future blog entry.

CPAN Unplugged

CPAN is often described as Perl’s Killer App; Modern Perl relies on it, with the perl distribution being almost considered in parts to be nothing more than a bootstrap for the rest of the language that’s out there in the cloud. Which makes it all the more annoying when you’re stuck somewhere without an internet connection missing the vital bit of the language you need. I just had first hand experience of being offline for a two week holiday, but I didn’t have this problem when hacking on personal projects: I took CPAN with me.

So, want CPAN at your fingertips even when you’re offline? Yep, you’ve guessed it: There’s a CPAN module for that!

It’s called CPAN::Mini, and it lets you create a mini-mirror of CPAN. A mini-mirror? What’s that? It’s a mirror of just the latest non-development versions of the modules from the CPAN – or in other words, it’s a mirror of anything you can install by just typing “install” and just the module name into the cpan shell. As I type this now this mirror weighs in at about 1.1GB, which is a fair bit smaller than the full archive.

So how do we create a mini-mirror? Well, first (when you’re actually online) you need to install the module.

bash$ sudo cpan Mini::CPAN

Once you’ve done that the minicpan command will be installed on your computer.

While you can pass arguments on the command line to tell it how to run, it’s easier to create a .minicpanrc file in your home directory so you don’t have to remember what commands to type each time you want to sync your mirror. This is what mine looks like:

local: /cpan/
remote: http://www.mirrorservice.org/sites/ftp.funet.fi/pub/languages/perl/CPAN/

So I’ve got minicpan set up to download from mirrorserver.org (my nearest CPAN mirror on the internet when I’m in the UK) and create files in /cpan on my hard drive.

So all that’s left is to run the cpan mirror command and watch it download.

bash$ minicpan

This prints out each file as it downloads. The first time you run this might take a while (depending on the speed of your internet connection) so you might want to trigger it while you’re laptop is going to be in the same place for a while with a fast internet connection (i.e. just before you go to bed or just after you get into the office for the day.)

The second time you run this command it’ll update the existing mirror. This means that it won’t have to download the whole 1.1GB again, just the index files and the new modules that have been released.

bash$ minicpan
authors/01mailrc.txt.gz ... updated
authors/id/A/AD/ADAMK/Test-POE-Stopping-1.05.tar.gz ... updated
authors/id/A/AD/ADAMK/CHECKSUMS ... updated
authors/id/A/AN/ANDK/CPAN-Testers-ParseReport-0.1.4.tar.bz2 ... updated
authors/id/A/AN/ANDK/CHECKSUMS ... updated
authors/id/A/AT/ATHOMASON/Ganglia-Gmetric-PP-1.01.tar.gz ... updated
authors/id/A/AT/ATHOMASON/CHECKSUMS ... updated
authors/id/A/AT/ATHOMASON/Gearman-WorkerSpawner-1.03.tar.gz ... updated
...
cleaning /cpan/authors/id/A/AA/AAYARS/Fractal-Noisemaker-0.011.tar.gz ...done
cleaning /cpan/authors/id/A/AD/ADAMK/Test-POE-Stopping-1.04.tar.gz ...done
cleaning /cpan/authors/id/A/AL/ALEXLOMAS/CHECKSUMS ...done
cleaning /cpan/authors/id/A/AL/ALEXLOMAS/WWW-Nike-NikePlus-0.02.tar.gz ...done
...

The module will also delete any old versions of modules that are no longer in the index; In the above example you can see Adam released a new version of Test::POE::Stopping, so CPAN::Mini downloaded the new distribution and deleted the old distribution (as no modules contained in the index still relied on that distribution). This keeps the size of the local mirror as small as possible on disk.

There's several ways you can configure the CPAN module to use this new local mirror, including typing commands in the CPAN shell. However, my preferred way is to directly edit the CPAN::Config module on the system directly.

First work out where the module containing your config is installed:

bash$ perl -E 'use CPAN::Config; say $INC{"CPAN/Config.pm"}'
/System/Library/Perl/5.10.0/CPAN/Config.pm

Then edit it changing the urllist parameter to contain your CPAN mirror in addition to your normal remote mirror:

'urllist' => [
  q[file:///cpan/],
  q[http://www.mirrorservice.org/sites/ftp.funet.fi/pub/languages/perl/CPAN/]
],

This means your CPAN shell will try and install files from disk first, and if for any reason that fails (for example, you tell it to install a development release) it'll go to the second mirror.

Which way round you order the mirrors depends really on how often you update your cpan mirror and personal preference. If you, as I do, put your local mirror first this has the disadvantage that CPAN will seem "frozen" at the last time you ran minicpan, with any new changes being hidden from you until you next update. It however means that installs are very quick compared to normal internet installs (be you offline or not) and it avoids having to wait for the internet connection timeout every time CPAN tries to fetch a file and fallback to the local mirror when you're offline.

With all this done, I can now install modules in the usual way with the CPAN shell no matter if I have an internet connection or not. Of course, I haven't yet explained how I work out what modules I should be using when I'm offline and haven't got access to search.cpan.org. I'll get to that in a future blog post...

Posted in 1

Permalink 2 Comments

say What?

Now that I’ve got Snow Leopard (finally) installed on my Mac, the default perl binary is now 5.10.0. This means many things: The given keyword and smart matching, the defined-or operator, the wonderful additions to the regex engine, and other things I’m bound to blog about later when I get round to enthusing about them.

What I wanted to talk about today is the simpliest change that’ll be making the most difference to me on a day to day basis: The “say” keyword. More or less say is exactly the same as print but two characters shorter and automatically adds a newline at the end. This is most useful when you’re writing one liners. This quick calculation:

bash$ perl -e 'print 235*1.15, "\n"'

Becomes just:

bash$ perl -E 'say 235*1.15'

(Note the use of -E instead of -e to automatically turn on the 5.10 keywords like say without having to add use 5.10 or use feature 'say'.)

This saves us a grand total of nine keypresses (including having to hit shift one less time.) More importantly it saves us having to use double quotes at all. This is really useful when you’re already using the quotes for something else. For example, running a Perl one-liner remotely with ssh:

bash$ ssh me@remote "perl -MSomeModule -e 'print SomeModule->VERSION, qq{\n}"'

With 5.10 on the remote machine this becomes just:

bash$ ssh me@remote "perl -MSomeModule -E 'say SomeModule->VERSION'"

This has not only the advantage of saving me a bunch of keystrokes, but also doens’t make me think as much. And the less I have to think, the less chance I’m going to do something stupid and make a mistake.

Posted in 1

Permalink 10 Comments

Follow

Get every new post delivered to your Inbox.