Since time immortal (or, at least, the release of Perl 5 over ten years ago) the standard way of declaring a method in Perl has to been to de clare a subroutine and manually extract the invocant and arguments from the special @_ variable. In the last year or so however, this has been gradually been changing, with new modules from the CPAN providing alternative syntax.
The Devel::Declare module has been the underpinnings of this work, a new module that allows module authors to hook Perl's lexer to do different things with the code as it's being parsed in and thus transform entirely new syntax into bytecode.
Previous attempts to alter method calling had involved what are known as source filters - that is to say a hook to intercept the source code when it's read from disk and modify it before it was parsed. The problem with this technique is that writing a program to alter Perl code is a very hard thing to do and (thanks to things like prototypes) is dependant on what code you've already compiled in this session. This resulted in a alterations that had reasonable potential for quite nasty bugs as incorrect and unsafe modifications could be made.
Devel::Declare, working in conjunction with the perl parser as it does, does not suffer from the same issues and allows new code to be able to be dynamically inserted that, once compiled, runs every bit as fast as if that code had been there in the first place.
A Starting Point
By way of an example we're going to be looking at a very basic method call that takes a couple of arguments and prints out some calculation on them.
Here's the example in traditional "sub" based Perl 5 syntax:
package Reticulator;
sub reticulate_splines {
my $self = shift;
my ($number, $how_much) = @_;
print "Reticulating $number splines by " . $how_much * $self->multiplier, "\n";
return;
}
This version of the code really doesn't have enough error checking, but I'll expand on this later in the examples.
Method::Signatures::Simple
The first, and most basic module I'm going to write about today is Method::Signatures::Simple. In a nutshell it just allows us to re-write the above code as:
As you can see $self is automatically extracted from @_ along with any other variables we've placed in the brackets after the method name.
You can ask the B::Deparse module to re-generate the equivalent code that perl thought it saw after Module::Signatures::Simple and Devel::Declare had their wicked way with it:
Ignoring the package cruft that B::Deparse always inserts, the code is identical to the version and runs just as fast.
Method::Signatures
One of the problems with Method::Signatures::Simple is while the code it effectively creates is no slower than the original traditional Perl, it's no better either! It's still just as buggy as the code we quickly wrote by hand, doing no input validation at all. For example, what if we forget to pass any arguments:
$reticulator->reticulate_splines;
Use of uninitialized value $number in concatenation (.) or string at example.pl line 28.
Use of uninitialized value $how_much in multiplication (*) at example.pl line 28.
Reticulating splines by 0
Ugh! What we need is some form of error checking. While we could add this with code in the body of our method, if we switch to using the less-simple module Method::Signatures we get this for 'free':
use Method::Signatures;
method reticulate_splines($number, $how_much) {
print "Reticulating $number splines by " . $how_much * $self->multiplier . "\n";
return;
}
Now the $number and $how_much are required arguments, and not passing them gives me a proper exception:
$reticulator->reticulate_splines;
Reticulator::reticulate_splines() missing required argument $number at example2.pl line 51.
And if we run the subroutine through B::Deparse we can see the equivalent code Module::Signature is creating for us:
{
package Reticulator;
my $self = shift @_;
Method::Signatures::required_arg('$number') unless @_ > 0;
my $number = $_[0];
Method::Signatures::required_arg('$how_much') unless @_ > 1;
my $how_much = $_[1];
print "Reticulating $number splines by " . $how_much * $self->multiplier . "\n";
return;
}
This code is, of course, doing slightly more than the original code. If we benchmark it we see it's slightly slower but not much - those extra unless statements are really cheap ops:
Specifying those arguments aren't required with question marks so they are considered optional again like so...
Causes those if statments to be removed from the equivalent code to what Method::Signature::Simple generates:
{
package Reticulator;
my $self = shift @_;
my $number = $_[0];
my $how_much = $_[1];
print "Reticulating $number splines by " . $how_much * $self->multiplier . "\n";
return;
}
This is about the same speed (slightly faster or slightly slower, depending on the
exact version of perl you're running, since it uses scalar rather than list assignment) as the original example.
Named Parameters
One of the other things that Method::Signatures gives us over the simplistic
Method::Signatures::Simple is the use of named parameters. That is to say
we can write this:
$reticulator->reticulate_splines( number => 10, how_much => 5 );
B::Deparse tells us this compiles into the same thing as if we'd written:
{
package Reticulator;
my $self = shift @_;
my(%args) = @_[0 .. $#_];
my $number = delete $args{'number'};
my $how_much = delete $args{'how_much'};
Method::Signatures::named_param_check(\%args);
print "Reticulating $number splines by " . $how_much * $self->multiplier . "\n";
return;
}
The extra readability and error checking comes at a significant cost - we've suddenly
got another subroutine subroutine call being made (this slows things down significantly.)
MooseX::Declare
There's still things we can do to produce incorrect output with the level of validation that Method::Signatures is giving us. For example, we can call the method with strings rather than the numbers we're meant to be dealing with and get garbage out:
$reticulator->reticulate_splines( number => "all", how_much => "lots" );
Argument "lots" isn't numeric in multiplication (*) at example3.pl line 29.
Reticulating all splines by 0
To add type checking to our arguments, we need to move onto using another module MooseX::Declare that, amongst other syntactic sugar improvements, brings the power of the Moose framework and its type hand to bear on the arguments list.
use MooseX::Declare;
class Reticulator {
method reticulate_splines(Int :$number!, Int :$how_much!) {
print "Reticulating $number splines by " . $how_much * $self->multiplier . "\n";
return;
}
}
There are several changes in the above example compared to those above it. The most obvious is the class keyword - this replaces the package keyword and other boilerplate like "use Moose", "use strict", "use warnings", and "use namespace::clean".
The next thing we've done is to fix the error that we allowed in the previous example: We've added "Int" type constraints (using Moose's type system) to ensure that the numbers we pass in really are integers. Finally we've added the trailing "!" to the variable names that mean "and make this manditory".
This now explodes with a plethora of debug info if we pass it garbage:
$obj->reticulate_splines_with_print(number => "all", how_much => "lots");
Validation failed for 'Tuple[Tuple[Object],Dict[number,Int,how_much,Int]]' with value
[ [ Reticulator=HASH(0x100cfc6b0) ], { how_much: "lots", number: "all" } ],
Internal Validation Error is:
[+] Validation failed for 'Dict[number,Int,how_much,Int]' with value
{ how_much: "lots", number: "all" }
[+] Validation failed for 'Int' with value all at
/Library/Perl/5.10.0/MooseX/Method/Signatures/Meta/Method.pm line 429
The belts and braces however come at a significant cost over the raw subroutine call we started with however. The reasons for this can only be hinted at by the equivalent code that B::Deparse shows us for MooseX::Declare. A lot of the validation is handled by completely separate method calls not
even listed here:
{
package MooseX::Method::Signatures::Meta::Method;
use warnings;
use strict 'refs';
@_ = ${$self;}->validate(\@_);
$actual_body ||= ${$self;}->actual_body;
goto \
}
MooseX::Declare gives us a lot more functionality than just shown in the above example though - you can define extra on the inline validation routines, default values, and even coercion routines for turning one data type into the data type you needed (e.g. stringifying a DateTime object.) If you use this functionality, the slowdowns are often worth it.
Conclusion
You can see from the above examples there's a range of options that allow you to write much more expressive code than using the old and cumbersome "sub" keyword. It's possible to use these techniques so that there's no run time costs, and it's possible to use these techniques to do a whole host complex input validation. It's not possible however, just like if you'd written the code by hand, to get both at the same time.
TANSTAAFL, but at least there's a choice on the menu now.
The Acme namespace on CPAN is a collection of supposedly useless junk that often has no other purpose than to amuse or otherwise entertain the author and Perl aficionados. This said, not all of it's useless, with one particular module Acme::MetaSyntactic being much more useful than you might first think.
Acme::MetaSyntactic is a module responsible for spewing out on demand random interesting names for things. While this isn't something that you might want to do yourself, it's great for extending your editor.
Six years ago at YACP::NA::Florida Damian Conway gave a talk about how he'd wired up vi to greatly speed up his Perl development. While I haven't invested as much time as Damian has, with the help of Acme::MetaSyntatic I've been able to do some rudimentary code generation techniques that save me a lot of time without much outlay.
I'm currently using the TextMate editor (Your Millage May Vary, and I've no bones to pick with anyone else who uses another editor.) One of the simplest accelerators TextMate offers is "Snippets" - little bits of text inserted when you hit a key combination, or when you type a given string and hit tab. Not just limited to simple text, these snippets can can also insert the result of running shell scripts, and thus allow a quick and rudimentary way to extend the editor's functionality.
The Poor Man's Debugger
One of the most basic snippets I have is to type "acme[TAB]". This triggers the following snippet:
print STDERR "`meta batman` `meta batman`\n";
In TextMate the backticks are one of the ways that snippets can call shell scripts, and in this case executes the "meta" command that ships with Acme::MetaSyntactic to give me a sound effect from the batman theme. Thus typing:
This is a very quick way to get a unique human readable print statement, and is a lot less tedious then having to keep writing things like:
print STDERR "Got to here\n";
And
print STDERR "Also got to here\n";
And so on.
'One Off' script variables
One of the great things about Acme::MetaSyntatic names is that they're always valid Perl variable names, so they're great for creating one off scripts. I have multiple snippets that insert code that executes something and assigns it to a uniquely named variables.
Let me give you a real example. I have a remote service I have to access quite often and rather than using the interactive console I prefer to instead write "one off" scripts (which I can then check into version control just incase I need them again several months hence.)
Of course, with a handy set of snippets, the time it takes to write the boilerplate of code can be eliminated. For example I type:
call[TAB]
And it spits out this:
my $sergeant_colon = $remote->call(
method => "ReticulateSplines",
id => $id,
param => {
},
);
And then I type:
call[TAB]
And this time it spits out:
my $windle_poons = $remote->call(
method => "ReticulateSplines",
id => $id,
param => {
},
);
So each result that I get back from the remote server is assigned to it's own variable name. Again, this was just as simple as creating a snippet that looks like:
my \$`meta discworld` = \$bc->call(
method => "${1:ReticulateSplines}",
id => \$id,
param => {
$2
},
);
(The $1 and $2 let TextMate know where to put the cursor when I execute this snippet and when I subsequently hit tab.)
And More...
Of course Acme::MetaSyntatic isn't just limited to Perl. I have macros that do JavaScript and HTML:
You get the idea. I'm sure you can think of a million places where you can save yourself thirty seconds each time by some clever basic code generation.
•
Share "Accelerating your Editor with Acme::MetaSyntactic "
So, how long does it take your webserver to generate pages?
This seemingly simple question is actually a lot more complicated than it might sound, and there's several ways to answer it, depending, I guess, on what question you're actually asking, and the reason you're asking it.
The simplest way to answer this is to fire up your web browser¹ and look that the pretty graphs showing how long it took to actually load some pages that you have determined as 'typical'. Using a real web brower to do the timings is the most accurate representation of what a user actually sees because it properly accounts for loading all the other bits of the page, all of which might be very slow and block the actual rendering of the page and actually be the most significant delay the user sees².
The problem with this approach is that it's not a very consistent method. Each time you do it you're going to get slightly different results back as various caches are charged and emptied and depending on what else the server is doing at the time.
One way to answer this is to use a tool like Selenium to do the same request with your browser several times and average the result. Another, more basic, approach is to run a tool like ab (apache bench) or Perl's own LWP to generate a lot of requests for the main HTML file. Either of these techniques is a lot better than the one off measurement, but still gives you different results at different times of the day, days of the week, or time in the year (for example, the servers might be more loaded during weekday nights than at three am on a sunday morning.)
Worse still, the servers might be influenced by what you're doing. You're not following typical user behaviour (you're creating a bunch of extra requests for a subset of pages) and maybe you're now loading the servers in a different way to the normal distribution of traffic and skewing your results.
One of the things I've started doing is looking at the performance of websites as a whole rather than the performance of the bits of the site I'm manually measuring. This is a totally different approach: Instead of creating requests and monitoring how long it takes for them to return in the client, I'm logging on the server details all about requests actual user activity creates and then summarising that.
The advantage in this approach is that it's directly measuring the thing that you actually care about: The time it takes to get a response to the user in the actual conditions that they're making these requests. When you do this you'll often be surprised about what's actually slowing down the site - sometimes what seems like a very quick page when called in isolation can turn out to be a big resource hog if it's called a large number of times by your users.
One of the things you'll notice when recording details of your site is that it's really easy to get mixed up about two things
1. Working out how long the user has to wait for given pages so that you can gauge if user experience is acceptable for those page requests
2. Working out what pages are taking up the most resources so that pages that you can work out what's slowing down your site.
This is really easy because as soon as your site starts bottlenecking on something (memory, CPU, database activity...) everything will start to seem slow, and it can be hard to work out what the thing that's doing the delaying and the thing that's being delayed is.
In order to work this out you need to go beyond the wallclock seconds that you're recording to work out how slow your user experience is and move onto recording how much resource each request is taking up.
There are various techniques for this, each more complicated and precise that the one before, but starting at a gross level will allow you work out which bit you should be concentrating on. Measuring memory used with mod_perl can be as simple as using Apache::SizeLimit:
my ($start_size, $start_shared) = $Apache2::SizeLimit::HOW_BIG_IS_IT->();
...
my ($end_size, $end_shared) = $Apache2::SizeLimit::HOW_BIG_IS_IT->();
print {$log_fh} "That request grew us by @{[ $end_size - $size_size ]}",
"(@{[ $end_shared - $start_shared ]}) shared\n";
Measuring CPU can be as simple as recoding the number of ticks used:
use POSIX ();
my ($start_r, $start_u, $start_s) = POSIX::times();
...
my ($end_r, $end_u, $end_s) = POSIX::times();
print {$log_fh} "Took @{[ $end_u $start_u ]} user ticks and ",
"@{[ end_s - $start_u ]} system ticks\n"
Measuring your database is a little more tricky. I'd suggest simply monitoring the wallclock seconds each database request takes (accurately, with Time::HiRes) and using this as a starting point on which requests are slow, but this won't alone tell you why the slowdowns are (you'll need to do something like perodically caputuring the output of "SHOW FULL PROCESSLIST" or its ilk in order to detect what's locking on what.)
Once you've got these gross levels recorded in your you can dive in in more detail in your staging environment where you can add vastly more instrumentation on the pages that are slowing down your site in order to work out exactly why those pages are using so much resources.
Happy hunting!
¹ And because this is your web browser, you're using something modern like a WebKit based brower or Firefox with Firebug, or even Opera or the I.E. 9 beta, right?
² If you haven't read High Performance Websites and the explanation of why this is, you should stop reading this blog post now and go overnight a copy.
So this week I got back from a week in Pisa. "Hang on", you no doubt say, "this week? Aren't you meant to be going to Pisa for YAPC shortly?"
Well yes, and sadly no. I won't be (shock, horror) attending this year's YAPC::Europe.
Up until this point I've been to every YAPC::Europe ever. For ten years in a row¹. When I first went I was merely pimply faced youth not out of university. Now? I'm a pimply faced 100000 year old² that's done everything from the webmonkey to CTO role, with several sojourns into things like R&D and System Architecture on the way. It's been a _long_ time, and I've enjoyed it every year.
So why aren't I going this year? It's not because I've stopped doing Perl (though right now I'm more on the pointy hair side of the equation than the actual typing funny symbols into the editor side of the game) nor because YAPC::Europe isn't worth attending anymore (it's great) nor is it because my company isn't willing to send me³.
It's because I've got other things going on. My second daughter, all things going to plan, will be arriving on the 25th of August, which is too close for comfort for trips to places that require a combination of EasyJet and non-exploding Iceland to get me back again.
Now, there's an argument to be made that I could safely go to a conference ending on the 6th August which is nearly three weeks before the 25th. I counter that with the story of my colleague at the first OSCON Europe. He and I were over in Brussels in order to man a booth in the exhibition hall there. At some point during the evening he received a call from his wife, who was several weeks before her own due date, saying The Time Was Now. He did make it back in time in the end, but having looked into his eyes and saw the panic there, I'm *not* going to put myself in that situation.
So what am I going to do about this? Well, the first step has already been done - the family took the trip back to Pisa we'd been looking forward to since the announcement of the venue in Portugal last year a few months early.
What am I going to do about the Perl? The only thing I can do: I'm going to put myself back into Ironman (this post being the first of a new series.)
Oh, and if *someone* doesn't at least bring me a t-shirt from Pisa, I'll cry myself to sleep for a week ;-)
¹ Fencepost problem. I've attended the ten YAPC::Europe conferences that occur annually, which have occurred over a period of nine years starting nearly ten years ago.
² In base 2.
³ YAPC::Europe is cheap and wonderful training and I recommend it to all employers. Last year the only reason I didn't let someone go is we couldn't have them out of the office at the same time as other people already agreed - so we sent them to YAPC::NA instead.
•
Share "And this year, time for something completely different"
Perl provides a high level of abstraction between you and the computer allowing you to write very expressive high level code quickly that does a lot. Sometimes however, when things don't go to plan or you want performance improvements it's important find out what's really going on at the lower levels and find out what perl's doing "under the hood."
What Did perl Think I Said?
Sometimes when code doesn't do what you expect it's nice to see how the Perl interpreter understands your code incase your understanding of Perl's syntax and perl's understanding of that same syntax differ. One way to do this is to use the B::Deparse module from the command line to regenerate Perl code from the internal representation perl has built from your source code when it parsed it.
This is as simple as:
bash$ perl -MO=Deparse myscript.pl
One of my favourite options for B::Deparse is -p which tells it to put in an obsessive amount of brackets so you can see what precedence perl is applying:
You'll even note there's two sets of brackets immediately after the print statement - one surrounding the addition and one enclosing the argument list to print. This means that B::Deparse can also be used to work out why the following script prints out 25 rather than 5:
The brackets we thought we were using for force precedence actually were parsed by perl as constraining what we were passing to print meaning that the **0.5 was actually ignored:
A scalar is many things at once - it can actually hold a string, an integer, a floating point value and convert between them at will. We can see the internal structure with the Devel::Peek module:
use Devel::Peek;
my $foo = 2;
Dump($foo);
This prints
SV = IV(0x100813f78) at 0x100813f80
REFCNT = 1
FLAGS = (PADMY,IOK,pIOK)
IV = 2
This tells you a lot about the object. It tells you it's an int (an IV) and the value of that int is 2. You can see that it's got one reference pointing to it (the $foo alias.) You can also see it's got several flags set on it telling us which of the values stored in the object are still current (in this case, the IV, since it's an IV)
$foo .= "";
Dump($foo);
This now prints:
SV = PVIV(0x100803c10) at 0x100813f80
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
IV = 2
PV = 0x100208900 "2"
CUR = 1
LEN = 8
We gain PV flags (it's a "pointer value" aka a string) and we also gain CUR (current string length) and LEN (total string length allocated before we need to re-alloc and copy the string.) The flags have changed to indicate that the PV value is now current too.
So we can tell a lot about the internal state of a scalar. Why would we care (assuming we're not going to be using XS that has to deal with this kind of stuff.) Mainly I find myself reaching for Devel::Peek to print out the contents of strings whenever I have encoding issues.
Consider this:
my $acme = "L\x{e9}on";
Dump $acme;
On my system this shows that Léon was actually stored internally as a latin-1 byte sequence:
SV = PV(0x100801c78) at 0x100813f98
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x100202550 "L\351on"
CUR = 4
LEN = 8
But it doesn't have to be
utf8::upgrade($acme);
Dump($acme);
Now the internal bytes of the string are stored in utf8 (and the UTF8 flag is turned on)
SV = PV(0x100801c78) at 0x100813f98
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1002010f0 "L\303\251on" [UTF8 "L\x{e9}on"]
CUR = 5
LEN = 6
As far as perl is concerned these are the same string:
my $acme = "L\x{e9}on";
my $acme2 = $acme;
utf8::upgrade($acme);
say "Yep, this will be printed"
if $acme eq $acme2;
In fact, perl may decide to switch between these two internal representations as you concatinate and manipulate your strings. This is not something you normally have to worry about until something goes wrong and you see something horrid being output:
This is usually a sign that you've read in some bytes that were encoded as latin-1 and forgotten to use Encode (or you've done that twice!), or you've passed a UTF-8 string though a C library, or you had duff data to begin with (garbage in, garbage out.) Of course, you can't really start to work out which of these cases is true unless you look in the variable, and that's hard: You can't just print it out because that will re-encode it with the binmode of that filehandle and your terminal may do all kinds of weirdness with it. The solution, of course, is to Dump it out as above and see an ASCII representation of what's actually stored in memory.
How Much Memory Is That Using?
In general you don't have to worry about memory in Perl - perl handles allocating and deallocating memory for you automatically. On the other hand, perl can't magically give your computer an infinite amount of memory so you still have to worry that you're using too much (especially in a webserver environment where you might be caching data between requests but running multiple Perl processes at the same time.) The Devel::Size module from the CPAN can be a great help here:
bash$ perl -E 'use Devel::Size qw(size); say size("a"x1024)'
1080
So in this case a string of 1024 "a" characters takes up the 1024 bytes for all the "a" characters plus 56 bytes for the internal scalar data structure (the exact size will vary slightly between versions of perl and across architectures.)
Devel::Size can also tell you how much memory nested data structures (and objects) are taking up:
perl -E 'use Devel::Size qw(total_size); say total_size({ z => [("a"x1024)x10] })'
11251
Be aware that Devel::Size will only report how much memory perl has allocated for you - not how much memory XS modules you've loaded into perl are taking up.
How Does perl Execute That?
Perl's interpreter (like those that run Python, Java, JavaScript, Ruby and many other languages) neither compiles your code to native machine instructions nor interprets the source code directly to execute it. It instead compiles the code to an bytecode representation and then 'executes' those bytes on a virtual machine capable of understanding much higher level instructions than the processor in your computer.
When you're optomising your code one of the most important things to do is reduce the number of "ops" (bytecode operations) that perl has to execute. This is because there's significant overhead in actually running the virtual machine itself, so the more you can get each Perl op to do the better, even if that op itself is more expensive to run.
For example, here's a script that counts the number of "a" characters in the output by using the index command to repeatedly search for the next "a" and increasing a counter whenever we do'
perl -E '$c++ while $pos = index($ARGV[0], "a", $pos) + 1; say $c' aardvark
3
Let's look at what ops that program actually creates. This can be done with the B::Concise module that ships with perl:
bash$ perl -MO=Concise -E '$c++ while $pos = index($ARGV[0], "a", $pos) + 1; say $c' aardvark
l <@> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 42 -e:1) v:%,{ ->3
g <@> leave vK* ->h
3 <0> enter v ->4
- <1> null vKP/1 ->g
c <|> and(other->d) vK/1 ->g
b <2> sassign sKS/2 ->c
9 <2> add[t7] sK/2 ->a
7 <@> index[t6] sK/3 ->8
- <0> ex-pushmark s ->4
- <1> ex-aelem sK/2 ->5
- <1> ex-rv2av sKR/1 ->-
4 <#> aelemfast[*ARGV] s ->5
- <0> ex-const s ->-
5 <$> const[GV "a"] s ->6
- <1> ex-rv2sv sK/1 ->7
6 <#> gvsv[*pos] s ->7
8 <$> const[IV 1] s ->9
- <1> ex-rv2sv sKRM*/1 ->b
a <#> gvsv[*pos] s ->b
- <@> lineseq vK ->-
e <1> preinc[t2] vK/1 ->f
- <1> ex-rv2sv sKRM/1 ->e
d <#> gvsv[*c] s ->e
f <0> unstack v ->4
h <;> nextstate(main 42 -e:1) v:%,{ ->i
k <@> say vK ->l
i <0> pushmark s ->j
- <1> ex-rv2sv sK/1 ->k
j <#> gvsv[*c] s ->k
It's not important to really understand this in any great detail; All we need worry about is that firstly it's very big for what we're trying to do and secondly that it's looping so those ops we can see are going to be executed multiple times.
Let's try an alternative approach, using the translation operator to translate all the "a" characters to "a" characters (so, do nothing) and return how many characters it 'changed'
bash$ perl -MO=Concise -E '$c = $ARGV[0] =~ tr/a/a/; say $c' aardvark
b <@> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 42 -e:1) v:%,{ ->3
6 <2> sassign vKS/2 ->7
- <1> null sKS/2 ->5
- <1> ex-aelem sK/2 ->4
- <1> ex-rv2av sKR/1 ->-
3 <#> aelemfast[*ARGV] s ->4
- <0> ex-const s ->-
4 <"> trans sS/IDENT ->5
- <1> ex-rv2sv sKRM*/1 ->6
5 <#> gvsv[*c] s ->6
7 <;> nextstate(main 42 -e:1) v:%,{ ->8
a <@> say vK ->b
8 <0> pushmark s ->9
- <1> ex-rv2sv sK/1 ->a
9 <#> gvsv[*c] s ->a
Ah! much less ops! And no loops! This is because the call to tr is a single op, meaning this whole thing is much faster. Of course, don't take my word for it - run a benchmark
#!/usr/bin/perl
use Benchmark qw(cmpthese);
cmpthese(10_000_000, {
'index' => sub { my $c; my $pos; $c++ while $pos = index($ARGV[0], "a", $pos) + 1 },
'tr' => sub { my $c; $c = $ARGV[0] =~ tr/a/a/ },
});
bash$ ./benchmark.pl aardvark
Rate index tr
index 2439024/s -- -39%
tr 4016064/s 65% --
And finally
This is just a smattering of modules that can help poke around inside the internal of Perl - practically the national sport of the Programming Republic of Perl. The CPAN contains a very large number of modules that can do all kinds of clever things - try looking on the CPAN for "B::" and "Devel::" modules.
Today I'm going to talk about parsing documents that use XML Namespaces with XML::Easy. While XML::Easy doesn't (by design) ship with its own XML Namespace aware parser, one of my modules XML::Easy::Transform::RationalizeNamespacePrefixes makes parsing documents that use namespaces a doddle with just one simple function call.
The problems namespaces solve and cause
XML Namespaces is an extension of the XML 1.0 specification that allows multiple standards to cooperate so they don't use the same names for their nodes, meaning it's possible to use more than one specification in the same document at the same time without conflict. For example here's an XML document that uses two different made up specs at the same time to describe a pub lunch that uses the tag "chips" to mean two different things:
So the way the XML Namespace specification works is by using a convention of naming nodes starting with an extra prefix. This allow you to use what otherwise would be the same named in the same document to have a different schematic meaning. For example the "chips" nodes are written as "prepackaged:chips" when they're referring to crisps, and "grub:chips" when they're referring to a fries. The clever bit of XML Namespaces is that doesn't matter what prefix you use to differentiate the two from each other, but what namespace URLs they map to. For example, this document here is considered to be essentially identical to the previous example as far as a namespace aware XML parser is concerned:
The meaning of the prefix is entirely derived from the presence of the xmlns prefixed attributes on the node or on the parent node mapping the prefix to a URL¹. This both is great and a complete nightmare: Great since you're mapping an arbitrary prefix to a the unique namespace URL you're not going to get conflicts with other specifications (the way you would if each specification defined its own prefix.) And a complete nightmare because you don't know what the thing you're looking for is actually called - either your code, or the parser, has to keep track of what namespaces are declared in the current scope and what prefixes map to what namespaces.
Using XML::Easy::Transform::RationalizeNamespacePrefixes
What would be great is if there was some way you could force everyone who gives you a document to use the prefixes you'd like, and then you'd know what they'd be called and instead of having to worry about all these xmlns:whatever attributes in the document (and what nodes were where in the tree in relation to them.) Then you could just look for all the "beverage:larger" nodes.
Well, we can't force other people to do what we want, but what we can do is make use of the fact that the prefixes are arbitrary and the same document with any prefix means the same thing. We can therefore just rewrite whatever document we're given into a form we'd like to deal with before we process it. This is the task XML::Easy::Transform::RationalizeNamespacesPrefixes was designed for - it rationalises the prefixes of the namespaces to whatever you want. For example, forcing using "drink" and "modifier" prefixes for the namespaces:
It used the namespace/prefixe mapping that we passed into it with namespaces to rename all the corresponding nodes in the document to have the whatever prefixes we want. This means we now know without looking at the xmlns attributes what our nodes will be called.
All the namespaces have been moved to the top element of the document. In this example the module didn't need to introduce any further prefixes to do this (which can happen if the same prefix is used to refer to different URLs in different parts of the tree) nor condense prefixes to a single prefix per namespace (which happens if multiple prefixes refer to the same URL) but if it had to do that, it would have. This means it's really easy to find other namespaces that are defined in our document - you just look for xmlns attributes at the top element.
The force_attribute_prefix option forces prefixes to be attached to attribute names too
Now we can parse the document without worrying about the namespaces at all. If we want to look for all the packets of preprepared food in the document:
use XML::Easy::Text qw(xml10_read_document);
use XML::Easy::Classify qw(is_xml_element);
use XML::Easy::NodeBasics qw(xe_twine);
use XML::Easy::Transform::RationalizeNamespacePrefixes qw(rationalize_namespace_prefixes);
sub packets {
my $element = shift;
return unless is_xml_element($element);
my @return;
push @return, $element->attribute("barstaff:type") if $element->type_name eq "barstaff:chips";
push @return, map { packets($_) } @{ xe_twine($element) };
return @return;
}
say "We need the following packets:";
say " * $_" for packets(
rationalize_namespace_prefixes(
xml10_read_document($string_of_xml), {
namespaces => {
"http://twoshortplanks.com/ns/example/behindthebar" => "barstaff",
},
force_attribute_prefix => 1,
}
)
);
There's more information on 'XML::Easy::Transform::RationalizeNamespacePrefix's search.cpan.org page².
And that concludes my mini-series into looking into XML::Easy. I'm sure to write more about it in the future as more interesting uses and extensions are written for it, but in my next entry I'll be taking a break from the pointy brackets!
[1] I've used the term URL mutliple times in this document when I should have really used URI. We're using the http:// thingy wosit to Identify a Unique Reference, so it should be a URI, rather an a Universal Resorce Location because there's no resource to locate at that address. It's just a unique name.
[2]Please note that this blog was originally posted close in time to when the new version of XML::Easy::Transform::RationalizeNamespacePrefixes was uploaded to the CPAN, so not all features described in this post may have reached your local CPAN mirror if you're reading it "hot off the presses".
In this post I want to take a further look at writing code that outputs XML with XML::Easy, and how the example from my previous blog entry can be improved upon by using XML::Easy::ProceduralWriter.
What's wrong with the previous example?
When we left things in my previous post we were looking at this snippet of code that outputs a simple XML webpage:
The above code produces exactly the output we want, but it doesn't necessarily go about the best way of producing it.
The first problem is that it's using Tie::IxHash to ensure that the keys of the hash (and thus the nodes in the resulting XML) come out in the right order rather than in a random order like traditional hashes. Tied data structures are much slower than normal data structures and using this structure in this way is a big performance hit. However in this case we have to tie because it's hard to write, in a readable way, the logic inline in the map statement to process a normal array two elements at a time.
Which brings us onto the second problem, also related to the map statement - it's somewhat unwieldy to write and hard to read (you have to scan to the end of the line to work out that it's using the %hash for its keys.) This only gets worse as you have to produce more complex XML and you try and use further (possibly nested) map statements and tertiary logic expressions to build up even more complex data structures - which is every bit as messy to do as it is to explain.
Both issues stem from trying to build the XML::Easy::Element tree all in one go, essentially in one statement as a single assignment. If we choose not to restrict ourselves in this way we can easily re-order the code to use a temporary variable and do away with both the tie and the map:
my @data = (
"http://search.cpan.org/" => "Search CPAN",
"http://blog.twoshortplanks.com" => "Blog",
"http://www.schlockmercenary.com" => "Schlock",
);
my @links;
while (@data) {
my $url = shift @data;
my $text = shift @data;
push @links, xe("li", xe("a", { href => $url }, $text) );
}
my $root_element = xe("html",
xe("head",
xe("title", "My Links"),
),
xe("body",
xe("h1", "Links"),
xe("ul", @links),
),
);
print {$fh} xml10_write_document($root_element);
The problem with this solution is now we've ended up with code that's backwards. We're creating the list elements and then creating the node that encloses them. Now we have to read the bottom of the code to work out that we're creating a HTML document at all!
Introducing XML::Easy::ProceduralWriter
To solve this problem I wrote XML::Easy::ProceduralWriter, a module that allows you to write your code in a procedural fashion but without having to "code in reverse".
Here's the above example re-written again, this time using XML::Easy::ProceduralWriter:
use XML::Easy::ProceduralWriter;
print {$fh} xml_bytes {
element "html", contains {
element "head", contains {
element "title", contains {
text "My Links";
};
};
element "body", contains {
element "ul", contains {
my @data = (
"http://search.cpan.org/" => "Search CPAN",
"http://blog.twoshortplanks.com" => "Blog",
"http://www.schlockmercenary.com" => "Schlock",
);
while (@data) {
element "li", contains {
element "a", href => shift @data, contains {
text shift @data;
};
};
}
};
};
};
};
Using the module is straight forward. You start by calling either xml_element (which returns an XML::Easy::Element) or xml_bytes (which returns a set of bytes you can print out) and inside these call you pass some code that generates XML elements and text. Each element can 'contain' further code that produces sub-elements and text that element contains and so on.
The key thing to notice is that unlike the previous examples where you were passing data structures into the functions here you're passing code to be executed. This means you can place arbitrary logic in what you pass in and you're not limited to single statements. For example, in the above code we declare variables in the middle of generating the XML. The conceptual jump here is realising that neither what the blocks of code nor what element and text return isn't important, but the side effects of calling these two functions are. The simplest way to think about it is to imagine the string being built up as the element and text statements are encountered in much the same way output is straight away printed to the filehandle when you use print (even though technically this isn't the case here - a full XML::Easy::Element object tree is always actually built in the background.)
The documentation for XML::Easy::ProceduralWriter contains a reasonable tutorial that explains its usage in more detail, but it should be pretty straight forward from just reading the above code to jump straight in.
And that's pretty much all I have to say about outputting XML with XML::Easy. In my next post we'll look instead at advanced parsing and how to cope with documents with XML Namespace declarations.
Last week I posted about why you should be interested in the new XML parsing library on the block, XML::Easy. In this post I'm going to actually dive into some code so you can see what it's like to work with.
Parsing XML by Example
The basics of parsing is pretty straight forward:
use XML::Easy::Text qw(xml10_read_document);
# read in the document
open my $fh, "<:utf8", "somexml.xml";
or die "Can't read filehandle: $!";
my $string = do { local $/; <> };
# parse it
my $root_element = xml10_read_document($string);
Now $root_element contains an XML::Easy::Element. Getting basic facts out of this element such as its name or attribute values is easy too:
say "the name of the root element is ".$root_element->type_name;
say "the content of the href attribute is ".$root_element->attribute("href")
if defeined $root_element->attribute("href");
Getting at the child elements involves dealing with a twine. What's a twine you say? Why it's nothing more than an alternating list of strings and elements. Let's look at an example to help explain this:
my $input = '<p>Hello my <i>friend</i>, here is my picture: <img src="http://farm1.static.flickr.com/116/262065452_6017d39626_t.jpg" /></p>'
We can then call this:
my $p = xml10_read_document($string);
my $twine = $p->content_twine;
The $twine variable now contains a an array reference holding alternating strings and XML::Easy::Elements
$twine->[0] contains the string "Hello my"
$twine->[1] contains an XML::Easy::Element representing the <i> tag (which in turn will contain the text "friend")
$twine->[2] contains the string ", here is my picture "
$twine->[3] contains an XML::Easy::Element representing the <img> tag
$twine->[4] contains the empty string "" between the <img> tag and the closing </p> tag
The important thing to remember about twines is that they always alternate element-string-element-string. When two elements are next to each other in the source document then they're separated by the empty string. You'll note that the twine first and last elements are always strings, even if they have to be empty, and an "empty" tag has a twine that contains just one element - the empty string.
Now we know the basics, let's look at a practical example. Imagine we want to get all the possible anchors (internal links) in an XHTML document. This simply involves looking for all the <a> tags that have a name attribute:
sub get_links {
my $element = shift;
my @results;
# check this element
push @results, $element->attribute("name")
if $element->type_name eq "a" && defined $element->attribute("name");
# check any child elements
my $swizzle = 0;
foreach (@{ $element->content_twine() }) {
# skip every other array element because it's a string
next if $swizzle = !$swizzle;
# recurse into the child nodes
push @results, get_links($_);
}
return @results;
}
If we want to make this even easier on ourselves there's a bunch of helper functions in the XML::Easy::Classify module that can be used to help process parts of XML documents. For example, we could have written the above code in a more terse (but less efficient) way by using is_xml_element:
use XML::Easy::Classify qw(is_xml_element);
sub get_links {
my $element = shift;
my @results;
# check this element
push @results, $element->attribute("name")
if $element->type_name eq "a" && defined $element->attribute("name");
# check any child elements
push @results, get_links($_)
foreach grep { is_xml_element $_ } @{ $element->content_twine() };
return @results;
}
Generating XML by Example
If you've got an XML::Easy::Element instance, writing it out as an XML document is just the opposite of reading it in:
use XML::Easy::Text qw(xml10_write_document);
# turn it into a string
my $string = xml10_write_document($root_element);
# write out the document
open my $fh, ">:utf8", "somexml.xml";
or die "Can't write to filehandle: $!";
print {$fh} $string;
So One of the first things you have to know about XML::Easy::Elements and their contents is that they are immutable, or put another way you can't change them once they're created. This means they have no methods for setting the name of an element, altering the attributes, or setting the children. All of these must be passed in in the constructor.
Let's just jump in with an example. We're going to create a little code that outputs the following XML document:
(I've added extra whitespace in the above example for clarity - the code examples that follow won't reproduce this whitespace)
I'm going to start of showing you the most verbose and explicit objected-orientated way to create XML::Easy::Elements, and then I'm going to show you the much quicker functional interface once you know what you're doing. The verbose way of creating an object is to explicitly pass in each of the things to the constructor:
XML::Easy::Element->new($name, $attributes_hashref, $xml_easy_content_instance)
The trouble with using such code is that it often requires requires pages and pages of code that puts Java to shame in it's repetition of the obvious (you don't really need to read the following code, just gawk at its length:)
So, we never ever write code like that! For starters we could use twines instead of content objects, but that's too verbose too. We use the functional interface presented by XML::Easy::NodeBasics instead:
The xe function simply takes a tag name followed by a list of things that are either hashrefs (containing attributes), strings (containing text,) or XML::Easy::Elements (containing nodes.) It can also take content objects and twines, which is handy when you're re-using fragments of XML that you've extracted from other documents you may have parsed. In short, it Does The Right Thing with whatever you throw at it.
Of course, we can optomise further by knowing that this code is Perl:
And that's about it for basic XML parsing and generation with XML::Easy. There's a lot more handy functions and explantions of the theory behind XML::Easy in the documentation. In my next post I'm going to look at another way of creating XML using XML::Easy, when I talk about one of my own modules: XML::Easy::ProceduralWriter.
Some days, you just want to parse XML document.
However, the standard distribution of Perl doesn't ship with a bundled XML parser, traditionally instead requiring the user to install a module from CPAN. This means there's no standard way to do this. Instead there are several choices of parser, each with their advantages and disadvantages: There is, as we often say in Perl, more than one way to do it. This is the first post in a series where I'm going to talk about XML::Easy, a relatively new XML parsing module that deserves a little more publicising.
But why another XML parsing library? What's wrong with the others? Well, a few things...
One of the biggest problems with the most popular XML parsing modules like XML::LibXML and XML::Parser is that they rely on external C dependancies being installed on your system (libxml2 and expat respectively) so it can be hard to rely on them being installable on any old system. Suppose you write some software that relies on these modules. What exactly are you asking of the users of your software who have to install that module as a dependency? You're asking them firstly to have a C compiler installed - something people using ActiveState Perl, basic web-host providers, or even Mac OS X users without dev tools do not have. Even more than this you're often asking them to download and install (either by hand or via their package management system) the external C libraries that these modules rely on, and then know how to configure the C compiler to link to these. Complicated!
To solve this XML::Easy ships with a pure Perl XML parser neither requiring external libraries or a C compiler to install: In a pinch you can simply copy the Perl modules into your library path and you're up and ready to go. This means that this library can be relied on pretty much anywhere.
The observant will point out that there are many existing pure Perl XML parsing libraries on CPAN. They suffer from another problem: They're slow. Perl runs not as native instructions but as interpreted bytecode executing on a virtual machine, which is a technical way of saying "in a way that makes lots of very simple operations slow." This is why the boring details of XML parsing are normally handled in C space.
Luckily, XML::Easy doesn't use its pure Perl parser unless it really has to. It prefers to compile and install on those systems that do have a working C compiler its own C code for parsing XML. Note that this C code, bound into the perl interpreter with fast XS, is wholly self contained and doesn't rely on external libraries. All the user on a modern Linux system has to do to install the module is type cpan XML::Easy at the command prompt. In this mode XML::Easy is fast: In our tests it's at least as fast as XML::LibXML (which is to say, very fast indeed.) This week I've been re-writing some code that used to use MkDoc::XML to use XML::Easy and the new code is 600 (yes, six hundred) times faster.
This is great news for module authors who just want to do something simple with fast performance if they can get it, but don't want to have to worry about putting too much of a burden on their users.
Of course, this would all be for naught if XML::Easy didn't do a good job of parsing XML - but it does. The other big failing of the many so-called XML parsers for Perl is that they screw up the little but important things. They miss part of the specification (sometimes even deliberately!) or they don't let you do things properly like handle unicode. XML::Easy isn't like this: It follows the specification quite carefully (with the devotion I've come to expect from its author, my co-worker Zefram) and doesn't screw up unicode because it doesn't attempt to handle character encodings itself but embraces and works with Perl's own unicode handling.
So by now, I'll have either sold you on the idea of XML::Easy or not, but I haven't really shown you how to use it. In the next post in this series I'm going to start talking about how you can use XML::Easy to parse XML and extract which bits you want.
In this blog post I'm going to talk about my own custom "no paste" solution that I've developed over the years. How I started out using a web page as a service, moved to scripting this from the command line, and how I finally ended up subverting DropBox to my ends. Skip to the end to get to the code.
So, what's "no paste" I hear you ask? "No paste" servers allow you to go to a webpage and submit a bit of text and get a unique url back where that text can be viewed. This means that you don't have to paste a whole bunch of text into IRC or an IM conversation, you can just upload your text and and copy and paste only the link the "no paste" server sent back into your IM. If you haven't seen one before check out textmate's nopaste server.
This has several advantages. Firstly and foremostly, it doesn't "SPAM" the conversation you're having. Pasting a whole bunch of code into a IRC channel where people are having a conversation causing that conversation to scroll off the screen before they can read it isn't polite. Secondly it makes the text easier to read and easier to copy and paste into an editor (for example, most IRC an IM clients will prepend each line someone says with a datestamp when you copy and paste from them.)
Excellent. A useful idea. Now how can we make it better?
As a Perl programmer I tend to automate a heck of a lot of what I do with my computer. Filing in a form on a webpage is easy to do, but it's more hassle than hitting a key combination that pastes whatever text is highlighted in your editor. If we do it a lot we should make the computer do all the work!
For a long time I used the App::Nopaste module on CPAN which installs a command line utility called nopaste which integrates with a vast range of "no paste" servers. This utility can take input on the command line and automatically fill in the forms on those websites for you. This means that it's trivial to execute from your editor - in textmate it's just a simple "Bundle" command.
In the end I stopped using nopaste not because I had a problem with the script, but because I had a problem with the nopaste servers, in particular the lack of privacy. Now, I'm a great believer in simply not putting anything on the internet that is truly private (face it, it's going to get out!) but there exists a bunch of "semi-private" stuff (closed source code, contact information, private correspondence) that shouldn't be put on a totally public paste service. Often it's just a case of editing the URL that the "no paste" server returns by increasing or decreasing the number at the end to see the thing the next or previous person pasted!
So in the end I decided it might be a good idea to run my own custom "No Paste" solution with semi-secure (non-guessable) URLs. One problem with that: I couldn't justify the infrastructure - I'm currently trying to reduce the amount of stuff I have to maintain, not increase it. So I started looking at what infrastructure I'm already using and seeing how I can better use that.
Enter DropBox. DropBox is a service that syncs a directory on your various computers with each other and the DropBox server. And one of the things it does is publish files in a certain directory as accessible from the web. This simplifies my problem a lot: All I need to do to have my own "No Paste" solution is simply have an easy way of getting text into a file on my hard drive and let the DropBox program automatically handle the "uploading" to a hosted service.
So, below is the script I wrote to do that. Features include:
Using a web-safe version of the "MD5 hex" one way hash of the file contents as the filename. This means that it's both unguessable unless you know what the text contains and reasonably guaranteed to be unique
Taking input from STDIN or the system clipboard
Printing out the URL that the text will be available at, and/or copying it to the clipboard, and/or displaying it in a Growl message
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use autodie;
use Path::Class qw(file dir);
use Digest::MD5 qw(md5_base64);
use Net::Growl qw(register notify);
use Getopt::Std qw(getopt);
########################################################################
# config
my $DROPBOX_ID = "301667";
my $GROWL_PASSWORD = "shoutout";
########################################################################
# get the config options
my %opt;
getopt("",\%opt);
# read the entire of STDIN / the files passed on the command line
my $data = $opt{c}
? read_clipboard()
: do { local $/; scalar <> };
# work out the digest for it. Covert the non url safe characters
# to url safe characters
my $uuid = md5_base64($data);
$uuid =~ s{/}{-}g;
$uuid =~ s{\+}{_}g;
# copy the data to the new file
open my $fh, ">:bytes",
file($ENV{HOME},"Dropbox","Public","nopaste","$uuid.txt");
print {$fh} $data;
close $fh;
# output the url that dropbox will make that file avalible at
my $url = "http://dl.getdropbox.com/u/$DROPBOX_ID/nopaste/$uuid.txt";
say $url unless $opt{q};
write_clipboard($url) if $opt{p};
if ($opt{g}) {
my $message = "shortly at $url";
$message .= " (copied to clipboard)" if $opt{p};
growl("Text Dropped", $message);
}
########################################################################
# this is mac os X depenent. I'd use the Clipboard module from CPAN
# to make this system independent, but it fails tests.
sub read_clipboard {
open my $pfh, "-|", "pbpaste";
local $/;
return scalar <$pfh>;
}
sub write_clipboard {
my $data = shift;
open my $pfh, "|-", "pbcopy";
print {$pfh} $data;
close $pfh;
}
sub growl {
my $title = shift;
my $description = shift;
register(
application => "droptxt",
password => $GROWL_PASSWORD,
);
notify(
application => "droptxt",
password => $GROWL_PASSWORD,
title => $title,
description => $description,
);
}
########################################################################
__END__
=head1 NAME
droptxt - easily write text to a file in your public dropbox
=head1 SYNOPSIS
# read from stdin
bash$ droptxt
this is some text
^D
http://dl.getdropbox.com/u/301667/nopaste/4ZwSg8klsyBmhf9SKs-j5g.txt
# read from a file
bash$ droptxt some_text.txt
http://dl.getdropbox.com/u/301667/nopaste/asdSDsq_asdQsasdw12s3d.txt
# read from the clipboard
bash$ droptxt -c
http://dl.getdropbox.com/u/301667/nopaste/cssj12-22WWdsqQfxjpDDe.txt
# also paste the url to the clipboard
bash droptxt -p some_text.txt
http://dl.getdropbox.com/u/301667/nopaste/asdSDsq_asdQsasdw12s3d.txt
=head1 DESCRIPTION
This is a command line utility that is designed to be used as an
alternative to "no paste" utilities. Instead of sending the input to a
webserver it simply writes it to a location on your hard drive where the
DropBox utility will syncronize it with the Dropox webservers.
=head2 Options
=over
=item -c
Copy the input from the system clipboard rather than from the usual
location.
=item -p
Paste the url to the system clipboard.
=item -g
Announce the url via Growl
=item -q
Do not print the url to STDOUT
=back
=head1 AUTHOR
Copyright Mark Fowler E<lt>mark@twoshortplanks.comE<gt> 2009. All rights reserved.
This program is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.
=head1 BUGS
Doesn't wait for DropBox to sync the file. The URL this creates may not be
usable straight away!
=head1 SEE ALSO
L<http://www.getdropbox.com> for details on the service.
L<App::Nopaste> for a utility that uses public "no paste" servers instead.
=cut