5.18 Hash Key Ordering Changes In A Nutshell

$ perlbrew switch perl-5.16.3
$ perl -E '%h=map {$_=>1} (1..10); say join ",",keys %h for 1..3'
6,3,7,9,2,8,1,4,10,5
6,3,7,9,2,8,1,4,10,5
6,3,7,9,2,8,1,4,10,5
$ perl -E '%h=map {$_=>1} (1..10); say join ",",keys %h for 1..3'
6,3,7,9,2,8,1,4,10,5
6,3,7,9,2,8,1,4,10,5
6,3,7,9,2,8,1,4,10,5
$ perlbrew switch perl-5.18.0
$ perl -E '%h=map {$_=>1} (1..10); say join ",",keys %h for 1..3'
10,2,8,3,6,9,5,4,1,7
10,2,8,3,6,9,5,4,1,7
10,2,8,3,6,9,5,4,1,7
$ perl -E '%h=map {$_=>1} (1..10); say join ",",keys %h for 1..3'
4,6,9,1,10,3,2,7,8,5
4,6,9,1,10,3,2,7,8,5
4,6,9,1,10,3,2,7,8,5

Now each execution has its own hash key ordering (but hash key ordering is the same for the duration of the execution until you insert a new key.)

perl-5.18 increases the chance that keys will be reordered when inserts happen:


$ perlbrew switch perl-5.16.3
$ perl -E 'for (1..20) { $h{$_}=1; say join ",",keys %h }'
1
1,2
1,3,2
4,1,3,2
4,1,3,2,5
6,4,1,3,2,5
6,4,1,3,7,2,5
6,3,7,2,8,1,4,5
6,3,7,9,2,8,1,4,5
6,3,7,9,2,8,1,4,10,5
6,11,3,7,9,2,8,1,4,10,5
6,11,3,7,9,12,2,8,1,4,10,5
6,11,3,7,9,12,2,8,1,4,13,10,5
6,11,3,7,9,12,2,14,8,1,4,13,10,5
6,11,3,7,9,12,2,15,14,8,1,4,13,10,5
11,7,2,1,16,13,6,3,9,12,14,15,8,4,10,5
11,7,17,2,1,16,13,6,3,9,12,14,15,8,4,10,5
11,7,17,2,1,18,16,13,6,3,9,12,14,15,8,4,10,5
11,7,17,2,1,18,16,13,6,3,9,12,14,15,8,4,19,10,5
11,7,17,2,1,18,16,13,6,3,9,12,20,14,15,8,4,19,10,5
$ perlbrew switch perl-5.18.0
$ perl -E 'for (1..20) { $h{$_}=1; say join ",",keys %h }'
1
2,1
1,2,3
2,4,1,3
4,1,2,5,3
2,4,1,5,3,6
4,1,2,6,5,7,3
7,8,1,3,5,6,2,4
4,2,6,3,5,8,1,9,7
9,7,8,1,6,3,5,10,4,2
9,7,11,8,1,6,5,10,3,4,2
11,8,1,7,9,2,12,4,5,10,3,6
4,13,12,2,6,3,5,10,8,1,11,9,7
11,14,8,1,7,9,2,12,4,13,5,10,3,6
4,13,12,15,2,6,3,5,10,14,8,1,11,9,7
9,16,7,8,14,3,13,12,15,2,1,11,6,5,10,4
5,10,17,6,4,11,1,3,2,12,15,13,16,7,9,8,14
4,17,6,5,10,1,11,13,12,15,2,18,3,8,14,9,16,7
17,6,5,10,4,1,11,19,3,13,12,15,2,18,9,16,7,8,14
11,1,4,5,10,17,6,8,14,16,7,9,2,18,20,12,15,13,3,19

This can be controlled with the PERL_PERTURB_KEYS environment variable


$ perlbrew switch perl-5.18.0
$ PERL_PERTURB_KEYS=0 perl -E'for (1..20) { $h{$_}=1; say join ",",keys %h }'
1
2,1
3,2,1
3,2,1,4
5,3,2,1,4
5,6,3,2,1,4
5,6,3,2,1,4,7
5,2,3,6,8,1,4,7
5,2,3,6,8,9,1,4,7
5,2,3,6,10,8,9,1,4,7
5,2,3,6,10,11,8,9,1,4,7
5,2,3,6,10,11,12,8,9,1,4,7
5,2,3,6,10,13,11,12,8,9,1,4,7
5,2,3,6,10,13,11,14,12,8,9,1,4,7
5,2,3,6,10,13,11,14,12,8,15,9,1,4,7
3,14,12,16,8,15,9,4,7,5,6,2,10,11,13,1
17,3,14,12,16,8,15,9,4,7,5,6,2,10,11,13,1
18,17,3,14,12,16,8,15,9,4,7,5,6,2,10,11,13,1
18,17,3,19,14,12,16,8,15,9,4,7,5,6,2,10,11,13,1
18,17,3,19,14,12,16,8,15,9,4,7,5,6,2,10,11,13,20,1

Posted in Uncategorized

Permalink Leave a comment

Another Feature Perl 5 Needs in 2012

In Features Perl 5 Needs in 2012 chromatic asks the question What’s on your list?  Well, my list is long.  But the thing at the top is structured core exceptions.

While I don’t think Perl 5′s exception throwing syntax is completely solved by Try::Tiny (just as it’s time for Perl 5 to have it’s own MOP it’s time for Perl 5 to have its own exception handling syntax to solve that problem in a well-defined way too) that’s not the biggest stumbling block to having awesome exception handling in Perl 5.  The main problem is that Perl 5 still throwing plain old strings as exceptions.

Often I find myself wanting to catch all IO errors, but not runtime stupid-coding errors (My error code is going to handle  a dumb filename, but it’s totally not going to know what to do if I’ve typoed a method name.)  This is hard because my code has to parse the exception string – essentially throw a bunch of regular expressions at it – until it can hopefully figure out what’s going on.  And, of couse, I say hopefully because the strings can (and should, as improvements are made) change between versions of perl.

The correct solution is for Perl to throw structured exceptions.  Where all IO errors are a subclass of the main IO error and all bad method calls are a subclass of the DispatchError error or somesuch.  Then I only have to check that $@ isa IO Error and the whole problem becomes simple.

Now in theory someone could build a version of Try::Tiny that had all this logic built in – did all this parsing for me – so it would seem like I had structured exceptions for all the inbuilt errors.  And as long as this module was updated for each version of perl this would probably work and maybe just be good enough.  But it’s not the correct solution!  It’s going to be slow (oh so slow) and brittle and…bad and right.  I want (no, I demand!) a good and right solution.  Enough with the band-aids!

So structured core exceptions goes to the top of my list.  Which of course means very little since I’m not a core Perl hacker (nor do I play one on TV) and I’m not going to have time or skills to do it myself.  But if the TPF wants to take some of that money I’m paying for core development every month and pay someone to do it, I’d be a very happy Perl programmer. 

Posted in Uncategorized

Permalink 9 Comments

On Diversity in Tech Communities

One of the advantages of working from home is that I can do more interesting things with my lunch break that simply eat lunch. So, every week for three years I took my daughters to a local sing-a-long group.

Now as you can imagine, this group was primarily made up of mothers and their children. The people who run the sessions are women. On the odd occasion another father would turn up, but it was mostly just me and thirty women and their children.

No-one ever implied that I shouldn’t be there. No-one ever made jokes about men being useless. People didn’t try to have “Cosmo” type conversations with me that would make me blush. No-one even made any comment implying it was un-manly for me to sing along with nursery rhymes like all the other parents did.

All in all you could say it was great. I’d never accuse anyone at any of these events of being sexist.

But then again, every so often the group would sing a song about Bobby Shaftoe. For those of you not familar the lyrics go:

Bobby Shaftoe went to sea,
Silver buckles on his knee.
He’ll come back and marry me,
Pretty Bobby Shaftoe.

I never – in three years – spoke up about how uncomfortable these lyrics are for a straight man to sing. In the end I just stopped singing them.

Did this really bother me that much? Not really, otherwise I would have said something. But it let me experience in the tiniest possible little way what it’s like to be suddenly reminded that you’re different to everyone else in the group and to find out that you can’t join in what everyone else is doing because it’s not designed for you.

So, with this in mind I wish that people would understand that when I’m suggesting a code of conduct for a tech community my primary objective is not to suggest a list of things you can and can’t do. Nor am I suggesting that people are deliberately being nasty. I’m just trying to encourage everyone to think a little wider about the other people in their community that aren’t just like them – because even the best of us sometimes can have a blind spot. You know, it’s not always about the big things.

Sometimes I just don’t want community member to have to sing songs about their desire to marry a sailorman. And if they do find themselves in a situation where someone is asking them to declare their desire for silver buckled knee wearers that they feel like they can politely point out that they shouldn’t have to.

That is all.

Oh Function Call, Oh Function Call, Wherefore Art Thou Function Call

Ever wondered where all places a function is called in Perl? Turns out that during my jetlagged ridden first day back at work, I wrote a module that can do just that called Devel::CompiledCalls.

Use of the module is easy. You just load it from the command line:

shell$ perl -c -MDevel::CompiledCalls=Data::Dumper::Dumper myscript.pl

And BAM, it prints out just what you wanted to know:

Data::Dumper::Dumper call at line 100 of MyModule.pm
Data::Dumper::Dumper call at line 4 of myscript.pl
myscript.pl syntax OK

While traditionally it’s been easy to write code using modules like Hook::LexWrap that prints out whenever a function is executed and at that point where that function is called from, what we really want is to print out at the time the call to the function is compiled by the Perl compiler. This is important because you might have a call to the function in some code that is only executed very infrequently (e.g. leap year handling) which would not be simply identified by hooking function execution at run time.

In the past developers have relied too much on tools like search and replace in our editors to locate the function calls. Given that Perl is hard to parse, and given that many of the calls might be squirreled away in installed modules that your editor doesn’t typically have open, this isn’t the best approach.

What Devel::CompiledCalls does is hook the compilation of the code (techically, we hook the CHECK phase of the code, but the effect is the same) with Zefram’s B::CallChecker. This allows the callback to fire as soon as the code is read in by Perl.

All in all, I’m pretty happy with the module and it’s a new tool in my bag of tricks for getting to grips with big, unwieldy codebases.

Posted in Uncategorized

Permalink 1 Comment

Coming To America

Big changes are afoot

Not many people know this yet, but Erena, myself and the girls are in the process of emigrating. Yesterday I completed purchase of our new house in New Lebanon, New York, in the United States of America. Yes, this means that I’m going to be attending London.pm meetings even less than I did when I moved to Chippenham.

One of the consequences of my move is a change of jobs. At the end of April I’ll be leaving the wonderful Photobox and starting to do some work for the equally wonderful OmniTI.

Photobox is a great place to work, with a world class Perl team. In the last four years I’ve worked for them I haven’t complemented them enough. What other place can I work next to regular conference speakers? Core committers? People with more CPAN modules than I can shake a stick at? Perl Pumpkins themselves! To be blunt, if I was to stay in the UK I’d can’t imagine I’d want to work anywhere else.

It’s been an interesting four years, seeing the company grow and grow and dealing with the scalaing problems – both in terms of the number of customers and the challenges of growing the team – and I’ve really enjoyed it. But new adventure beckons, and I needn’t sing the praises of OmniTI to tell everyone here what great company they are and I’m going to be honoured to work with them.

So, once I’ve completed the gauntlet of my Green Card application (which is scheduled to take many months yet) it’ll be off with a hop, skip and a jump over the pond for a new life. Can’t wait.

Posted in Uncategorized

Permalink 3 Comments

Frac’ing your HTML

In my previous blog entry I talked about encoding weird characters into HTML entities. In this entry I’m going to talk about converting some patterns of ASCII – dumb ways of writing fractions – and turning them into HTML entities or Unicode characters.

Hey Good Looking, What’s Cooking?

Imagine a simple recipe:

<ul>
   <li>1/2 cup of sugar</li>
   <li>1/2 cup of spice</li>
   <li>1/4 cup of all things nice</li>
</ul>

While this is nice, we can do better. There’s nice Unicode
characters for ¼, ½ and corresponding HTML entities that we can use to have the browser render them for us. What we need is some way to change all our mucky ASCII into these entities. Faced with this problem on his recipes site European Perl Hacker Léon Brocard wrote a module called HTML::Fraction that could tweak strings of HTML.

use HTML::Fraction;
my $frac = HTML::Fraction->new();
my $output = $frac->tweak($string_of_html);

This module creates output like:

<ul>
   <li>&frac12; cup of sugar</li>
   <li>&frac12; cup of spice</li>
   <li>&frac14; cup of all things nice</li>
</ul>

Which renders nicely as:

  • ½ cup of sugar
  • ½ cup of spice
  • ¼ cup of all things nice

HTML::Fraction can even cope with decimal representation in your string. For example:

  • 0.5 slugs
  • 0.67 snails
  • 0.14 puppy dogs tails

Processed with HTML::Fraction renders like so:

  • ¼ slugs
  • ⅔ snails
  • ⅐ puppy dogs tails

Unicode Characters Instead

Of course, we don’t always want to render out HTML. Sometimes we just want a plain old string back. Faced with this issue myself, I wrote a quick subclass called String::Fraction:

use String::Fraction;
my $frac = String::Fraction->new();
my $output = $frac->tweak($string);

The entire source code of this module is short enough that I can show you it here.

package String::Fraction;
use base qw(HTML::Fraction);

use strict;
use warnings;

our $VERSION = "0.30";

# Our superclass sometimes uses named entities
my %name2char = (
  '1/4'  => "\x{00BC}",
  '1/2'  => "\x{00BD}",
  '3/4'  => "\x{00BE}",
);

sub _name2char {
  my $self = shift;
  my $str = shift;

  # see if we can work from the Unicode character
  # from the entity returned by our superclass
  my $entity = $self->SUPER::_name2char($str);
  if ($entity =~ /\A &\#(\d+); \z/x) {
    return chr($1);
  }

  # superclass doesn't return a decimal entity?
  # use our own lookup table
  return $name2char{ $str }
}

We simply override one method _name2char so that instead of returning a HTML entity we
return corresponding Unicode character.

Posted in Uncategorized

Permalink 2 Comments

Once is Enough

In this blog post I discuss how HTML entities work, how to encode them with Perl, and how to detect when you’ve accidentally double encoded your entities with my module Test::DoubleEncodedEntities.

How HTML Entities work

In HTML you can represent any character in simple ASCII by using entities. These come in two forms, either using the decimal codepoint of the character or, for some frequently used characters more readable human named entities

Character Unicode codepoint Decimal Entity Named Enitity
é 233 é &eacute;
© 169 © &copy;
9731 none
< 60 < &lt;
& 38 & &amp;

So instead of writing

<!DOCTYPE html>
<html><body>© 2012 Mark Fowler</body></html>

You can write

<!DOCTYPE html>
<html><body>&copy; 2012 Mark Fowler</body></html>

By delivering a document in ASCII and using entities for any codepoints above 127 you can ensure that even the most broken of browsers will render the right characters.

Importantly, when an entity is converted back into a character by the browser the character no longer has any of its special meaning, so you can use encoding to escape sequences that would otherwise be considered markup. For example:

<!DOCTYPE html>
<html><body>say "yep"
  if $ready &amp;&amp; $bad &lt; $good;
</body></html>

Correctly renders as

say "yep" if $ready && $bad < $good;

Encoding Entities with Perl

The go-to module for encoding and decoding entities is HTML::Entities. Its use is simple: You pass the string you want to encode into the encode_entities function and it returns the same string with the entities encoded:

use HTML::Entites qw(encode_entities);

my $string = "\x{a9} Mark Fowler 2012";
my $encoded = encode_entities($string);
say "<!DOCTYPE html>"
say "<html><body>$encoded</body></html>";

If you no longer need the non-encoded string you can have HTML::Entities modify the string you pass to it by not assigning the output to anything (HTML::Entities is smart enough to notice it’s being called in void context where its return value is not being used.)

use HTML::Entites qw(encode_entities);

my $string = "\x{a9} Mark Fowler 2012";
encode_entities($string);
say "<!DOCTYPE html>"
say "<html><body>$string</body></html>";

The Double Encoding Problem

The trouble with encoding HTML entities is that if you do it a second time then you end up with nonsensical looking text. For example

use HTML::Entites qw(encode_entities);

my $string = "\x{a9} Mark Fowler 2012";
encode_entities($string);
encode_entities($string);
say "<!DOCTYPE html>"
say "<html><body>$string</body></html>";

Outputs

<!DOCTYPE html>
<hmtl><body>&amp;copy; Mark Fowler 2012</body></html>

Which when rendered by the browser displays

&copy; Mark Fowler 2012

As the &amp; has turned into & but isn’t then combind with the copy; to turn it into the copyright symbol ©.

Each subsequent encoding turns the & at the start of the entity into &amp;, including those at the start of any previously created &amp;. Do this ten or so times and you end up with:

&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;copy; Mark Fowler 2012

The obvious solution is to make sure you encode the entities only once! But that’s not as easy as it might seem. If you’re building your output up from multiple processes it’s quite easy to mistakenly encode twice; Worse, if you’re using data that you don’t control (for example, extracted from a web page, downloaded from a feed, imported from a user) you might find that some or more of it had unexpectedly already been encoded.

Testing for the Problem

I recently re-released my module Test::DoubleEncodedEntities that can be used to write automated tests for double encoding.

use Test::More tests => 1;
use Test::DoubleEncodedEntities;
ok_dee($string, "check for double encoded entities");

It works heuristically by looking for strings that could possibly be double encoded entities. Obviously there’s lots of HTML documents out there where it’s perfectly legitimate to have double encoded entities: any of them talking about entity encoding, such as this blog post itself, will naturally do do. However, the vast majority – where you control the input – will not have these format of strings and we can test for them.

For example:

use Test::More tests => 6;
use Test::DoubleEncodedEntities;

ok_dee("&copy; Mark Fowler 2012",     "should pass");
ok_dee("&amp;copy; Mark Fowler 2012", "should fail");
ok_dee("&copy; Mark Fowler 2012", "should fail");
ok_dee("© Mark Fowler 2012",     "should pass");
ok_dee("&amp;#169; Mark Fowler 2012", "should fail");
ok_dee("&#169; Mark Fowler 2012", "should fail");

Produces the output:

1..6
ok 1 - should pass
not ok 2 - should fail
#   Failed test 'should fail'
#   at test.pl line 5.
# Found 1 "&amp;copy;"
not ok 3 - should fail
#   Failed test 'should fail'
#   at test.pl line 6.
# Found 1 "&copy;"
ok 4 - should pass
not ok 5 - should fail
#   Failed test 'should fail'
#   at test.pl line 8.
# Found 1 "&amp;#169;"
not ok 6 - should fail
#   Failed test 'should fail'
#   at test.pl line 9.
# Found 1 "&#169;"
# Looks like you failed 4 tests of 6.

Correctly detecting the double encoded entities in the should fail tests

Posted in Uncategorized

Permalink 7 Comments

Follow

Get every new post delivered to your Inbox.