Once is Enough

In this blog post I discuss how HTML entities work, how to encode them with Perl, and how to detect when you’ve accidentally double encoded your entities with my module Test::DoubleEncodedEntities.

How HTML Entities work

In HTML you can represent any character in simple ASCII by using entities. These come in two forms, either using the decimal codepoint of the character or, for some frequently used characters more readable human named entities

CharacterUnicode codepointDecimal EntityNamed Enitity

So instead of writing

<!DOCTYPE html>
<html><body>© 2012 Mark Fowler</body></html>

You can write

<!DOCTYPE html>
<html><body>&copy; 2012 Mark Fowler</body></html>

By delivering a document in ASCII and using entities for any codepoints above 127 you can ensure that even the most broken of browsers will render the right characters.

Importantly, when an entity is converted back into a character by the browser the character no longer has any of its special meaning, so you can use encoding to escape sequences that would otherwise be considered markup. For example:

<!DOCTYPE html>
<html><body>say "yep"
  if $ready &amp;&amp; $bad &lt; $good;

Correctly renders as

say "yep" if $ready && $bad < $good;

Encoding Entities with Perl

The go-to module for encoding and decoding entities is HTML::Entities. Its use is simple: You pass the string you want to encode into the encode_entities function and it returns the same string with the entities encoded:

use HTML::Entites qw(encode_entities);

my $string = "\x{a9} Mark Fowler 2012";
my $encoded = encode_entities($string);
say "<!DOCTYPE html>"
say "<html><body>$encoded</body></html>";

If you no longer need the non-encoded string you can have HTML::Entities modify the string you pass to it by not assigning the output to anything (HTML::Entities is smart enough to notice it’s being called in void context where its return value is not being used.)

use HTML::Entites qw(encode_entities);

my $string = "\x{a9} Mark Fowler 2012";
say "<!DOCTYPE html>"
say "<html><body>$string</body></html>";

The Double Encoding Problem

The trouble with encoding HTML entities is that if you do it a second time then you end up with nonsensical looking text. For example

use HTML::Entites qw(encode_entities);

my $string = "\x{a9} Mark Fowler 2012";
say "<!DOCTYPE html>"
say "<html><body>$string</body></html>";


<!DOCTYPE html>
<hmtl><body>&amp;copy; Mark Fowler 2012</body></html>

Which when rendered by the browser displays

&copy; Mark Fowler 2012

As the &amp; has turned into & but isn’t then combind with the copy; to turn it into the copyright symbol ©.

Each subsequent encoding turns the & at the start of the entity into &amp;, including those at the start of any previously created &amp;. Do this ten or so times and you end up with:

&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;copy; Mark Fowler 2012

The obvious solution is to make sure you encode the entities only once! But that’s not as easy as it might seem. If you’re building your output up from multiple processes it’s quite easy to mistakenly encode twice; Worse, if you’re using data that you don’t control (for example, extracted from a web page, downloaded from a feed, imported from a user) you might find that some or more of it had unexpectedly already been encoded.

Testing for the Problem

I recently re-released my module Test::DoubleEncodedEntities that can be used to write automated tests for double encoding.

use Test::More tests => 1;
use Test::DoubleEncodedEntities;
ok_dee($string, "check for double encoded entities");

It works heuristically by looking for strings that could possibly be double encoded entities. Obviously there’s lots of HTML documents out there where it’s perfectly legitimate to have double encoded entities: any of them talking about entity encoding, such as this blog post itself, will naturally do do. However, the vast majority - where you control the input - will not have these format of strings and we can test for them.

For example:

use Test::More tests => 6;
use Test::DoubleEncodedEntities;

ok_dee("&copy; Mark Fowler 2012",     "should pass");
ok_dee("&amp;copy; Mark Fowler 2012", "should fail");
ok_dee("&copy; Mark Fowler 2012", "should fail");
ok_dee("© Mark Fowler 2012",     "should pass");
ok_dee("&amp;#169; Mark Fowler 2012", "should fail");
ok_dee("&#169; Mark Fowler 2012", "should fail");

Produces the output:

ok 1 - should pass
not ok 2 - should fail
#   Failed test 'should fail'
#   at test.pl line 5.
# Found 1 "&amp;copy;"
not ok 3 - should fail
#   Failed test 'should fail'
#   at test.pl line 6.
# Found 1 "&copy;"
ok 4 - should pass
not ok 5 - should fail
#   Failed test 'should fail'
#   at test.pl line 8.
# Found 1 "&amp;#169;"
not ok 6 - should fail
#   Failed test 'should fail'
#   at test.pl line 9.
# Found 1 "&#169;"
# Looks like you failed 4 tests of 6.

Correctly detecting the double encoded entities in the should fail tests

- to blog -

blog built using the cayman-theme by Jason Long. LICENSE