In this blog post I discuss how HTML entities work, how to encode them with Perl, and how to detect when you’ve accidentally double encoded your entities with my module Test::DoubleEncodedEntities.
How HTML Entities work
In HTML you can represent any character in simple ASCII by using entities. These come in two forms, either using the decimal codepoint of the character or, for some frequently used characters more readable human named entities
| Character | Unicode codepoint | Decimal Entity | Named Enitity |
|---|---|---|---|
| é | 233 | é | é |
| © | 169 | © | © |
| ☃ | 9731 | ☃ | none |
| < | 60 | < | < |
| & | 38 | & | & |
So instead of writing
<!DOCTYPE html>
<html><body>© 2012 Mark Fowler</body></html>
You can write
<!DOCTYPE html>
<html><body>© 2012 Mark Fowler</body></html>
By delivering a document in ASCII and using entities for any codepoints above 127 you can ensure that even the most broken of browsers will render the right characters.
Importantly, when an entity is converted back into a character by the browser the character no longer has any of its special meaning, so you can use encoding to escape sequences that would otherwise be considered markup. For example:
<!DOCTYPE html>
<html><body>say "yep"
if $ready && $bad < $good;
</body></html>
Correctly renders as
say "yep" if $ready && $bad < $good;
Encoding Entities with Perl
The go-to module for encoding and decoding entities is HTML::Entities. Its use is simple: You pass the string you want to encode into the encode_entities function and it returns the same string with the entities encoded:
use HTML::Entites qw(encode_entities);
my $string = "\x{a9} Mark Fowler 2012";
my $encoded = encode_entities($string);
say "<!DOCTYPE html>"
say "<html><body>$encoded</body></html>";
If you no longer need the non-encoded string you can have HTML::Entities modify the string you pass to it by not assigning the output to anything (HTML::Entities is smart enough to notice it’s being called in void context where its return value is not being used.)
use HTML::Entites qw(encode_entities);
my $string = "\x{a9} Mark Fowler 2012";
encode_entities($string);
say "<!DOCTYPE html>"
say "<html><body>$string</body></html>";
The Double Encoding Problem
The trouble with encoding HTML entities is that if you do it a second time then you end up with nonsensical looking text. For example
use HTML::Entites qw(encode_entities);
my $string = "\x{a9} Mark Fowler 2012";
encode_entities($string);
encode_entities($string);
say "<!DOCTYPE html>"
say "<html><body>$string</body></html>";
Outputs
<!DOCTYPE html>
<hmtl><body>&copy; Mark Fowler 2012</body></html>
Which when rendered by the browser displays
© Mark Fowler 2012
As the & has turned into & but isn’t then combind with the copy; to turn it into the copyright symbol ©.
Each subsequent encoding turns the & at the start of the entity into &, including those at the start of any previously created &. Do this ten or so times and you end up with:
&amp;amp;amp;amp;amp;amp;amp;amp;amp;copy; Mark Fowler 2012
The obvious solution is to make sure you encode the entities only once! But that’s not as easy as it might seem. If you’re building your output up from multiple processes it’s quite easy to mistakenly encode twice; Worse, if you’re using data that you don’t control (for example, extracted from a web page, downloaded from a feed, imported from a user) you might find that some or more of it had unexpectedly already been encoded.
Testing for the Problem
I recently re-released my module Test::DoubleEncodedEntities that can be used to write automated tests for double encoding.
use Test::More tests => 1;
use Test::DoubleEncodedEntities;
ok_dee($string, "check for double encoded entities");
It works heuristically by looking for strings that could possibly be double encoded entities. Obviously there’s lots of HTML documents out there where it’s perfectly legitimate to have double encoded entities: any of them talking about entity encoding, such as this blog post itself, will naturally do do. However, the vast majority – where you control the input – will not have these format of strings and we can test for them.
For example:
use Test::More tests => 6;
use Test::DoubleEncodedEntities;
ok_dee("© Mark Fowler 2012", "should pass");
ok_dee("&copy; Mark Fowler 2012", "should fail");
ok_dee("© Mark Fowler 2012", "should fail");
ok_dee("© Mark Fowler 2012", "should pass");
ok_dee("&#169; Mark Fowler 2012", "should fail");
ok_dee("© Mark Fowler 2012", "should fail");
Produces the output:
1..6
ok 1 - should pass
not ok 2 - should fail
# Failed test 'should fail'
# at test.pl line 5.
# Found 1 "&copy;"
not ok 3 - should fail
# Failed test 'should fail'
# at test.pl line 6.
# Found 1 "©"
ok 4 - should pass
not ok 5 - should fail
# Failed test 'should fail'
# at test.pl line 8.
# Found 1 "&#169;"
not ok 6 - should fail
# Failed test 'should fail'
# at test.pl line 9.
# Found 1 "©"
# Looks like you failed 4 tests of 6.
Correctly detecting the double encoded entities in the should fail tests
Some of the text in the example at the end of this post is garbled; This is because wordpress seems to not be allowing me to put in the double encoded entities I need to complete the example.
And, of course, some times you specifically want double-encoded entities – so that you can show the entities in HTML output, as in the table near the top of this post. But it looks like either you or WordPress are being too clever here, as I’m seeing the third column as the same as the first column.
WordPress.com is being way too clever here. I’m tempted to move away to a self hosted solution here!
Bizarre how it’s being too clever for the numeric entities but exactly the right level of clever for the named entities.
You teach a tool, but neglected to tell the circumstances when it should not be used and when it should. W3C [tutorial-char-enc] recommends: “Save your pages as UTF-8, whenever you can. … Avoid using character escapes, except for invisible or ambiguous characters.” The details are in [qa-escapes].
http://www.w3.org/International/tutorials/tutorial-char-enc/Overview.en#quicksummary
http://www.w3.org/International/questions/qa-escapes.en
Hello Anonymous whoever you are!
I’m a UTF-8 junkie; I’ve done enough i18n programming to know that you should probably always be using UTF-8 unless you’ve got a very good reason not to (e.g. you need a fixed width binary representation or you’re using a lot of very high bit characters, in which case UTF-32 or its ilk might be better.)
This said, I do prefer to output characters in the ASCII range (even if I am generating UTF-8) and use HTML entities. Until you’ve suddenly had part of your CMS re-written to embed text in emails for inclusion into webmail clients you’ve not known the pain of using character encodings. Another point worth noting is that the output character encoding often influences the character encoding the browser will submit form data back in and care needs to be taken there. In short…there’s no short answer to this.
Pingback: Frac’ing your HTML « As Thick As Two Short Planks – Mark Fowler's Blog