210 lines
7.8 KiB
Plaintext
210 lines
7.8 KiB
Plaintext
=head1 NAME
|
|
|
|
perlunitut - Perl Unicode Tutorial
|
|
|
|
=head1 DESCRIPTION
|
|
|
|
The days of just flinging strings around are over. It's well established that
|
|
modern programs need to be capable of communicating funny accented letters, and
|
|
things like euro symbols. This means that programmers need new habits. It's
|
|
easy to program Unicode capable software, but it does require discipline to do
|
|
it right.
|
|
|
|
There's a lot to know about character sets, and text encodings. It's probably
|
|
best to spend a full day learning all this, but the basics can be learned in
|
|
minutes.
|
|
|
|
These are not the very basics, though. It is assumed that you already
|
|
know the difference between bytes and characters, and realise (and accept!)
|
|
that there are many different character sets and encodings, and that your
|
|
program has to be explicit about them. Recommended reading is "The Absolute
|
|
Minimum Every Software Developer Absolutely, Positively Must Know About Unicode
|
|
and Character Sets (No Excuses!)" by Joel Spolsky, at
|
|
L<http://joelonsoftware.com/articles/Unicode.html>.
|
|
|
|
This tutorial speaks in rather absolute terms, and provides only a limited view
|
|
of the wealth of character string related features that Perl has to offer. For
|
|
most projects, this information will probably suffice.
|
|
|
|
=head2 Definitions
|
|
|
|
It's important to set a few things straight first. This is the most important
|
|
part of this tutorial. This view may conflict with other information that you
|
|
may have found on the web, but that's mostly because many sources are wrong.
|
|
|
|
You may have to re-read this entire section a few times...
|
|
|
|
=head3 Unicode
|
|
|
|
B<Unicode> is a character set with room for lots of characters. The ordinal
|
|
value of a character is called a B<code point>. (But in practice, the
|
|
distinction between code point and character is blurred, so the terms often
|
|
are used interchangeably.)
|
|
|
|
There are many, many code points, but computers work with bytes, and a byte has
|
|
room for only 256 values. Unicode has many more characters than that,
|
|
so you need a method to make these accessible.
|
|
|
|
Unicode is encoded using several competing encodings, of which UTF-8 is the
|
|
most used. In a Unicode encoding, multiple subsequent bytes can be used to
|
|
store a single code point, or simply: character.
|
|
|
|
=head3 UTF-8
|
|
|
|
B<UTF-8> is a Unicode encoding. Many people think that Unicode and UTF-8 are
|
|
the same thing, but they're not. There are more Unicode encodings, but much of
|
|
the world has standardized on UTF-8.
|
|
|
|
UTF-8 treats the first 128 codepoints, 0..127, the same as ASCII. They take
|
|
only one byte per character. All other characters are encoded as two to
|
|
four bytes using a complex scheme. Fortunately, Perl handles this for
|
|
us, so we don't have to worry about this.
|
|
|
|
=head3 Text strings (character strings)
|
|
|
|
B<Text strings>, or B<character strings> are made of characters. Bytes are
|
|
irrelevant here, and so are encodings. Each character is just that: the
|
|
character.
|
|
|
|
On a text string, you would do things like:
|
|
|
|
$text =~ s/foo/bar/;
|
|
if ($string =~ /^\d+$/) { ... }
|
|
$text = ucfirst $text;
|
|
my $character_count = length $text;
|
|
|
|
The value of a character (C<ord>, C<chr>) is the corresponding Unicode code
|
|
point.
|
|
|
|
=head3 Binary strings (byte strings)
|
|
|
|
B<Binary strings>, or B<byte strings> are made of bytes. Here, you don't have
|
|
characters, just bytes. All communication with the outside world (anything
|
|
outside of your current Perl process) is done in binary.
|
|
|
|
On a binary string, you would do things like:
|
|
|
|
my (@length_content) = unpack "(V/a)*", $binary;
|
|
$binary =~ s/\x00\x0F/\xFF\xF0/; # for the brave :)
|
|
print {$fh} $binary;
|
|
my $byte_count = length $binary;
|
|
|
|
=head3 Encoding
|
|
|
|
B<Encoding> (as a verb) is the conversion from I<text> to I<binary>. To encode,
|
|
you have to supply the target encoding, for example C<iso-8859-1> or C<UTF-8>.
|
|
Some encodings, like the C<iso-8859> ("latin") range, do not support the full
|
|
Unicode standard; characters that can't be represented are lost in the
|
|
conversion.
|
|
|
|
=head3 Decoding
|
|
|
|
B<Decoding> is the conversion from I<binary> to I<text>. To decode, you have to
|
|
know what encoding was used during the encoding phase. And most of all, it must
|
|
be something decodable. It doesn't make much sense to decode a PNG image into a
|
|
text string.
|
|
|
|
=head3 Internal format
|
|
|
|
Perl has an B<internal format>, an encoding that it uses to encode text strings
|
|
so it can store them in memory. All text strings are in this internal format.
|
|
In fact, text strings are never in any other format!
|
|
|
|
You shouldn't worry about what this format is, because conversion is
|
|
automatically done when you decode or encode.
|
|
|
|
=head2 Your new toolkit
|
|
|
|
Add to your standard heading the following line:
|
|
|
|
use Encode qw(encode decode);
|
|
|
|
Or, if you're lazy, just:
|
|
|
|
use Encode;
|
|
|
|
=head2 I/O flow (the actual 5 minute tutorial)
|
|
|
|
The typical input/output flow of a program is:
|
|
|
|
1. Receive and decode
|
|
2. Process
|
|
3. Encode and output
|
|
|
|
If your input is binary, and is supposed to remain binary, you shouldn't decode
|
|
it to a text string, of course. But in all other cases, you should decode it.
|
|
|
|
Decoding can't happen reliably if you don't know how the data was encoded. If
|
|
you get to choose, it's a good idea to standardize on UTF-8.
|
|
|
|
my $foo = decode('UTF-8', get 'http://example.com/');
|
|
my $bar = decode('ISO-8859-1', readline STDIN);
|
|
my $xyzzy = decode('Windows-1251', $cgi->param('foo'));
|
|
|
|
Processing happens as you knew before. The only difference is that you're now
|
|
using characters instead of bytes. That's very useful if you use things like
|
|
C<substr>, or C<length>.
|
|
|
|
It's important to realize that there are no bytes in a text string. Of course,
|
|
Perl has its internal encoding to store the string in memory, but ignore that.
|
|
If you have to do anything with the number of bytes, it's probably best to move
|
|
that part to step 3, just after you've encoded the string. Then you know
|
|
exactly how many bytes it will be in the destination string.
|
|
|
|
The syntax for encoding text strings to binary strings is as simple as decoding:
|
|
|
|
$body = encode('UTF-8', $body);
|
|
|
|
If you needed to know the length of the string in bytes, now's the perfect time
|
|
for that. Because C<$body> is now a byte string, C<length> will report the
|
|
number of bytes, instead of the number of characters. The number of
|
|
characters is no longer known, because characters only exist in text strings.
|
|
|
|
my $byte_count = length $body;
|
|
|
|
And if the protocol you're using supports a way of letting the recipient know
|
|
which character encoding you used, please help the receiving end by using that
|
|
feature! For example, E-mail and HTTP support MIME headers, so you can use the
|
|
C<Content-Type> header. They can also have C<Content-Length> to indicate the
|
|
number of I<bytes>, which is always a good idea to supply if the number is
|
|
known.
|
|
|
|
"Content-Type: text/plain; charset=UTF-8",
|
|
"Content-Length: $byte_count"
|
|
|
|
=head1 SUMMARY
|
|
|
|
Decode everything you receive, encode everything you send out. (If it's text
|
|
data.)
|
|
|
|
=head1 Q and A (or FAQ)
|
|
|
|
After reading this document, you ought to read L<perlunifaq> too, then
|
|
L<perluniintro>.
|
|
|
|
=head1 ACKNOWLEDGEMENTS
|
|
|
|
Thanks to Johan Vromans from Squirrel Consultancy. His UTF-8 rants during the
|
|
Amsterdam Perl Mongers meetings got me interested and determined to find out
|
|
how to use character encodings in Perl in ways that don't break easily.
|
|
|
|
Thanks to Gerard Goossen from TTY. His presentation "UTF-8 in the wild" (Dutch
|
|
Perl Workshop 2006) inspired me to publish my thoughts and write this tutorial.
|
|
|
|
Thanks to the people who asked about this kind of stuff in several Perl IRC
|
|
channels, and have constantly reminded me that a simpler explanation was
|
|
needed.
|
|
|
|
Thanks to the people who reviewed this document for me, before it went public.
|
|
They are: Benjamin Smith, Jan-Pieter Cornet, Johan Vromans, Lukas Mai, Nathan
|
|
Gray.
|
|
|
|
=head1 AUTHOR
|
|
|
|
Juerd Waalboer <#####@juerd.nl>
|
|
|
|
=head1 SEE ALSO
|
|
|
|
L<perlunifaq>, L<perlunicode>, L<perluniintro>, L<Encode>
|
|
|