Tuesday, February 3, 2009

Hot PHP UTF-8 tips

Hot PHP UTF-8 tips

Del.icio.us

by Harry Fuecks

As a result of all the noise about UTF-8, got an email from Marek Gayer with some very smart tips on handling UTF-8. What follows is a discussion illustrating what happens when you get obsessed with performance and optimizations (be warned - may be boring, depending on your perspective).

Outrunning mbstring case functions with native PHP implementations
The native PHP strtolower / strtoupper functions don’t understand UTF-8 - they can only handle characters in the ASCII range plus (may) examine your servers locale setting for further character information. The latter behaviour actually makes them “dangerous” to use on a UTF-8 string, because there’s a chance that strtolower could mistake bytes in a UTF-8 multi-byte sequences as being something it should convert to lowercase, breaking the encoding. That shouldn’t be a problem if you’re writing code for a server you control but it is if you’re writing software for other people to use.

Restricting locale behaviourTurns out you can disable this locale behaviour by restricting your locale to the POSIX locale, which means only characters in the ASCII range will be considered (overriding whatever your server’s locale settings are), by executing the following;


1 <?php
2 setlocale(LC_CTYPE, 'C');
view plain | print | copy to clipboard

<?php
setlocale(LC_CTYPE, 'C');

That should work on any platform (certainly *Nix-based and Windows) and effects more than just strtolower() / strtoupper() - other PHP functionality picks up information from the locale, such as the PCRE /w meta character, strcasecmp() and ucfirst(), all of which might result in adverse effects on UTF-8.

The only issue, as I see it, is if you’re writing distributable software; should be messing with setlocale in the first place? See the warning in the documentation here - can be a problem for Windows where you have only a single server process - you may be effecting other apps running on the server.

Fast Case ConversionTo make it possible to do case conversion (e.g. strtolower/upper) without depending on mbstring (because who knows if shared hosts have installed it?), applications like Mediawiki (as in Wikipedia) and Dokuwiki solve this by implementing pure-PHP versions of these functions and using arrays like this or this ($UTF8_LOWER_TO_UPPER variable towards end of the script), which works because only a limited selection of alphabets have the notion of case in the first place - the array is big but not sooo big that it’s a terrible performance overhead. What’s interesting to note about both those lookup arrays is they contain characters in the ASCII range. They’re also support many alphabets.

Mediawiki then (essentially) does a str_to_upper like this (at least in the 1.7.1 release - see languages/LanguageUtf8.php - this seems to have changed since under SVN);


1 // ... bunch of stuff removed
2 return preg_replace( "/$x([a-z]|[\\xc0-\\xff][\\x80-\\xbf]*)/e",
3 "strtr( \"\$1\" , \$wikiUpperChars )",
4 $str
5 );
view plain | print | copy to clipboard


// ... bunch of stuff removed
return preg_replace( "/$x([a-z]|[\\xc0-\\xff][\\x80-\\xbf]*)/e",
"strtr( \"\$1\" , \$wikiUpperChars )",
$str
);


…it’s locating each valid UTF-8 character sequence and executing PHP’s strtr() function with the lookup array, via callback - the /e pattern modifier (time to phone a friend?) to convert the case. That keeps memory use minimal, traded against performance (probably - not benchmarked) - many callbacks / evals.

Dokuwiki (and phputf8) uses a similar approach but first splits the input string into an array or UTF-8 sequences and sees if they match in the lookup array. This is PHP UTF-8’s implementation, which is almost the same (utf8_to_unicode() converts a UTF-8 string to an array of sequences, representing characters, and utf8_from_unicode() does the reverse) ;


1 function utf8_strtolower($string){
2 global $UTF8_UPPER_TO_LOWER;
3
4 $uni = utf8_to_unicode($string);
5
6 if ( !$uni ) {
7 return FALSE;
8 }
9
10 $cnt = count($uni);
11 for ($i=0; $i < $cnt; $i++){
12 if ( isset($UTF8_UPPER_TO_LOWER[$uni[$i]]) ) {
13 $uni[$i] = $UTF8_UPPER_TO_LOWER[$uni[$i]];
14 }
15 }
16
17 return utf8_from_unicode($uni);
18 }
view plain | print | copy to clipboard

function utf8_strtolower($string){
global $UTF8_UPPER_TO_LOWER;

$uni = utf8_to_unicode($string);

if ( !$uni ) {
return FALSE;
}

$cnt = count($uni);
for ($i=0; $i < $cnt; $i++){
if ( isset($UTF8_UPPER_TO_LOWER[$uni[$i]]) ) {
$uni[$i] = $UTF8_UPPER_TO_LOWER[$uni[$i]];
}
}

return utf8_from_unicode($uni);
}
That’s going to use more memory for a short period, given that it copies the input string as an array (actually that needs fixing!) plus an array would need more space to store the equivalent information to a string but (should) be faster.

Anyway - enter Marek’s approach which can be summarized as;


1 function StrToLower ($s) {
2 global $TabToLower;
3 return strtr (strtolower ($s), $TabToLower);
4 }
view plain | print | copy to clipboard

function StrToLower ($s) {
global $TabToLower;
return strtr (strtolower ($s), $TabToLower);
}

… where $TabToLower is the lookup table (now minus the ASCII character lookups, handled by strtolower). Note the code Marek showed me uses classes - this is just a simplification. It relies on the POSIX locale being set (otherwise the UTF-8 encoding might get broken) and exploit a facets UTF-8’s design, namely any complete sequence in a valid UTF-8 string is unique (can’t be mistaken for part of a longer sequence). You also need to read the strtr() documentation very carefully…

strtr() may be called with only two arguments. If called with two arguments it behaves in a new way: from then has to be an array that contains string -> string pairs that will be replaced in the source string. strtr() will always look for the longest possible match first and will *NOT* try to replace stuff that it has already worked on.

I’ve yet to benchmark this but Marek tells me he’s found it to be roughly x3 faster than the equivalent mbstring functions, which I can believe.

Marek also employs some smart tricks for handling the lookup arrays. Both the dokuwiki and mediawiki approaches have all possible case conversions defined - i.e. they apply to multiple human languages. While this may be appropriate for user submitted content, when you’re doing stuff like localizations of you’re UI, chances are you’ll only be using a single language - you don’t need the full lookup table, just those applicable to the language involved, assuming you know what those are. Also you might think about looking at the incoming $_SERVER['HTTP_ACCEPT_LANGUAGE'] from the browser.

Anyway - when I get some time, will figure out how to use Marek’s ideas in PHP UTF-8.

Output Conversion
Another smart tip from Marek, which I haven’t seen discussed before, is how to deliver content to clients that can’t deal with UTF-8 e.g. old browsers, phones(?). His approach is simple and effective - once you’ve finished building the output page, capture it in an output buffer, check what the client sent as acceptable character sets ($_SERVER['HTTP_ACCEPT_CHARSET']) and convert (downgrade) the output with iconv if necessary.

You need to be careful examining the content of that header and processing it correctly. You also need to make sure you’ve redeclared the Content-Type charset plus any HTML meta characters or the encoding in an XML processing instruction. But this is certainly the serious / accessible way to solve the problem in PHP.

Moral of the story…
…is it’s worth talking to people who actually need UTF-8, vs. those in countries complacently using ISO-8859-1 (which doesn’t natively support the Euro symbol BTW!).

Given that Mediawiki has “done” Unicode Normalization in PHP (here), the only remaining piece of the puzzle is Unicode Collation (e.g. for sorting) - here’s a nice place for inspiration. After that - who needs PHP 6 ;)

This entry was posted on Thursday, August 10th, 2006 at 7:54 pm, contains 1,228 words, and is filed under PHP. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed. The views and opinions in this blog post are those of its author.
This post has 14 responses so far malikyte Says:
August 11th, 2006 at 6:48 am
………. I can’t wait for PHP6 to ease the burden of trying to understand all that. Forgive me; I have been trying to read through all your slides and presentation material, the code examples and related links, but without being truly exposed to the necessity of character sets and encodings until the many recent articles posted here (and at Chris Shiflett’s blog). It’s almost too much to take in. So again, I praise the day when full i18n support is implemented into PHP and I can find a gutsy host that will upgrade quickly.


Anonymous Says:
August 11th, 2006 at 1:23 pm



defenderz_ Says:
August 11th, 2006 at 7:25 pm
I wonder why they didn`t implement native utf8 support in php5. its so 90ies…


HarryF Says:
August 12th, 2006 at 5:05 am
It’s almost too much to take in. So again, I praise the day when full i18n support is implemented into PHP and I can find a gutsy host that will upgrade quickly.

I think that’s a very forgiveable perspective but at the same time, it’s worth battling on. PHP6 is going to make the problem easier to manage but I don’t think it’s going to make the problem magically vanish. What I also worry is whether it may be a mistake to make all strings Unicode with a flick of a php.ini file - there are issues related to security such phishing-type attacks (unicode characters which look almost like normal ASCII characters) - have yet to clarify exactly how PHP6 is going to look like though, so that may be FUD.

The real issue is character encoding is a leaky abstraction - it’s very hard to hide it behind APIs.

If there are two key points to getting it in PHP I’d say it’s to consider PHP’s problem - http://www.phpwact.org/php/i18n/charsets#php_s_problem_with_character_encoding then look closely at the table here: http://en.wikipedia.org/wiki/UTF-8#Description - examine the 0’s and 1’s it’s describing. Eventually it will fall into place.


HarryF Says:
August 12th, 2006 at 5:07 am
I wonder why they didn`t implement native utf8 support in php5. its so 90ies…

It’s not a problem you can solve easily plus it’s a lot of work. The tipping point was IBM open sourcing ICU - http://en.wikipedia.org/wiki/International_Components_for_Unicode - that saves the work


MarekG Says:
August 12th, 2006 at 8:16 am
It is not clear to me why MediaWiki uses for converting case this code:

function uc ( $str, $first = false ) // in file LanguageUtf8.php, mediawiki-1.7.1.tar.gz

return preg_replace( "/$x([a-z]|[\\xc0-\\xff][\\x80-\\xbf]*)/e", "strtr( \"\$1\" , \$wikiUpperChars )", $str );

See their lookup table, they have also “a-z=>A-Z” arrays there:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Utf8Case.php

This has to be slow.

Why it should not be enough just:

if (!$first) return strtr ($str, $wikiUpperChars); // ?

Reverse Email Lookup » Reverse Email Lookup - Hot PHP UTF-8 tips Says:
August 12th, 2006 at 4:15 pm
[...] Hot PHP UTF-8 tipsSitePoint, Australia - Aug 10, 2006… noise about UTF-8, got an email from Marek … if they match in the lookup array … of sequences, representing characters, and utf8_from_unicode() does the reverse) ; … [...]


links for 2006-08-11 » D.C Life Says:
August 12th, 2006 at 11:49 pm
[...] SitePoint Blogs » Hot PHP UTF-8 tips (tags: read php) No Tags .adHeadline {font: bold 10pt Arial; text-decoration: underline; color: blue;} .adText {font: normal 10pt Arial; text-decoration: none; color: black;} [...]


monul Says:
October 13th, 2006 at 6:15 pm
hehe! hacking encodings - eternal php theme!



monul Says:
October 13th, 2006 at 6:16 pm
monul


bietchetlien Says:
March 30th, 2007 at 4:14 am
How to explode unicode string? Thanks


erkekjetter Says:
May 27th, 2008 at 12:04 am
You can find a extended unicode upper/lower case mapping table at
http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/nls/rbagsuppertolowermaptable.htm
Might be useful for someone, it certainly was for me.


Rin Says:
December 24th, 2008 at 7:42 am
A package of PHP functions to manipulate strings encoded in a UTF-8 encoding. The powerful solution/contribution for UTF-8 support in your CMF/CMS, written on PHP.
http://forum.dklab.ru/viewtopic.php?p=91015


Anonymous Says:
January 5th, 2009 at 5:54 pm
fdsf


bdafdsf dfasa

dfadsfsd
alert(’hi’);

No comments: