[pmwiki-users] UTF-8 as core default encoding (was: Headers arenotsending charset !)

Petko Yotov 5ko at free.fr
Mon Mar 12 14:46:32 CDT 2007

On Monday 12 March 2007 16:53, Patrick R. Michaud wrote:
> On Mon, Mar 12, 2007 at 05:37:59PM +0200, Athan wrote:
> > "Patrick R. Michaud" <pmichaud at pobox.com> wrote in message
> > news:20070312152005.GG29823 at host.pmichaud.com...
> >
> > > AFAICT, none of them support preg_match, so they require a workaround.
> >
> > Correct!
> > Actually preg_match supports utf8 using /u modifier though PCRE has to be
> > compiled with PCRE_UTF8 support.
> preg_match supports the /u modifier, but the /u modifier doesn't
> cause either /i or [[:upper:]]/[[:lower:]] in patterns to work.
> All that the /u modifier does is cause PCRE to recognize multibyte utf-8
> sequences as being single characters (and that doesn't seem to
> matter much for the patterns that PmWiki uses).

Actually, from PHP 4.4.0 on, there is a \p{Ll} and \p{Lu} for lower and upper 
case letters[1]. It works fine, just tested it:

function AsSpacedU($text) {
  $upper = "\\p{Lu}";
  $lower = "\\p{Ll}";
  $text = preg_replace("/([$lower\\d])([$upper])/u", '$1 $2', $text);
  $text = preg_replace('/(?<![-\\d])(\\d+( |$))/',' $1',$text);
  return preg_replace("/([$upper])([$upper][$lower\\d])/u",
    '$1 $2', $text);
echo AsSpacedU("ПеткоВалериев"). "\n";

Works fine!

So, in the next few hours I'll make a major rewrite of xlpage-utf-8.php in 
order to:

* move $CaseConversions in another script
** load it only when there is no mb_strtoupper, or phpversion < 4.4
* use the new features when possible (/u, etc.)

I feel that this will become lightning fast with almost no "performance 
penalty" that Pm talked about.

It may also be possible to actually limit the $PageNameChars to only letters 
and numbers, but I have little knowledge of other different alphabets other 
than Latin and Cyrillic.


[1] See the "Unicode character properties" section in 

More information about the pmwiki-users mailing list