[pmwiki-users] Defaulting PmWiki to utf8

sti at pooq.com sti at pooq.com
Wed Nov 14 18:29:12 CST 2007


Patrick R. Michaud wrote:

> The big problem is that any existing pages of an iso-8859-1
> site will have been saved using an iso-8859-1 encoding, using
> iso-8859-1 encoded filenames.  Thus, it's not just a simple
> matter of changing a configuration option -- we also have to
> convert the various page files as well.

Right now I'm working on a site with many French page names. I looked into
using utf-8 when I started a few months ago but ran into some problems, and
ended up changing back.

Most of my problems had to do with the fact that my hosting site didn't have
good utf-8 support for its shell-based tools, but I didn't like what happened
to my URLs either.

Since PmWiki uses plain-ASCII URLs (for good reasons) I ended up with large
numbers of unreadable page URLs due to the escaped utf-8 characters.
Admittedly, even with iso-8859-1, I have that to some degree, but I find that
a page reference like [[Lang.Français] which renders to an URL like

  http://www.example.com/Lang/Fran%e7ais

to be (somewhat) easier to read than the utf-8 equivalent of

  http://www.example.com/Lang/Fran%c3%a7ais

and it gets progressively worse the more non-ASCII characters there are in a
page name.

Now, French computer users are used to seeing URLs with the accents dropped, so

  http://www.example.com/Lang/Francais

would be considered an acceptable URL, although not as acceptable as a page
name. I've been thinking that for proper utf-8 support, one might want to be
able to supply a Name->URL mapping function as part of the configuration. In
the case of French, it would just replace accented characters with their
non-accented counterparts.

For something like Chinese, (there was recently a request on a list for help
in dealing with Chinese URLs) you would probably want to translate to pinyin
(which would require a roughly 25,000 entry table...) as that uses only ASCII
characters.

Then, when looking for a page on disk, PmWiki would first look under the name
as given, and secondly under the mapped name. Since the mapping might be
irreversible (as is the case in my French example) one would want to store the
canonical name as an attribute in the page, for use in things like {$FullName}.

I imagine that one would want PmWiki to be able to gradually move from one
system to another. So that when one edits a page called Lang.Fran%e7ais from
the previous setup, it gets renamed to Lang.Francais when saved. I admit this
could get messy.

> So, what I'm seeing at the moment is that if we switch to using
> utf8 by default, admins of existing sites have to be notified 
> somehow that the default has changed and told how to configure
> the site to continue using iso-8859-1, or given a procedure to
> somehow convert the site's pages to utf8.  And once someone
> starts the utf8 conversion, it can get a bit messy to try to
> convert back.

I noticed a while back that the encoding of a page is stored internally. I
would simply use the current encoding setting to dictate how to display a
page, and what encoding to use when saving. Thus, as pages get edited, they
would gradually all be converted over to the new encoding. As pages are
displayed, they get automatically converted to the current encoding.

> Any thoughts on the overall process, how much of an impact a move
> like this might have on existing sites, or how we might go about this?

Well, there are my thoughts. I think the overall move to utf-8 would be a good
thing, but I fear there will be some headaches along the way.



More information about the pmwiki-users mailing list