[pmwiki-users] Re: central european characters

Patrick R. Michaud pmichaud at pobox.com
Tue Sep 27 09:24:53 CDT 2005


On Tue, Sep 27, 2005 at 06:32:10AM +0000, Anno wrote:
> I found a solution to my problem, and I think it is actually a bug in how the
> character substitution is done.
> 
> In pagelist.php and in pmwiki.php, in 3 places the special characters are
> replaced with the html entities, but if the character is a part of a html
> entity, the substitution shouldn't be done.
> ...

Possible short answer: Use utf-8 -- it works fine.  Try doing a search
for C+circumflex from the searchbox at http://www.pmwiki.org/wiki/UTF8.

Long answer:

Unfortunately, I don't think we can make the proposed change without 
possibly breaking a lot of other things in the process.  But let's
investigate...

The translation from the special characters into HTML entities
is actually being performed by the *browser*, not by PmWiki.
In this case, the special characters are "c" and "C" with circumflex,
which don't appear in the iso-8859-1 character set (and thus I can't
enter them in my mail message, sorry).  The browser recognizes that
these characters are outside of iso-8859-1, so it automatically 
converts those characters into html entities č and Č when
the edit text or search phrase is submitted to PmWiki.

So, all PmWiki sees is "č" and "Č" coming from the browser,
and it dutifully saves or uses those values as being what came from
the author's browser.  Later, when PmWiki is sending those values back 
to the edit form, it replaces the '&' with '&' to produce 
"č" so that the edit form correctly displays "č", which is 
what PmWiki had received from the author.

If we say that '&' that are part of character entities shouldn't
be translated, then an author who saves "»" in an edit form
will get back "«" when the page is later edited, which isn't
exactly what the author originally entered (and may not be what is
desired).

What's worse, although the browser will automatically replace
C+circumflex with Č, it doesn't encode other ampersands in the
page.  Thus, when PmWiki receives "Č" from a browser, it has
no way of knowing if the author originally typed a C+circumflex
or the sequence "Č".  

However, now that I understand the issue, I may be able to fix the
search so that it will correctly find instances of "Č" in
pages.  But edits will continue to show the C+circumflex characters
as Č because that's what the browser is sending.

Another, possibly better solution will be to switch Slovene to use
the utf-8 character set (or a character set that supports the
extra characters) instead of iso-8859-1.  As mentioned above, the 
special characters outside of iso-8859-1 work fine for searching 
and editing when PmWiki is set for utf-8.  (The downside is that 
spaced wikiwords doesn't work in utf-8 yet.)

If you want me to migrate the existing Slovene pages into utf-8
mode, I can do that.

In the meantime, I'll see if I can get the search to work even
when things are set to iso-8859-1.

Pm




More information about the pmwiki-users mailing list