[pmwiki-users] UTF-8 support and mbstrings

Wed Jun 7 09:20:37 CDT 2006

On Wed, Jun 07, 2006 at 11:17:32AM +0300, Athan wrote:
> I was able to modify pmwiki enabling case insensitive search.
> 
> Created a utf8toupper like function named utf8tolower in xlpage-utf-8.php 
> This function is actually a "reversed" utf8toupper.
> In pagelist.php, I replaced strtolower($t) with utf8tolower. Now pageindex 
> is created and stored in utf-8.
> In line 232 of pagelist.php ....
> - if (!preg_match($i, $text))
> + if (!preg_match(utf8tolower($i), utf8tolower($text)))

Yes, something like that could work; I've been trying to avoid the
cost of converting the page text to lowercase in the first place.
Pagelists are slow enough already without adding even more conversion
overhead.

It would certainly be better to factor the utf8tolower() calls
out of the loop, so that we aren't repeatedly converting things
to lowercase on each iteration through the loop.  :-)

Also, in order for term exclusion to work, we need to fix line 230.
So, my suggestion would be

      if ($searchterms) {
        $text = utf8tolower($pn."\n".@$page['targets']."\n".@$page['text']);
        if ($exclp && preg_match($exclp, $text)) continue;
        foreach($inclp as $i)
          if (!preg_match($i, $text))
            { if ($inclx) $xlist[] = $pn; continue 2; }
      }

and also to make sure that $exclp and $inclp are converted to
lowercase when they're initialized.

But all of this needs to be on a switch of some sort; we don't want to 
be calling utf8tolower() for sites that aren't running utf8, nor do
we want to give them the execution overhead of unnecessary calls to
convert things to lowercase.

Pm

> > On Tue, Jun 06, 2006 at 02:55:43PM +0300, Athan wrote:
> >> Any hope to see such a version of pmwiki ?
> >> Current version works fine with single byte chars but lacks case 
> >> insensitive
> >> search when use non-latin utf-8 strings.
> >> So, why not an mbstring version? Most hosts support php with mbstrings
> >> compiled in. Besides that, it is very easy to have it disabled when 
> >> mbstring
> >> functions are not available.
> >
> > PmWiki uses preg_match for its (case-insensitive) text search -- this is
> > faster than calling the string or mbstring functions.  Unfortunately,
> > there isn't an mbstring version of preg_match available.
> >
> > (Yes, there's an mb_eregi function that does pattern matching, but
> > unfortunately it uses a somewhat different syntax from the pcre-based
> > pattern matching functions.)
> >
> > So, in the case of search it's not a simple matter of replacing
> > functions with mbstring equivalents -- it requires reworking the
> > entire algorithm to be able to use mb_eregi, or avoiding the pattern-match
> > searches altogether.
> >
> > However, I'm looking to modularize the pagelist functions anyway, so
> > perhaps text search can be placed into its own module.  Then it would
> > be much easier to have a mbstring version of text search.
> >
> > Votes are being recorded at http://www.pmwiki.org/wiki/PITS/00682.  :-)
> >
> > Pm 
> 
> 
> 
> 
> _______________________________________________
> pmwiki-users mailing list
> pmwiki-users at pmichaud.com
> http://host.pmichaud.com/mailman/listinfo/pmwiki-users
>