[pmwiki-users] search does not find text with markup

Tue Dec 20 16:08:28 CST 2005

On Tue, Dec 20, 2005 at 09:18:28AM +0000, Karl Loncarek wrote:
> my problem is quite simple:
> 
> When I search for e.g. "example"
> then a wiki page that contains only: 
> e''x''ample
> 
> is not found. Any way to achieve this easily (that markup is ignored, when 
> searching)?
> 
> any ideas?

Here's where things stand in 2.1.beta14.  When a page is saved,
PmWiki runs the markup text through the MarkupToHTML function 
(excluding things such as (:include:) and (:pagelist:)) and then
saves the first 600 bytes as an "excerpt" attribute.  This leading
text is then readily available for things like RSS feeds and
searches, and can be used to provide some idea of a page's contents
in the absence of an explicit (:description:) directive.
At the moment the 600 byte limit on excerpts is primarily there
to prevent the internal $PCache from taking up too much memory,
and also to keep disk space requirements down.

However, we could modify this somewhat -- we could save the entire
rendered text, and we could strip the HTML tags from the excerpt.
This could nicely resolve the problem described above, since the 
excerpt would be searchable as well as the markup text.  It
would also allow searches to easily display the text surrounding
a found search term.  

The downsides of this approach are: 
1.  by removing the HTML from an excerpt we're left with only 
    the text -- no structural indications such as paragraphs or lists
    in the excerpt,
2.  storing the rendered text in the page file increases the
    page file size a bit (although probably not too significantly
    except for large pages),
3.  PmWiki's memory-based page cache can get too large if each 
    page's excerpt attribute is stored there.

Still, these three downsides might be a good trade for the
extra functionality we might get as a result.  Any opinions?

Pm