[pmwiki-users] Keep() function documented

Wed Jul 6 09:35:32 CDT 2005

Patrick R. Michaud wrote:

> On Wed, Jul 06, 2005 at 11:00:00AM +0200, Joachim Durchholz wrote:
> 
>>(It would be best if all the rules were integrated into a single one, so 
>>that there is no "order" in which markups are processed - but I don't 
>>think that's an option for PmWiki. 
> 
> ...because there has to be an "order".  It's vitally important that
> '''text''' be found and processed before ''text''; it's equally 
> important that url processing take place before wikiword links.  

Sure, there's a kind of priority at work. While priority is easiest to 
achieve using order, there are other approaches - for example, in 
regexes, different valid parses are priorised via greediness rules (or 
minimalism rules, given the proper options).

> Getting the order correct is why I went to the trouble of creating 
> a generic rule-based ordering system -- it's fundamental to the
> task of converting wiki markup to some other output form.

That table-driven approach is one of the things that make PmWiki so 
incredibly flexible. It doesn't avoid unwanted interactions between 
conflicting markups, but it makes the conflicts transparent - and that's 
enough to turn markup design from an arcane, peril-fraught Black Art 
into a common Art (it's still an Art, of course, but that's a far lower 
entry barrier than what's the usual in other wikis).

>>>> ... BTW why does PmWiki split the text into lines? Efficiency
>>>> reasons, or other considerations?)
>>> 
>>> Two reasons:  First, the line-by-line model is the mental model
>>> that most authors tend to understand when processing text; it
>>> makes sense to keep that particular model.
>> 
>> ... It's just that this forces constructs that may span several
>> lines into a *very* early stage of processing (whether these
>> constructs are nestable or not)
> 
> False.  Notably, PmWiki's implementation of block markups (including 
> nested lists, tables, etc.) all happen at the line-by-line level,
> which is currently fairly late in the processing cycle.

Um... well, but it requires explicit coding. You can't recognise a 
(:table:)...(:tableend:) structure with a PCRE, you need to hand-code in 
PHP.

More importantly, you can't do nested tables (unless the docs are outdated).

>>> Secondly, it's a huge efficiency boost -- my experiments have
>>> shown me that the many pattern matches that get performed are
>>> *much* more efficient on many small strings than they are on one
>>> very large one.
>> 
>> I suspect it's the replacement step that is more efficient -
>> replacing two characters with fifteen in a twenty-character string
>> is bound to be more efficient than doing the same in a 20K text
>> (there are advanced string packages that don't exhibit this
>> behavior, but they have been largely unknown and unused).
> 
> I think it's the match itself that is sped up as well.  Many of
> PCRE's match optimizations work by scanning from the end of the
> string to be matched, thus if the string is long there's a lot of
> scanning to be performed.

Hmm... could be another reason.
It would be interesting to see which rules take longer on a long string. 
Maybe those regexes could be rewritten so that there's no big difference 
anymore.

 > Beyond that, every rule that contains a '*' or '+'
> quantifier in it means that there are a lot more combinations and
> interactions that PCRE has to try before it can definitively decide
> a match doesn't exist.

Sure, but all match attempts except those that explicitly request for it 
will terminate at the next end-of-line.

Not that I question that scanning&replacing in many small strings is 
more efficient than scanning&replacing in a single large string.

I was thinking that integrating all these regexes into a huge single 
regex should also boost thing, and tremendously (since all those 
multiple scans are replaced by a single, linear one). Then I found that 
combining regexes cannot be done automatically, at least not easily... 
and that the number of capturing parentheses is limited to 99, which 
further restricts the options... and decided to punt the issue :-)

Well, maybe we simply have to live with the fact that regex-parsed 
nested block structures need to be done before the 'split' rule.

Regards,
Jo