[pmwiki-users] Core Spam Blcok Thoughts

Wed Apr 19 17:44:49 CDT 2006

On Apr 5, 2006, at 3:57 PM, Chris Lott wrote:

> It would be REALLY nice if the blocklist script were enhanced a bit to
> a) have an option for blocking only whole words, 2) be able to block
> using regular expressions, 3) have overrides so that a field could
> override a block in the sitewide config.

1(a) it's possible, but difficult, since a "whole word" can start or  
end in something other than a space character.  The reason this is  
difficult is outlined in #2 below.  Essentially it would require  
using regexes.  The partial workaround is to use whitespace, but that  
will only catch some instances of the word, not all of them.

2) regex -- while I could POSSIBLY see having an additional syntax  
like "regex:^word$" as an alternative to blocking, I don't recommend  
this be done for the whole blocklist.  The reason is one of server  
resources.  I have a blocklist that is enormous and EXCEPTIONALLY  
effective.  To use regex instead of string matching would grind my  
server to a crawl.  Regex is enormously more taxing than simple  
string matches.  If people want an option to turn scoring off, then  
we could stop matches the moment that there's a positive match.  I  
like scoring my matches because it's much easier to pick out the  
worst offenders, and easier to pick up the possible false positives  
(people honestly trying to post who were blocked).

3) It's possible to add an unblocklist or to have markup in a  
blocklist page that unblocks a term.  The problem is that now you  
have to change the parse process --
	a) first you parse the list(s) and remove "block:" from the entries   
-- now you have a list of what you're looking to block
	b) pull the IP addresses out of the list, they are compared differently
	c) check IP
	d) now you parse the unblock list and remove items from the block  
entries that are in the unblock entries
	e) now you compare the list items one at a time with every word  
posted on the page
	f) parse regular expressions through the post

You're asking to add step d & f.  If the blocklist is long, in step d  
it has to do a needle-in-haystack search through every item in the  
blocklist.  Step e already takes a long time if the post is long and  
the list is long.  Step f has the potential to take even longer --  
because regex parsing is enormously more complex for the server  
processes to handle.  If the post is long it could grind the server  
to a halt -- YSMV (your server may vary).

This may not be a huge problem for people with servers on steroids,  
but I would like to avoid the complexity.  If a word is a problem on  
a farm, then I'd suggest moving the term to the fields that need to  
block it.  Most words aren't like that, but if I had a medical field  
in my farm, I would be in trouble.  As it is, I had to make a  
decision to remove common psychoactive medications from the  
Kinhost.org blocklist so that users could possibly discuss that they  
were put on lithium or etc.

However, as a rule, I don't see an issue with adding the regex  
functionality with the blazons and cautions that excessive (mis)use  
of it when a simple string could do is not recommended.

I know of one PmWiki installation that died -- the owner believes the  
blocklist was the reason.  Indeed, getting the blocklist page to come  
up was one of the major problems.

Another caution -- your history on your blocklist can becomes  
expansive.  You should probably have the history purged frequently if  
you are maintaining a large blocklist.

Crisses
--
Six hours in a car with two anime freaks - hopefully I'll survive  
with my hair the same colour.
  -Malcolm&