[pmwiki-users] Defaulting PmWiki to utf8

Hans design5 at softflow.co.uk
Sat Dec 1 11:43:02 CST 2007


Friday, November 30, 2007, 11:11:47 PM, H. Fox wrote:

> I think it may be very important to send the charset for security reasons.

I quote from CERT
http://www.cert.org/tech_tips/malicious_code_mitigation.html

Explicitly Setting the Character Encoding

Many web pages leave the character encoding ("charset" parameter in
HTTP) undefined. In earlier versions of HTML and HTTP, the character
encoding was supposed to default to ISO-8859-1 if it wasn't defined.
In fact, many browsers had a different default, so it was not possible
to rely on the default being ISO-8859-1. HTML version 4 legitimizes
this - if the character encoding isn't specified, any character
encoding can be used.

If the web server doesn't specify which character encoding is in use,
it can't tell which characters are special. Web pages with unspecified
character encoding work most of the time because most character sets
assign the same characters to byte values below 128. But which of the
values above 128 are special? Some 16-bit character-encoding schemes
have additional multi-byte representations for special characters such
as "<". Some browsers recognize this alternative encoding and act on
it. This is "correct" behavior, but it makes attacks using malicious
scripts much harder to prevent. The server simply doesn't know which
byte sequences represent the special characters.

For example, UTF-7 provides alternative encoding for "<" and ">", and
several popular browsers recognize these as the start and end of a
tag. This is not a bug in those browsers. If the character encoding
really is UTF-7, then this is correct behavior. The problem is that it
is possible to get into a situation in which the browser and the
server disagree on the encoding. Web servers should set the character
set, then make sure that the data they insert is free from byte
sequences that are special in the specified encoding. For example:


<HTML>
<HEAD>
<META http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">
<TITLE>HTML SAMPLE</TITLE>
</HEAD>
<BODY>
<P>This is a sample HTML page
</BODY>
</HTML>

The META tag in the HEAD section of this sample HTML forces the page
to use the ISO-8859-1 character set encoding.

=============== end quote =============

The PmWiki default skin does not set charset=ISO-8859-1, although in
the documentation on PmWiki/Internationalizations it says that
ISO-8859-1 is PmWiki's default.

When a skin templates sets this with a meta tag above the title tag,
then any subsequent char code setting via for instance
   include scripts/xlpage-utf-8.php
will appear later in the HTML header, and I assume override the
setting before this. Please correct me if I am wrong. I just try to
find a good standard for skins.

I recommend to read the page I quoted from
http://www.cert.org/tech_tips/malicious_code_mitigation.html
not just for the charset advise, but for lots more on input character
validations.



  ~Hans




More information about the pmwiki-users mailing list