[pmwiki-users] Issues with non basic latin in uploads file names
Petko Yotov
5ko at 5ko.fr
Wed Feb 22 22:41:31 PST 2023
On 23/02/2023 01:30, kirpi at kirpi.it wrote:
> I have a twofold problem with uploads file names.
>
> First issue.
> My install is UTF-8 enabled but, even if allowing the display of all
> languages and all alphabets in pages is fine, UTF-8 file names in
> uploads are a pain as they create several issues once they are
> uploaded to my shared hosting server (example: I cannot delete them
> anymore, not even via FileZilla, nor rename them).
> By following some hints[1] and some search[2] I tried and put together
> an easy way to replace accented characters as well as any problematic
> characters in uploads. Although apparently verbose, this "table" seems
> straightforward to understand, extend and adapt to one's need.
>
>
> $UploadNameChars = "-\\w. "; # default: allow dash, letters, digits,
> underscore, and dots (no spaces)
This appears to allow a space, but later $MakeUploadNamePatterns
replaces the spaces with underscores.
There is one problem with the '\w' ("word") character type, it may
change depending on the locale. On a server with the English (UK) locale
\w means [a-zA-Z0-9_] but with English (NZ) locale it may include 5
accented Māori characters, and with Français locale it may also include
10 different accented characters (but not all possible accented
characters). Also even this sometimes behaves inconsistently on my own
servers.
So I usually replace '\w' with 'a-zA-Z0-9_' to make it clear I only want
plain Latin characters.
I suspect when $UploadNameChars was implemented, it meant to include
plain letters rather than locale letters. But it is as it is now, and I
don't want to change the PmWiki default because it might inconvenience
administrators with existing wikis.
> $MakeUploadNamePatterns = array(
> 'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A',
> 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'AE', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
> 'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N',
> 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
> 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'ss',
> 'à'=>'a',
> 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'ae',
> 'ç'=>'c',
> 'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i',
> 'î'=>'i',
> 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o',
> 'õ'=>'o',
> 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y',
> 'þ'=>'b',
> 'ÿ'=>'y', 'Ğ'=>'G', 'İ'=>'I', 'Ş'=>'S', 'ğ'=>'g', 'ı'=>'i',
> 'ş'=>'s',
> 'ü'=>'u', 'ă'=>'a', 'Ă'=>'A', 'ș'=>'s', 'Ș'=>'S', 'ț'=>'t',
> 'Ț'=>'T',
The array keys above should be regular expression patterns like '/É/'.
> "/[^$UploadNameChars]/" => '', # strip all not-allowed
> characters
> '/(\\.[^.]*)$/' => 'cb_tolower', # convert extension to lowercase
> '/^[^[:alnum:]_]+/' => '', # strip initial spaces, dashes,
> dots
> '/[^[:alnum:]_]+$/' => '', # strip trailing spaces, dashes,
> dots
> '/ +/' => '_'); # replace space(s) with
> underscore
>
>
> I did not yet try it in config.php because I am afraid to screw
> something, and prefer to ask before doing some harm.
> Would it make sense to use the above code, please?
>
> Still there might be issues, like with "ё" which could perhaps be
> converted into "e" in some cases and "io" in others; but this is where
The characters ë (Latin e-tréma) and ё (Cyrillic yo) may look similarly
in most fonts, but are not at the same code points, so there is no need
to worry about this.
However, in German language, the "vowel with umlaut" like "ü" would be
folded to "ue". In French language the same letter would be folded to
"u" (same code point).
It becomes messy as "vowel with umlaut" (one character) can also be
written as "vowel followed by combining-diaeresis" (two characters).
Both are valid. In your patterns above this is not a problem as the
combining diacritics will be removed, but for a German language they
will be lost. :-)
> the "table" comes into play: it will be easy to spot specific letters
> and adapt them to one's need (be it mainly transliterating or just
> getting a usable file name somehow), while adding more characters if
> required[3].
> ----
>
> Now the second issue: we cannot map everything. Think Chinese as an
> example.
> Having file names in "random alphabets" is to me a huge problem
> because both some software and some user of my website will end up
> being stuck somewhere. It happened already too many times. Imagine I
> upload this file 王毅与普京会晤_中俄关系稳如泰山_百度搜索.pdf it will be a pain for most
> of the people in the world to handle it.
> I am perhaps too old, but sticking more or less to a basic a-zA-Z1-0
1-0 is invalid, as 0 is before 1 in the character set. You probably mean
to use 0-9.
> group of characters is a safer choice, it is more inclusive in some
> way: I am 99% sure that anybody and any system can handle that file if
> renamed in a basic English alphabet plus some numbers.
>
> I am not sure how to solve this, but I guess I would like to tell the
> system that, if there is no specific map set in the wiki for such
> characters (see issue one), then any random generated name would be
> better than the original Chinese (or whatever). At least I can rename
> the files afterwards.
See below for FileZilla and character sets.
A randomly generated file name can be achieved not with
$MakeUploadNamePatterns (since it will also affect existing links to
files, and possibly file listings), but with a custom
$UploadVerifyFunction which can rename the file while it is uploaded.
This is a little more advanced, and cannot be a generic function, it
will adapt to your specification, your current wiki usage and workflows
- let me know if you need assistance with it.
> I often happen to upload images by simply dragging them from web pages
> (or other sources) into my wiki, and I do not even know their file
> names.
Do you drag them from a web page and drop them in a DDMU dropzone? I
didn't know this was possible - in fact it still isn't in my browser.
Or do you drop them into FileZilla?
> Quite often I end up with exotic names that are stuck in my
> folders; impossible to rename or delete them, as I get "invalid
> attachment name" or "PmWiki can't process your request, no such
> attachment".
'invalid attachment name' appears if you try to upload a file that
matches the entries in $UploadBlacklist (.php, .pl, .cgi in the middle
of a file name that may still be executed by the server.)
I cannot find this message: "PmWiki can't process your request, no such
attachment". The first part comes from the Abort() function, but I don't
see "no such attachment".
Is it "?requested file not found" ? This may come from HandleDownload()
when a file name cannot be found.
> Even FileZilla cannot rename or delete them.
FileZilla has an option to change the character set. In the site
manager, when you select a site on the left, on the right there are
tabs, the last one is "Charset".
If you cannot rename or delete files on the server because of invalid
filenames, try changing the character set from Automatic to an 8-bit
encoding like ISO-8859-1, then reconnect.
You can do this at least temporarily in order to delete or rename the
lost files.
> I would like to avoid such issues by making sure that files are
> properly renamed to basic latin before being uploaded.
> But in this case I would not know how to.
When you upload such a file, how do you link to it from the wiki?
Do you ever type [[Attach:自由定制的风格.pdf]] in your page?
Or do you simply have (:attachlist:) or (:thumblist:) / Mini:* that will
list all files?
Modifying $MakeUploadNamePatterns may break some links from your wiki
pages to existing attachments.
> not understand why $UploadNameChars was not left to default by Petko
> in that case.
Because of the '\w' issue in different locales.
Petko
More information about the pmwiki-users
mailing list