[pmwiki-users] Issues with non basic latin in uploads file names

kirpi@kirpi.it kirpi.it at gmail.com
Wed Feb 22 16:30:33 PST 2023


I have a twofold problem with uploads file names.

First issue.
My install is UTF-8 enabled but, even if allowing the display of all
languages and all alphabets in pages is fine, UTF-8 file names in
uploads are a pain as they create several issues once they are
uploaded to my shared hosting server (example: I cannot delete them
anymore, not even via FileZilla, nor rename them).
By following some hints[1] and some search[2] I tried and put together
an easy way to replace accented characters as well as any problematic
characters in uploads. Although apparently verbose, this "table" seems
straightforward to understand, extend and adapt to one's need.


$UploadNameChars = "-\\w. "; # default: allow dash, letters, digits,
underscore, and dots (no spaces)
$MakeUploadNamePatterns = array(
'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A',
'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'AE', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N',
'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
    'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'ss', 'à'=>'a',
    'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'ae', 'ç'=>'c',
    'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i',
    'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
    'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b',
    'ÿ'=>'y', 'Ğ'=>'G', 'İ'=>'I', 'Ş'=>'S', 'ğ'=>'g', 'ı'=>'i', 'ş'=>'s',
    'ü'=>'u', 'ă'=>'a', 'Ă'=>'A', 'ș'=>'s', 'Ș'=>'S', 'ț'=>'t', 'Ț'=>'T',
    "/[^$UploadNameChars]/" => '',    # strip all not-allowed characters
    '/(\\.[^.]*)$/' => 'cb_tolower',  # convert extension to lowercase
    '/^[^[:alnum:]_]+/' => '',        # strip initial spaces, dashes, dots
    '/[^[:alnum:]_]+$/' => '',        # strip trailing spaces, dashes, dots
    '/ +/' => '_');                   # replace space(s) with underscore


I did not yet try it in config.php because I am afraid to screw
something, and prefer to ask before doing some harm.
Would it make sense to use the above code, please?

Still there might be issues, like with "ё" which could perhaps be
converted into "e" in some cases and "io" in others; but this is where
the "table" comes into play: it will be easy to spot specific letters
and adapt them to one's need (be it mainly transliterating or just
getting a usable file name somehow), while adding more characters if
required[3].

----

Now the second issue: we cannot map everything. Think Chinese as an example.
Having file names in "random alphabets" is to me a huge problem
because both some software and some user of my website will end up
being stuck somewhere. It happened already too many times. Imagine I
upload this file 王毅与普京会晤_中俄关系稳如泰山_百度搜索.pdf it will be a pain for most
of the people in the world to handle it.
I am perhaps too old, but sticking more or less to a basic a-zA-Z1-0
group of characters is a safer choice, it is more inclusive in some
way: I am 99% sure that anybody and any system can handle that file if
renamed in a basic English alphabet plus some numbers.

I am not sure how to solve this, but I guess I would like to tell the
system that, if there is no specific map set in the wiki for such
characters (see issue one), then any random generated name would be
better than the original chinese (or whatever). At least I can rename
the files afterwards.
I often happen to upload images by simply dragging them from web pages
(or other sources) into my wiki, and I do not even know their file
names. Quite often I end up with exotic names that are stuck in my
folders; impossible to rename or delete them, as I get "invalid
attachment name" or "PmWiki can't process your request, no such
attachment". Even FileZilla cannot rename or delete them.

I would like to avoid such issues by making sure that files are
properly renamed to basic latin before being uploaded.
But in this case I would not know how to.

Thanks!
Luigi

----
[1] https://www.pmwiki.org/wiki/PmWiki/UploadVariables-Talk
[2] https://stackoverflow.com/questions/3371697/replacing-accented-characters-php
[3] Also, the cyrillic/latin code from [1] could be included, but I do
not understand why $UploadNameChars was not left to default by Petko
in that case.



More information about the pmwiki-users mailing list