Originally Posted by SD
PHP Code
$title = preg_replace("#[^A-Za-z0-9]+#", "-", strtolower($title));
$title = preg_replace("#(-){2,}#", "$1", $title);
$title = trim($title, '-');
$title = substr($title, 0, 30);  

seems a shorter way to do it. maybe i missed something.. :shrug:

BTW, the reason I used "mb_convert_case" instead of your chosen "strtolower" String is that PHP by default does not know about utf-8. It assumes any string is ASCII, so "strtolower" converts bytes containing codes of uppercase letters A-Z to codes of lowercase a-z. As the UTF-8 non-ascii letters are written with two or more bytes, the "strtolower," you suggest using, converts each byte separately, and if the byte happens to contain code equal to letters A-Z, it is converted. The result sequence is broken, and it no longer represents correct character and could create multiple unintended characters, especially for non-english languages.

To change this, you need to configure the mbstring extension ( http://www.php.net/manual/en/book.mbstring.php ) to replace "strtolower" with "mb_strtolower" or use "mb_convert_case," such as how it was originally written for you in my recommended fix, "mb_convert_case($title, MB_CASE_LOWER, "UTF-8")"

Some further reading on this at:
http://www.daniweb.com/web-development/php/threads/342307/utf-8-encoding-issues-with-strtolower

Because this is directly related to URLs, I always recommend encoding them in UTF-8. From the Wikipedia page on percent encoding @ http://en.wikipedia.org/wiki/Percent-encoding#Percent-encoding_in_a_URI

Quote
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

It seems like because there were other accepted ways of doing URL encoding in the past, browsers attempt several methods of decoding a URI, but if you're the one doing the encoding you should use UTF-8. UTF-8 should also be used because it is the only encoding allowed by the newer IRI standard (RFC 3987) that is replacing the older URL standard.

---

The original intended purpose of this post was to 1) submit a bug report, 2) publish examples of what this bug affects, and 3) then submit a solution/fix for it.

My exact suggested fix (linked in the OP) resolves the following items:
-Keep the URLs spider-friendly
-Standardize across UBBT releases
-Compatible with 7.5.7 and prior, to avoid "duplicate content" flags
-Allow copy/pasted URLs from UBBT to be parsed correctly on UBBT and other internet softwares
-Resolve IE cookie issues
-Not break anything else


Current developer of UBB.threads PHP Forum Software
Current Release: UBBT 7.7.5 // Preview: UBBT 8.0.0
isaac @ id242.com // my forum @ CelicaHobby.com