Previous Thread
Next Thread
Print Thread
Hop To
Joined: Dec 2008
Posts: 26
P
Newbie
Newbie
P Offline
Joined: Dec 2008
Posts: 26
Hi. I've just done a test upgrade of a board from 7.5.8 to 7.5.9, and have noticed that the default encoding has now changed from ISO Latin 1 (ISO-8869-1) to UTF8. This has resulted in the non-7-bit-ASCII characters being corrupted (e.g. "†" is now displayed as "�"). I can understand the change to UTF8 (I think it's a great idea going forwards), but is there any way to convert the database content encoding to UTF8 as part of the upgrade, or a script I can run afterwards?

Joined: Jun 2006
Posts: 16,292
Likes: 116
UBB.threads Developer
UBB.threads Developer
Joined: Jun 2006
Posts: 16,292
Likes: 116
There are several wiki entries about this, namely you can change UTF8 back to the original character set by editing the language files (just use the search in the language editor for "utf" and change the specification).


I am a Web Development Contractor, I do not work for UBBCentral. I have provided free User to User Support since the beginning of these support forums.
Do you need Forum Install or Upgrade Services?
Forums: A Gardeners Forum, Scouters World
UBB.threads: UBBWiki, UBB Styles, UBB.Sitemaps
Longtime Supporter & Resident Post-A-Holic
VNC Web Services: Code Modifications, Upgrades, Styling, Coding Services, Disaster Recovery, and more!
Joined: Dec 2008
Posts: 26
P
Newbie
Newbie
P Offline
Joined: Dec 2008
Posts: 26
Thanks Gizmo. I'll take a look at those today.

Joined: Dec 2008
Posts: 26
P
Newbie
Newbie
P Offline
Joined: Dec 2008
Posts: 26
I had a few problems during conversion. I could see that the database text encoding had changed (at least, the sample pieces of text I was checking had changed appropriately), but UBB was still showing diamonds.

I eventually tracked that down to the fact that UBB was using the PHP/MySQL client default for character encoding, which was ISO Latin 1. I changed the "connect()" function in the "libs/mysql.inc.php" file so that if it does have a valid database connection, it sets the character set to utf8. That corrected the issue for me, so you may wish to add that change to the codebase to ensure that the character encoding is what is expected rather than relying on the default being correct.

For reference, the change was:
PHP Code

--- mysql.inc.php.bak   2015-01-27 07:05:08.000000000 +0000
+++ mysql.inc.php	   2015-10-12 08:06:30.000000000 +0000
@@ -25,6 +25,8 @@
 
		if( ! $this->dbh ) {
			$this->not_right( "Unable to connect to the database!" );
+		} else {
+			mysql_set_charset( "utf8", $this->dbh);
		}
 
		if( ! mysql_select_db( $config['DATABASE_NAME'], $this->dbh ) ) {
 

Last edited by Pak Chan; 10/12/2015 6:00 AM. Reason: Moved comment to later
Joined: Apr 2004
Posts: 1,945
Likes: 145
UBB.threads Developer
UBB.threads Developer
Joined: Apr 2004
Posts: 1,945
Likes: 145
Originally Posted by Pak Chan
I had a few problems during conversion. I could see that the database text encoding had changed (at least, the sample pieces of text I was checking had changed appropriately), but UBB was still showing diamonds.

Your UBB.threads was still showing black diamonds for older posts, since that is how they were originally stored before you updated your database's collation from latin1 (latin1_swedish_ci) to utf8 (utf8_general_ci)

Have a quick skim through this page - I know its long, but its worth just skimming through it to quickly pickup a concept you may have missed:
http://sqlmag.com/database-performance-tuning/seven-step-process-changing-database-collation

EDIT: Here is a more simple link that speaks "yours and mine language" much better smile
http://stackoverflow.com/questions/1294117/how-to-change-collation-of-database-table-column


The database character set and collation for 758 and 759 remained the same. There were zero database modifications between 758 and 759.

An initial install of UBB.threads uses the character set and its collation that is set within your database software.

So if your database is setup to use "utf8 (UTF-8 Unicode)" Character set, with a "utf8_general_ci" collation as the default, that is what will be used for the initial installation of UBB.threads.

---

The only updates regarding "UTF-8" in UBB.threads 7.5.9 are in the HTML header.

The <meta charset="UTF-8"> is for "display only" settings. It tells the user's browser which character encoding set to use for displaying the HTML page.

Common values:
UTF-8 - Character encoding for Unicode
ISO-8859-1 - Character encoding for the Latin alphabet

In theory, any character encoding can be used, but no browser understands all of them. The more widely a character encoding is used, the better the chance that a browser will understand it.

More reading on this can be found at
https://code.google.com/p/doctype-mirror/wiki/MetaCharsetAttribute

UBB.threads was updated to use UTF-8 in version 7.5.9 since it is the most widely used upon today's internet.

In UBBT759, there were exactly two files changed regarding the character set (charset meta tag);
1-languages/generic.php - this change only effect HTML headers.
2-languages/myfeeds.php - this change only effects RSS feed encoding. This declaration should have never been in the RSS feed in the first place. So by updating it from "iso-8859-1" to "utf-8" proved to only point out an long standing error within the template file. This was also hard-coded within 758's templates/default/rss_xml.tpl for some reason, and that made its way in to 759's release code. As of 760, it has been fixed by removing it.

Both of these items overwrote your previous "meta charset" language setting during an automatic 759 language upgrade. Doing that was a programming error on my part. I did not foresee that it would cause an issue, such as page display. In the future, character set encoding within the language files will not be pushed in to the upgrader script.

As relating to UBB.threads 7.6.0, there are no additional updates to the database planned at this time, and I do not foresee any to happen between now and the time it will be released.

Last edited by id242; 10/12/2015 6:10 AM. Reason: to add a more friendly link

Current developer of UBB.threads PHP Forum Software
Current Release: UBBT 7.7.5 // Preview: UBBT 8.0.0
isaac @ id242.com // my forum @ CelicaHobby.com
Joined: Dec 2008
Posts: 26
P
Newbie
Newbie
P Offline
Joined: Dec 2008
Posts: 26
Thanks for the explanation, Isaac.

The post above was for the test system (we have a test version of the live system that we do upgrade and change testing on). The conversion of the live system went ahead relatively fine and didn't require the change the test system had. I might have messed something up on the test system conversion, so I'm going to do that again (from scratch) to see if I did something wrong.

My database was originally built with ISO Latin 1 and collation sequence latin1_general_ci.

I don't have PhpMyAdmin installed, so for reference, my upgrade process was:-
  1. Close the board
  2. Back up existing site (as a ZIP archive) and database (via mysqldump) - this is to allow a rollback
  3. Upgrade UBB as per instructions
  4. Test UBB upgrade (login, control panel, reading/creating/editing posts, etc.)
  5. Back up the database again - this time to convert
    Code
    mysqldump -u <username> -p <password> <database> --add-drop-table > dump1.sql
    
  6. Manually convert the dump by replacing the "latin1" in the following strings with "utf8" in the dump1.sql file:-
    • CHARSET=latin1
    • COLLATE=latin1_general_ci
    • CHARACTER SET latin1
    • COLLATE latin1_general_ci
  7. Restore the database from the converted dump - this recreated the tables with the correct collation sequence, and character encoding
    Code
    mysql -u <username> -p <password> <database> --default-character-set=utf8 < dump1.sql
    
  8. Login to UBB control panel and clear the UBB cache to make sure that all content is rebuilt
  9. Retest UBB (control panel, reading/creating/editing posts, etc.) to make sure that everything is OK
  10. Open the board

The last two strings are for custom tables we've added in order to support some custom islands we have. (If anyone else is going to use this sequence, change your strings according to the character set and collation sequence your database has.)

Joined: Apr 2004
Posts: 1,945
Likes: 145
UBB.threads Developer
UBB.threads Developer
Joined: Apr 2004
Posts: 1,945
Likes: 145
Thanks for the quick reply!

As a side note, I am unaware of the Euro symbol being available within utf8. So forcing utf8 upon everyone at the level in your recommendation, is probably not a good idea.

Though...
I worked on a similar issue about 3 weeks ago, and I believe the way that was worked out was tossing in an "iconv" to the post/sig display fields, so the text is displayed on the fly using PHP 54 without touching the database. This would handle the Euro and other foreign languages. I only spent a few hours of research and havent revisited the idea yet. I'll share the common idea in this topic with you, and for anyone else who might have dropped by to review it

More reading @ https://en.wikipedia.org/wiki/Iconv

My testing page @ http://id242.com/woof/test-utf8.php

PHP SOURCE CODE for my testing page:
Code
<!doctype html>
<html lang="en">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
</head>
<body>
<?php
$text = "This is the Euro symbol &#039;€&#039;.";

echo 'ISO-8859-1 to UTF-8 using iconv (//TRANSLIT, //IGNORE, none qualifiers) on an HTML page with a charset=utf-8 meta tag.<br /><br />';


echo 'Original : ', $text, '<br /><br />';
echo 'TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//TRANSLIT", $text), '<br />';
echo 'IGNORE   : ', iconv("ISO-8859-1", "UTF-8//IGNORE", $text), '<br />';
echo 'Plain    : ', iconv("ISO-8859-1", "UTF-8", $text), '<br />';
echo "<br />";
echo "& euro; or & #x20ac;  bluntly gets the job done for a str_replace -- &euro; or &#x20ac;<br />";
echo "<br /><br /><br />";


$text2 = "This is some German 'Weiß, Goldmann, Göbel, Weiss, Göthe, Goethe und Götz'.";

echo 'Original : ', $text2, '<br /><br />';
echo 'TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//TRANSLIT", $text2), '<br />';
echo 'IGNORE   : ', iconv("ISO-8859-1", "UTF-8//IGNORE", $text2), '<br />';
echo 'Plain    : ', iconv("ISO-8859-1", "UTF-8", $text2), '<br />';
?>

Last edited by id242; 10/12/2015 7:13 AM.

Current developer of UBB.threads PHP Forum Software
Current Release: UBBT 7.7.5 // Preview: UBBT 8.0.0
isaac @ id242.com // my forum @ CelicaHobby.com
Joined: Dec 2008
Posts: 26
P
Newbie
Newbie
P Offline
Joined: Dec 2008
Posts: 26
The euro symbol is in UTF8 (e2 82 ac), but it's not one we're likely to use (though I never know!)

I did try an iconv from window-1252 to utf8, as some of the test data was showing up as that encoding in the database. However, the iconv failed with an "illegal input sequence" error, which prompted me to check the database dump, whereupon I discovered that the "illegal" character sequence been dumped as UTF8! It seems that mysqldump had converted it on output. As a result, I didn't bother with an iconv conversion from ISO Latin 1.

I think that some of the users compose their text in MS Word and paste it into the browser to create their posts (the board is for roleplaying so lots of long story fragments) so the encoding of some of the text might be different to the board encoding. I haven't verified this theory yet.

I'm still checking the board and testing character sequences and encoding, but so far, I've been able to paste in Ogham, Greek, Cyrillic and Old German, so it works for new posts. I'm still going through a sample of pre-existing posts to see if any have encoding errors.

Joined: Apr 2004
Posts: 1,945
Likes: 145
UBB.threads Developer
UBB.threads Developer
Joined: Apr 2004
Posts: 1,945
Likes: 145
I've updated my code a bit on the test page i was working on (also updated within the post above). i need to grab some Zs. 4:30am here on the US west coast.

http://php.net/manual/en/function.iconv.php
As for iconv, I read the php example and noticed their sample was backwards from how we need it, their example of iconv was going from utf8 to iso-8859-1 -- we need to go from iso-8859-1 to utf8.

In our case, we just want to take the $post / $sig strings and convert them to utf8 without black diamonds. And we already know how all the older posts were stored in our database, as "latin1_swedish_ci" / ISO-8869-1.


Also have a quick skim through this obvious solution -- I've also mentioned this option within a URL sanitation thread a few years ago:
http://php.net/manual/en/function.mb-convert-encoding.php

At this point, we just need to take the stored user-generated-content from your database's previous collation and display it on the screen so it looks nice smile

Last edited by id242; 10/12/2015 7:39 AM.

Current developer of UBB.threads PHP Forum Software
Current Release: UBBT 7.7.5 // Preview: UBBT 8.0.0
isaac @ id242.com // my forum @ CelicaHobby.com
Joined: Apr 2004
Posts: 1,945
Likes: 145
UBB.threads Developer
UBB.threads Developer
Joined: Apr 2004
Posts: 1,945
Likes: 145
Code
function to_utf8($text) {
    $map = array(
        chr(0x8A) => chr(0xA9),
        chr(0x8C) => chr(0xA6),
        chr(0x8D) => chr(0xAB),
        chr(0x8E) => chr(0xAE),
        chr(0x8F) => chr(0xAC),
        chr(0x9C) => chr(0xB6),
        chr(0x9D) => chr(0xBB),
        chr(0xA1) => chr(0xB7),
        chr(0xA5) => chr(0xA1),
        chr(0xBC) => chr(0xA5),
        chr(0x9F) => chr(0xBC),
        chr(0xB9) => chr(0xB1),
        chr(0x9A) => chr(0xB9),
        chr(0xBE) => chr(0xB5),
        chr(0x9E) => chr(0xBE),
        chr(0x80) => '&euro;',
        chr(0x82) => '&sbquo;',
        chr(0x84) => '&bdquo;',
        chr(0x85) => '&hellip;',
        chr(0x86) => '&dagger;',
        chr(0x87) => '&Dagger;',
        chr(0x89) => '&permil;',
        chr(0x8B) => '&lsaquo;',
        chr(0x91) => '&lsquo;',
        chr(0x92) => '&rsquo;',
        chr(0x93) => '&ldquo;',
        chr(0x94) => '&rdquo;',
        chr(0x95) => '&bull;',
        chr(0x96) => '&ndash;',
        chr(0x97) => '&mdash;',
        chr(0x99) => '&trade;',
        chr(0x9B) => '&rsquo;',
        chr(0xA6) => '&brvbar;',
        chr(0xA9) => '&copy;',
        chr(0xAB) => '&laquo;',
        chr(0xAE) => '&reg;',
        chr(0xB1) => '&plusmn;',
        chr(0xB5) => '&micro;',
        chr(0xB6) => '&para;',
        chr(0xB7) => '&middot;',
        chr(0xBB) => '&raquo;',
    );
    return html_entity_decode(mb_convert_encoding(strtr($text, $map), 'UTF-8', 'ISO-8859-1'), ENT_QUOTES, 'UTF-8');
}

then calling that function
to_utf8($text);

Seems to have fixed the UTF8 main conversion things.
Rather blunt, but it does work.


Current developer of UBB.threads PHP Forum Software
Current Release: UBBT 7.7.5 // Preview: UBBT 8.0.0
isaac @ id242.com // my forum @ CelicaHobby.com
Joined: Dec 2008
Posts: 26
P
Newbie
Newbie
P Offline
Joined: Dec 2008
Posts: 26
I thought you were going to get some sleep!

Don't worry; it'll wait a little while...I've got most of it sorted now. I did have to make the same change to the mysql.inc.php on the live system as the test one in the end. It was encoding the text in some, but not all, of the posts differently on output, which is weird. E.g. converting the apostrophe (e2 80 99) in the database to a control block (c2 92) on output. Setting the character encoding on the database connection seems to have fixed the randomness in output encoding.

One new post was affected by the upgrade, prior to the change to mysql.inc.php, but it was a user reporting an issue with the encoding and including a sample, so I'm not fixing that post (the database shows that the string seems to have been double-encoded on input, probably utf8-as-latin1 to utf8). The others seem to be OK.

The rest of the text in the database, as far as I can tell, is in utf8 after the export and re-import. I still don't know how it was converted by mysqldump, and that worries me somewhat as I don't know if it analysed the non-7-bit-ASCII characters to work out what the encoding was likely to be and converted it, or whether it just assumed it would be ISO Latin 1 and proceeded accordingly. So far, however, I've spotted no other issues.

If I do, and they're extensive, I'll add your function to the codebase and call it appropriately.

Last edited by Pak Chan; 10/12/2015 11:06 AM. Reason: Added response to function.
Joined: Dec 2008
Posts: 26
P
Newbie
Newbie
P Offline
Joined: Dec 2008
Posts: 26
There's another code change that needs to be made; the mailer currently sends out HTML email encoded in ISO-8859-1, which is the default in libs/phpmailer/class.phpmailer.php. Rather than change that base library, I've changed libs/mailer.inc.php to set the character set explicitly to whatever is defined for display:-
PHP Code
--- mailer.inc.php.bak  2015-01-27 07:05:08.000000000 +0000
+++ mailer.inc.php	  2015-10-13 09:04:27.000000000 +0000
@@ -22,6 +22,7 @@
 
		$this->lang = $config['LANGUAGE'];
		$this->admin = $dotdot;
+		$this->CharSet = $ubbt_lang['CHARSET'];
 
		if (defined('IS_ADMIN')) $dotdot = "../";
 

This seems to correct the problem, and though Gmail doesn't handle the resulting UTF-8 properly (it corrupts the occasional character in long tracts of wide-character text, probably every 1200-1400 bytes or so on a line), both Thunderbird and Outlook seem fine with it.


Link Copied to Clipboard
ShoutChat
Comment Guidelines: Do post respectful and insightful comments. Don't flame, hate, spam.
Recent Topics
spam issues
by ECNet - 03/19/2024 11:45 PM
Looking for a forum
by azr - 03/15/2024 11:26 PM
Editing Links in Post
by Outdoorking - 03/15/2024 9:31 AM
Question on barkrowler and the like
by Mors - 02/29/2024 6:51 PM
Member Permissions Help
by domspeak - 02/27/2024 6:31 PM
Who's Online Now
2 members (Havenofsobriety, rootman), 624 guests, and 106 robots.
Key: Admin, Global Mod, Mod
Random Gallery Image
Latest Gallery Images
Los Angeles
Los Angeles
by isaac, August 6
3D Creations
3D Creations
by JAISP, December 30
Artistic structures
Artistic structures
by isaac, August 29
Stones
Stones
by isaac, August 19
Powered by UBB.threads™ PHP Forum Software 8.0.0
(Preview build 20230217)