We recently had a problem in our UBB forum, after an upgrade from mysql 5.7 to mysql 8.0, with punctuation marks and emojis being rendered as Windows-1252 characters instead of utf8. e.g “ = left quote = “
I finally got the issue corrected by adding these two lines to my.cnf's mysqld section: character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
(The db was already set to utf8mb4).
That left me with a cleanup job. (about 8.000 instances). So, I wrote a query to clean them up. However, I needed to exclude two forums where we were discussing the issues (because I didn't want to clean those up.)
So, I wrote this query, which did a great job of cleaning up the mess. (This is one query. I had to write several more to clean up post subjects, member signatures, and topic subjects as well.
This is the cleanup query:
Replace all Windows-1252 characters in post bodies with utf8 characters
UPDATE sbf_POSTS a
JOIN sbf_TOPICS t ON a.TOPIC_ID = t.TOPIC_ID
JOIN sbf_FORUMS f ON t.FORUM_ID = f.FORUM_ID
SET a.POST_BODY = REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(REPLACE
(a.POST_BODY, '“' , '“'),
'â€','”'),
'‘','‘'),
'’', '’'),
'—', '–'),
'–', '—'),
'…', '᠁')
WHERE (
a.POST_BODY LIKE '%“%' OR
a.POST_BODY LIKE '%â€%' OR
a.POST_BODY LIKE '%‘%' OR
a.POST_BODY LIKE '%’%' OR
a.POST_BODY LIKE '%—%' OR
a.POST_BODY LIKE '%–%' OR
a.POST_BODY LIKE '%…%'
)
AND f.FORUM_ID != '24'
AND f.FORUM_ID != '57'
Unfortunately, that didn't clean up ALL of them. I found 181 instances that were missed using this simple query.
select POST_BODY from sbf_POSTS where POST_BODY LIKE '%â€%'
My question is, why does the simple query find instances that the complex query does not? I'm assuming it's because of the joins, but I don't understand why those would create this problem.