Translate latin1 encoding to UTF8

Shity but useful script to translate a the database encoding for a field in a safe way.

Modify the script.php file and replace the next params.

<DB_HOST>
<DB_USER>
<DB_PASS>
<DB_SCHEMA>
<FIELD_TO_TRANSLATE>
<TABLE_TO_TRANSLATE>

This script helps you in the step 3 of the next toolbox for sleuthing character encoding problems.

A toolbox for sleuthing character encoding problems:

1 - Use MySQL CHAR_LENGTH to find rows with multi-byte characters:

SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);

2 - Use MySQL HEX and PHP bin2hex

SELECT name, HEX(name) FROM clients;

Get the result back into PHP, and run a bin2hex on the string, compare it to MySQL's hex of that same string

3 - See it in both encodings

$db->query("SET NAMES latin1");
$db->query("SELECT name, HEX(name) FROM clients");
(compare the string and its hex result from MySQL with the bin2hex from PHP)
$db->query("SET NAMES utf8");
$db->query("SELECT name, HEX(name) FROM clients");

For all those strings that looked perfect in LATIN1 encoding, here's how I would fix them in the database:

$db->query("SET NAMES latin1");
$db->query("SELECT id, name FROM clients");
$hex = bin2hex($x['name']);
$db->query("SET NAMES utf8");
$db->query("UPDATE clients SET name=UNHEX($hex) WHERE id=$id")

That seemed to work, for most things. If you have some encoding in latin1 and some in other encodings, continue reading.

4 - Use a HEX/UNHEX replace for the unfixable charactgers

Imagine, after all that fixing, you found strings like this:

Let~!@s say ^|%What a nice house you~!@ve got here, don~!@t you think?^!%.

Who knows when or how this happened, but obviously ~!@ is meant to be an apostrophe, ^|% an open-quote, and ^!% a closing-quote.

I'd use MySQL SUBSTRING to find the 3 characters that needed replacing:

SELECT SUBSTRING(quote, 353, 3) FROM table WHERE id=1;

Once narrowing it down to the exact string, add a HEX() around it:

SELECT HEX(SUBSTRING(quote, 353, 3)) FROM table WHERE id=1;

... which would give you a result like C8035EF6BB92BF2

Then use that with MySQL REGEXP to find and replace all occurences in your database!

UPDATE table SET field = REPLACE(field, UNHEX('C8035EF6BB92BF2'), "'") WHERE field REGEXP UNHEX('C8035EF6BB92BF2');

I set up some PHP arrays of all my tables, and all their text fields, to run this same query on everything in my database.

Then do it again for curly-quotes and other weirnesses.

A few times, I had no idea what a character was supposed to be (like the Icelandic and Gaelic ones) - so I had to go visit the artist's website, and find their song titles or bio information spelled correctly there.

5 - Validate UTF8

Got the is_utf8 function from PHP docs to validate all the values in the database. Doing this found a bunch of invisible problems, which only through hours of MySQL SUBSTRING and HEX revealed that there were invisible characters with HEX values of 00-19 scattered around my text fields. I used the same solution as above to replace them:

UPDATE table SET field = REPLACE(field, UNHEX('05'), '') WHERE field REGEXP UNHEX('05');

I looped this inside an array of all hex values under 20.

6 - Converting HTML entities

Find HTML entities hidden in the database:

SELECT field FROM table WHERE field REGEXP '&#[0-9]*;'

Use the utf8_chr function from the comments of the PHP html_entity_decode page. Use PHP preg_match_all to find the entities inside the string, and replace them:

function myreplace($string) {
    preg_match_all('/&#(\d*)/', $string, $matches);
    foreach($matches[1] as $num) {
        $string = str_replace("&#$num;", utf8_chr($num), $string);
    }

    return $string;
}

Update the database with the returned result.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
composer.json		composer.json
composer.lock		composer.lock
script.php		script.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Translate latin1 encoding to UTF8

A toolbox for sleuthing character encoding problems:

About

Releases

Packages

Languages

manelpm10/db-latin-to-utf8

Folders and files

Latest commit

History

Repository files navigation

Translate latin1 encoding to UTF8

A toolbox for sleuthing character encoding problems:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages