title | isChild | anchor |
---|---|---|
Working with UTF-8 |
true |
php_and_utf8 |
This section was originally written by Alex Cabal over at PHP Best Practices and has been used as the basis for our own UTF-8 advice.
Right now PHP does not support Unicode at a low level. There are ways to ensure that UTF-8 strings are processed OK, but it's not easy, and it requires digging in to almost all levels of the web app, from HTML to SQL to PHP. We'll aim for a brief, practical summary.
The basic string operations, like concatenating two strings and assigning strings to variables, don't need anything
special for UTF-8. However most string functions, like strpos()
and strlen()
, do need special consideration. These
functions often have an mb_*
counterpart: for example, mb_strpos()
and mb_strlen()
. These mb_*
strings are made
available to you via the Multibyte String Extension, and are specifically designed to operate on Unicode strings.
You must use the mb_*
functions whenever you operate on a Unicode string. For example, if you use substr()
on a
UTF-8 string, there's a good chance the result will include some garbled half-characters. The correct function to use
would be the multibyte counterpart, mb_substr()
.
The hard part is remembering to use the mb_*
functions at all times. If you forget even just once, your Unicode
string has a chance of being garbled during further processing.
Not all string functions have an mb_*
counterpart. If there isn't one for what you want to do, then you might be out
of luck.
You should use the mb_internal_encoding()
function at the top of every PHP script you write (or at the top of your
global include script), and the mb_http_output()
function right after it if your script is outputting to a browser.
Explicitly defining the encoding of your strings in every script will save you a lot of headaches down the road.
Additionally, many PHP functions that operate on strings have an optional parameter letting you specify the character
encoding. You should always explicitly indicate UTF-8 when given the option. For example, htmlentities()
has an
option for character encoding, and you should always specify UTF-8 if dealing with such strings. Note that as of PHP 5.4.0, UTF-8 is the default encoding for htmlentities()
and htmlspecialchars()
.
Finally, If you are building a distributed application and cannot be certain that the mbstring
extension will be
enabled, then consider using the patchwork/utf8 Composer package. This will use mbstring
if it is available, and
fall back to non UTF-8 functions if not.
If your PHP script accesses MySQL, there's a chance your strings could be stored as non-UTF-8 strings in the database even if you follow all of the precautions above.
To make sure your strings go from PHP to MySQL as UTF-8, make sure your database and tables are all set to the
utf8mb4
character set and collation, and that you use the utf8mb4
character set in the PDO connection string. See
example code below. This is critically important.
Note that you must use the utf8mb4
character set for complete UTF-8 support, not the utf8
character set! See
Further Reading for why.
Use the mb_http_output()
function to ensure that your PHP script outputs UTF-8 strings to your browser.
The browser will then need to be told by the HTTP response that this page should be considered as UTF-8. The historic
approach to doing that was to include the charset <meta>
tag in
your page's <head>
tag. This approach is perfectly valid, but setting the charset in the Content-Type
header is
actually much faster.
{% highlight php %}
PDO::ERRMODE_EXCEPTION, PDO::ATTR_PERSISTENT => false ) ); // Store our transformed string as UTF-8 in our database // Your DB and tables are in the utf8mb4 character set and collation, right? $handle = $link->prepare('insert into ElvishSentences (Id, Body) values (?, ?)'); $handle->bindValue(1, 1, PDO::PARAM_INT); $handle->bindValue(2, $string); $handle->execute(); // Retrieve the string we just stored to prove it was stored correctly $handle = $link->prepare('select * from ElvishSentences where Id = ?'); $handle->bindValue(1, 1, PDO::PARAM_INT); $handle->execute(); // Store the result into an object that we'll output later in our HTML $result = $handle->fetchAll(\PDO::FETCH_OBJ); header('Content-Type: text/html; charset=UTF-8'); ?> <title>UTF-8 test page</title> Body); // This should correctly output our transformed UTF-8 string to the browser } ?> {% endhighlight %}- PHP Manual: String Operations
- PHP Manual: String Functions
- PHP Manual: Multibyte String Functions
- PHP UTF-8 Cheatsheet
- Handling UTF-8 with PHP
- Stack Overflow: What factors make PHP Unicode-incompatible?
- Stack Overflow: Best practices in PHP and MySQL with international strings
- How to support full Unicode in MySQL databases
- Bringing Unicode to PHP with Portable UTF-8
- Stack Overflow: DOMDocument loadHTML does not encode UTF-8 correctly