Data Encoding: A Guide to UTF-8 for PHP and MySQL
As a MySQL or PHP developer, once you step beyond the comfortable
confines of English-only character sets, you quickly find yourself entangled
in the wonderfully wacky world of UTF-8.
On a previous job, we began running into data encoding issues when displaying bios of artists from all over the world. It soon became apparent that there were problems with the stored data, as sometimes the data was correctly encoded and sometimes it was not.
This led programmers to implement a hodge-podge of patches, sometimes with JavaScript, sometimes with HTML charset meta tags, sometimes with PHP, and soon. Soon, we ended up with a list of 600,000 artist bios with double- or triple encoded information, with data being stored in different ways depending on who programmed the feature or implemented the patch. A classical technical rat’s nest.Indeed, navigating through UTF-8 related data encoding issues can be a frustrating and hair-pulling experience. This post provides a concise cookbook for addressing
these issues when working with PHP and MySQL in particular, based on practical experience and lessons learned (and with thanks, in part, to information discovered here and here along the way).
Specifically, we’ll cover the following in this post:
• Mods you’ll need to make to your php.ini file and PHP code.
• Mods you’ll need to make to your my.ini file and other MySQL-related issues to be aware of (including config mods needed if you’re using Sphinx)
• How to migrate data from a MySQL database previously encoded in latin1 to instead use a UTF-8 encoding
PHP & UTF-8 Encoding – modifications to your php.ini file:
The first thing you need to do is to modify your php.ini file to use UTF-8 as the default character set:
default_charset = "utf-8";
(Note: You can subsequently use phpinfo() to verify that this has been set properly.)
OK cool, so now PHP and UTF-8 should work just fine together. Right?
Well, not exactly. In fact, not even close.While this change will ensure that PHP always outputs UTF-8 as the character encoding (in browser response Content-type headers), you still need to make a
number of modifications to your PHP code to make sure that it properly processes
and generates UTF-8 characters.
PHP & UTF-8 Encoding – modifications to your code:
To be sure that your PHP code plays well in the UTF-8 data encoding sandbox, here are the things you need to do:
• Set UTF-8 as the character set for all headers output by your PHP code
In every PHP output header, specify UTF-8 as the encoding:
header('Content-Type: text/html; charset=utf-8');
• Specify UTF-8 as the encoding type for XML
<?xml version="1.0" encoding="UTF-8"?>
• Strip out unsupported characters from XML
Since not all UTF-8 characters are accepted in an XML document, you’ll need to strip any such characters out from any XML that you generate. A useful function for doing this (which I found here) is the following:
function utf8_for_xml($string) {
return
return
preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u',
' ', $string); }
Here’s how you can use this function in your code:
$safeString = utf8_for_xml($yourUnsafeString);
• Specify UTF-8 as the character set for all HTML content
For HTML content, specify UTF-8 as the encoding:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
In HTML forms, specify UTF-8 as the encoding:
<form accept-charset="utf-8">
• Specify UTF-8 as the encoding in all calls to htmlspecialchars
e.g.:
htmlspecialchars($str, ENT_NOQUOTES, "UTF-8")
Note: As of PHP 5.6.0, default_charset value is used as the default. From PHP 5.4.0, UTF-8 was the default, but prior to PHP 5.4.0, ISO-8859-1 was used as the default. It’s therefore a good idea to always explicitly specify UTF-8 to be safe, even though this argument is technically optional.
Also note that, for UTF-8, htmlspecialchars and htmlentities can be used interchangeably.
• Set UTF-8 as the default character set for all MySQL connections
Specify UTF-8 as the default character set to use when exchanging data with the MySQL database using mysql_set_charset :
$link = mysql_connect('localhost', 'user', 'password');
mysql_set_charset('utf8',$link);
Note that, as of PHP 5.5.0, mysql_set_charset is deprecated, and mysqli::set_charset should be used instead:
$mysqli = new mysqli("localhost", "my_user", "my_password", "test");
/* check connection */
if (mysqli_connect_errno()) {
printf("Connect failed: %s\n",
mysqli_connect_error());
exit();
}
/* change character set to utf8 */
if (!$mysqli->set_charset("utf8")) {
printf("Error loading character set utf8:%s\n", $mysqli->error);
} else {
printf("Current character set: %s\n", $mysqli->character_set_name());
}
$mysqli->close();
Want to read more click here
No comments:
Post a Comment