I'm attempting to remove accents from characters in PHP string as the first step to making the string usable in a URL.
I'm using the following code:
$input = "Fóø Bår";
setlocale(LC_ALL, "en_US.utf8");
$output = iconv("utf-8", "ascii//TRANSLIT", $input);
print($output);
The output I would expect would be something like this:
F'oo Bar
However, instead of the accented characters being transliterated they are replaced with question marks:
F?? B?r
Everything I can find online indicates that setting the locale will fix this problem, however I'm already doing this. I've already checked the following details:
- The locale I am setting is supported by the server (included in the list produced by
locale -a
)
- The source and target encodings (UTF-8 and ASCII) are supported by the server's version of iconv (included in the list produced by
iconv -l
)
- The input string is UTF-8 encoded (verified using PHP's
mb_check_encoding
function, as suggested in the answer by mercator )
- The call to
setlocale
is successful (it returns'en_US.utf8'
rather thanFALSE
)
The cause of the problem:
The server is using the wrong implementation of iconv. It has the glibc version instead of the required libiconv version.
Note that the iconv function on some systems may not work as you expect. In such case, it'd be a good idea to install the GNU libiconv library. It will most likely end up with more consistent results.
– PHP manual's introduction to iconv
Details about the iconv implementation that is used by PHP are included in the output of the phpinfo
function.
(I'm not able to re-compile PHP with the correct iconv library on the server I'm working with for this project so the answer I've accepted below is the one that was most useful for removing accents without iconv support.)
Source: Tips4all
I think the problem here is that your encodings consider ä and å different symbols to 'a'. In fact, the PHP documentation for strtr offers a sample for removing accents the ugly way :(
ReplyDeletehttp://ie2.php.net/strtr
You could use urlencode. Does not quite do what you want (remove accents), but will give you a url usable string
ReplyDelete$output = urlencode ($input);
In Perl I could use a translate regex, but I cannot think of the PHP equivalent
$input =~ tr/áâàå/aaaa/;
etc...
you could do this using preg_replace
$patterns[0] = '/[á|â|à|å|ä]/';
$patterns[1] = '/[ð|é|ê|è|ë]/';
$patterns[2] = '/[í|î|ì|ï]/';
$patterns[3] = '/[ó|ô|ò|ø|õ|ö]/';
$patterns[4] = '/[ú|û|ù|ü]/';
$patterns[5] = '/æ/';
$patterns[6] = '/ç/';
$patterns[7] = '/ß/';
$replacements[0] = 'a';
$replacements[1] = 'e';
$replacements[2] = 'i';
$replacements[3] = 'o';
$replacements[4] = 'u';
$replacements[5] = 'ae';
$replacements[6] = 'c';
$replacements[7] = 'ss';
$output = preg_replace($patterns, $replacements, $input);
(Please note this was typed from a foggy beer ridden Friday after noon memory, so may not be 100% correct)
or you could make a hash table and do a replacement based off of that.
This is a code i found and use often
ReplyDeletefunction stripAccents($stripAccents){
return strtr($stripAccents,'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ','aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');
}
I agree with georgebrock's comment.
ReplyDeleteIf you find a way to get //TRANSLIT to work, you can build friendly URLs:
use iconv with //TRANSLIT ñ => n~
remove non-alphanumeric non-whitespace chars inside words: $url = preg_replace( '/(\w)[^\w\s](\w)/', '$1$2', $url );
replace remaining separations: $url = preg_replace( '/[^a-z0-9]+/', '-', $url );
remove double/leading/traling: $url = preg_replace( '-', e.g. '/(?:(^|\-)\-+|\-$)/', '', $url );
If you can't get it to work, replace setp 1 with strtr/character-based replacement, like Xetius' solution.
I can't reproduce your problem. I get the expected result.
ReplyDeleteHow exactly are you using mb_detect_encoding() to verify your string is in fact UTF-8?
If I simply call mb_detect_encoding($input) on both a UTF-8 and ISO-8859-1 encoded version of your string, both of them return "UTF-8", so that function isn't particularly reliable.
iconv() gives me a PHP "notice" when it gets the wrongly encoded string and only echoes "F", but that might just be because of different PHP/iconv settings/versions (?).
I suggest to you try calling mb_check_encoding($input, "utf-8") first to verify that your string really is UTF-8. I think it probably isn't.
When using iconv, locale mus be set:
ReplyDeletefunction test_enc($text = 'ěščřžýáíé ĚŠČŘŽÝÁÍÉ fóø bår FÓØ BÅR æ')
{
echo '<tt>';
echo iconv('utf8', 'ascii//TRANSLIT', $text);
echo '</tt><br/>';
}
test_enc();
setlocale(LC_ALL, 'cs_CZ.utf8');
test_enc();
setlocale(LC_ALL, 'en_US.utf8');
test_enc();
Yields into:
????????? ????????? f?? b?r F?? B?R ae
escrzyaie ESCRZYAIE fo? bar FO? BAR ae
escrzyaie ESCRZYAIE fo? bar FO? BAR ae
Another locales then cs_CZ and en_US I haven't installed and I can't test it.
In C# I see solution using translation to unicode normalized form - accents are splitted out and then filtered via nonspacing unicode category.
One of the tricks I stumbled upon on the web was using htmlentities then stripping the encoded character :
ReplyDelete$stripped = preg_replace('`&[^;]+;`','',htmlentities($string));
Not perfect but it does work well in some case.
But, you're writing about creating an URL string, so urlencode and its counterpart urldecode may be better. Or, if you are creating a query string, use this last function : http_build_query.
u can use this class for removing unwanted characters.. But still it does not solves your problem
ReplyDelete