mb_detect_encoding

(PHP 4 >= 4.0.6, PHP 5)

mb_detect_encoding — Detect character encoding

Описание

string mb_detect_encoding ( string $str [, mixed $encoding_list [, bool $strict ]] )

Detects character encoding in string str .

Список параметров

str

The string being detected.

encoding_list

encoding_list is list of character encoding. Encoding order may be specified by array or comma separated list string.

If encoding_list is omitted, detect_order is used.

strict

strict specifies whether to use the strict encoding detection or not. Default is FALSE.

Возвращаемые значения

The detected character encoding.

Примеры

Пример #1 mb_detect_encoding() example

<?php
/* Detect character encoding with current detect_order */
echo mb_detect_encoding($str);

/* "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS" */
echo mb_detect_encoding($str"auto");

/* Specify encoding_list character encoding by comma separated list */
echo mb_detect_encoding($str"JIS, eucjp-win, sjis-win");

/* Use array to specify encoding_list  */
$ary[] = "ASCII";
$ary[] = "JIS";
$ary[] = "EUC-JP";
echo 
mb_detect_encoding($str$ary);
?>

Смотрите также

Коментарии

Автор:
Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.
To verify utf 8 use the following:

//
//    utf8 encoding validation developed based on Wikipedia entry at:
//    http://en.wikipedia.org/wiki/UTF-8
//
//    Implemented as a recursive descent parser based on a simple state machine
//    copyright 2005 Maarten Meijer
//
//    This cries out for a C-implementation to be included in PHP core
//
    function valid_1byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0x80) == 0x00;
    }
   
    function valid_2byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xE0) == 0xC0;
    }

    function valid_3byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xF0) == 0xE0;
    }

    function valid_4byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xF8) == 0xF0;
    }
   
    function valid_nextbyte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xC0) == 0x80;
    }
   
    function valid_utf8($string) {
        $len = strlen($string);
        $i = 0;   
        while( $i < $len ) {
            $char = ord(substr($string, $i++, 1));
            if(valid_1byte($char)) {    // continue
                continue;
            } else if(valid_2byte($char)) { // check 1 byte
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } else if(valid_3byte($char)) { // check 2 bytes
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } else if(valid_4byte($char)) { // check 3 bytes
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } // goto next char
        }
        return true; // done
    }

for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png
2005-01-12 17:55:40
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it.

Replace
         } // goto next char
with
         } else {
           return false; // 10xxxxxx occuring alone
         } // goto next char
2005-01-14 02:27:05
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Much simpler UTF-8-ness checker using a regular expression created by the W3C:

<?php

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
   
   
// From http://w3.org/International/questions/qa-forms-utf-8.html
   
return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs'
$string);
   
// function is_utf8

?>
2005-02-17 09:57:14
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Автор:
If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:
mb_detect_encoding($string, 'UTF-8, ISO-8859-1');

if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
2005-03-29 10:32:23
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Автор:
beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)

mb_detect_encoding('accentu?e' , 'UTF-8, ISO-8859-1')

returns ISO-8859-1, while 

mb_detect_encoding('accentu?' , 'UTF-8, ISO-8859-1')

returns UTF-8

bottom line : an ending '?' (and probably other accentuated chars) mislead mb_detect_encoding
2005-07-27 21:48:52
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Based upon that snippet below using preg_match() I needed something faster and less specific.  That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8.  I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.

I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string.  This is quite a lot faster.

<?php

function detectUTF8($string)
{
        return 
preg_match('%(?:
        [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
        |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
        |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
        |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
        |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
        |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
        )+%xs'
$string);
}

?>
2006-08-03 05:22:16
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Автор:
from PHPDIG

    function isUTF8($str) {
        if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) {
            return true;
        } else {
            return false;
        }
    }
2006-08-15 03:26:19
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.

The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice  that \x80 is used as the euro-sign in the 8859-1 charset. 

I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:

if(detectUTF8($str)){
  $str=str_replace("\xE2\x82\xAC","&euro;",$str); 
  $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);
  $str=str_replace("&euro;","\x80",$str); 
}

If html-output is needed the last line is not necessary (and even unwanted).
2007-09-04 17:00:15
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Автор:
<?php
/*
*QQ: 290359552
* conver to Utf8 if $str is not equals to 'UTF-8'
*/
function convToUtf8($str)
{
if( 
mb_detect_encoding($str,"UTF-8, ISO-8859-1, GBK")!="UTF-8" )
{

return 
iconv("gbk","utf-8",$str);

}
else
{
return 
$str;
}

}
?>
2008-07-21 01:14:56
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.

<?php
function is_utf8($str) {
   
$c=0$b=0;
   
$bits=0;
   
$len=strlen($str);
    for(
$i=0$i<$len$i++){
       
$c=ord($str[$i]);
        if(
$c 128){
            if((
$c >= 254)) return false;
            elseif(
$c >= 252$bits=6;
            elseif(
$c >= 248$bits=5;
            elseif(
$c >= 240$bits=4;
            elseif(
$c >= 224$bits=3;
            elseif(
$c >= 192$bits=2;
            else return 
false;
            if((
$i+$bits) > $len) return false;
            while(
$bits 1){
               
$i++;
               
$b=ord($str[$i]);
                if(
$b 128 || $b 191) return false;
               
$bits--;
            }
        }
    }
    return 
true;
}
?>
2008-08-24 00:58:28
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Beware of bug to detect Russian encodings
http://bugs.php.net/bug.php?id=38138
2008-10-06 12:18:05
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
I seriously underestimated the importance of setlocale...
<?php
$strings 
= array(
   
"mais coisas a pensar sobre diário ou dois!",
   
"plus de choses à penser à journalier ou à deux !",
   
"¡más cosas a pensar en diario o dos!",
   
"più cose da pensare circa giornaliere o due!",
   
"flere ting å tenke på hver dag eller to!",
   
"Další věcí, přemýšlet o každý den nebo dva!",
   
"mehr über Spaß spät schönen",
   
"më vonë gjatë fun bukur",
   
"több mint szórakozás késő csodálatos kenyér"
);

$convert = array();
setlocale(LC_CTYPE'de_DE.UTF-8');
foreach( 
$strings as $string )
       
$convert[] = iconv('UTF-8''ASCII//TRANSLIT//IGNORE'$string);
?>

Produces the following: 

Array
(
    [0] => mais coisas a pensar sobre diario ou dois!
    [1] => plus de choses a penser a journalier ou a deux !
    [2] => ?mas cosas a pensar en diario o dos!
    [3] => piu cose da pensare circa giornaliere o due!
    [4] => flere ting aa tenke paa hver dag eller to!
    [5] => Dalsi veci, premyslet o kazdy den nebo dva!
    [6] => mehr ueber Spass spaet schoenen
    [7] => me vone gjate fun bukur
    [8] => toebb mint szorakozas keso csodalatos kenyer
)

whereas 

<?php
$convert 
= array();
setlocale(LC_CTYPE'nl_NL.UTF-8');
foreach( 
$strings as $string )
       
$convert[] = iconv('UTF-8''ASCII//TRANSLIT//IGNORE'$string);
?>

produces:
Array
(
    [0] => mais coisas a pensar sobre di?rio ou dois!
    [1] => plus de choses ? penser ? journalier ou ? deux !
    [2] => ?m?s cosas a pensar en diario o dos!
    [3] => pi? cose da pensare circa giornaliere o due!
    [4] => flere ting ? tenke p? hver dag eller to!
    [5] => Dal?? v?c?, p?em??let o ka?d? den nebo dva!
    [6] => mehr ?ber Spass sp?t sch?nen
    [7] => m? von? gjat? fun bukur
    [8] => t?bb mint sz?rakoz?s k?s? csod?latos keny?r
)

This might be of interest when trying to convert utf-8 strings into ASCII suitable for URL's, and such. this was never obvious for me since I've used locales for us and nl.
2009-03-28 12:33:20
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Автор:
Another light way to detect character encoding:
<?php
function detect_encoding($string) { 
  static 
$list = array('utf-8''windows-1251');
 
  foreach (
$list as $item) {
   
$sample iconv($item$item$string);
    if (
md5($sample) == md5($string))
      return 
$item;
  }
  return 
null;
}
?>
2009-03-30 05:16:22
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)

<?php
// Unicode BOM is U+FEFF, but after encoded, it will look like this.
define ('UTF32_BIG_ENDIAN_BOM'   chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF));
define ('UTF32_LITTLE_ENDIAN_BOM'chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00));
define ('UTF16_BIG_ENDIAN_BOM'   chr(0xFE) . chr(0xFF));
define ('UTF16_LITTLE_ENDIAN_BOM'chr(0xFF) . chr(0xFE));
define ('UTF8_BOM'               chr(0xEF) . chr(0xBB) . chr(0xBF));

function 
detect_utf_encoding($filename) {

   
$text file_get_contents($filename);
   
$first2 substr($text02);
   
$first3 substr($text03);
   
$first4 substr($text03);
   
    if (
$first3 == UTF8_BOM) return 'UTF-8';
    elseif (
$first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';
    elseif (
$first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';
    elseif (
$first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';
    elseif (
$first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';
}
?>
2009-05-22 06:58:04
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Автор:
If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.

<?php
    $str 
'áéóú'// ISO-8859-1
   
mb_detect_encoding($str'UTF-8'); // 'UTF-8'
   
mb_detect_encoding($str'UTF-8'true); // false
?>
2011-02-18 05:43:45
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
a) if the FUNCTION mb_detect_encoding is not available: 

### mb_detect_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_detect_encoding')) { 
function 
mb_detect_encoding($string$enc=null) { 
   
    static 
$list = array('utf-8''iso-8859-1''windows-1251');
   
    foreach (
$list as $item) {
       
$sample iconv($item$item$string);
        if (
md5($sample) == md5($string)) { 
            if (
$enc == $item) { return true; }    else { return $item; } 
        }
    }
    return 
null;
}
}

// -------------------------------------------
?>

b) if the FUNCTION mb_convert_encoding is not available: 

### mb_convert_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_convert_encoding')) { 
function 
mb_convert_encoding($string$target_encoding$source_encoding) { 
   
$string iconv($source_encoding$target_encoding$string); 
    return 
$string
}
}

// -------------------------------------------
?>
2013-03-24 16:04:36
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Just a note: Instead of using the often recommended (rather complex) regular expression by W3C (http://www.w3.org/International/questions/qa-forms-utf-8.en.php), you can simply use the 'u' modifier to test a string for UTF-8 validity:

<?php
 
if (preg_match("//u"$string)) {
     
// $string is valid UTF-8
 
}
2013-06-11 13:41:41
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Автор:
// ----------------------------------------------------------- 

if(!function_exists('mb_detect_encoding')) {

function mb_detect_encoding($string, $enc=null, $ret=true) {
    $out=$enc; 
    static $list = array('utf-8', 'iso-8859-1', 'iso-8859-15', 'windows-1251');
        foreach ($list as $item) {
            $sample = iconv($item, $item, $string);
            if (md5($sample) == md5($string)) { $out = ($ret !== false) ? true : $item; } 
        } 
    return $out;
}

}

// -----------------------------------------------------------
2013-10-09 00:17:06
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
if the  function " mb_detect_encoding" does not exist  ... 

... try: 

<?php 
// ---------------------------------------------------- 
if ( !function_exists('mb_detect_encoding') ) { 

// ---------------------------------------------------------------- 
function mb_detect_encoding ($string$enc=null$ret=null) { 
       
        static 
$enclist = array( 
           
'UTF-8''ASCII'
           
'ISO-8859-1''ISO-8859-2''ISO-8859-3''ISO-8859-4''ISO-8859-5'
           
'ISO-8859-6''ISO-8859-7''ISO-8859-8''ISO-8859-9''ISO-8859-10'
           
'ISO-8859-13''ISO-8859-14''ISO-8859-15''ISO-8859-16'
           
'Windows-1251''Windows-1252''Windows-1254'
            );
       
       
$result false
       
        foreach (
$enclist as $item) { 
           
$sample iconv($item$item$string); 
            if (
md5($sample) == md5($string)) { 
                if (
$ret === NULL) { $result $item; } else { $result true; } 
                break; 
            }
        }
       
    return 
$result

// ---------------------------------------------------------------- 


// ---------------------------------------------------- 
?>

example / usage of: mb_detect_encoding() 

<?php 
// ------------------------------------------------------ 
function str_to_utf8 ($str) { 
   
    if (
mb_detect_encoding($str'UTF-8'true) === false) { 
   
$str utf8_encode($str); 
    }

    return 
$str;
}
// ------------------------------------------------------ 
?>

$txtstr = str_to_utf8($txtstr);
2013-12-25 11:29:57
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
<?php
$file 
file_get_contents("somefile.txt");
$encodings implode(','mb_list_encodings());
echo 
mb_detect_encoding($file$encodingstrue);
?>
seems to work
2016-11-06 14:54:22
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
For detect UTF-8, you can use:

if (preg_match('!!u', $str)) { echo 'utf-8'; }

- Norihiori
2017-03-30 16:11:28
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
About function mb_detect_encoding, the link http://php.net/manual/zh/function.mb-detect-encoding.php , like this:
mb_detect_encoding('áéóú', 'UTF-8', true); // false
but now the result is not false, can you give me reason, thanks!
2018-01-04 14:18:56
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
In my environment (PHP 7.1.12),
"mb_detect_encoding()" doesn't work
     where "mb_detect_order()" is not set appropriately.

To enable "mb_detect_encoding()" to work in such a case,
     simply put "mb_detect_order('...')"
     before "mb_detect_encoding()" in your script file.

Both 
     "ini_set('mbstring.language', '...');"
     and
     "ini_set('mbstring.detect_order', '...');"
DON'T work in script files for this purpose
whereas setting them in PHP.INI file may work.
2018-03-28 10:17:39
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Автор:
It was helpful for my exec(...) call. When it returned cp866 or cp1251:

try {
    $line = iconv('CP866', 'CP1251', $line);
} catch(Exception $e) {
}
return iconv('CP1251', 'UTF-8', $line);
2022-04-04 16:48:37
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
The documentation is no longer correct for php8.1 and mb_detect_encoding no longer supports order of encodings. The example outputs given in the documentation are also no longer correct for php8.1. This is somewhat explained here https://github.com/php/php-src/issues/8279

I understand the previous ambiguity in these functions, but in my option 8.1 should have deprecated mb_detect_encoding and mb_detect_order and came up with different functions. It now tries to find the encoding that will use the least amount of space regardless of the order, and I am not sure who needs that.

Below is an example function that will do what mb_detect_encoding was doing prior to the 8.1 change.

<?php

function mb_detect_enconding_in_order(string $string, array $encodings): string|false
{
    foreach(
$encodings as $enc) {
        if (
mb_check_encoding($string$enc)) {
            return 
$enc;
        }
    }
    return 
false;
}

?>
2022-09-19 10:47:12
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html
Major undocumented breaking change since 8.1.7
https://3v4l.org/BLjZ3

Make sure to replace mb_detect_encoding with a loop of calls to mb_check_encoding
2022-11-09 00:35:17
http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

    Поддержать сайт на родительском проекте КГБ