mb_detect_encoding

(PHP 4 >= 4.0.6, PHP 5)

mb_detect_encoding — Detect character encoding

Description

string mb_detect_encoding ( string $str [, mixed $encoding_list = mb_detect_order() [, bool $strict = false ]] )

Detects character encoding in string str.

Parameters

str

The string being detected.

encoding_list

encoding_list is list of character encoding. Encoding order may be specified by array or comma separated list string.

If encoding_list is omitted, detect_order is used.

strict

strict specifies whether to use the strict encoding detection or not. Default is FALSE.

Return Values

The detected character encoding or FALSE if the encoding cannot be detected from the given string.

Examples

Example #1 mb_detect_encoding() example


<?php
/* Detect character encoding with current detect_order */
echo mb_detect_encoding($str);

/* "auto" is expanded according to mbstring.language */
echo mb_detect_encoding($str, "auto");

/* Specify encoding_list character encoding by comma separated list */
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");

/* Use array to specify encoding_list  */
$ary[] = "ASCII";
$ary[] = "JIS";
$ary[] = "EUC-JP";
echo mb_detect_encoding($str, $ary);
?>

Коментарии

Jan 12

Автор: maarten


Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.

To verify utf 8 use the following:



//

//    utf8 encoding validation developed based on Wikipedia entry at:

//    http://en.wikipedia.org/wiki/UTF-8

//

//    Implemented as a recursive descent parser based on a simple state machine

//    copyright 2005 Maarten Meijer

//

//    This cries out for a C-implementation to be included in PHP core

//

    function valid_1byte($char) {

        if(!is_int($char)) return false;

        return ($char & 0x80) == 0x00;

    }

    

    function valid_2byte($char) {

        if(!is_int($char)) return false;

        return ($char & 0xE0) == 0xC0;

    }



    function valid_3byte($char) {

        if(!is_int($char)) return false;

        return ($char & 0xF0) == 0xE0;

    }



    function valid_4byte($char) {

        if(!is_int($char)) return false;

        return ($char & 0xF8) == 0xF0;

    }

    

    function valid_nextbyte($char) {

        if(!is_int($char)) return false;

        return ($char & 0xC0) == 0x80;

    }

    

    function valid_utf8($string) {

        $len = strlen($string);

        $i = 0;    

        while( $i < $len ) {

            $char = ord(substr($string, $i++, 1));

            if(valid_1byte($char)) {    // continue

                continue;

            } else if(valid_2byte($char)) { // check 1 byte

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

            } else if(valid_3byte($char)) { // check 2 bytes

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

            } else if(valid_4byte($char)) { // check 3 bytes

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

            } // goto next char

        }

        return true; // done

    }



for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png

2005-01-12 17:55:40

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Feb 17

Автор: php-note-2005 at ryandesign dot com


Much simpler UTF-8-ness checker using a regular expression created by the W3C:



<?php



// Returns true if $string is valid UTF-8 and false otherwise.

function is_utf8($string) {

    

    // From http://w3.org/International/questions/qa-forms-utf-8.html

    return preg_match('%^(?:

          [\x09\x0A\x0D\x20-\x7E]            # ASCII

        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte

        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs

        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte

        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates

        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3

        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15

        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16

    )*$%xs', $string);

    

} // function is_utf8



?>

2005-02-17 09:57:14

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Mar 29

Автор: Chrigu


If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:

mb_detect_encoding($string, 'UTF-8, ISO-8859-1');



if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.

2005-03-29 10:32:23

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Jul 27

Автор: telemach


beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)



mb_detect_encoding('accentu?e' , 'UTF-8, ISO-8859-1')



returns ISO-8859-1, while 



mb_detect_encoding('accentu?' , 'UTF-8, ISO-8859-1')



returns UTF-8



bottom line : an ending '?' (and probably other accentuated chars) mislead mb_detect_encoding

2005-07-27 21:48:52

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Aug 03

Автор: chris AT w3style.co DOT uk


Based upon that snippet below using preg_match() I needed something faster and less specific.  That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8.  I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.



I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string.  This is quite a lot faster.



<?php



function detectUTF8($string)

{

        return preg_match('%(?:

        [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte

        |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs

        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte

        |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates

        |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3

        |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15

        |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16

        )+%xs', $string);

}



?>

2006-08-03 05:22:16

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Sep 04

Автор: rl at itfigures dot nl


I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.



The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice  that \x80 is used as the euro-sign in the 8859-1 charset. 



I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:



if(detectUTF8($str)){

  $str=str_replace("\xE2\x82\xAC","&euro;",$str); 

  $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);

  $str=str_replace("&euro;","\x80",$str); 

}



If html-output is needed the last line is not necessary (and even unwanted).

2007-09-04 17:00:15

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Aug 24

Автор: hmdker at gmail dot com


Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.



<?php

function is_utf8($str) {

    $c=0; $b=0;

    $bits=0;

    $len=strlen($str);

    for($i=0; $i<$len; $i++){

        $c=ord($str[$i]);

        if($c > 128){

            if(($c >= 254)) return false;

            elseif($c >= 252) $bits=6;

            elseif($c >= 248) $bits=5;

            elseif($c >= 240) $bits=4;

            elseif($c >= 224) $bits=3;

            elseif($c >= 192) $bits=2;

            else return false;

            if(($i+$bits) > $len) return false;

            while($bits > 1){

                $i++;

                $b=ord($str[$i]);

                if($b < 128 || $b > 191) return false;

                $bits--;

            }

        }

    }

    return true;

}

?>

2008-08-24 00:58:28

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Oct 06

Автор: dennis at nikolaenko dot ru


Beware of bug to detect Russian encodings

http://bugs.php.net/bug.php?id=38138

2008-10-06 12:18:05

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

May 22

Автор: nat3738 at gmail dot com


A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)



<?php

// Unicode BOM is U+FEFF, but after encoded, it will look like this.

define ('UTF32_BIG_ENDIAN_BOM'   , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF));

define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00));

define ('UTF16_BIG_ENDIAN_BOM'   , chr(0xFE) . chr(0xFF));

define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE));

define ('UTF8_BOM'               , chr(0xEF) . chr(0xBB) . chr(0xBF));



function detect_utf_encoding($filename) {



    $text = file_get_contents($filename);

    $first2 = substr($text, 0, 2);

    $first3 = substr($text, 0, 3);

    $first4 = substr($text, 0, 3);

    

    if ($first3 == UTF8_BOM) return 'UTF-8';

    elseif ($first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';

    elseif ($first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';

    elseif ($first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';

    elseif ($first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';

}

?>

2009-05-22 06:58:04

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Feb 18

Автор: Gerg Tisza


If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.



<?php

    $str = 'áéóú'; // ISO-8859-1

    mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'

    mb_detect_encoding($str, 'UTF-8', true); // false

?>

2011-02-18 05:43:45

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Mar 24

Автор: bmrkbyet at web dot de


a) if the FUNCTION mb_detect_encoding is not available: 



### mb_detect_encoding ... iconv ###



<?php

// -------------------------------------------



if(!function_exists('mb_detect_encoding')) { 

function mb_detect_encoding($string, $enc=null) { 

    

    static $list = array('utf-8', 'iso-8859-1', 'windows-1251');

    

    foreach ($list as $item) {

        $sample = iconv($item, $item, $string);

        if (md5($sample) == md5($string)) { 

            if ($enc == $item) { return true; }    else { return $item; } 

        }

    }

    return null;

}

}



// -------------------------------------------

?>



b) if the FUNCTION mb_convert_encoding is not available: 



### mb_convert_encoding ... iconv ###



<?php

// -------------------------------------------



if(!function_exists('mb_convert_encoding')) { 

function mb_convert_encoding($string, $target_encoding, $source_encoding) { 

    $string = iconv($source_encoding, $target_encoding, $string); 

    return $string; 

}

}



// -------------------------------------------

?>

2013-03-24 16:04:36

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Jun 11

Автор: eyecatchup at gmail dot com


Just a note: Instead of using the often recommended (rather complex) regular expression by W3C (http://www.w3.org/International/questions/qa-forms-utf-8.en.php), you can simply use the 'u' modifier to test a string for UTF-8 validity:



<?php

  if (preg_match("//u", $string)) {

      // $string is valid UTF-8

  }

2013-06-11 13:41:41

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Dec 25

Автор: emoebel at web dot de


if the  function " mb_detect_encoding" does not exist  ... 



... try: 



<?php 

// ---------------------------------------------------- 

if ( !function_exists('mb_detect_encoding') ) { 



// ---------------------------------------------------------------- 

function mb_detect_encoding ($string, $enc=null, $ret=null) { 

       

        static $enclist = array( 

            'UTF-8', 'ASCII', 

            'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5', 

            'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 

            'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16', 

            'Windows-1251', 'Windows-1252', 'Windows-1254', 

            );

        

        $result = false; 

        

        foreach ($enclist as $item) { 

            $sample = iconv($item, $item, $string); 

            if (md5($sample) == md5($string)) { 

                if ($ret === NULL) { $result = $item; } else { $result = true; } 

                break; 

            }

        }

        

    return $result; 

} 

// ---------------------------------------------------------------- 



} 

// ---------------------------------------------------- 

?>



example / usage of: mb_detect_encoding() 



<?php 

// ------------------------------------------------------ 

function str_to_utf8 ($str) { 

    

    if (mb_detect_encoding($str, 'UTF-8', true) === false) { 

    $str = utf8_encode($str); 

    }



    return $str;

}

// ------------------------------------------------------ 

?>



$txtstr = str_to_utf8($txtstr);

2013-12-25 11:29:57

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Mar 30

Автор: garbage at iglou dot eu


For detect UTF-8, you can use:



if (preg_match('!!u', $str)) { echo 'utf-8'; }



- Norihiori

2017-03-30 16:11:28

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Jan 04

Автор: lotushzy at gmail dot com


About function mb_detect_encoding, the link http://php.net/manual/zh/function.mb-detect-encoding.php , like this:

mb_detect_encoding('áéóú', 'UTF-8', true); // false

but now the result is not false, can you give me reason, thanks!

2018-01-04 14:18:56

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Mar 28

Автор: recentUser at example dot com


In my environment (PHP 7.1.12),

"mb_detect_encoding()" doesn't work

     where "mb_detect_order()" is not set appropriately.



To enable "mb_detect_encoding()" to work in such a case,

     simply put "mb_detect_order('...')"

     before "mb_detect_encoding()" in your script file.



Both 

     "ini_set('mbstring.language', '...');"

     and

     "ini_set('mbstring.detect_order', '...');"

DON'T work in script files for this purpose

whereas setting them in PHP.INI file may work.

2018-03-28 10:17:39

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Apr 04

Автор: d_maksimov


It was helpful for my exec(...) call. When it returned cp866 or cp1251:



try {

    $line = iconv('CP866', 'CP1251', $line);

} catch(Exception $e) {

}

return iconv('CP1251', 'UTF-8', $line);

2022-04-04 16:48:37

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Sep 19

Автор: mta59066 at gmail dot com


The documentation is no longer correct for php8.1 and mb_detect_encoding no longer supports order of encodings. The example outputs given in the documentation are also no longer correct for php8.1. This is somewhat explained here https://github.com/php/php-src/issues/8279



I understand the previous ambiguity in these functions, but in my option 8.1 should have deprecated mb_detect_encoding and mb_detect_order and came up with different functions. It now tries to find the encoding that will use the least amount of space regardless of the order, and I am not sure who needs that.



Below is an example function that will do what mb_detect_encoding was doing prior to the 8.1 change.



<?php



function mb_detect_enconding_in_order(string $string, array $encodings): string|false

{

    foreach($encodings as $enc) {

        if (mb_check_encoding($string, $enc)) {

            return $enc;

        }

    }

    return false;

}



?>

2022-09-19 10:47:12

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

Nov 09

Автор: geompse at gmail dot com


Major undocumented breaking change since 8.1.7

https://3v4l.org/BLjZ3



Make sure to replace mb_detect_encoding with a loop of calls to mb_check_encoding

2022-11-09 00:35:17

http://php5.kiev.ua/manual/ru/function.mb-detect-encoding.html

mb_decode_numericentity

mb_detect_order

Multibyte String Functions

PHP Manual

PHP5

Для web разработчика

Jul 04
Функция mb_detect_encoding() - Detect character encoding

mb_detect_encoding

Description

Parameters

Return Values

Examples

See Also

Коментарии

PHP5

Для web разработчика

Jul 04Функция mb_detect_encoding() - Detect character encoding

mb_detect_encoding

Description

Parameters

Return Values

Examples

See Also

Коментарии

Jul 04
Функция mb_detect_encoding() - Detect character encoding