DOMDocument::loadHTML

(PHP 5, PHP 7)

DOMDocument::loadHTML — Загрузка HTML из строки

Описание

public bool DOMDocument::loadHTML ( string $source [, int $options = 0 ] )

Функция разбирает HTML содержащийся в строке source. В отличие от XML, HTML не обязан быть правильно построенным документом. Эта функция также может быть вызвана статически для загрузки и создания объекта класса DOMDocument. Статический вызов может использоваться в случаях, когда нет необходимости устанавливать значения параметров объекта DOMDocument до загрузки документа.

Список параметров

source: HTML строка.
options: Начиная с версии PHP 5.4.0 и Libxml 2.6.0, можно также использовать параметр options для передачи дополнительных параметров Libxml.

Возвращаемые значения

Возвращает TRUE в случае успешного завершения или FALSE в случае возникновения ошибки. В случае статического вызова возвращает объект класса DOMDocument или FALSE в случае возникновения ошибки.

Ошибки

Если через аргумент source передана пустая строка, будет сгенерировано предупреждение. Это предупреждение генерируется не libxml, поэтому оно не может быть обработано библиотечными обработчиками ошибок.

Этот метод может быть вызван статически, но при этом будет сгенерирована ошибка уровня E_STRICT.

Несмотря на то, что некорректный HTML обычно успешно загружается, данная функция может генерировать ошибки уровня E_WARNING при обнаружении плохой разметки. Для обработки данных ошибок можно воспользоваться функциями обработки ошибок libxml.

Примеры

Пример #1 Создание документа


<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br></body></html>");
echo $doc->saveHTML();
?>

Список изменений

Версия	Описание
5.4.0	Добавлен параметр `options`.

Смотрите также

DOMDocument::loadHTMLFile() - Загрузка HTML из файла
DOMDocument::saveHTML() - Сохраняет документ из внутреннего представления в строку, используя HTML форматирование
DOMDocument::saveHTMLFile() - Сохраняет документ из внутреннего представления в файл, используя HTML форматирование

Коментарии

Apr 26

Автор: bigtree at DONTSPAM dot 29a dot nl


Pay attention when loading html that has a different charset than iso-8859-1. Since this method does not actively try to figure out what the html you are trying to load is encoded in (like most browsers do), you have to specify it in the html head. If, for instance, your html is in utf-8, make sure you have a meta tag in the html's head section:



<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

</head>



If you do not specify the charset like this, all high-ascii bytes will be html-encoded. It is not enough to set the dom document you are loading the html in to UTF-8.

2005-04-26 05:15:39

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Feb 15

Автор: romain dot lalaut at laposte dot net


Note that the elements of such document will have no namespace even with <html xmlns="http://www.w3.org/1999/xhtml">

2007-02-15 10:31:30

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Apr 26

Автор: hanhvansu at yahoo dot com


When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of dom functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cáº¡nh tranh".  I suggest we use mb_convert_encoding before load UTF-8 page :

<?php

    $pageDom = new DomDocument();    

    $searchPage = mb_convert_encoding($htmlUTF8Page, 'HTML-ENTITIES', "UTF-8"); 

    @$pageDom->loadHTML($searchPage);



?>

2007-04-26 23:50:54

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Oct 04

Автор: xuanbn at yahoo dot com


If you use loadHTML() to process utf HTML string (eg in Vietnamese), you may experience result in garbage text, while some files were OK. Even your HTML already have meta charset  like



  <meta http-equiv="content-type" content="text/html; charset=utf-8">



I have discovered that, to help loadHTML() process utf file correctly, the meta tag should come first, before any utf string appear. For example, this HTML file



<html>

 <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

    <title> Vietnamese - Tiếng Việt</title>

  </head>

<body></body>

</html>



will be OK with loadHTML() when <meta> tag appear <title> tag.



But the file below will not regcornize by loadHTML() because <title> tag contains utf string appear before <meta> tag.



<html>

 <head>

    <title> Vietnamese - Tiếng Việt</title>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

<body></body>

</html>

2007-10-04 04:38:12

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Oct 20

Автор: jamesedwardcooke+php at gmail dot com


Using loadHTML() automagically sets the doctype property of your DOMDocument instance(to the doctype in the html, or defaults to 4.0 Transitional). If you set the doctype with DOMImplementation it will be overridden.



I assumed it was possible to set it and then load html with the doctype I defined(in order to decide the doctype at runtime), and ran into a huge headache trying to find out where my doctype was going. Hopefully this helps someone else.

2008-10-20 02:37:08

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Feb 11

Автор: Errol


It should be noted that when any text is provided within the body tag

outside of a containing element, the DOMDocument will encapsulate that

text into a paragraph tag (<p>).



For example:

<?php

$doc = new DOMDocument();

$doc->loadHTML("<html><body>Test<br><div>Text</div></body></html>");

echo $doc->saveHTML();

?>



will yield:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"

"http://www.w3.org/TR/REC-html40/loose.dtd">

<html><body>

<p>Test<br></p>

<div>Text</div>

</body></html>



while:

<?php

$doc = new DOMDocument();

$doc->loadHTML(

    "<html><body><i>Test</i><br><div>Text</div></body></html>");

echo $doc->saveHTML();

?>



will yield:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"

"http://www.w3.org/TR/REC-html40/loose.dtd">

<html><body>

<i>Test</i><br><div>Text</div>

</body></html>

2009-02-11 10:05:13

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Jun 14

Автор: piopier


Here is a function I wrote to capitalize the previous remarks about charset problems (UTF-8...) when using loadHTML and then DOM functions.

It adds the charset meta tag just after <head> to improve automatic encoding detection, converts any specific character to an html entity, thus PHP DOM functions/attributes will return correct values.



<?php

mb_detect_order("ASCII,UTF-8,ISO-8859-1,windows-1252,iso-8859-15");

function loadNprepare($url,$encod='') {

        $content        = file_get_contents($url);

        if (!empty($content)) {

                if (empty($encod))

                        $encod  = mb_detect_encoding($content);

                $headpos        = mb_strpos($content,'<head>');

                if (FALSE=== $headpos)

                        $headpos= mb_strpos($content,'<HEAD>');

                if (FALSE!== $headpos) {

                        $headpos+=6;

                        $content = mb_substr($content,0,$headpos) . '<meta http-equiv="Content-Type" content="text/html; charset='.$encod.'">' .mb_substr($content,$headpos);

                }

                $content=mb_convert_encoding($content, 'HTML-ENTITIES', $encod);

        }

        $dom = new DomDocument;

        $res = $dom->loadHTML($content);

        if (!$res) return FALSE;

        return $dom;

}

?>



NB: it uses mb_strpos/mb_substr instead of mb_ereg_replace because that seemed more efficient with huge html pages.

2009-06-14 11:29:19

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Dec 21

Автор: mdmitry at gmail dot com


You can also load HTML as UTF-8 using this simple hack:



<?php



$doc = new DOMDocument();

$doc->loadHTML('<?xml encoding="UTF-8">' . $html);



// dirty fix

foreach ($doc->childNodes as $item)

    if ($item->nodeType == XML_PI_NODE)

        $doc->removeChild($item); // remove hack

$doc->encoding = 'UTF-8'; // insert proper



?>

2009-12-21 11:02:31

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Jan 04

Автор: Shane Harter


DOMDocument is very good at dealing with imperfect markup, but it throws warnings all over the place when it does. 



This isn't well documented here. The solution to this is to implement a separate aparatus for dealing with just these errors. 



Set libxml_use_internal_errors(true) before calling loadHTML. This will prevent errors from bubbling up to your default error handler. And you can then get at them (if you desire) using other libxml error functions. 



You can find more info here ref.libxml

2010-01-04 10:42:18

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Apr 10

Автор: Alex


Beware of the "gotcha" (works as designed but not as expected): if you use loadHTML, you cannot validate the document. Validation is only for XML. Details here: http://bugs.php.net/bug.php?id=43771&edit=1

2010-04-10 11:45:01

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Dec 16

Автор: cake at brothercake dot com


Be aware that this function doesn't actually understand HTML -- it fixes tag-soup input using the general rules of SGML, so it creates well-formed markup, but has no idea which element contexts are allowed.



For example, with input like this where the first element isn't closed: 



    <span>hello <div>world</div>



loadHTML will change it to this, which is well-formed but invalid:



    <span>hello <div>world</div></span>

2012-12-16 19:57:46

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Oct 05

Автор: finkenb2 at mail dot lib dot msu dot edu


Warning:  This does not function well with HTML5 elements such as SVG.  Most of the advice on the Web is to turn off errors in order to have it work with HTML5.

2015-10-05 19:03:24

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Feb 13

Автор: fr at felix-riesterer dot de


Remember: If you use an HTML5 doctype and a meta element like so



<meta charset=utf-8">



your HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities. However the HTML4-like version will work (as has been pointed out 10 years ago by "bigtree at 29a"):



<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

2016-02-13 03:43:57

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Dec 24

Автор: kerim-yagmurcu at gmx dot de


For those of you who want to get an external URL's class element, I have 2 usefull functions. In this example we get the '<h3 class="r">'

 elements back (search result headers) from google search:



1. Check the URL (if it is reachable, existing)

<?php

# URL Check

function url_check($url) { 

    $headers = @get_headers($url); 

    return is_array($headers) ? preg_match('/^HTTP\\/\\d+\\.\\d+\\s+2\\d\\d\\s+.*$/',$headers[0]) : false; 

};

?>



2. Clean the element you want to get (remove all tags, tabs, new-lines etc.)

<?php

# Function to clean a string

function clean($text){

    $clean = html_entity_decode(trim(str_replace(';','-',preg_replace('/\s+/S', " ", strip_tags($text)))));// remove everything

    return $clean;

    echo '\n';// throw a new line

}

?>



After doing that, we can output the search result headers with following method:

<?php

$searchstring = 'djceejay';

$url = 'http://www.google.de/webhp#q='.$searchstring;

if(url_check($url)){

    $doc = new DomDocument;

    $doc->validateOnParse = true;

    $doc->loadHtml(file_get_contents($url));

    $output = clean($doc->getElementByClass('r')->textContent);

    echo $output . '<br>';

}else{

    echo 'URL not reachable!';// Throw message when URL not be called

}

?>

2016-12-24 17:13:10

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Apr 03

Автор: BychkovVV at mail dot ru


If you are loading html content from any website, in "utf-8" encoding, when meta width content-type is not first child of HEAD, it would not be acknowledged by parser (encoding); So you can make this fix:

  function domLoadHTML($html)

           {$testDOM = new DOMDocument('1.0', 'UTF-8');

            $testDOM->loadHTML($html);

            $charset = NULL;

            $searchInElemnt = function(&$item) use (&$searchInElemnt, &$charset)

              {if($item->childNodes)

                 {foreach($item->childNodes as $childItem)

                    {switch($childItem->nodeName)

                       {case 'html':

                        case 'head':

                          $searchInElemnt($childItem);

                          break;

                        case 'meta':

                          $attributes = array();

                          foreach ($childItem->attributes as $attr) 

                            {$attributes[mb_strtoupper($attr->localName)] = $attr->nodeValue;

                            }

                          if(array_key_exists('HTTP-EQUIV', $attributes) && (mb_strtoupper($attributes['HTTP-EQUIV']) == 'CONTENT-TYPE') && array_key_exists('CONTENT', $attributes) && preg_match('~[\s]*;[\s]*charset[\s]*=[\s]*([^\s]+)~', $attributes['CONTENT'], $matches))

                            {$charset = preg_replace('~[\s\']~', '', $matches[1]);

                            }

                       }

                    }

                 }

              };

            $searchInElemnt($testDOM);

            if(isset($charset))

              {$dom = new DOMDocument('1.0', $charset);

               $dom->loadHTML('<?xml encoding="'.$charset.'">'.$html);

               foreach ($dom->childNodes as $item)

               if($item->nodeType == XML_PI_NODE)

                 {$dom->removeChild($item);

                 }

               $dom->encoding = $charset;

              }

            else

              {$dom = $testDOM;                 

              }

            return $dom;

           };

2020-04-03 11:30:59

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Sep 15

Автор: divinity76+spam at gmail dot com


if you want to get rid of all the "DOMText elements containing ONLY whitespace", maybe try



<?php



function loadHTML_noemptywhitespace(string $html, int $extra_flags = 0, int $exclude_flags = 0): DOMDocument

{

    $flags = LIBXML_HTML_NODEFDTD | LIBXML_NOBLANKS | LIBXML_NONET;

    $flags = ($flags | $extra_flags) & ~ $exclude_flags;



    $domd = new DOMDocument();

    $domd->preserveWhiteSpace = false;

    @$domd->loadHTML('<?xml encoding="UTF-8">' . $html, $flags);

    $removeAnnoyingWhitespaceTextNodes = function (\DOMNode $node) use (&$removeAnnoyingWhitespaceTextNodes): void {

        if ($node->hasChildNodes()) {

            // Warning: it's important to do it backwards; if you do it forwards, the index for DOMNodeList might become invalidated;

            // that's why i don't use foreach() - don't change it (unless you know what you're doing, ofc)

            for ($i = $node->childNodes->length - 1; $i >= 0; --$i) {

                $removeAnnoyingWhitespaceTextNodes($node->childNodes->item($i));

            }

        }

        if ($node->nodeType === XML_TEXT_NODE && !$node->hasChildNodes() && !$node->hasAttributes() && empty(trim($node->textContent))) {

            //echo "Removing annoying POS";

            // var_dump($node);

            $node->parentNode->removeChild($node);

        } //elseif ($node instanceof DOMText) { echo "not removed"; var_dump($node, $node->hasChildNodes(), $node->hasAttributes(), trim($node->textContent)); }

    };

    $removeAnnoyingWhitespaceTextNodes($domd);

    return $domd;

}

2020-09-15 00:25:11

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Nov 20

Автор: deepakrajpal dot com at gmail dot com


If we are loading html5 tags such as <section>, <svg> there is following error:



DOMDocument::loadHTML(): Tag section invalid in Entity



We can disable standard libxml errors (and enable user error handling) using libxml_use_internal_errors(true); before loadHTML();



This is quite useful in phpunit custom assertions as given in following example (if using phpunit test cases):



// Create a DOMDocument

$dom = new DOMDocument();



// fix html5/svg errors

libxml_use_internal_errors(true);

        

// Load html 

$dom->loadHTML("<section></section>");

$htmlNodes = $dom->getElementsByTagName('section');



if ($htmlNodes->length == 0) {

    $this->assertFalse(TRUE);

} else {

    $this->assertTrue(TRUE);

}

2020-11-20 08:16:33

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

Dec 10

Автор: obayed dot opu at gmail dot com


To support HTML5 you have to disable xml error handling by add `LIBXML_NOERROR` as an option of loadHTML method.



Example:



<?php

$doc = new DOMDocument();

$doc->loadHTML("<html><body>Test<br><section>I'M UNSUPPORTED</section></body></html>", LIBXML_NOERROR);

echo $doc->saveHTML();

?>

2021-12-10 11:58:01

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

May 19

Автор: Anonymous


loadHTML() & loadHTMLFile() may always generate warnings if the html include some tags such as "nav, section, footer, etc" adopted as of HTML5 (in PHP 8.1.6).



Try to run below.



<?php



$file_name = 'PHP Runtime Configuration - Manual.html'; // Download this file from "https://www.php.net/manual/en/session.configuration.php" in advance.



$doc = new DOMDocument();

$doc->loadHTMLFile($file_name); // if set "LIBXML_NOERROR" as 2nd arg, no error

echo $doc->saveHTML();



// Warning: DOMDocument::loadHTMLFile(): Tag nav invalid in PHP Runtime Configuration - Manual.html, line: 63 in D:\xampp\htdocs\test\xml(dom)\loadHTML\index.php on line 6



?>

2022-05-19 19:05:36

http://php5.kiev.ua/manual/ru/domdocument.loadhtml.html

DOMDocument::load

DOMDocument::loadHTMLFile

DOMDocument

PHP Manual

PHP5

Для web разработчика

May 10
Функция DOMDocument::loadHTML() - Загрузка HTML из строки

DOMDocument::loadHTML

Описание

Список параметров

Возвращаемые значения

Ошибки

Примеры

Список изменений

Смотрите также

Коментарии

PHP5

Для web разработчика

May 10Функция DOMDocument::loadHTML() - Загрузка HTML из строки

DOMDocument::loadHTML

Описание

Список параметров

Возвращаемые значения

Ошибки

Примеры

Список изменений

Смотрите также

Коментарии

May 10
Функция DOMDocument::loadHTML() - Загрузка HTML из строки