Метасимволы
Сила регулярных выражений исходит из возможности использовать условия и повторения в шаблоне. Они записываются при помощи метасимволов, которые специальным образом интерпретируются.
Существуют два различных набора метасимволов: те, которые используются внутри квадратных скобок, и те, которые используются вне квадратных скобок. Вне квадратных скобок используются следующие метасимволы:
- \
- общий экранирующий символ, допускающий несколько вариантов применения
- ^
- декларирует начало данных (или строки в многострочном режиме)
- $
- декларирует конец данных или до завершения строки (или окончание строки в многострочном режиме)
- .
- соответствует любому символу, кроме перевода строки (по умолчанию)
- [
- начало описания символьного класса
- ]
- конец описания символьного класса
- |
- начало ветки условного выбора
- (
- начало подмаски
- )
- конец подмаски
- ?
- расширяет смысл метасимвола (, является также квантификатором, означающим отсутствие либо ровно 1 вхождение, также преобразует жадные квантификаторы в ленивые (смотрите повторение)
- *
- квантификатор, означающий 0 или более вхождений
- +
- квантификатор, означающий 1 или более вхождений
- {
- начало количественного квантификатора
- }
- конец количественного квантификатора
- \
- общий экранирующий символ
- ^
- означает отрицание класса, допустим только в начале класса
- -
- означает символьный интервал
- ]
- завершает символьный класс
- PHP Руководство
- Функции по категориям
- Индекс функций
- Справочник функций
- Обработка текста
- Функции для работы с регулярными выражениями (Perl-совместимые)
- Регулярные выражения PCRE
- Вступление
- Разделители
- Метасимволы
- Экранирующие последовательности
- Свойства Unicode-символов
- Якоря
- Метасимвол точка
- Символьные классы
- Альтернативный выбор
- Установка внутренних опций
- Подмаски
- Повторение
- Обратные ссылки
- Утверждения
- Однократные подмаски
- Условные подмаски
- Комментарии
- Рекурсивные шаблоны
- Производительность
Коментарии
The meta character $ accepts a (one) newline character (\n).
(Take a moment to let this information sink in)
You might want to (r)trim() your input afterwards if you have a match because otherwise it (still) might not meet a length requirement or other strange stuff might happen when you store the input as-is.
disturbing usage of "any character" for multi-lines...
remark:
'.' (all characters) just does NOT include on single character the newline (\n) by default,
while \n is included in all other matching searches (e.g. \s).
Funny enough, the "carriage return" (\r) is included, when using '.'
You have to write "(.|\\n)" instead of a single dot, with disadvantages in using complex matching-results,
or simple use the "s" modificator to bring dot to accept the newline.
$subject="<tag>Hello\nWorld</tag>";
preg_match( '/<tag>[A-Za-z\\s]*<\\/tag>/' , $subject ); //true
preg_match( '/<tag>[^<]*<\\/tag>/' , $subject ); //true
preg_match( '/<tag>(.|\\n)*<\\/tag>/' , $subject ); //true
preg_match( '/<tag>.*<\\/tag>/s' , $subject ); //true
preg_match( '/<tag>.*<\\/tag>/' , $subject ); //ATTENTION! *false*
A hint for those of you who are trying to fight off (or work around at least) the problem of matching a pattern correctly at the end ($) of any line in multiple lines mode (/m).
<?php
// Various OS-es have various end line (a.k.a line break) chars:
// - Windows uses CR+LF (\r\n);
// - Linux LF (\n);
// - OSX CR (\r).
// And that's why single dollar meta assertion ($) sometimes fails with multiline modifier (/m) mode - possible bug in PHP 5.3.8(?).
$str="ABC ABC\n\n123 123\r\ndef def\rnop nop\r\n890 890\nQRS QRS\r\r~-_ ~-_";
// C 3 p 0 _
$pat1='/\w$/mi';
$pat2='/\w\r?$/mi';
$pat3='/\w\R?$/mi';
$n=preg_match_all($pat1, $str, $m1);
$o=preg_match_all($pat2, $str, $m2);
$p=preg_match_all($pat3, $str, $m3);
echo $str."\n1 !!! $pat1 ($n): ".print_r($m1[0], true)
."\n2 !!! $pat2 ($o): ".print_r($m2[0], true)
."\n3 !!! $pat3 ($p): ".print_r($m3[0], true);
// Note the difference between the two very helpful escape sequences in $pat2 (\r) and in $pat3 (\R) - for some applications at least.
/* The code above results in the following output:
ABC ABC
123 123
def def
nop nop
890 890
QRS QRS
~-_ ~-_
1 !!! /\w$/mi (3): Array
(
[0] => C
[1] => 0
[2] => _
)
2 !!! /\w\r?$/mi (5): Array
(
[0] => C
[1] => 3
[2] => p
[3] => 0
[4] => _
)
3 !!! /\w\R?$/mi (5): Array
(
[0] => C
[1] => 3
[2] => p
[3] => 0
[4] => _
)
*/
?>
Unfortunately, I haven't got any access to a server with the latest PHP version - my local PHP is 5.3.8 and my public host's PHP is version 5.2.17.
Significantly updated version (with $pat4 and the best $pat5!):
A hint for those of you who are trying to fight off (or work around at least) the problem of matching a pattern correctly at the end ($) of any line in multiple lines mode (/m).
<?php
// Various OS-es have various end line (a.k.a line break) chars:
// - Windows uses CR+LF (\r\n);
// - Linux LF (\n);
// - OSX CR (\r).
// And that's why single dollar meta assertion ($) sometimes fails with multiline modifier (/m) mode - possible bug in PHP 5.3.8 or just a "feature"(?).
$str="ABC ABC\n\n123 123\r\ndef def\rnop nop\r\n890 890\nQRS QRS\r\r~-_ ~-_";
// C 3 p 0 _
$pat1='/\w$/mi';
$pat2='/\w\r?$/mi';
$pat3='/\w\R?$/mi';
$pat4='/\w\v?$/mi';
$pat5='/(*ANYCRLF)\w$/mi';
$n=preg_match_all($pat1, $str, $m1);
$o=preg_match_all($pat2, $str, $m2);
$p=preg_match_all($pat3, $str, $m3);
$r=preg_match_all($pat4, $str, $m4);
$s=preg_match_all($pat5, $str, $m5);
echo $str."\n1 !!! $pat1 ($n): ".print_r($m1[0], true)
."\n2 !!! $pat2 ($o): ".print_r($m2[0], true)
."\n3 !!! $pat3 ($p): ".print_r($m3[0], true)
."\n4 !!! $pat4 ($r): ".print_r($m4[0], true)
."\n5 !!! $pat5 ($s): ".print_r($m5[0], true);
// Note the difference between the three very helpful escape sequences in $pat2 (\r), $pat3 (\R), $pat4 (\v) and altered newline option in $pat5 ((*ANYCRLF)) - for some applications at least.
/* The code above results in the following output:
ABC ABC
123 123
def def
nop nop
890 890
QRS QRS
~-_ ~-_
1 !!! /\w$/mi (3): Array
(
[0] => C
[1] => 0
[2] => _
)
2 !!! /\w\r?$/mi (5): Array
(
[0] => C
[1] => 3
[2] => p
[3] => 0
[4] => _
)
3 !!! /\w\R?$/mi (5): Array
(
[0] => C
[1] => 3
[2] => p
[3] => 0
[4] => _
)
4 !!! /\w\v?$/mi (5): Array
(
[0] => C
[1] => 3
[2] => p
[3] => 0
[4] => _
)
5 !!! /(*ANYCRLF)\w$/mi (7): Array
(
[0] => C
[1] => 3
[2] => f
[3] => p
[4] => 0
[5] => S
[6] => _
)
*/
?>
Unfortunately, I haven't got any access to a server with the latest PHP version - my local PHP is 5.3.8 and my public host's PHP is version 5.2.17.
An important addendum (with new $pat3_2 utilising \R properly, its results and comments):
Note that there are (sometimes difficult to grasp at first glance) nuances of meaning and application of escape sequences like \r, \R and \v - none of them is perfect in all situations, but they are quite useful nevertheless. Some official PCRE control options and their changes come in handy too - unfortunately neither (*ANYCRLF), (*ANY) nor (*CRLF) is documented here on php.net at the moment (although they seem to be available for over 10 years and 5 months now), but they are described on Wikipedia ("Newline/linebreak options" at https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions) and official PCRE library site ("Newline convention" at http://www.pcre.org/original/doc/html/pcresyntax.html#SEC17) pretty well. The functionality of \R appears somehow disappointing (with default configuration of compile time option) according to php.net as well as official description ("Newline sequences" at https://www.pcre.org/original/doc/html/pcrepattern.html#newlineseq) when used improperly.
A hint for those of you who are trying to fight off (or work around at least) the problem of matching a pattern correctly at the end (or at the beginning) of any line even without the multiple lines mode (/m) or meta-character assertions ($ or ^).
<?php
// Various OS-es have various end line (a.k.a line break) chars:
// - Windows uses CR+LF (\r\n);
// - Linux LF (\n);
// - OSX CR (\r).
// And that's why single dollar meta assertion ($) sometimes fails with multiline modifier (/m) mode - possible bug in PHP 5.3.8 or just a "feature"(?) of default configuration option for meta-character assertions (^ and $) at compile time of PCRE.
$str="ABC ABC\n\n123 123\r\ndef def\rnop nop\r\n890 890\nQRS QRS\r\r~-_ ~-_";
// C 3 p 0 _
$pat3='/\w\R?$/mi'; // Somehow disappointing according to php.net and pcre.org when used improperly
$pat3_2='/\w(?=\R)/i'; // Much better with allowed lookahead assertion (just to detect without capture) without multiline (/m) mode; note that with alternative for end of string ((?=\R|$)) it would grab all 7 elements as expected, but '/(*ANYCRLF)\w$/mi' is more straightforward in use anyway
$p=preg_match_all($pat3, $str, $m3);
$r=preg_match_all($pat3_2, $str, $m4);
echo $str."\n3 !!! $pat3 ($p): ".print_r($m3[0], true)
."\n3_2 !!! $pat3_2 ($r): ".print_r($m4[0], true);
// Note the difference between the two very helpful escape sequences in $pat3 and $pat3_2 (\R) - for some applications at least.
/* The code above results in the following output:
ABC ABC
123 123
def def
nop nop
890 890
QRS QRS
~-_ ~-_
3 !!! /\w\R?$/mi (5): Array
(
[0] => C
[1] => 3
[2] => p
[3] => 0
[4] => _
)
3_2 !!! /\w(?=\R)/i (6): Array
(
[0] => C
[1] => 3
[2] => f
[3] => p
[4] => 0
[5] => S
)
*/
?>
Unfortunately, I haven't got any access to a server with the latest PHP version - my local PHP is 5.3.8 and my public host's PHP is version 5.2.17.
". match any character except newline (by default)"
Here "newline" seems to include both \n (LF) and \r (CR) in PHP 7.4.6. PHP 7.3.18 seems to be more tolerant and only include \n (LF).
Example:
<?php
$s = "HTTP/1.1 200 OK\r\n";
if (!preg_match('/^HTTP\/(\d\.\d)\s*(\d+).*\n/', $s, $m))
echo "Not matched correctly!\n";
else
echo "OK\n";
?>
The ".*" is supposed to match 0-n characters including \r (CR). It does so in PHP 7.3.18 but not in PHP 7.4.6.
Result (PHP 7.3.18):
OK
Result (PHP 7.4.6):
Not matched correctly!
A pattern that works in both versions of PHP looks like this:
'/^HTTP\/(\d\.\d)\s*(\d+).*\r?\n/'