PHP 5.6 и PHP 7 на русском: Свойства Unicode-символов

Свойства Unicode-символов

Начиная с версии 5.1.0, были добавлены три дополнительные управляющие последовательности, совпадающих с некоторыми общими символьными типами в режиме UTF-8. Вот они:

\p{xx}: символ со свойством xx
\P{xx}: символ без свойства xx
\X: расширенная последовательность Unicode

Имена свойств, представленные выше как xx, ограничены общими категориями свойств Unicode. Каждый символ имеет ровно одно такое свойство, указываемое двубуквенной аббревиатурой. Для совместимости с Perl также можно указать отрицание добавлением знака "^" между открывающей скобкой и именем свойства. Например, \p{^Lu} - это то же самое, что и \P{Lu}.

Если с \p или \P указана только одна буква, она включает все свойства, которые начинаются с этой буквы. В этом случае, если не используется отрицание, фигурные скобки необязательны; следующие два примера эквивалентны:

\p{L}
\pL

**Поддерживаемые коды свойств**
Свойство	Совпадение	Замечание
C	Другое
Cc	Control
Cf	Формат
Cn	Не присвоено
Co	Частное использование
Cs	Суррогат
L	Буква	Включает следующие свойства: Ll, Lm, Lo, Lt и Lu.
Ll	Строчная буква
Lm	Модификатор буквы
Lo	Другая буква
Lt	Заглавная буква
Lu	Прописная буква
M	Знак
Mc	Пробельный знак
Me	Окружающий знак
Mn	Непробельный знак
N	Число
Nd	Десятичное число
Nl	Буквенное число
No	Другое число
P	Пунктуация
Pc	Соединяющая пунктуация
Pd	Знаки тире
Pe	Закрывающая пунктуация
Pf	Заключительная пунктуация
Pi	Начальная пунктуация
Po	Другая пунктуация
Ps	Открывающая пунктуация
S	Символ
Sc	Денежный знак
Sk	Модификатор символа
Sm	Математический символ
So	Другой символ
Z	Разделитель
Zl	Разделитель строки
Zp	Разделитель абзаца
Zs	Пробельный разделитель

Расширенные свойства, такие как греческие ("Greek") или музыкальные символы ("InMusicalSymbols") не поддерживаются в PCRE.

Указывание регистро-независимого (безрегистрового) режима не влияет на эти управляющие последовательности. Например, \p{Lu} всегда совпадает только с прописными буквами.

Последовательность \X совпадает с любым числом Unicode символов, формирующих расширенную Unicode последовательность. \X эквивалентно (?>\PM\pM*).

Это означает, что совпадет символ без свойства "знака" ("mark" property), за которым идет ноль или более символов со свойством "знака", и вся последовательность обрабатывается как одна неделимая группа (смотрите ниже). Символы со свойством "знака" - это обычно знаки ударения, влияющие на предыдущий символ.

Совпадение символов по Unicode свойству не является быстрой операцией, потому для этой цели PCRE необходимо осуществить поиск в структуре данных с более чем пятнадцатью тысяч символов. Поэтому традиционные управляющие последовательности в PCRE, такие как \d и \w, не используют Unicode свойства.

Коментарии

Mar 01

Автор: suit at rebell dot at


these properties are usualy only available if PCRE is compiled with "--enable-unicode-properties"



if you want to match any word but want to provide a fallback, you can do something like that: 



<?php

if(@preg_match_all('/\p{L}+/u', $str, $arr) {

  // fallback goes here

  // for example just '/\w+/u' for a less acurate match

}

?>

2010-03-01 07:13:08

http://php5.kiev.ua/manual/ru/regexp.reference.unicode.html

May 08

Автор: mercury at caucasus dot net


An excellent article explaining all these properties can be found here: http://www.regular-expressions.info/unicode.html

2010-05-08 13:32:29

http://php5.kiev.ua/manual/ru/regexp.reference.unicode.html

Jan 22

Автор: o_shes01 at uni-muenster dot de


For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter. 

For example, there are three codepoints for the "LJ" digraph in Unicode: 

  (*) uppercase "LJ": U+01C7 

  (*) titlecase "Lj": U+01C8 

  (*) lowercase "lj": U+01C9

2011-01-22 08:23:58

http://php5.kiev.ua/manual/ru/regexp.reference.unicode.html

Dec 24

Автор: xuantoaiph at gmail dot com


My country, Vietnam, have our own alphabet table:

http://en.wikipedia.org/wiki/Vietnamese_alphabet

I hope PHP will support better than in Vietnamese.

2013-12-24 10:03:31

http://php5.kiev.ua/manual/ru/regexp.reference.unicode.html

Jan 20

Автор: huhwatnouDONTspamPLEASE at hotmail dot com


To select UTF-8 mode for the additional escape sequences (\p{xx}, \P{xx}, and \X) , use the "u" modifier (see reference.pcre.pattern.modifiers).



I wondered why a German sharp S (ß) was marked as a control character by \p{Cc} and it took me a while to properly read the first sentence: "Since 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. " :-$ and then to find out how to do so.

2016-01-20 11:00:53

http://php5.kiev.ua/manual/ru/regexp.reference.unicode.html

Sep 25

Автор: php at lnx-bsp dot net


Not made clear in the top of page explanation, but these escaped character classes can be included within square brackets to make a broader character class. For example:



<?php preg_match( '/[\p{N}\p{L}]+/', $data ) ?>



Will match any combination of letters and numbers.

2017-09-25 08:53:22

http://php5.kiev.ua/manual/ru/regexp.reference.unicode.html

Aug 03

Автор: Steve


Examples are always useful! See https://unicodeplus.com/category for more.



C    Other     

Cc   Control      (Unicode code points in the ranges U+0000-U+001F and U+007F-U+009F)

Cf   Format       (Soft hyphen (U+00AD), zero width space (U+200B), etc.)

Cn   Unassigned   (Any code point that is not in the Unicode table)

Co   Private use     

Cs   Surrogate    (Characters in the range U+D800 to U+DFFF, which are invalid in utf-8)



L    Letter

Ll   Lower case letter (a-z, µßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ and more)

Lm   Modifier letter   (Letter-like characters that are usually combined with others, but here they stand alone:

                        ʰʱʲʳʴʵʶʷʸʹʺʻʼʽʾʿˀˁˆˇˈˉˊˋˌˍˎˏːˑˠˡˢˣˤˬˮʹͺՙ and more)

Lo   Other letter      (ªºƻǀǁǂǃʔ and many more ideographs and letters from unicase alphabets)

Lt   Title case letter (ǅǈǋǲᾈᾉᾊᾋᾌᾍᾎᾏᾘᾙᾚᾛᾜᾝᾞᾟᾨᾩᾪᾫᾬᾭᾮᾯᾼῌῼ)

Lu   Upper case letter (A-Z, ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ and more)

L&   Ordinary letter   (Any character that has the Lu, Ll, or Lt property)



M    Mark

Mc   Spacing mark      (None in latin scripts)

Me   Enclosing mark    (Combining enclosing square (U+20DE) like in a⃞ , combining enclosing circle backslash (U+20E0) like in a⃠)

Mn   Non-spacing mark  (Combining diacritical marks U+0300-U+036f, like the accents on this letter a: áâãāa̅ăȧäảåa̋ǎa̍a̎ȁa̐ȃ)



N    Number      

Nd   Decimal number (0123456789, ٠١٢٣٤٥٦٧٨٩ and digits in many other scripts.)

Nl   Letter number  (ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿ and some more)

No   Other number   (⁰¹²³⁴⁵⁶⁷⁸⁹ ₀₁₂₃₄₅₆₇₈₉ ½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒ ①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳, etc.)



P    Punctuation      

Pc   Connector punctuation (_ underscore (U+005F), ‿ undertie U+203F, ⁀ character tie (U+2040), etc.)

Pd   Dash punctuation      (- hyphen-minus (U+002D), ‐ hyphen (U+2010), ‑ non-breaking hyphen (U+2011), ‒ figure dash (U+2012),

                            – en dash (U+2013), — em dash (U+2014), ― horizontal bar (U+2015), etc.)

Pe   Close punctuation     (right parenthesis, bracket, or brace: `)` (U+0029), `]` (U+005D), `}` (U+007D), etc.) 

Pf   Final punctuation     (right quotation marks: » (U+00BB), ’ (U+2019), ” (U+201D), etc.)

Pi   Initial punctuation   (left  quotation marks: « (U+00AB), ‘ (U+2018), “ (U+201C), etc.)

Po   Other punctuation     (!"#%&'*,./:;?@\¡§¶·¿)

Ps   Open punctuation      (left parenthesis, bracket, or brace: `(` (U+0028), `[` (U+005B), `{` (U+007B), etc.) 



S    Symbol      

Sc   Currency symbol     ($¢£¤¥, ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₶ ₷ ₸ ₹ ₺ ₻ ₼ ₽ ₾ ₿ (U+20A0-U+20BF), etc.)

Sk   Modifier symbol     (Symbol-like characters that are usually combined with others, but here they stand alone:

                          ^`¨¯´¸ and more)

Sm   Mathematical symbol (+<=>|~¬±×÷϶ and many more)

So   Other symbol        (¦ broken bar (U+00A6), © copyright sign (U+00A9), ® registered sign (U+00AE), ° degree sign (U+00B0);

                          arrows, signs, emojis and many many more)



Z    Separator      

Zl   Line separator      (line separator (U+2028))

Zp   Paragraph separator (paragraph separator (U+2029))

Zs   Space separator     (space, no-break space, en quad, em quad, en space, em space, figure space, thin space, hair space, etc.)