418 lines
15 KiB
Plaintext
418 lines
15 KiB
Plaintext
=head1 NAME
|
|
|
|
perlreref - Perl Regular Expressions Reference
|
|
|
|
=head1 DESCRIPTION
|
|
|
|
This is a quick reference to Perl's regular expressions.
|
|
For full information see L<perlre> and L<perlop>, as well
|
|
as the L</"SEE ALSO"> section in this document.
|
|
|
|
=head2 OPERATORS
|
|
|
|
C<=~> determines to which variable the regex is applied.
|
|
In its absence, $_ is used.
|
|
|
|
$var =~ /foo/;
|
|
|
|
C<!~> determines to which variable the regex is applied,
|
|
and negates the result of the match; it returns
|
|
false if the match succeeds, and true if it fails.
|
|
|
|
$var !~ /foo/;
|
|
|
|
C<m/pattern/msixpogcdualn> searches a string for a pattern match,
|
|
applying the given options.
|
|
|
|
m Multiline mode - ^ and $ match internal lines
|
|
s match as a Single line - . matches \n
|
|
i case-Insensitive
|
|
x eXtended legibility - free whitespace and comments
|
|
p Preserve a copy of the matched string -
|
|
${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined.
|
|
o compile pattern Once
|
|
g Global - all occurrences
|
|
c don't reset pos on failed matches when using /g
|
|
a restrict \d, \s, \w and [:posix:] to match ASCII only
|
|
aa (two a's) also /i matches exclude ASCII/non-ASCII
|
|
l match according to current locale
|
|
u match according to Unicode rules
|
|
d match according to native rules unless something indicates
|
|
Unicode
|
|
n Non-capture mode. Don't let () fill in $1, $2, etc...
|
|
|
|
If 'pattern' is an empty string, the last I<successfully> matched
|
|
regex is used. Delimiters other than '/' may be used for both this
|
|
operator and the following ones. The leading C<m> can be omitted
|
|
if the delimiter is '/'.
|
|
|
|
C<qr/pattern/msixpodualn> lets you store a regex in a variable,
|
|
or pass one around. Modifiers as for C<m//>, and are stored
|
|
within the regex.
|
|
|
|
C<s/pattern/replacement/msixpogcedual> substitutes matches of
|
|
'pattern' with 'replacement'. Modifiers as for C<m//>,
|
|
with two additions:
|
|
|
|
e Evaluate 'replacement' as an expression
|
|
r Return substitution and leave the original string untouched.
|
|
|
|
'e' may be specified multiple times. 'replacement' is interpreted
|
|
as a double quoted string unless a single-quote (C<'>) is the delimiter.
|
|
|
|
C<m?pattern?> is like C<m/pattern/> but matches only once. No alternate
|
|
delimiters can be used. Must be reset with reset().
|
|
|
|
=head2 SYNTAX
|
|
|
|
\ Escapes the character immediately following it
|
|
. Matches any single character except a newline (unless /s is
|
|
used)
|
|
^ Matches at the beginning of the string (or line, if /m is used)
|
|
$ Matches at the end of the string (or line, if /m is used)
|
|
* Matches the preceding element 0 or more times
|
|
+ Matches the preceding element 1 or more times
|
|
? Matches the preceding element 0 or 1 times
|
|
{...} Specifies a range of occurrences for the element preceding it
|
|
[...] Matches any one of the characters contained within the brackets
|
|
(...) Groups subexpressions for capturing to $1, $2...
|
|
(?:...) Groups subexpressions without capturing (cluster)
|
|
| Matches either the subexpression preceding or following it
|
|
\g1 or \g{1}, \g2 ... Matches the text from the Nth group
|
|
\1, \2, \3 ... Matches the text from the Nth group
|
|
\g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group
|
|
\g{name} Named backreference
|
|
\k<name> Named backreference
|
|
\k'name' Named backreference
|
|
(?P=name) Named backreference (python syntax)
|
|
|
|
=head2 ESCAPE SEQUENCES
|
|
|
|
These work as in normal strings.
|
|
|
|
\a Alarm (beep)
|
|
\e Escape
|
|
\f Formfeed
|
|
\n Newline
|
|
\r Carriage return
|
|
\t Tab
|
|
\037 Char whose ordinal is the 3 octal digits, max \777
|
|
\o{2307} Char whose ordinal is the octal number, unrestricted
|
|
\x7f Char whose ordinal is the 2 hex digits, max \xFF
|
|
\x{263a} Char whose ordinal is the hex number, unrestricted
|
|
\cx Control-x
|
|
\N{name} A named Unicode character or character sequence
|
|
\N{U+263D} A Unicode character by hex ordinal
|
|
|
|
\l Lowercase next character
|
|
\u Titlecase next character
|
|
\L Lowercase until \E
|
|
\U Uppercase until \E
|
|
\F Foldcase until \E
|
|
\Q Disable pattern metacharacters until \E
|
|
\E End modification
|
|
|
|
For Titlecase, see L</Titlecase>.
|
|
|
|
This one works differently from normal strings:
|
|
|
|
\b An assertion, not backspace, except in a character class
|
|
|
|
=head2 CHARACTER CLASSES
|
|
|
|
[amy] Match 'a', 'm' or 'y'
|
|
[f-j] Dash specifies "range"
|
|
[f-j-] Dash escaped or at start or end means 'dash'
|
|
[^f-j] Caret indicates "match any character _except_ these"
|
|
|
|
The following sequences (except C<\N>) work within or without a character class.
|
|
The first six are locale aware, all are Unicode aware. See L<perllocale>
|
|
and L<perlunicode> for details.
|
|
|
|
\d A digit
|
|
\D A nondigit
|
|
\w A word character
|
|
\W A non-word character
|
|
\s A whitespace character
|
|
\S A non-whitespace character
|
|
\h A horizontal whitespace
|
|
\H A non horizontal whitespace
|
|
\N A non newline (when not followed by '{NAME}';;
|
|
not valid in a character class; equivalent to [^\n]; it's
|
|
like '.' without /s modifier)
|
|
\v A vertical whitespace
|
|
\V A non vertical whitespace
|
|
\R A generic newline (?>\v|\x0D\x0A)
|
|
|
|
\pP Match P-named (Unicode) property
|
|
\p{...} Match Unicode property with name longer than 1 character
|
|
\PP Match non-P
|
|
\P{...} Match lack of Unicode property with name longer than 1 char
|
|
\X Match Unicode extended grapheme cluster
|
|
|
|
POSIX character classes and their Unicode and Perl equivalents:
|
|
|
|
ASCII- Full-
|
|
POSIX range range backslash
|
|
[[:...:]] \p{...} \p{...} sequence Description
|
|
|
|
-----------------------------------------------------------------------
|
|
alnum PosixAlnum XPosixAlnum 'alpha' plus 'digit'
|
|
alpha PosixAlpha XPosixAlpha Alphabetic characters
|
|
ascii ASCII Any ASCII character
|
|
blank PosixBlank XPosixBlank \h Horizontal whitespace;
|
|
full-range also
|
|
written as
|
|
\p{HorizSpace} (GNU
|
|
extension)
|
|
cntrl PosixCntrl XPosixCntrl Control characters
|
|
digit PosixDigit XPosixDigit \d Decimal digits
|
|
graph PosixGraph XPosixGraph 'alnum' plus 'punct'
|
|
lower PosixLower XPosixLower Lowercase characters
|
|
print PosixPrint XPosixPrint 'graph' plus 'space',
|
|
but not any Controls
|
|
punct PosixPunct XPosixPunct Punctuation and Symbols
|
|
in ASCII-range; just
|
|
punct outside it
|
|
space PosixSpace XPosixSpace \s Whitespace
|
|
upper PosixUpper XPosixUpper Uppercase characters
|
|
word PosixWord XPosixWord \w 'alnum' + Unicode marks
|
|
+ connectors, like
|
|
'_' (Perl extension)
|
|
xdigit ASCII_Hex_Digit XPosixDigit Hexadecimal digit,
|
|
ASCII-range is
|
|
[0-9A-Fa-f]
|
|
|
|
Also, various synonyms like C<\p{Alpha}> for C<\p{XPosixAlpha}>; all listed
|
|
in L<perluniprops/Properties accessible through \p{} and \P{}>
|
|
|
|
Within a character class:
|
|
|
|
POSIX traditional Unicode
|
|
[:digit:] \d \p{Digit}
|
|
[:^digit:] \D \P{Digit}
|
|
|
|
=head2 ANCHORS
|
|
|
|
All are zero-width assertions.
|
|
|
|
^ Match string start (or line, if /m is used)
|
|
$ Match string end (or line, if /m is used) or before newline
|
|
\b{} Match boundary of type specified within the braces
|
|
\B{} Match wherever \b{} doesn't match
|
|
\b Match word boundary (between \w and \W)
|
|
\B Match except at word boundary (between \w and \w or \W and \W)
|
|
\A Match string start (regardless of /m)
|
|
\Z Match string end (before optional newline)
|
|
\z Match absolute string end
|
|
\G Match where previous m//g left off
|
|
\K Keep the stuff left of the \K, don't include it in $&
|
|
|
|
=head2 QUANTIFIERS
|
|
|
|
Quantifiers are greedy by default and match the B<longest> leftmost.
|
|
|
|
Maximal Minimal Possessive Allowed range
|
|
------- ------- ---------- -------------
|
|
{n,m} {n,m}? {n,m}+ Must occur at least n times
|
|
but no more than m times
|
|
{n,} {n,}? {n,}+ Must occur at least n times
|
|
{n} {n}? {n}+ Must occur exactly n times
|
|
* *? *+ 0 or more times (same as {0,})
|
|
+ +? ++ 1 or more times (same as {1,})
|
|
? ?? ?+ 0 or 1 time (same as {0,1})
|
|
|
|
The possessive forms (new in Perl 5.10) prevent backtracking: what gets
|
|
matched by a pattern with a possessive quantifier will not be backtracked
|
|
into, even if that causes the whole match to fail.
|
|
|
|
There is no quantifier C<{,n}>. That's interpreted as a literal string.
|
|
|
|
=head2 EXTENDED CONSTRUCTS
|
|
|
|
(?#text) A comment
|
|
(?:...) Groups subexpressions without capturing (cluster)
|
|
(?pimsx-imsx:...) Enable/disable option (as per m// modifiers)
|
|
(?=...) Zero-width positive lookahead assertion
|
|
(?*pla:...) Same; avail experimentally starting in 5.28
|
|
(?!...) Zero-width negative lookahead assertion
|
|
(?*nla:...) Same; avail experimentally starting in 5.28
|
|
(?<=...) Zero-width positive lookbehind assertion
|
|
(?*plb:...) Same; avail experimentally starting in 5.28
|
|
(?<!...) Zero-width negative lookbehind assertion
|
|
(?*nlb:...) Same; avail experimentally starting in 5.28
|
|
(?>...) Grab what we can, prohibit backtracking
|
|
(?*atomic:...) Same; avail experimentally starting in 5.28
|
|
(?|...) Branch reset
|
|
(?<name>...) Named capture
|
|
(?'name'...) Named capture
|
|
(?P<name>...) Named capture (python syntax)
|
|
(?[...]) Extended bracketed character class
|
|
(?{ code }) Embedded code, return value becomes $^R
|
|
(??{ code }) Dynamic regex, return value used as regex
|
|
(?N) Recurse into subpattern number N
|
|
(?-N), (?+N) Recurse into Nth previous/next subpattern
|
|
(?R), (?0) Recurse at the beginning of the whole pattern
|
|
(?&name) Recurse into a named subpattern
|
|
(?P>name) Recurse into a named subpattern (python syntax)
|
|
(?(cond)yes|no)
|
|
(?(cond)yes) Conditional expression, where "(cond)" can be:
|
|
(?=pat) lookahead
|
|
(?!pat) negative lookahead
|
|
(?<=pat) lookbehind
|
|
(?<!pat) negative lookbehind
|
|
(N) subpattern N has matched something
|
|
(<name>) named subpattern has matched something
|
|
('name') named subpattern has matched something
|
|
(?{code}) code condition
|
|
(R) true if recursing
|
|
(RN) true if recursing into Nth subpattern
|
|
(R&name) true if recursing into named subpattern
|
|
(DEFINE) always false, no no-pattern allowed
|
|
|
|
=head2 VARIABLES
|
|
|
|
$_ Default variable for operators to use
|
|
|
|
$` Everything prior to matched string
|
|
$& Entire matched string
|
|
$' Everything after to matched string
|
|
|
|
${^PREMATCH} Everything prior to matched string
|
|
${^MATCH} Entire matched string
|
|
${^POSTMATCH} Everything after to matched string
|
|
|
|
Note to those still using Perl 5.18 or earlier:
|
|
The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use
|
|
within your program. Consult L<perlvar> for C<@->
|
|
to see equivalent expressions that won't cause slow down.
|
|
See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you
|
|
can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}>
|
|
and C<${^POSTMATCH}>, but for them to be defined, you have to
|
|
specify the C</p> (preserve) modifier on your regular expression.
|
|
In Perl 5.20, the use of C<$`>, C<$&> and C<$'> makes no speed difference.
|
|
|
|
$1, $2 ... hold the Xth captured expr
|
|
$+ Last parenthesized pattern match
|
|
$^N Holds the most recently closed capture
|
|
$^R Holds the result of the last (?{...}) expr
|
|
@- Offsets of starts of groups. $-[0] holds start of whole match
|
|
@+ Offsets of ends of groups. $+[0] holds end of whole match
|
|
%+ Named capture groups
|
|
%- Named capture groups, as array refs
|
|
|
|
Captured groups are numbered according to their I<opening> paren.
|
|
|
|
=head2 FUNCTIONS
|
|
|
|
lc Lowercase a string
|
|
lcfirst Lowercase first char of a string
|
|
uc Uppercase a string
|
|
ucfirst Titlecase first char of a string
|
|
fc Foldcase a string
|
|
|
|
pos Return or set current match position
|
|
quotemeta Quote metacharacters
|
|
reset Reset m?pattern? status
|
|
study Analyze string for optimizing matching
|
|
|
|
split Use a regex to split a string into parts
|
|
|
|
The first five of these are like the escape sequences C<\L>, C<\l>,
|
|
C<\U>, C<\u>, and C<\F>. For Titlecase, see L</Titlecase>; For
|
|
Foldcase, see L</Foldcase>.
|
|
|
|
=head2 TERMINOLOGY
|
|
|
|
=head3 Titlecase
|
|
|
|
Unicode concept which most often is equal to uppercase, but for
|
|
certain characters like the German "sharp s" there is a difference.
|
|
|
|
=head3 Foldcase
|
|
|
|
Unicode form that is useful when comparing strings regardless of case,
|
|
as certain characters have complex one-to-many case mappings. Primarily a
|
|
variant of lowercase.
|
|
|
|
=head1 AUTHOR
|
|
|
|
Iain Truskett. Updated by the Perl 5 Porters.
|
|
|
|
This document may be distributed under the same terms as Perl itself.
|
|
|
|
=head1 SEE ALSO
|
|
|
|
=over 4
|
|
|
|
=item *
|
|
|
|
L<perlretut> for a tutorial on regular expressions.
|
|
|
|
=item *
|
|
|
|
L<perlrequick> for a rapid tutorial.
|
|
|
|
=item *
|
|
|
|
L<perlre> for more details.
|
|
|
|
=item *
|
|
|
|
L<perlvar> for details on the variables.
|
|
|
|
=item *
|
|
|
|
L<perlop> for details on the operators.
|
|
|
|
=item *
|
|
|
|
L<perlfunc> for details on the functions.
|
|
|
|
=item *
|
|
|
|
L<perlfaq6> for FAQs on regular expressions.
|
|
|
|
=item *
|
|
|
|
L<perlrebackslash> for a reference on backslash sequences.
|
|
|
|
=item *
|
|
|
|
L<perlrecharclass> for a reference on character classes.
|
|
|
|
=item *
|
|
|
|
The L<re> module to alter behaviour and aid
|
|
debugging.
|
|
|
|
=item *
|
|
|
|
L<perldebug/"Debugging Regular Expressions">
|
|
|
|
=item *
|
|
|
|
L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale>
|
|
for details on regexes and internationalisation.
|
|
|
|
=item *
|
|
|
|
I<Mastering Regular Expressions> by Jeffrey Friedl
|
|
(L<http://oreilly.com/catalog/9780596528126/>) for a thorough grounding and
|
|
reference on the topic.
|
|
|
|
=back
|
|
|
|
=head1 THANKS
|
|
|
|
David P.C. Wollmann,
|
|
Richard Soderberg,
|
|
Sean M. Burke,
|
|
Tom Christiansen,
|
|
Jim Cromie,
|
|
and
|
|
Jeffrey Goff
|
|
for useful advice.
|
|
|
|
=cut
|