argra****@users*****
argra****@users*****
2013年 2月 28日 (木) 20:02:15 JST
Index: docs/perl/5.12.1/perlrecharclass.pod diff -u /dev/null docs/perl/5.12.1/perlrecharclass.pod:1.1 --- /dev/null Thu Feb 28 20:02:15 2013 +++ docs/perl/5.12.1/perlrecharclass.pod Thu Feb 28 20:02:15 2013 @@ -0,0 +1,1588 @@ + +=encoding euc-jp + +=head1 NAME +X<character class> + +=begin original + +perlrecharclass - Perl Regular Expression Character Classes + +=end original + +perlrecharclass - Perl 正規表現文字クラス + +=head1 DESCRIPTION + +=begin original + +The top level documentation about Perl regular expressions +is found in L<perlre>. + +=end original + +Perl 正規表現に関する最上位文書は L<perlre> です。 + +=begin original + +This manual page discusses the syntax and use of character +classes in Perl Regular Expressions. + +=end original + +このマニュアルページは Perl 正規表現の文字クラスの文法と使用法について +議論します。 + +=begin original + +A character class is a way of denoting a set of characters, +in such a way that one character of the set is matched. +It's important to remember that matching a character class +consumes exactly one character in the source string. (The source +string is the string the regular expression is matched against.) + +=end original + +文字クラスは、集合の中の一文字がマッチングするというような方法で、 +文字の集合を指定するための方法です。 +文字集合はソース文字列の中から正確に一文字だけを消費するということを +覚えておくことは重要です。 +(ソース文字列とは正規表現がマッチングしようとしている文字列です。) + +=begin original + +There are three types of character classes in Perl regular +expressions: the dot, backslashed sequences, and the form enclosed in square +brackets. Keep in mind, though, that often the term "character class" is used +to mean just the bracketed form. This is true in other Perl documentation. + +=end original + +Perl 正規表現には 3 種類の文字クラスがあります: ドット、 +逆スラッシュシーケンス、大かっこで囲まれた形式です。 +Keep in mind, though, that often the term "character class" is used +to mean just the bracketed form. This is true in other Perl documentation. +(TBT) + +=head2 The dot + +(ドット) + +=begin original + +The dot (or period), C<.> is probably the most used, and certainly +the most well-known character class. By default, a dot matches any +character, except for the newline. The default can be changed to +add matching the newline with the I<single line> modifier: either +for the entire regular expression using the C</s> modifier, or +locally using C<(?s)>. + +=end original + +ドット (またはピリオド) C<.> はおそらくもっともよく使われ、そして確実に +もっともよく知られている文字クラスです。 +デフォルトでは、ドットは改行を除く任意の文字にマッチングします。 +デフォルトは I<単一行> 修飾子を使うことで改行にもマッチングするように +変更されます: 正規表現全体に対して C</s> 修飾子を使うか、ローカルには +C<(?s)> を使います。 + +=begin original + +Here are some examples: + +=end original + +以下は例です: + +=begin original + + "a" =~ /./ # Match + "." =~ /./ # Match + "" =~ /./ # No match (dot has to match a character) + "\n" =~ /./ # No match (dot does not match a newline) + "\n" =~ /./s # Match (global 'single line' modifier) + "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) + "ab" =~ /^.$/ # No match (dot matches one character) + +=end original + + "a" =~ /./ # マッチングする + "." =~ /./ # マッチングする + "" =~ /./ # マッチングしない (ドットは文字にマッチングする必要がある) + "\n" =~ /./ # マッチングしない (ドットは改行にはマッチングしない) + "\n" =~ /./s # マッチングする (グローバル「単一行」修飾子) + "\n" =~ /(?s:.)/ # マッチングする (ローカル「単一行」修飾子) + "ab" =~ /^.$/ # マッチングしない (ドットは一文字にマッチングする) + +=head2 Backslashed sequences +X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> +X<\N> X<\v> X<\V> X<\h> X<\H> +X<word> X<whitespace> + +(逆スラッシュシーケンス) + +=begin original + +Perl regular expressions contain many backslashed sequences that +constitute a character class. That is, they will match a single +character, if that character belongs to a specific set of characters +(defined by the sequence). A backslashed sequence is a sequence of +characters starting with a backslash. Not all backslashed sequences +are character classes; for a full list, see L<perlrebackslash>. + +=end original + +Perl 正規表現には、文字クラスを構成する多くの逆スラッシュシーケンスを +持ちます。 +これは(シーケンスで定義される)ある特定の文字集合に属する一つの文字に +マッチングします。 +逆スラッシュシーケンスは、逆スラッシュで始まる並びです。 +全ての逆スラッシュシーケンスが文字クラスというわけではありません; 完全な +リストは、L<perlrebackslash> を参照してください。 + +=begin original + +Here's a list of the backslashed sequences that are character classes. They +are discussed in more detail below. + +=end original + +以下は文字クラスの逆スラッシュシーケンスの一覧です。 +以下でさらに詳細に議論します。 + +=begin original + + \d Match a digit character. + \D Match a non-digit character. + \w Match a "word" character. + \W Match a non-"word" character. + \s Match a whitespace character. + \S Match a non-whitespace character. + \h Match a horizontal whitespace character. + \H Match a character that isn't horizontal whitespace. + \N Match a character that isn't newline. Experimental. + \v Match a vertical whitespace character. + \V Match a character that isn't vertical whitespace. + \pP, \p{Prop} Match a character matching a Unicode property. + \PP, \P{Prop} Match a character that doesn't match a Unicode property. + +=end original + + \d 数字にマッチング。 + \D 非数字にマッチング。 + \w 「単語」文字にマッチング。 + \W 非「単語」文字にマッチング。 + \s 空白文字にマッチング。 + \S 非空白文字にマッチング。 + \h 水平空白文字にマッチング。 + \H 水平空白でない文字にマッチング。 + \N 改行以外の文字にマッチング。実験的。 + \v 垂直空白文字にマッチング。 + \V 垂直空白でない文字にマッチング。 + \pP, \p{Prop} Unicode 特性にマッチする文字にマッチング。 + \PP, \P{Prop} Unicode 特性にマッチしない文字にマッチング。 + +=head3 Digits + +(数字) + +=begin original + +C<\d> matches a single character that is considered to be a I<digit>. What is +considered a digit depends on the internal encoding of the source string and +the locale that is in effect. If the source string is in UTF-8 format, C<\d> +not only matches the digits '0' - '9', but also Arabic, Devanagari and digits +from other languages. Otherwise, if there is a locale in effect, it will match +whatever characters the locale considers digits. Without a locale, C<\d> +matches the digits '0' to '9'. See L</Locale, EBCDIC, Unicode and UTF-8>. + +=end original + +C<\d> は I<数字> と考えられる単一の文字にマッチングします。 +何が数字と考えられるかはソース文字列の内部エンコーディングとロケールが +有効かどうかに依存します。 +ソース文字列が UTF-8 形式なら、C<\d> は数字 '0' - '9' だけでなく、Arabic, +Devanagari およびその他の言語の数字もマッチングします。 +さもなければ、ロケールが有効なら、ロケールが数字と考える文字に +マッチングします。 +ロケールがなければ、C<\d> は '0' から '9' の数字にマッチングします。 +L</Locale, EBCDIC, Unicode and UTF-8> を参照してください。 + +=begin original + +Any character that isn't matched by C<\d> will be matched by C<\D>. + +=end original + +C<\d> にマッチングしない任意の文字は C<\D> にマッチングします。 + +=head3 Word characters + +(単語文字) + +=begin original + +A C<\w> matches a single alphanumeric character (an alphabetic character, or a +decimal digit) or an underscore (C<_>), not a whole word. Use C<\w+> to match +a string of Perl-identifier characters (which isn't the same as matching an +English word). What is considered a word character depends on the internal +encoding of the string and the locale or EBCDIC code page that is in effect. If +it's in UTF-8 format, C<\w> matches those characters that are considered word +characters in the Unicode database. That is, it not only matches ASCII letters, +but also Thai letters, Greek letters, etc. If the source string isn't in UTF-8 +format, C<\w> matches those characters that are considered word characters by +the current locale or EBCDIC code page. Without a locale or EBCDIC code page, +C<\w> matches the ASCII letters, digits and the underscore. +See L</Locale, EBCDIC, Unicode and UTF-8>. + +=end original + +C<\w> は単語全体ではなく、単一の英数字(つまり英字または数字)または +下線(C<_>)にマッチングします。 +Use C<\w+> to match +a string of Perl-identifier characters (which isn't the same as matching an +English word). +何が単語文字と考えられるかは文字列の内部エンコーディングと有効なロケールや +EBCDIC コードページに依存します。 +UTF-8 形式の場合、C<\w> は Unicode データベースで単語文字と考えられるものに +マッチングします。 +これは、ASCII の文字だけではなく、タイの文字、ギリシャの文字、などにも +マッチングするということです。 +ソース文字列が UTF-8 形式でない場合、C<\w> は現在のロケールや +EBCDIC コードページで単語文字と考えられるものにマッチングします。 +ロケールや EBCDIC コードページなしの場合、C<\w> は ASCII 文字、数字、下線に +マッチングします。 +L</Locale, EBCDIC, Unicode and UTF-8> を参照してください。 +(TBT) + +=begin original + +Any character that isn't matched by C<\w> will be matched by C<\W>. + +=end original + +C<\w> にマッチングしない任意の文字は C<\W> にマッチングします。 + +=head3 Whitespace + +(空白) + +=begin original + +C<\s> matches any single character that is considered whitespace. In the ASCII +range, C<\s> matches the horizontal tab (C<\t>), the new line (C<\n>), the form +feed (C<\f>), the carriage return (C<\r>), and the space. (The vertical tab, +C<\cK> is not matched by C<\s>.) The exact set of characters matched by C<\s> +depends on whether the source string is in UTF-8 format and the locale or +EBCDIC code page that is in effect. If it's in UTF-8 format, C<\s> matches what +is considered whitespace in the Unicode database; the complete list is in the +table below. Otherwise, if there is a locale or EBCDIC code page in effect, +C<\s> matches whatever is considered whitespace by the current locale or EBCDIC +code page. Without a locale or EBCDIC code page, C<\s> matches the five +characters mentioned in the beginning of this paragraph. Perhaps the most +notable possible surprise is that C<\s> matches a non-breaking space only if +the non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC +code page that is in effect has that character. +See L</Locale, EBCDIC, Unicode and UTF-8>. + +=end original + +C<\s> は空白と考えられる単一の文字にマッチングします。 +ASCII の範囲では、C<\s> は水平タブ(C<\t>)、改行(C<\n>)、ページ送り(C<\f>)、 +復帰(C<\r>)、スペースにマッチングします。 +(垂直タブ C<\cK> は C<\s> にマッチングしません。) +C<\s> がマッチングする文字の正確な集合はソース文字列が UTF-8 形式かどうかや +有効なロケールおよび EBCDIC コードページに依存します。 +UTF-8 形式なら、C<\s> は Unicode データベースで空白と考えられるものに +マッチングします; 完全な一覧は後述の表にあります。 +さもなければ、ロケールや EBICDIC コードページが有効なら、C<\s> は現在の +ロケールや EBCDIC コードページで空白と考えられるものにマッチングします。 +ロケールや EBCDIC コードページなしでは、C<\s> はこの段落の始めに言及した +五つの文字にマッチングします。 +おそらくもっとも顕著な驚きは、non-breaking space は UTF-8 エンコードされた +文字列にあるか、ロケールまたは EBCDIC コードページが有効な場合にのみ、 +C<\s> にマッチングするということです。 +L</Locale, EBCDIC, Unicode and UTF-8> を参照してください。 + +=begin original + +Any character that isn't matched by C<\s> will be matched by C<\S>. + +=end original + +C<\s> にマッチングしない任意の文字は C<\S> にマッチングします。 + +=begin original + +C<\h> will match any character that is considered horizontal whitespace; +this includes the space and the tab characters and 17 other characters that are +listed in the table below. C<\H> will match any character +that is not considered horizontal whitespace. + +=end original + +C<\h> は水平空白と考えられる任意の文字にマッチングします; これはスペースと +タブ文字および +17 other characters that are +listed in the table below. +C<\H> は水平空白と考えられない文字にマッチングします。 +(TBT) + +=begin original + +C<\N> is new in 5.12, and is experimental. It, like the dot, will match any +character that is not a newline. The difference is that C<\N> will not be +influenced by the single line C</s> regular expression modifier. Note that +there is a second meaning of C<\N> when of the form C<\N{...}>. This form is +for named characters. See L<charnames> for those. If C<\N> is followed by an +opening brace and something that is not a quantifier, perl will assume that a +character name is coming, and not this meaning of C<\N>. For example, C<\N{3}> +means to match 3 non-newlines; C<\N{5,}> means to match 5 or more non-newlines, +but C<\N{4F}> and C<\N{F4}> are not legal quantifiers, and will cause perl to +look for characters named C<4F> or C<F4>, respectively (and won't find them, +thus raising an error, unless they have been defined using custom names). + +=end original + +C<\N> is new in 5.12, and is experimental. It, like the dot, will match any +character that is not a newline. The difference is that C<\N> will not be +influenced by the single line C</s> regular expression modifier. Note that +there is a second meaning of C<\N> when of the form C<\N{...}>. This form is +for named characters. See L<charnames> for those. If C<\N> is followed by an +opening brace and something that is not a quantifier, perl will assume that a +character name is coming, and not this meaning of C<\N>. For example, C<\N{3}> +means to match 3 non-newlines; C<\N{5,}> means to match 5 or more non-newlines, +but C<\N{4F}> and C<\N{F4}> are not legal quantifiers, and will cause perl to +look for characters named C<4F> or C<F4>, respectively (and won't find them, +thus raising an error, unless they have been defined using custom names). +(TBT) + +=begin original + +C<\v> will match any character that is considered vertical whitespace; +this includes the carriage return and line feed characters (newline) plus 5 +other characters listed in the table below. +C<\V> will match any character that is not considered vertical whitespace. + +=end original + +C<\v> は垂直空白と考えられる任意の文字にマッチングします; これは復帰と +行送り(改行)文字 plus 5 +other characters listed in the table below です。 +C<\V> は垂直空白と考えられない任意の文字にマッチングします。 +(TBT) + +=begin original + +C<\R> matches anything that can be considered a newline under Unicode +rules. It's not a character class, as it can match a multi-character +sequence. Therefore, it cannot be used inside a bracketed character +class; use C<\v> instead (vertical whitespace). +Details are discussed in L<perlrebackslash>. + +=end original + +C<\R> は Unicode の規則で改行と考えられるものにマッチングします。 +複数文字の並びにマッチングすることもあるので、これは +文字クラスではありません。 +従って、大かっこ文字クラスの中では使えません; use C<\v> instead (vertical whitespace). +詳細は L<perlrebackslash> で議論しています。 +(TBT) + +=begin original + +Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match +the same characters, regardless whether the source string is in UTF-8 +format or not. The set of characters they match is also not influenced +by locale nor EBCDIC code page. + +=end original + +C<\s>, C<\d>, C<\w> と違って、C<\h> および C<\v> はソース文字列が +UTF-8 形式かどうかに関わらず同じ文字にマッチングします。 +マッチングする文字の集合はロケールや EBCDIC コードページの影響も受けません。 + +=begin original + +One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The +vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered +vertical whitespace. Furthermore, if the source string is not in UTF-8 format, +and any locale or EBCDIC code page that is in effect doesn't include them, the +next line (C<"\x85">) and the no-break space (C<"\xA0">) characters are not +matched by C<\s>, but are by C<\v> and C<\h> respectively. If the source +string is in UTF-8 format, both the next line and the no-break space are +matched by C<\s>. + +=end original + +C<\s> が C<[\h\v]> と等価と考える人がいるかもしれません。 +これは正しくありません。 +垂直タブ (C<"\x0b">) は C<\s> にマッチングしませんが、垂直空白と +考えられます。 +さらに、ソース文字列が UTF-8 形式でなければ、 +and any locale or EBCDIC code page that is in effect doesn't include them, +next line (C<"\x85">) と +no-break space (C<"\xA0">) 文字は C<\s> にマッチングしませんが、 +それぞれ C<\v> および C<\h> にはマッチングします。 +ソース文字列が UTF-8 形式なら、next line と no-break space は C<\s> に +マッチングします。 +(TBT) + +=begin original + +The following table is a complete listing of characters matched by +C<\s>, C<\h> and C<\v> as of Unicode 5.2. + +=end original + +以下の表は Unicode 5.2 現在で C<\s>, C<\h>, C<\v> にマッチングする文字の +完全な一覧です。 + +=begin original + +The first column gives the code point of the character (in hex format), +the second column gives the (Unicode) name. The third column indicates +by which class(es) the character is matched (assuming no locale or EBCDIC code +page is in effect that changes the C<\s> matching). + +=end original + +最初の列は文字の符号位置(16 進形式)、2 番目の列は (Unicode の)名前です。 +3 番目の列はどのクラスにマッチングするかを示しています +(assuming no locale or EBCDIC code +page is in effect that changes the C<\s> matching)。 +(TBT) + + 0x00009 CHARACTER TABULATION h s + 0x0000a LINE FEED (LF) vs + 0x0000b LINE TABULATION v + 0x0000c FORM FEED (FF) vs + 0x0000d CARRIAGE RETURN (CR) vs + 0x00020 SPACE h s + 0x00085 NEXT LINE (NEL) vs [1] + 0x000a0 NO-BREAK SPACE h s [1] + 0x01680 OGHAM SPACE MARK h s + 0x0180e MONGOLIAN VOWEL SEPARATOR h s + 0x02000 EN QUAD h s + 0x02001 EM QUAD h s + 0x02002 EN SPACE h s + 0x02003 EM SPACE h s + 0x02004 THREE-PER-EM SPACE h s + 0x02005 FOUR-PER-EM SPACE h s + 0x02006 SIX-PER-EM SPACE h s + 0x02007 FIGURE SPACE h s + 0x02008 PUNCTUATION SPACE h s + 0x02009 THIN SPACE h s + 0x0200a HAIR SPACE h s + 0x02028 LINE SEPARATOR vs + 0x02029 PARAGRAPH SEPARATOR vs + 0x0202f NARROW NO-BREAK SPACE h s + 0x0205f MEDIUM MATHEMATICAL SPACE h s + 0x03000 IDEOGRAPHIC SPACE h s + +=over 4 + +=item [1] + +=begin original + +NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in +UTF-8 format, or the locale or EBCDIC code page that is in effect includes them. + +=end original + +NEXT LINE と NO-BREAK SPACE はソース文字列が UTF-8 形式か +or the locale or EBCDIC code page that is in effect includes them +C<\s> にマッチングします。 +(TBT) + +=back + +=begin original + +It is worth noting that C<\d>, C<\w>, etc, match single characters, not +complete numbers or words. To match a number (that consists of integers), +use C<\d+>; to match a word, use C<\w+>. + +=end original + +C<\d>, C<\w> などは単語や数値全体ではなく単一の文字にマッチングすると +いうことは注意する価値があります。 +(整数で構成される)数値にマッチングするには、C<\d+> を使ってください; 単語に +マッチングするには、C<\w+> を使ってください。 + +=head3 Unicode Properties + +(Unicode 特性) + +=begin original + +C<\pP> and C<\p{Prop}> are character classes to match characters that fit given +Unicode properties. One letter property names can be used in the C<\pP> form, +with the property name following the C<\p>, otherwise, braces are required. +When using braces, there is a single form, which is just the property name +enclosed in the braces, and a compound form which looks like C<\p{name=value}>, +which means to match if the property "name" for the character has the particular +"value". +For instance, a match for a number can be written as C</\pN/> or as +C</\p{Number}/>, or as C</\p{Number=True}/>. +Lowercase letters are matched by the property I<Lowercase_Letter> which +has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or +C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/> +(the underscores are optional). +C</\pLl/> is valid, but means something different. +It matches a two character string: a letter (Unicode property C<\pL>), +followed by a lowercase C<l>. + +=end original + +C<\pP> と C<\p{Prop}> は指定された Unicode 特性に一致する文字に +マッチングする文字クラスです。 +一文字特性は C<\pP> 形式で、C<\p> に引き続いて特性名です; さもなければ +中かっこが必要です。 +When using braces, there is a single form, which is just the property name +enclosed in the braces, and a compound form which looks like C<\p{name=value}>, +which means to match if the property "name" for the character has the particular +"value". +例えば、数字にマッチングするものは C</\pN/> または C</\p{Number}/> または +C</\p{Number=True}/> と書けます。 +小文字は I<LowercaseLetter> 特性にマッチングします; これには +I<Ll> と言う短縮形式があります。 +中かっこが必要なので、C</\p{Ll}/> または C</\p{Lowercase_Letter}/> または +C</\p{General_Category=Lowercase_Letter}/> と書きます(下線はオプションです)。 +C</\pLl/> も妥当ですが、違う意味になります。 +これは 2 文字にマッチングします: 英字 (Unicode 特性 C<\pL>)に引き続いて +小文字の C<l> です。 +(TBT) + +=begin original + +For more details, see L<perlunicode/Unicode Character Properties>; for a +complete list of possible properties, see +L<perluniprops/Properties accessible through \p{} and \P{}>. +It is also possible to define your own properties. This is discussed in +L<perlunicode/User-Defined Character Properties>. + +=end original + +For more details, see L<perlunicode/Unicode Character Properties>; for a +complete list of possible properties, see +L<perluniprops/Properties accessible through \p{} and \P{}>. +独自の特性を定義することも可能です。 +これは L<perlunicode/User-Defined Character Properties> で議論されています。 +(TBT) + +=head4 Examples + +(例) + +=begin original + + "a" =~ /\w/ # Match, "a" is a 'word' character. + "7" =~ /\w/ # Match, "7" is a 'word' character as well. + "a" =~ /\d/ # No match, "a" isn't a digit. + "7" =~ /\d/ # Match, "7" is a digit. + " " =~ /\s/ # Match, a space is whitespace. + "a" =~ /\D/ # Match, "a" is a non-digit. + "7" =~ /\D/ # No match, "7" is not a non-digit. + " " =~ /\S/ # No match, a space is not non-whitespace. + +=end original + + "a" =~ /\w/ # マッチング; "a" は「単語」文字。 + "7" =~ /\w/ # マッチング; "7" も「単語」文字。 + "a" =~ /\d/ # マッチングしない; "a" は数字ではない。 + "7" =~ /\d/ # マッチング; "7" は数字。 + " " =~ /\s/ # マッチング; スペースは空白。 + "a" =~ /\D/ # マッチング; "a" は非数字。 + "7" =~ /\D/ # マッチングしない; "7" は非数字ではない。 + " " =~ /\S/ # マッチングしない; スペースは非空白ではない。 + +=begin original + + " " =~ /\h/ # Match, space is horizontal whitespace. + " " =~ /\v/ # No match, space is not vertical whitespace. + "\r" =~ /\v/ # Match, a return is vertical whitespace. + +=end original + + " " =~ /\h/ # マッチング; スペースは水平空白。 + " " =~ /\v/ # マッチングしない; スペースは垂直空白ではない。 + "\r" =~ /\v/ # マッチング; 復帰は垂直空白。 + +=begin original + + "a" =~ /\pL/ # Match, "a" is a letter. + "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. + +=end original + + "a" =~ /\pL/ # マッチング; "a" は英字。 + "a" =~ /\p{Lu}/ # マッチングしない; /\p{Lu}/ は大文字にマッチングする。 + +=begin original + + "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character + # 'THAI CHARACTER SO SO', and that's in + # Thai Unicode class. + "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character. + +=end original + + "\x{0e0b}" =~ /\p{Thai}/ # マッチング; \x{0e0b} は文字 + # 'THAI CHARACTER SO SO' で、これは + # Thai Unicode クラスにある。 + "a" =~ /\P{Lao}/ # マッチング; "a" はラオス文字ではない。 + +=head2 Bracketed Character Classes + +(かっこ付き文字クラス) + +=begin original + +The third form of character class you can use in Perl regular expressions +is the bracketed form. In its simplest form, it lists the characters +that may be matched, surrounded by square brackets, like this: C<[aeiou]>. +This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other +character classes, exactly one character will be matched. To match +a longer string consisting of characters mentioned in the character +class, follow the character class with a quantifier. For instance, +C<[aeiou]+> matches a string of one or more lowercase ASCII vowels. + +=end original + +Perl 正規表現で使える文字クラスの第 3 の形式は大かっこ形式です。 +もっとも単純な形式では、以下のように大かっこの中にマッチングする文字を +リストします: C<[aeiou]>. +これは C<a>, C<e>, C<i>, C<o>, C<u> のどれかにマッチングします。 +他の文字クラスと同様、正確に一つの文字にマッチングします。 +文字クラスで言及した文字で構成されるより長い文字列にマッチングするには、 +文字クラスに量指定子を付けます。 +例えば、C<[aeiou]+> は一つまたはそれ以上の小文字 ASCII 母音に +マッチングします。 + +=begin original + +Repeating a character in a character class has no +effect; it's considered to be in the set only once. + +=end original + +文字クラスの中で文字を繰り返しても効果はありません; 一度だけ現れたものと +考えられます。 + +=begin original + +Examples: + +=end original + +例: + +=begin original + + "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. + "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. + "ae" =~ /^[aeiou]$/ # No match, a character class only matches + # a single character. + "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. + +=end original + + "e" =~ /[aeiou]/ # マッチング; "e" はクラスにある。 + "p" =~ /[aeiou]/ # マッチングしない; "p" はクラスにない。 + "ae" =~ /^[aeiou]$/ # マッチングしない; 一つの文字クラスは + # 一文字だけにマッチングする。 + "ae" =~ /^[aeiou]+$/ # マッチング; 量指定子により。 + +=head3 Special Characters Inside a Bracketed Character Class + +(かっこ付き文字クラスの中の特殊文字) + +=begin original + +Most characters that are meta characters in regular expressions (that +is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose +their special meaning and can be used inside a character class without +the need to escape them. For instance, C<[()]> matches either an opening +parenthesis, or a closing parenthesis, and the parens inside the character +class don't group or capture. + +=end original + +正規表現内でメタ文字(つまり、C<.>, C<*>, C<(> のように特別な意味を持つ +文字)となるほとんどの文字は文字クラス内ではエスケープしなくても特別な意味を +失うので、エスケープする必要はありません。 +例えば、C<[()]> は開きかっこまたは閉じかっこにマッチングし、文字クラスの中の +かっこはグループや捕捉にはなりません。 + +=begin original + +Characters that may carry a special meaning inside a character class are: +C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be +escaped with a backslash, although this is sometimes not needed, in which +case the backslash may be omitted. + +=end original + +文字クラスの中でも特別な意味を持つ文字は: +C<\>, C<^>, C<->, C<[>, C<]> で、以下で議論します。 +これらは逆スラッシュでエスケープできますが、不要な場合もあり、そのような +場合では逆スラッシュは省略できます。 + +=begin original + +The sequence C<\b> is special inside a bracketed character class. While +outside the character class C<\b> is an assertion indicating a point +that does not have either two word characters or two non-word characters +on either side, inside a bracketed character class, C<\b> matches a +backspace character. + +=end original + +シーケンス C<\b> は大かっこ文字クラスの内側では特別です。 +文字クラスの外側では C<\b> 二つの単語文字か二つの非単語文字のどちらかではない +位置を示す表明ですが、大かっこ文字クラスの内側では C<\b> は後退文字に +マッチングします。 + +=begin original + +The sequences +C<\a>, +C<\c>, +C<\e>, +C<\f>, +C<\n>, +C<\N{I<NAME>}>, +C<\N{U+I<wide hex char>}>, +C<\r>, +C<\t>, +and +C<\x> +are also special and have the same meanings as they do outside a bracketed character +class. + +=end original + +The sequences +C<\a>, +C<\c>, +C<\e>, +C<\f>, +C<\n>, +C<\N{I<NAME>}>, +C<\N{U+I<wide hex char>}>, +C<\r>, +C<\t>, +and +C<\x> +are also special and have the same meanings as they do outside a bracketed character +class. +(TBT) + +=begin original + +Also, a backslash followed by two or three octal digits is considered an octal +number. + +=end original + +Also, a backslash followed by two or three octal digits is considered an octal +number. +(TBT) + +=begin original + +A C<[> is not special inside a character class, unless it's the start +of a POSIX character class (see below). It normally does not need escaping. + +=end original + +C<[> は、POSIX 文字クラス(後述)の開始でない限りは文字クラスの中では +特別ではありません。 +これは普通エスケープは不要です。 + +=begin original + +A C<]> is normally either the end of a POSIX character class (see below), or it +signals the end of the bracketed character class. If you want to include a +C<]> in the set of characters, you must generally escape it. +However, if the C<]> is the I<first> (or the second if the first +character is a caret) character of a bracketed character class, it +does not denote the end of the class (as you cannot have an empty class) +and is considered part of the set of characters that can be matched without +escaping. + +=end original + +A C<]> は普通は POSIX 文字クラス(後述)の終わりか、大かっこ文字クラスの終了を +示すかどちらかです。 +文字集合に C<]> を含める必要がある場合、一般的には +エスケープしなければなりません。 +しかし、C<]> が大かっこ文字クラスの I<最初> (または最初の文字がキャレットなら +2 番目) の文字の場合、(空クラスを作ることはできないので)これはクラスの +終了を意味せず、エスケープなしでマッチングできる文字の集合の一部と +考えられます。 + +=begin original + +Examples: + +=end original + +例: + +=begin original + + "+" =~ /[+?*]/ # Match, "+" in a character class is not special. + "\cH" =~ /[\b]/ # Match, \b inside in a character class + # is equivalent to a backspace. + "]" =~ /[][]/ # Match, as the character class contains. + # both [ and ]. + "[]" =~ /[[]]/ # Match, the pattern contains a character class + # containing just ], and the character class is + # followed by a ]. + +=end original + + "+" =~ /[+?*]/ # マッチング; 文字クラス内の "+" は特別ではない。 + "\cH" =~ /[\b]/ # マッチング; 文字クラスの内側の \b は後退と + # 等価。 + "]" =~ /[][]/ # マッチング; 文字クラスに [ と ] の両方を + # 含んでいる。 + "[]" =~ /[[]]/ # マッチング; パターンは ] だけを含んでいる + # 文字クラスと、それに引き続く + # ] からなる。 + +=head3 Character Ranges + +(文字範囲) + +=begin original + +It is not uncommon to want to match a range of characters. Luckily, instead +of listing all the characters in the range, one may use the hyphen (C<->). +If inside a bracketed character class you have two characters separated +by a hyphen, it's treated as if all the characters between the two are in +the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> +matches any lowercase letter from the first half of the ASCII alphabet. + +=end original + +文字のある範囲にマッチングしたいというのは珍しくありません。 +幸運なことに、その範囲の文字を全て一覧に書く代わりに、ハイフン (C<->) を +使えます。 +大かっこ文字クラスの内側で二つの文字がハイフンで区切られていると、 +二つの文字の間の全ての文字がクラスに書かれているかのように扱われます。 +例えば、C<[0-9]> は任意の ASCII 数字にマッチングし、C<[a-m]> は +ASCII アルファベットの前半分の小文字にマッチングします。 + +=begin original + +Note that the two characters on either side of the hyphen are not +necessary both letters or both digits. Any character is possible, +although not advisable. C<['-?]> contains a range of characters, but +most people will not know which characters that will be. Furthermore, +such ranges may lead to portability problems if the code has to run on +a platform that uses a different character set, such as EBCDIC. + +=end original + +ハイフンのそれぞれの側の二つの文字は両方とも英字であったり両方とも +数字であったりする必要はありませんが、勧められないことに注意してください。 +C<['-?]> は文字の範囲を含みますが、ほとんどの人はどの文字が含まれるか +分かりません。 +さらに、このような範囲は、コードが EBCDIC のような異なった文字集合を使う +プラットフォームで実行されると移植性の問題を引き起こします。 + +=begin original + +If a hyphen in a character class cannot syntactically be part of a range, for +instance because it is the first or the last character of the character class, +or if it immediately follows a range, the hyphen isn't special, and will be +considered a character that may be matched literally. You have to escape the +hyphen with a backslash if you want to have a hyphen in your set of characters +to be matched, and its position in the class is such that it could be +considered part of a range. + +=end original + +例えば文字クラスの最初または最後であったり、範囲の直後のために、文字クラスの +中のハイフンが文法的に範囲の一部となれない場合、ハイフンは特別ではなく、 +リテラルにマッチングするべき文字として扱われます。 +マッチングする文字の集合にハイフンを入れたいけれどもその位置が範囲の +一部として考えられる場合はハイフンを逆スラッシュでエスケープする +必要があります。 + +=begin original + +Examples: + +=end original + +例: + +=begin original + + [a-z] # Matches a character that is a lower case ASCII letter. + [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or + # the letter 'z'. + [-z] # Matches either a hyphen ('-') or the letter 'z'. + [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the + # hyphen ('-'), or the letter 'm'. + ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? + # (But not on an EBCDIC platform). + +=end original + + [a-z] # 小文字 ASCII 英字にマッチング。 + [a-fz] # 'a' から 'f' の英字およびと 'z' の英字に + # マッチング。 + [-z] # ハイフン ('-') または英字 'z' にマッチング。 + [a-f-m] # 'a' から 'f' の英字、ハイフン ('-')、英字 'm' に + # マッチング。 + ['-?] # 文字 '()*+,-./0123456789:;<=>? のどれかにマッチング + # (しかし EBCDIC プラットフォームでは異なります)。 + +=head3 Negation + +(否定) + +=begin original + +It is also possible to instead list the characters you do not want to +match. You can do so by using a caret (C<^>) as the first character in the +character class. For instance, C<[^a-z]> matches a character that is not a +lowercase ASCII letter. + +=end original + +代わりにマッチングしたくない文字の一覧を指定することも可能です。 +文字クラスの先頭の文字としてキャレット (C<^>) を使うことで実現します。 +例えば、C<[^a-z]> 小文字の ASCII 英字以外の文字にマッチングします。 + +=begin original + +This syntax make the caret a special character inside a bracketed character +class, but only if it is the first character of the class. So if you want +to have the caret as one of the characters you want to match, you either +have to escape the caret, or not list it first. + +=end original + +この文法はキャレットを大かっこ文字クラスの内側で特別な文字にしますが、 +クラスの最初の文字の場合のみです。 +それでマッチングしたい文字の一つでキャレットを使いたい場合、キャレットを +エスケープするか、最初以外の位置に書く必要があります。 + +=begin original + +Examples: + +=end original + +例: + +=begin original + + "e" =~ /[^aeiou]/ # No match, the 'e' is listed. + "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. + "^" =~ /[^^]/ # No match, matches anything that isn't a caret. + "^" =~ /[x^]/ # Match, caret is not special here. + +=end original + + "e" =~ /[^aeiou]/ # マッチングしない; 'e' がある。 + "x" =~ /[^aeiou]/ # マッチング; 'x' は小文字の母音ではない。 + "^" =~ /[^^]/ # マッチングしない; キャレット以外全てにマッチング。 + "^" =~ /[x^]/ # マッチング; キャレットはここでは特別ではない。 + +=head3 Backslash Sequences + +(逆スラッシュシーケンス) + +=begin original + +You can put any backslash sequence character class (with the exception of +C<\N>) inside a bracketed character class, and it will act just +as if you put all the characters matched by the backslash sequence inside the +character class. For instance, C<[a-f\d]> will match any digit, or any of the +lowercase letters between 'a' and 'f' inclusive. + +=end original + +大かっこ文字クラスの中に(C<\N> を例外として)逆スラッシュシーケンス +文字クラスを置くことができ、逆スラッシュシーケンスにマッチングする全ての +文字を文字クラスの中に置いたかのように動作します。 +例えば、C<[a-f\d]> は任意の数字、あるいは 'a' から 'f' までの小文字に +マッチングします。 + +=begin original + +C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> or +C<\N{U+I<wide hex char>}> for the same reason that a dot C<.> inside a +bracketed character class loses its special meaning: it matches nearly +anything, which generally isn't what you want to happen. + +=end original + +C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> or +C<\N{U+I<wide hex char>}> for the same reason that a dot C<.> inside a +bracketed character class loses its special meaning: it matches nearly +anything, which generally isn't what you want to happen. +(TBT) + +=begin original + +Examples: + +=end original + +例: + +=begin original + + /[\p{Thai}\d]/ # Matches a character that is either a Thai + # character, or a digit. + /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic + # character, nor a parenthesis. + +=end original + + /[\p{Thai}\d]/ # タイ文字または数字の文字に + # マッチングする。 + /[^\p{Arabic}()]/ # アラビア文字でもかっこでもない文字に + # マッチングする。 + +=begin original + +Backslash sequence character classes cannot form one of the endpoints +of a range. + +=end original + +逆スラッシュシーケンス文字クラスは範囲の端点の一つにはできません。 + +=head3 Posix Character Classes +X<character class> X<\p> X<\p{}> +X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph> +X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit> + +(Posix 文字クラス) + +=begin original + +Posix character classes have the form C<[:class:]>, where I<class> is +name, and the C<[:> and C<:]> delimiters. Posix character classes only appear +I<inside> bracketed character classes, and are a convenient and descriptive +way of listing a group of characters, though they currently suffer from +portability issues (see below and L<Locale, EBCDIC, Unicode and UTF-8>). Be +careful about the syntax, + +=end original + +Posix 文字クラスは C<[:class:]> の形式で、I<class> は名前、C<[:> と C<:]> は +デリミタです。 +Posix 文字クラスは大かっこ文字クラスの I<内側> にのみ現れ、文字のグループを +一覧するのに便利で記述的な方法ですが、 +though they currently suffer from +portability issues (see below and L<Locale, EBCDIC, Unicode and UTF-8>). +文法について注意してください、 +(TBT) + + # Correct: + $string =~ /[[:alpha:]]/ + + # Incorrect (will warn): + $string =~ /[:alpha:]/ + +=begin original + +The latter pattern would be a character class consisting of a colon, +and the letters C<a>, C<l>, C<p> and C<h>. +These character classes can be part of a larger bracketed character class. For +example, + +=end original + +後者のパターンは、コロンおよび C<a>, C<l>, C<p>, C<h> の文字からなる +文字クラスです。 +These character classes can be part of a larger bracketed character class. For +example, +(TBT) + + [01[:alpha:]%] + +=begin original + +is valid and matches '0', '1', any alphabetic character, and the percent sign. + +=end original + +is valid and matches '0', '1', any alphabetic character, and the percent sign. +(TBT) + +=begin original + +Perl recognizes the following POSIX character classes: + +=end original + +Perl 以下の POSIX 文字クラスを認識します: + +=begin original + + alpha Any alphabetical character ("[A-Za-z]"). + alnum Any alphanumerical character. ("[A-Za-z0-9]") + ascii Any character in the ASCII character set. + blank A GNU extension, equal to a space or a horizontal tab ("\t"). + cntrl Any control character. See Note [2] below. + digit Any decimal digit ("[0-9]"), equivalent to "\d". + graph Any printable character, excluding a space. See Note [3] below. + lower Any lowercase character ("[a-z]"). + print Any printable character, including a space. See Note [4] below. + punct Any graphical character excluding "word" characters. Note [5]. + space Any whitespace character. "\s" plus the vertical tab ("\cK"). + upper Any uppercase character ("[A-Z]"). + word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w". + xdigit Any hexadecimal digit ("[0-9a-fA-F]"). + +=end original + + alpha 任意の英字 ("[A-Za-z]")。 + alnum 任意の英数字。("[A-Za-z0-9]") + ascii 任意の ASCII 文字集合の文字。 + blank GNU 拡張; スペースまたは水平タブ ("\t") と同じ。 + cntrl 任意の制御文字。後述の [2] 参照。 + digit 任意の 10 進数字 ("[0-9]"); "\d" と等価。 + graph 任意の表示文字; スペースを除く。後述の [3] 参照。 + lower 任意の小文字 ("[a-z]")。 + print 任意の表示文字; スペースを含む。後述の [4] 参照。 + punct 任意の「単語」文字を除く表示文字。[5] 参照。 + space 任意の空白文字。"\s" に加えて水平タブ ("\cK")。 + upper 任意の大文字 ("[A-Z]")。 + word Perl 拡張 ("[A-Za-z0-9_]"); "\w" と等価。 + xdigit 任意の 16 進文字 ("[0-9a-fA-F]")。 + +=begin original + +Most POSIX character classes have two Unicode-style C<\p> property +counterparts. (They are not official Unicode properties, but Perl extensions +derived from official Unicode properties.) The table below shows the relation +between POSIX character classes and these counterparts. + +=end original + +Most POSIX character classes have two Unicode-style C<\p> property +counterparts. (They are not official Unicode properties, but Perl extensions +derived from official Unicode properties.) The table below shows the relation +between POSIX character classes and these counterparts. +(TBT) + +=begin original + +One counterpart, in the column labelled "ASCII-range Unicode" in +the table will only match characters in the ASCII range. (On EBCDIC platforms, +they match those characters which have ASCII equivalents.) + +=end original + +One counterpart, in the column labelled "ASCII-range Unicode" in +the table will only match characters in the ASCII range. (On EBCDIC platforms, +they match those characters which have ASCII equivalents.) +(TBT) + +=begin original + +The other counterpart, in the column labelled "Full-range Unicode", matches any +appropriate characters in the full Unicode character set. For example, +C<\p{Alpha}> will match not just the ASCII alphabetic characters, but any +character in the entire Unicode character set that is considered to be +alphabetic. + +=end original + +The other counterpart, in the column labelled "Full-range Unicode", matches any +appropriate characters in the full Unicode character set. For example, +C<\p{Alpha}> will match not just the ASCII alphabetic characters, but any +character in the entire Unicode character set that is considered to be +alphabetic. +(TBT) + +=begin original + +(Each of the counterparts has various synonyms as well. +L<perluniprops/Properties accessible through \p{} and \P{}> lists all the +synonyms, plus all the characters matched by each of the ASCII-range +properties. For example C<\p{AHex}> is a synonym for C<\p{ASCII_Hex_Digit}>, +and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.) + +=end original + +(Each of the counterparts has various synonyms as well. +L<perluniprops/Properties accessible through \p{} and \P{}> lists all the +synonyms, plus all the characters matched by each of the ASCII-range +properties. For example C<\p{AHex}> is a synonym for C<\p{ASCII_Hex_Digit}>, +and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.) +(TBT) + +=begin original + +Both the C<\p> forms are unaffected by any locale that is in effect, or whether +the string is in UTF-8 format or not, or whether the platform is EBCDIC or not. +In contrast, the POSIX character classes are affected. If the source string is +in UTF-8 format, the POSIX classes (with the exception of C<[[:punct:]]>, see +Note [5]) behave like their "Full-range" Unicode counterparts. If the source +string is not in UTF-8 format, and no locale is in effect, and the platform is +not EBCDIC, all the POSIX classes behave like their ASCII-range counterparts. +Otherwise, they behave based on the rules of the locale or EBCDIC code page. +It is proposed to change this behavior in a future release of Perl so that the +the UTF8ness of the source string will be irrelevant to the behavior of the +POSIX character classes. This means they will always behave in strict +accordance with the official POSIX standard. That is, if either locale or +EBCDIC code page is present, they will behave in accordance with those; if +absent, the classes will match only their ASCII-range counterparts. If you +disagree with this proposal, send email to C<perl5****@perl*****>. + +=end original + +Both the C<\p> forms are unaffected by any locale that is in effect, or whether +the string is in UTF-8 format or not, or whether the platform is EBCDIC or not. +In contrast, the POSIX character classes are affected. If the source string is +in UTF-8 format, the POSIX classes (with the exception of C<[[:punct:]]>, see +Note [5]) behave like their "Full-range" Unicode counterparts. If the source +string is not in UTF-8 format, and no locale is in effect, and the platform is +not EBCDIC, all the POSIX classes behave like their ASCII-range counterparts. +Otherwise, they behave based on the rules of the locale or EBCDIC code page. +It is proposed to change this behavior in a future release of Perl so that the +the UTF8ness of the source string will be irrelevant to the behavior of the +POSIX character classes. This means they will always behave in strict +accordance with the official POSIX standard. That is, if either locale or +EBCDIC code page is present, they will behave in accordance with those; if +absent, the classes will match only their ASCII-range counterparts. If you +disagree with this proposal, send email to C<perl5****@perl*****>. +(TBT) + + [[:...:]] ASCII-range Full-range backslash Note + Unicode Unicode sequence + ----------------------------------------------------- + alpha \p{PosixAlpha} \p{Alpha} + alnum \p{PosixAlnum} \p{Alnum} + ascii \p{ASCII} + blank \p{PosixBlank} \p{Blank} = [1] + \p{HorizSpace} \h [1] + cntrl \p{PosixCntrl} \p{Cntrl} [2] + digit \p{PosixDigit} \p{Digit} \d + graph \p{PosixGraph} \p{Graph} [3] + lower \p{PosixLower} \p{Lower} + print \p{PosixPrint} \p{Print} [4] + punct \p{PosixPunct} \p{Punct} [5] + \p{PerlSpace} \p{SpacePerl} \s [6] + space \p{PosixSpace} \p{Space} [6] + upper \p{PosixUpper} \p{Upper} + word \p{PerlWord} \p{Word} \w + xdigit \p{ASCII_Hex_Digit} \p{XDigit} + +=over 4 + +=item [1] + +=begin original + +C<\p{Blank}> and C<\p{HorizSpace}> are synonyms. + +=end original + +C<\p{Blank}> and C<\p{HorizSpace}> are synonyms. +(TBT) + +=item [2] + +=begin original + +Control characters don't produce output as such, but instead usually control +the terminal somehow: for example newline and backspace are control characters. +In the ASCII range, characters whose ordinals are between 0 and 31 inclusive, +plus 127 (C<DEL>) are control characters. + +=end original + +制御文字はそれ自体は出力されず、普通は何か端末を制御します: 例えば +改行と後退は制御文字です。 +In the ASCII range, characters whose ordinals are between 0 and 31 inclusive, +plus 127 (C<DEL>) are control characters. +(TBT) + +=begin original + +On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> +to be the EBCDIC equivalents of the ASCII controls, plus the controls +that in Unicode have ordinals from 128 through 139. + +=end original + +On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> +to be the EBCDIC equivalents of the ASCII controls, plus the controls +that in Unicode have ordinals from 128 through 139. +(TBT) + +=item [3] + +=begin original + +Any character that is I<graphical>, that is, visible. This class consists +of all the alphanumerical characters and all punctuation characters. + +=end original + +I<graphical>、つまり見える文字。 +このクラスは全ての英数字と全ての句読点文字。 + +=item [4] + +=begin original + +All printable characters, which is the set of all the graphical characters +plus whitespace characters that are not also controls. + +=end original + +全ての表示可能な文字; 全ての graphical 文字に加えて制御文字でない空白文字。 + +=item [5] + +=begin original + +C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all the +non-controls, non-alphanumeric, non-space characters: +C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect, +it could alter the behavior of C<[[:punct:]]>). + +=end original + +C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all the +non-controls, non-alphanumeric, non-space characters: +C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect, +it could alter the behavior of C<[[:punct:]]>). +(TBT) + +=begin original + +When the matching string is in UTF-8 format, C<[[:punct:]]> matches the above +set, plus what C<\p{Punct}> matches. This is different than strictly matching +according to C<\p{Punct}>, because the above set includes characters that aren't +considered punctuation by Unicode, but rather "symbols". Another way to say it +is that for a UTF-8 string, C<[[:punct:]]> matches all the characters that +Unicode considers to be punctuation, plus all the ASCII-range characters that +Unicode considers to be symbols. + +=end original + +When the matching string is in UTF-8 format, C<[[:punct:]]> matches the above +set, plus what C<\p{Punct}> matches. This is different than strictly matching +according to C<\p{Punct}>, because the above set includes characters that aren't +considered punctuation by Unicode, but rather "symbols". Another way to say it +is that for a UTF-8 string, C<[[:punct:]]> matches all the characters that +Unicode considers to be punctuation, plus all the ASCII-range characters that +Unicode considers to be symbols. +(TBT) + +=item [6] + +=begin original + +C<\p{SpacePerl}> and C<\p{Space}> differ only in that C<\p{Space}> additionally +matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms. + +=end original + +C<\p{SpacePerl}> and C<\p{Space}> differ only in that C<\p{Space}> additionally +matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms. +(TBT) + +=back + +=head4 Negation +X<character class, negation> + +(否定) + +=begin original + +A Perl extension to the POSIX character class is the ability to +negate it. This is done by prefixing the class name with a caret (C<^>). +Some examples: + +=end original + +POSIX 文字クラスに対する Perl の拡張は否定の機能です。 +これはクラス名の前にキャレット (C<^>) を置くことで実現します。 +いくつかの例です: + + POSIX ASCII-range Full-range backslash + Unicode Unicode sequence + ----------------------------------------------------- + [[:^digit:]] \P{PosixDigit} \P{Digit} \D + [[:^space:]] \P{PosixSpace} \P{Space} + \P{PerlSpace} \P{SpacePerl} \S + [[:^word:]] \P{PerlWord} \P{Word} \W + +=head4 [= =] and [. .] + +([= =] と [. .]) + +=begin original + +Perl will recognize the POSIX character classes C<[=class=]>, and +C<[.class.]>, but does not (yet?) support them. Use of +such a construct will lead to an error. + +=end original + +Perl は POSIX 文字クラス C<[=class=]> と C<[.class.]> を認識しますが、 +これらには(まだ?)対応していません。 +このような構文の使用はエラーを引き起こします。 + +=head4 Examples + +(例) + +=begin original + + /[[:digit:]]/ # Matches a character that is a digit. + /[01[:lower:]]/ # Matches a character that is either a + # lowercase letter, or '0' or '1'. + /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything + # except the letters 'a' to 'f'. This is + # because the main character class is composed + # of two POSIX character classes that are ORed + # together, one that matches any digit, and + # the other that matches anything that isn't a + # hex digit. The result matches all + # characters except the letters 'a' to 'f' and + # 'A' to 'F'. + +=end original + + /[[:digit:]]/ # 数字の文字にマッチングする。 + /[01[:lower:]]/ # 小文字、'0'、'1' のいずれかの文字に + # マッチングする。 + /[[:digit:][:^xdigit:]]/ # 'a' から 'f' 以外の任意の文字に + # マッチング。これはメインの文字クラスでは二つの + # POSIX 文字クラスが OR され、一つは任意の数字に + # マッチングし、もう一つは 16 進文字でない全ての + # 文字にマッチングします。従って + # 'a' から 'f' および 'A' から 'F' を + # 除く全ての文字に + # マッチングすることになります。 + +=head2 Locale, EBCDIC, Unicode and UTF-8 + +(ロケール、EBCDIC、Unicode、UTF-8) + +=begin original + +Some of the character classes have a somewhat different behaviour depending +on the internal encoding of the source string, and the locale that is +in effect, and if the program is running on an EBCDIC platform. + +=end original + +ソース文字列の内部エンコーディングと有効なロケールに、そしてプログラムが +EBCDIC プラットフォームで実行されるかどうかに依存して少し異なった +振る舞いをする文字クラスもあります。 + +=begin original + +C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations, +including C<\W>, C<\D>, C<\S>) suffer from this behaviour. (Since the backslash +sequences C<\b> and C<\B> are defined in terms of C<\w> and C<\W>, they also are +affected.) + +=end original + +C<\w>, C<\d>, C<\s> および POSIX 文字クラス (および C<\W>, C<\D>, C<\S> を +含むこれらの否定) はこの振る舞いの影響を受けます。 +(Since the backslash +sequences C<\b> and C<\B> are defined in terms of C<\w> and C<\W>, they also are +affected.) +(TBT) + +=begin original + +The rule is that if the source string is in UTF-8 format, the character +classes match according to the Unicode properties. If the source string +isn't, then the character classes match according to whatever locale or EBCDIC +code page is in effect. If there is no locale nor EBCDIC, they match the ASCII +defaults (52 letters, 10 digits and underscore for C<\w>; 0 to 9 for C<\d>; +etc.). + +=end original + +ソース文字列が UTF-8 形式なら、文字クラスは Unicode 特性に従って +マッチングするという規則です。 +ソース文字列が UTF-8 形式ではなければ、文字クラスはロケールや EBCDIC +コードページが有効かどうかに従ってマッチングします。 +ロケールや EBCDIC が有効でなければ、ASCII のデフォルト (C<\w> では 52 の英字、 +10 の数字と下線; C<\d> では 0 から 9; など) にマッチングします。 + +=begin original + +This usually means that if you are matching against characters whose C<ord()> +values are between 128 and 255 inclusive, your character class may match +or not depending on the current locale or EBCDIC code page, and whether the +source string is in UTF-8 format. The string will be in UTF-8 format if it +contains characters whose C<ord()> value exceeds 255. But a string may be in +UTF-8 format without it having such characters. See L<perluniprops/The +"Unicode Bug">. + +=end original + +これは普通、C<ord()> の値が 128 から 255 の範囲の文字にマッチングするなら、 +その文字クラスは現在のロケールや EBCDIC コードページ、およびソース文字列が +UTF-8 形式かどうかに依存してマッチングしたりしなかったりします。 +C<ord()> 値が 255 を超える文字が含まれているなら文字列は UTF-8 形式です。 +しかしそのような文字がなくても UTF-8 形式かもしれません。 +L<perluniprops/The "Unicode Bug"> を参照してください。 + +=begin original + +For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s> +or the POSIX character classes, and use the Unicode properties instead. + +=end original + +移植性の理由により、C<\w>, C<\d>, C<\s> や POSIX 文字クラスは使わず、 +Unicode 特性を使う方が良いです。 + +=head4 Examples + +(例) + +=begin original + + $str = "\xDF"; # $str is not in UTF-8 format. + $str =~ /^\w/; # No match, as $str isn't in UTF-8 format. + $str .= "\x{0e0b}"; # Now $str is in UTF-8 format. + $str =~ /^\w/; # Match! $str is now in UTF-8 format. + chop $str; + $str =~ /^\w/; # Still a match! $str remains in UTF-8 format. + +=end original + + $str = "\xDF"; # $str は UTF-8 形式ではない。 + $str =~ /^\w/; # マッチングしない; $str は UTF-8 形式ではない。 + $str .= "\x{0e0b}"; # ここで $str は UTF-8 形式。 + $str =~ /^\w/; # マッチング! $str は UTF-8 形式。 + chop $str; + $str =~ /^\w/; # まだマッチング! $str は UTF-8 形式のまま。 + +=cut + +=begin meta + +Translate: SHIRAKATA Kentaro <argra****@ub32*****> (5.10.1) +Status: in progress + +=end meta +