Wednesday, March 25, 2009

Pattern Modifiers regular expression UTF-8

Pattern Modifiers
The current possible PCRE modifiers are listed below. The names in parentheses refer to internal PCRE names for these modifiers. Spaces and newlines are ignored in modifiers, other characters cause error.



i (PCRE_CASELESS)
If this modifier is set, letters in the pattern match both upper and lower case letters.
m (PCRE_MULTILINE)
By default, PCRE treats the subject string as consisting of a single "line" of characters (even if it actually contains several newlines). The "start of line" metacharacter (^) matches only at the start of the string, while the "end of line" metacharacter ($) matches only at the end of the string, or before a terminating newline (unless D modifier is set). This is the same as Perl. When this modifier is set, the "start of line" and "end of line" constructs match immediately following or immediately before any newline in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl's /m modifier. If there are no "\n" characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect.
s (PCRE_DOTALL)
If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.
x (PCRE_EXTENDED)
If this modifier is set, whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl's /x modifier, and makes it possible to include comments inside complicated patterns. Note, however, that this applies only to data characters. Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence (?( which introduces a conditional subpattern.
e (PREG_REPLACE_EVAL)
If this modifier is set, preg_replace() does normal substitution of backreferences in the replacement string, evaluates it as PHP code, and uses the result for replacing the search string. Single quotes, double quotes, backslashes and NULL chars will be escaped by backslashes in substituted backreferences.
Only preg_replace() uses this modifier; it is ignored by other PCRE functions.

A (PCRE_ANCHORED)
If this modifier is set, the pattern is forced to be "anchored", that is, it is constrained to match only at the start of the string which is being searched (the "subject string"). This effect can also be achieved by appropriate constructs in the pattern itself, which is the only way to do it in Perl.
D (PCRE_DOLLAR_ENDONLY)
If this modifier is set, a dollar metacharacter in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set. There is no equivalent to this modifier in Perl.
S
When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.
U (PCRE_UNGREEDY)
This modifier inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by "?". It is not compatible with Perl. It can also be set by a (?U) modifier setting within the pattern or by a question mark behind a quantifier (e.g. .*?).
X (PCRE_EXTRA)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Any backslash in a pattern that is followed by a letter that has no special meaning causes an error, thus reserving these combinations for future expansion. By default, as in Perl, a backslash followed by a letter with no special meaning is treated as a literal. There are at present no other features controlled by this modifier.
J (PCRE_INFO_JCHANGED)
The (?J) internal option setting changes the local PCRE_DUPNAMES option. Allow duplicate names for subpatterns.
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.





Differences From Perl PCRE Patterns
--------------------------------------------------------------------------------
Last updated: Fri, 20 Mar 2009

add a note User Contributed Notes
Possible modifiers in regex patterns
ebarnard at marathonmultimedia dot com
06-Feb-2007 10:35
When adding comments with the /x modifier, don't use the pattern delimiter in the comments. It may not be ignored in the comments area. Example:

<?php
$target = 'some text';
if(preg_match('/
e # Comments here
/x',$target)) {
print "Target 1 hit.\n";
}
if(preg_match('/
e # /Comments here with slash
/x',$target)) {
print "Target 1 hit.\n";
}
?>

prints "Target 1 hit." but then generates a PHP warning message for the second preg_match():

Warning: preg_match() [function.preg-match]: Unknown modifier 'C' in /ebarnard/x-modifier.php on line 11
varrah NO_GARBAGE_OR_SPAM AT mail DOT ru
03-Nov-2005 12:12
Spent a few days, trying to understand how to create a pattern for Unicode chars, using the hex codes. Finally made it, after reading several manuals, that weren't giving any practical PHP-valid examples. So here's one of them:

For example we would like to search for Japanese-standard circled numbers 1-9 (Unicode codes are 0x2460-0x2468) in order to make it through the hex-codes the following call should be used:
preg_match('/[\x{2460}-\x{2468}]/u', $str);

Here $str is a haystack string
\x{hex} - is an UTF-8 hex char-code
and /u is used for identifying the class as a class of Unicode chars.

Hope, it'll be useful.
hfuecks at nospam dot org
15-Jul-2005 02:14
Regarding the validity of a UTF-8 string when using the /u pattern modifier, some things to be aware of;

1. If the pattern itself contains an invalid UTF-8 character, you get an error (as mentioned in the docs above - "UTF-8 validity of the pattern is checked since PHP 4.3.5"

2. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8

3. PCRE regards five and six octet UTF-8 character sequences as valid (both in patterns and the subject string) but these are not supported in Unicode ( see section 5.9 "Character Encoding" of the "Secure Programming for Linux and Unix HOWTO" - can be found at http://www.tldp.org/ and other places )

4. For an example algorithm in PHP which tests the validity of a UTF-8 string (and discards five / six octet sequences) head to: http://hsivonen.iki.fi/php-utf8/

The following script should give you an idea of what works and what doesn't;

<?php
$examples = array(
'Valid ASCII' => "a",
'Valid 2 Octet Sequence' => "\xc3\xb1",
'Invalid 2 Octet Sequence' => "\xc3\x28",
'Invalid Sequence Identifier' => "\xa0\xa1",
'Valid 3 Octet Sequence' => "\xe2\x82\xa1",
'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x28\xa1",
'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xe2\x82\x28",

'Valid 4 Octet Sequence' => "\xf0\x90\x8c\xbc",
'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x28\x8c\xbc",
'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x90\x28\xbc",
'Invalid 4 Octet Sequence (in 4th Octet)' => "\xf0\x28\x8c\x28",
'Valid 5 Octet Sequence (but not Unicode!)' => "\xf8\xa1\xa1\xa1\xa1",
'Valid 6 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1",
);

echo "++Invalid UTF-8 in pattern\n";
foreach ( $examples as $name => $str ) {
echo "$name\n";
preg_match("/".$str."/u",'Testing');
}

echo "++ preg_match() examples\n";
foreach ( $examples as $name => $str ) {

preg_match("/\xf8\xa1\xa1\xa1\xa1/u", $str, $ar);
echo "$name: ";

if ( count($ar) == 0 ) {
echo "Matched nothing!\n";
} else {
echo "Matched {$ar[0]}\n";
}

}

echo "++ preg_match_all() examples\n";
foreach ( $examples as $name => $str ) {
preg_match_all('/./u', $str, $ar);
echo "$name: ";

$num_utf8_chars = count($ar[0]);
if ( $num_utf8_chars == 0 ) {
echo "Matched nothing!\n";
} else {
echo "Matched $num_utf8_chars character\n";
}

}
?>
csaba at alum dot mit dot edu
09-Apr-2005 12:40
Extracting lines of text:

You might want to grab a line of text within a multiline piece of text. For example, suppose you want to replace the first and last lines within the <body> portion of a web $page with your own $lineFirst and $lineLast. Here's one possible way:

<?php
$lineFirst = "This is a new first line<br>\r\n";
$lineLast  = "This is a new last line<br>\r\n";
$page = <<<EOD
<html><head>
<title>This is a test page</title>
</head><body>
This is the first line<br>
Hi Fred<br>
Hi Bill<br>
This is the last line<br>
</body>
</html>
EOD;
$re = "/<body>.*^(.+)(^.*?^)(.+)(^<\\/body>.*?)/smU";
if (preg_match($re, $page, $aMatch, PREG_OFFSET_CAPTURE))
$newPage = substr($text, 0, $aMatch[1][1]) .
$lineFirst . $aMatch[2][0] .
$lineLast . $aMatch[4][0];
print $newPage;
?>

The two (.+) are supposed to match the first and last lines within the <body> tag. The /s option (dot all) is needed so the .* can also match newlines. The /m option (multiline) is needed so that the ^ can match newlines. The /U option (ungreedy) is needed so that the .* and .+ will only gobble up the minimum number of characters necessary to get to the character following the * or +. The exception to this, however, is that the .*? temporarily overrides the /U setting on .* turning it from non greedy to greedy. In the middle, this ensures that all the lines except the first and last (within the <body> tag) are put into $aMatch[2]. At the end, it ensures that all the remaining characters in the string are gobbled up, which could also have been achieved by .*)\\z/ instead of .*?)/

Csaba Gabor from Vienna


drupal_validate_utf8
Drupal 4.7 Drupal 5 Drupal 6 Drupal 7 includes/bootstrap.inc, line 766

Versions 4.7 – 7 drupal_validate_utf8($text)
Checks whether a string is valid UTF-8.

All functions designed to filter input should use drupal_validate_utf8 to ensure they operate on valid UTF-8 strings to prevent bypass of the filter.

When text containing an invalid UTF-8 lead byte (0xC0 - 0xFF) is presented as UTF-8 to Internet Explorer 6, the program may misinterpret subsequent bytes. When these subsequent bytes are HTML control characters such as quotes or angle brackets, parts of the text that were deemed safe by filters end up in locations that are potentially unsafe; An onerror attribute that is outside of a tag, and thus deemed safe by a filter, can be interpreted by the browser as if it were inside the tag.

This function exploits preg_match behaviour (since PHP 4.3.5) when used with the u modifier, as a fast way to find invalid UTF-8. When the matched string contains an invalid byte sequence, it will fail silently.

preg_match may not fail on 4 and 5 octet sequences, even though they are not supported by the specification.

The specific preg_match behaviour is present since PHP 4.3.5.

Parameters
$text The text to check.

Return value
TRUE if the text is valid UTF-8, FALSE if not.

2 functions call drupal_validate_utf8()
2 functions call drupal_validate_utf8()
check_plain in includes/bootstrap.inc
Encode special characters in a plain-text string for display as HTML.
filter_xss in modules/filter/filter.module
Filters XSS. Based on kses by Ulf Harnhammar, see http://sourceforge.net/projects/kses
Code
<?php
function drupal_validate_utf8($text) {
if (strlen($text) == 0) {
return TRUE;
}
return (preg_match('/^./us', $text) == 1);
}
?>

1 comment:

CAPitalZ said...

Thanks a lot.

After 2 days of research using Google, I came to your site.

I did try the example you specified, but still the regex doesn't seems to match
here is what I used:
'/[\x{0B80}-\x{0BFF}]/xsu'
I have version PHP 5.2.8
I'm trying to see the passed string contain "Tamil" script characters or not.