Discussion:
REG_ICASE regex matching and negated bracket expr
Szabolcs Nagy
2014-08-19 22:59:21 UTC
Permalink
In chapter 9 the case insensitive matching of negated (^) bracket
expressions seems to be inconsistent with historical practice.

9.2 says that when matching a character to the pattern case
insensitively, the case counterpart should be tried as well.

9.3.5 says that a non-matching list bracket expression matches
any character except for the ones in the list.

So [^aBcC] should match 'a', 'A', 'b', 'B', 'd' or 'D', and not
match 'c' or 'C' characters, because when eg. matching 'a', both
'a' and 'A' should be tried and 'A' is not in the non-matching
list so it matches.

I suspect that on historical implementations this pattern does
not match 'a', 'A', 'b' or 'B'.
(Because the non-matching list is checked case insensitively
and if the character or its case counterpart is on the list
then it does not match.)

Is this interpretation correct and are there implementations
that handle this case accordingly?

example code to try (prints the matched characters):

#include <regex.h>
#include <stdio.h>
int main()
{
regex_t re;
regcomp(&re, "[^aBcC]", REG_ICASE);
if (!regexec(&re, "a", 0, 0, 0)) puts("a");
if (!regexec(&re, "A", 0, 0, 0)) puts("A");
if (!regexec(&re, "b", 0, 0, 0)) puts("b");
if (!regexec(&re, "B", 0, 0, 0)) puts("B");
if (!regexec(&re, "c", 0, 0, 0)) puts("c");
if (!regexec(&re, "C", 0, 0, 0)) puts("C");
if (!regexec(&re, "d", 0, 0, 0)) puts("d");
if (!regexec(&re, "D", 0, 0, 0)) puts("D");
}
Szabolcs Nagy
2014-08-27 14:33:08 UTC
Permalink
Post by Szabolcs Nagy
In chapter 9 the case insensitive matching of negated (^) bracket
expressions seems to be inconsistent with historical practice.
9.2 says that when matching a character to the pattern case
insensitively, the case counterpart should be tried as well.
9.3.5 says that a non-matching list bracket expression matches
any character except for the ones in the list.
So [^aBcC] should match 'a', 'A', 'b', 'B', 'd' or 'D', and not
match 'c' or 'C' characters, because when eg. matching 'a', both
'a' and 'A' should be tried and 'A' is not in the non-matching
list so it matches.
I suspect that on historical implementations this pattern does
not match 'a', 'A', 'b' or 'B'.
(Because the non-matching list is checked case insensitively
and if the character or its case counterpart is on the list
then it does not match.)
ok it seems the first time the regcomp api appeared in 1992
it was already inconsistent with the current posix spec

http://minnie.tuhs.org/cgi-bin/utree.pl?file=4.4BSD/usr/src/lib/libc/regex/regcomp.c

(see the handling of REG_ICASE and invert in the p_bracket function.
older regex implementations seems to lack the icase flag)

current bsd systems still add case counterparts to the
bracket set first and then invert the set

glibc, tre do the same

(even perl and pcre do the same)

i think this is a bug in the posix standard so i'll file an issue
Post by Szabolcs Nagy
Is this interpretation correct and are there implementations
that handle this case accordingly?
#include <regex.h>
#include <stdio.h>
int main()
{
regex_t re;
regcomp(&re, "[^aBcC]", REG_ICASE);
if (!regexec(&re, "a", 0, 0, 0)) puts("a");
if (!regexec(&re, "A", 0, 0, 0)) puts("A");
if (!regexec(&re, "b", 0, 0, 0)) puts("b");
if (!regexec(&re, "B", 0, 0, 0)) puts("B");
if (!regexec(&re, "c", 0, 0, 0)) puts("c");
if (!regexec(&re, "C", 0, 0, 0)) puts("C");
if (!regexec(&re, "d", 0, 0, 0)) puts("d");
if (!regexec(&re, "D", 0, 0, 0)) puts("D");
}
Loading...