Szabolcs Nagy
2014-08-19 22:59:21 UTC
In chapter 9 the case insensitive matching of negated (^) bracket
expressions seems to be inconsistent with historical practice.
9.2 says that when matching a character to the pattern case
insensitively, the case counterpart should be tried as well.
9.3.5 says that a non-matching list bracket expression matches
any character except for the ones in the list.
So [^aBcC] should match 'a', 'A', 'b', 'B', 'd' or 'D', and not
match 'c' or 'C' characters, because when eg. matching 'a', both
'a' and 'A' should be tried and 'A' is not in the non-matching
list so it matches.
I suspect that on historical implementations this pattern does
not match 'a', 'A', 'b' or 'B'.
(Because the non-matching list is checked case insensitively
and if the character or its case counterpart is on the list
then it does not match.)
Is this interpretation correct and are there implementations
that handle this case accordingly?
example code to try (prints the matched characters):
#include <regex.h>
#include <stdio.h>
int main()
{
regex_t re;
regcomp(&re, "[^aBcC]", REG_ICASE);
if (!regexec(&re, "a", 0, 0, 0)) puts("a");
if (!regexec(&re, "A", 0, 0, 0)) puts("A");
if (!regexec(&re, "b", 0, 0, 0)) puts("b");
if (!regexec(&re, "B", 0, 0, 0)) puts("B");
if (!regexec(&re, "c", 0, 0, 0)) puts("c");
if (!regexec(&re, "C", 0, 0, 0)) puts("C");
if (!regexec(&re, "d", 0, 0, 0)) puts("d");
if (!regexec(&re, "D", 0, 0, 0)) puts("D");
}
expressions seems to be inconsistent with historical practice.
9.2 says that when matching a character to the pattern case
insensitively, the case counterpart should be tried as well.
9.3.5 says that a non-matching list bracket expression matches
any character except for the ones in the list.
So [^aBcC] should match 'a', 'A', 'b', 'B', 'd' or 'D', and not
match 'c' or 'C' characters, because when eg. matching 'a', both
'a' and 'A' should be tried and 'A' is not in the non-matching
list so it matches.
I suspect that on historical implementations this pattern does
not match 'a', 'A', 'b' or 'B'.
(Because the non-matching list is checked case insensitively
and if the character or its case counterpart is on the list
then it does not match.)
Is this interpretation correct and are there implementations
that handle this case accordingly?
example code to try (prints the matched characters):
#include <regex.h>
#include <stdio.h>
int main()
{
regex_t re;
regcomp(&re, "[^aBcC]", REG_ICASE);
if (!regexec(&re, "a", 0, 0, 0)) puts("a");
if (!regexec(&re, "A", 0, 0, 0)) puts("A");
if (!regexec(&re, "b", 0, 0, 0)) puts("b");
if (!regexec(&re, "B", 0, 0, 0)) puts("B");
if (!regexec(&re, "c", 0, 0, 0)) puts("c");
if (!regexec(&re, "C", 0, 0, 0)) puts("C");
if (!regexec(&re, "d", 0, 0, 0)) puts("d");
if (!regexec(&re, "D", 0, 0, 0)) puts("D");
}