bug#17196: UTF-8 printf string formating problem

Discussion:

Eric Blake

2014-04-07 21:57:03 UTC

[adding the Austin Group]

Yes printf follows the C standard which only considers bytes.
...
I don't think we'd be able to change the current operation of printf
due to backwards compat reasons? Though we might be able to somehow leverage
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD

Dan Douglas pointed out in the corresponding discussion in bug-bash
that ksh uses the L modifier.
http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html

ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
âââ

At least there is prior art for it.

So we can count bytes, chars or cells (graphemes).
Thinking a bit more about it, I think shell level printf
should be dealing in text of the current encoding and counting cells.
LC_ALL=C printf ...
I see that ksh behaves as I would expect and counts cells,
$ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
aÌââ
$ ksh -c "printf '%.3Ls\n' $'ïŒ¡\u2605\u2605\u2605'"
ïŒ¡â
$ ksh -c "printf '%.3Ls\n' $'ïŒ¡ïŒ¡\u2605\u2605\u2605'"
ïŒ¡
$ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
aÌâ
$ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
aÌâ
$ zsh -c "printf '%.3Ls\n' $'ïŒ¡\u2605\u2605\u2605'"
ïŒ¡ââ
I see that dash gives invalid directive for any of %ls %Ls %S.
Pity there is no consensus here.
printf '%3s' 'blah' # count cells
printf '%3Ls' 'blah' # count chars
LANG=C '%3Ls' 'blah' # count bytes
LANG=C '%3s' 'blah' # count bytes

Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
and currently states that %Ls is undefined. But I would LOVE to have a
standardized spelling for counting characters instead of bytes. The
extension %Ls looks like a good candidate for standardization, precisely
because counting characters when printing a multibyte string is more
useful than counting bytes (you do NOT want to end in the middle of a
multibyte character), and because ksh offers it as existing practice.

Your idea for counting "cells" (by which I'm assuming you mean one or
more characters that all display within the same cell of the terminal,
as if the end user saw only one grapheme), on the other hand, does not
seem to have any precedence, and I would strongly object to having %s
count by cells because %s already has a standardized (if unfortunate)
meaning of counting by bytes. Maybe yet another extension is warranted
(perhaps %LLs?) as a new notion for counting by cells instead of
characters, but it's harder to justify that without existing practice.

--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Pádraig Brady

2014-04-08 00:11:13 UTC

Permalink

Post by Eric Blake
[adding the Austin Group]

Dan Douglas pointed out in the corresponding discussion in bug-bash
that ksh uses the L modifier.
http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html

ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
★★★

At least there is prior art for it.

So we can count bytes, chars or cells (graphemes).
Thinking a bit more about it, I think shell level printf
should be dealing in text of the current encoding and counting cells.
LC_ALL=C printf ...
I see that ksh behaves as I would expect and counts cells,
$ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
á★★
$ ksh -c "printf '%.3Ls\n' $'Ａ\u2605\u2605\u2605'"
Ａ★
$ ksh -c "printf '%.3Ls\n' $'ＡＡ\u2605\u2605\u2605'"
Ａ
$ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
á★
$ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
á★
$ zsh -c "printf '%.3Ls\n' $'Ａ\u2605\u2605\u2605'"
Ａ★★
I see that dash gives invalid directive for any of %ls %Ls %S.
Pity there is no consensus here.
printf '%3s' 'blah' # count cells
printf '%3Ls' 'blah' # count chars
LANG=C '%3Ls' 'blah' # count bytes
LANG=C '%3s' 'blah' # count bytes

Note ksh seems to count cells with %Ls

Post by Eric Blake
Your idea for counting "cells" (by which I'm assuming you mean one or
more characters that all display within the same cell of the terminal,
as if the end user saw only one grapheme), on the other hand, does not
seem to have any precedence, and I would strongly object to having %s
count by cells because %s already has a standardized (if unfortunate)
meaning of counting by bytes. Maybe yet another extension is warranted
(perhaps %LLs?) as a new notion for counting by cells instead of
characters, but it's harder to justify that without existing practice.

At the shell level I expect that the vast majority
of uses would prefer to be specifying cell counts.
I thought there might not be much backwards compat issues
with doing that, especially since zsh and gawk adjust
the meaning of %s according to the locale
(albeit for char rather than cell count).

But it's a fair point that there may be scripts
that don't consider the zsh behavior.

If we had to make it explicit for backwards compat reasons,
then I suppose counting by characters is the least useful,
so we could just standardize the existing ksh behavior and have:

printf '%3s' 'blah' # count bytes
printf '%3Ls' 'blah' # count cells
LANG=C '%3Ls' 'blah' # count bytes

This has the disadvantage of not degrading gracefully
on dash for example where %Ls is rejected.

thanks,
Pádraig.

Eric Blake

2014-04-08 01:28:10 UTC

Permalink

Post by PÃ¡draig Brady
If we had to make it explicit for backwards compat reasons,
then I suppose counting by characters is the least useful,
printf '%3s' 'blah' # count bytes
printf '%3Ls' 'blah' # count cells
LANG=C '%3Ls' 'blah' # count bytes

If we add %3Ls to the shell, we should also add it to libc's printf(3),
which means coordinating with the C committee.

Post by PÃ¡draig Brady
This has the disadvantage of not degrading gracefully
on dash for example where %Ls is rejected.

If a future version of the standard mandates behavior for %Ls, I suspect
dash would be made compliant fairly quickly - the dash maintainers
strive hard to comply with POSIX.

--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Ranjit Singh

2014-04-16 05:55:17 UTC

Permalink

Post by PÃ¡draig Brady

Post by Eric Blake
Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
and currently states that %Ls is undefined. But I would LOVE to have a
standardized spelling for counting characters instead of bytes. The
extension %Ls looks like a good candidate for standardization, precisely
because counting characters when printing a multibyte string is more
useful than counting bytes (you do NOT want to end in the middle of a
multibyte character), and because ksh offers it as existing practice.

Yeah, an mbcs string printed via fprintf vs wprintf is counted in bytes,
viz %n.

Post by PÃ¡draig Brady

What about %Gs for graphical? I agree there needs to be a method,
so we might as well posit the full-set, since there is prior art
in C libraries, and \U and \u have been around since C99.

Post by PÃ¡draig Brady
At the shell level I expect that the vast majority
of uses would prefer to be specifying cell counts.
I thought there might not be much backwards compat issues
with doing that, especially since zsh and gawk adjust
the meaning of %s according to the locale
(albeit for char rather than cell count).

Agreed.

Well, it's fugly: you forgot the printf in the last line, and I
didn't even notice til I got to here. It's new, based on what
Eric wrote, so we might as well get it right.

printf '%3s' 'blah' # count bytes
printf '%3Ls' 'blah' # count chrs
printf '%3Gs' 'blah' # count cells

Or %Cs hmm. Neither of those is right, since the lower-case
forms conflict with printf(3) conversion-specifiers.
Similarly for %L, though with the %l length-modifier, ofc.

Let the bikeshedding begin. ;)

Regards,
Ranjit

--
"One can be a gentleman, without being a push-over."

Eric Blake

2014-04-16 10:03:05 UTC

Permalink

Post by Ranjit Singh
Well, it's fugly: you forgot the printf in the last line, and I
didn't even notice til I got to here. It's new, based on what
Eric wrote, so we might as well get it right.
printf '%3s' 'blah' # count bytes
printf '%3Ls' 'blah' # count chrs
printf '%3Gs' 'blah' # count cells
Or %Cs hmm. Neither of those is right, since the lower-case
forms conflict with printf(3) conversion-specifiers.
Similarly for %L, though with the %l length-modifier, ofc.

G and C are format specifiers, so they cannot be used as a length specifier.

L is already a length modifier, but is currently undefined when paired
with %s; so using %Ls (and the counterpart wide char %LS) makes sense.
The idea of the proposal is adding one or two new length specifier
designations that state that the length is determined by characters
and/or properties of the entire sequence of characters, rather than mere
bytes.

--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Ranjit Singh

2014-04-27 15:33:45 UTC

Permalink

Post by Eric Blake

G and C are format specifiers, so they cannot be used as a length specifier.

Doh, missed that; ofc. I'm not used to %C and %S at all, though
should have known equivalence of %G with %E/%F.

Post by Eric Blake
L is already a length modifier, but is currently undefined when
paired with %s; so using %Ls (and the counterpart wide char %LS)
makes sense.

The trouble I have with %L is that in the context of C you have
L"...". %Ls for graphical length is simply asking for confusion imo,
even if %ls is known, though this is a WG14 discussion really.

Still, we seem agreed on the general idea and need for such a
specifier. I concur that %Ls is currently undefined within C, and
though I'd still not use it for this meaning within that context,
I'm not overly fussed.

Post by Eric Blake
The idea of the proposal is adding one or two new length specifier
designations that state that the length is determined by characters
and/or properties of the entire sequence of characters, rather than mere
bytes.

Yes, I got that, thanks. Though it's not so much about the entire
sequence if you're specifying an explicit length, as about unit of
measurement. I agree that depends on the sequence of graphemes/chrs
encountered until we get to that count of units (and WG14 will
likely need to add something about shift sequences when printing
as mbcs), but it's about the unit nonetheless.

For further prior art, mksh uses ${%x} to mean "graphical length"
(as in number of cells occupied on screen) vs usual ${#x} for
length in chrs. ${%x} for a single chr can be 0-2, or -1 for a
control or error.

I hope we'll be raising %Ln (or equiv) for consideration as well.
Again, that's about unit of measurement, and I'm not especially
fussed which letter is used, so long as we get the functionality.

Regards,
Ranjit.

--
"One can be a gentleman, without being a push-over."

Steffen Nurpmeso

2014-04-09 12:49:37 UTC

Permalink

Eric Blake <eblake-H+wXaHxf7aLQT0dZR+***@public.gmane.org> wrote:
|>> Dan Douglas wrote:
|>>> ksh93 already has this feature using the "L" modifier:
|>>>
|>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
|>>> ★★★
|>>
|>> At least there is prior art for it.
|>
|> So we can count bytes, chars or cells (graphemes).
|>
|> Thinking a bit more about it, I think shell level printf
|> should be dealing in text of the current encoding and counting cells.
|> In the edge case where you want to deal in bytes one can do:
|> LC_ALL=C printf ...
|>
|> I see that ksh behaves as I would expect and counts cells,
|> though requires the explicit %L enabler:
|> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
|> á★★
|> $ ksh -c "printf '%.3Ls\n' $'Ａ\u2605\u2605\u2605'"
|> Ａ★
|> $ ksh -c "printf '%.3Ls\n' $'ＡＡ\u2605\u2605\u2605'"
|> Ａ
|>
|> zsh seems to just count characters:
|> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
|> á★
|> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
|> á★
|> $ zsh -c "printf '%.3Ls\n' $'Ａ\u2605\u2605\u2605'"
|> Ａ★★
|>
|> I see that dash gives invalid directive for any of %ls %Ls %S.
|>
|> Pity there is no consensus here.
|> Personally I would go for:
|> printf '%3s' 'blah' # count cells
|> printf '%3Ls' 'blah' # count chars
|> LANG=C '%3Ls' 'blah' # count bytes
|> LANG=C '%3s' 'blah' # count bytes
|
|Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
|and currently states that %Ls is undefined. But I would LOVE to have a
|standardized spelling for counting characters instead of bytes. The
|extension %Ls looks like a good candidate for standardization, precisely
|because counting characters when printing a multibyte string is more
|useful than counting bytes (you do NOT want to end in the middle of a
|multibyte character), and because ksh offers it as existing practice.
|
|Your idea for counting "cells" (by which I'm assuming you mean one or
|more characters that all display within the same cell of the terminal,
|as if the end user saw only one grapheme), on the other hand, does not
|seem to have any precedence, and I would strongly object to having %s
|count by cells because %s already has a standardized (if unfortunate)
|meaning of counting by bytes. Maybe yet another extension is warranted
|(perhaps %LLs?) as a new notion for counting by cells instead of
|characters, but it's harder to justify that without existing practice.

I see you are trying to invent the word character for code points
and reserve the term "graphem" for user-perceived characters.
This goes in line with the GNU library which has the existing
practice to let wcwidth(3) return the value 1 for accents and
other combining code points as well as so-called (Unicode)
noncharacters. And who would call wcwidth(3) on something that is
not to be drawn onto the screen directly afterwards. And, of
course, which terminal will perform the composition of code points
written via STD I/O to characters on its own.
I think for quite a while it is up to the input methods to combine
into something precomposed in order to let POSIX programs finally
work with it.

--steffen

Rich Felker

2014-04-10 07:56:10 UTC

Permalink

Post by Steffen Nurpmeso
|>>>
|>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
|>>> ★★★
|>>
|>> At least there is prior art for it.
|>
|> So we can count bytes, chars or cells (graphemes).
|>
|> Thinking a bit more about it, I think shell level printf
|> should be dealing in text of the current encoding and counting cells.
|> LC_ALL=C printf ...
|>
|> I see that ksh behaves as I would expect and counts cells,
|> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
|> á★★
|> $ ksh -c "printf '%.3Ls\n' $'Ａ\u2605\u2605\u2605'"
|> Ａ★
|> $ ksh -c "printf '%.3Ls\n' $'ＡＡ\u2605\u2605\u2605'"
|> Ａ
|>
|> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
|> á★
|> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
|> á★
|> $ zsh -c "printf '%.3Ls\n' $'Ａ\u2605\u2605\u2605'"
|> Ａ★★
|>
|> I see that dash gives invalid directive for any of %ls %Ls %S.
|>
|> Pity there is no consensus here.
|> printf '%3s' 'blah' # count cells
|> printf '%3Ls' 'blah' # count chars
|> LANG=C '%3Ls' 'blah' # count bytes
|> LANG=C '%3s' 'blah' # count bytes
|
|Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
|and currently states that %Ls is undefined. But I would LOVE to have a
|standardized spelling for counting characters instead of bytes. The
|extension %Ls looks like a good candidate for standardization, precisely
|because counting characters when printing a multibyte string is more
|useful than counting bytes (you do NOT want to end in the middle of a
|multibyte character), and because ksh offers it as existing practice.
|
|Your idea for counting "cells" (by which I'm assuming you mean one or
|more characters that all display within the same cell of the terminal,
|as if the end user saw only one grapheme), on the other hand, does not
|seem to have any precedence, and I would strongly object to having %s
|count by cells because %s already has a standardized (if unfortunate)
|meaning of counting by bytes. Maybe yet another extension is warranted
|(perhaps %LLs?) as a new notion for counting by cells instead of
|characters, but it's harder to justify that without existing practice.
I see you are trying to invent the word character for code points
and reserve the term "graphem" for user-perceived characters.
This goes in line with the GNU library which has the existing
practice to let wcwidth(3) return the value 1 for accents and
other combining code points as well as so-called (Unicode)
noncharacters. And who would call wcwidth(3) on something that is
not to be drawn onto the screen directly afterwards. And, of
course, which terminal will perform the composition of code points
written via STD I/O to characters on its own.
I think for quite a while it is up to the input methods to combine
into something precomposed in order to let POSIX programs finally
work with it.

Many languages do not have precomposed forms for all the character
sequences they need, and for some, it would not even be practical to
have precomposed forms, and would force the use of complex input
methods instead of simple keyboard maps.

Rich

Steffen Nurpmeso

2014-04-10 16:16:24 UTC

Permalink

Rich Felker <dalias-/***@public.gmane.org> wrote:
|On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote:
|> Eric Blake <eblake-H+wXaHxf7aLQT0dZR+***@public.gmane.org> wrote:
|>|Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
|>|and currently states that %Ls is undefined. But I would LOVE to have a
|>|standardized spelling for counting characters instead of bytes. The
|>|extension %Ls looks like a good candidate for standardization, precisely
|>|because counting characters when printing a multibyte string is more
|>|useful than counting bytes (you do NOT want to end in the middle of a
|>|multibyte character), and because ksh offers it as existing practice.
|>|
|>|Your idea for counting "cells" (by which I'm assuming you mean one or
|>|more characters that all display within the same cell of the terminal,
|>|as if the end user saw only one grapheme), on the other hand, does not
|>|seem to have any precedence, and I would strongly object to having %s
[.]
|> I see you are trying to invent the word character for code points
|> and reserve the term "graphem" for user-perceived characters.
|> This goes in line with the GNU library which has the existing
|> practice to let wcwidth(3) return the value 1 for accents and
|> other combining code points as well as so-called (Unicode)
|> noncharacters. And who would call wcwidth(3) on something that is
|> not to be drawn onto the screen directly afterwards. And, of
|> course, which terminal will perform the composition of code points
|> written via STD I/O to characters on its own.
|> I think for quite a while it is up to the input methods to combine
|> into something precomposed in order to let POSIX programs finally
|> work with it.
|
|Many languages do not have precomposed forms for all the character
|sequences they need, and for some, it would not even be practical to
|have precomposed forms, and would force the use of complex input
|methods instead of simple keyboard maps.

And of course with UTF-8 decomposed forms of characters from an
immense number of languages can occur in at least theory, in,
e.g., a text file.
The german U+00F6 (LATIN SMALL LETTER U WITH DIAERESIS) could very
well be Â«ÃŒÂ» but also U+0076 U+0308 Â«u ÌÂ», dependent on where it
came from. And note that my vim(1) composed U+00F6 when i tried
to input the latter string automatically, i had to separate, enter
each, and join them together to get at Â«uÂ» plus, actually non-,
combining diaeresis. (In fact actually Â«combining with a spaceÂ».)
Of course a wcwidth(3) of 1 for U+0308 is much better than 0 when
it really produces something visual.

Even better would nonetheless be the great picture with
a termios(4) IUTF8 flag, some extended xywidth(3) that returns
a tuple of {[EastAsianWidth indication,] is-combining,
width-if-non-combining} and best even some composition function.
I don't think that Â«user-perceived characters don't have any
precedenceÂ». A whole lot of development in the past decade on the
winner side (that is, the other :) was exactly that -- making
software barrier-free.
If POSIX beams itself onto UTF-8 it should really consider to
offer a way to be able to act on what the user really deals with.
And that is, in the Unicode world -- and isn't that what the bug
report is about --, not necessarily a mbrlen(3)-division of bytes.

--steffen

Chet Ramey

2014-04-10 18:10:26 UTC

Permalink

Post by Steffen Nurpmeso
Even better would nonetheless be the great picture with
a termios(4) IUTF8 flag, some extended xywidth(3) that returns
a tuple of {[EastAsianWidth indication,] is-combining,
width-if-non-combining} and best even some composition function.

But we have always been at war with EastAsia!

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU ***@case.edu http://cnswww.cns.cwru.edu/~chet/

Steffen Nurpmeso

2014-04-11 10:16:15 UTC

Permalink

Hello,

Chet Ramey <***@case.edu> wrote:
|On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote:
|
|> Even better would nonetheless be the great picture with
|> a termios(4) IUTF8 flag, some extended xywidth(3) that returns
|> a tuple of {[EastAsianWidth indication,] is-combining,
|> width-if-non-combining} and best even some composition function.
|
|But we have always been at war with EastAsia!

I see you really would love to get a hand from POSIX too:

?0[***@sherwood bash-4.3]$ grep -r UNICODE_COMB .
./lib/readline/display.c: if (t > 0 && UNICODE_COMBINING_CHAR (wc) && WCWIDTH (wc) == 0)
./lib/readline/rlmbutil.h:#define UNICODE_COMBINING_CHAR(x) ((x) >= 768 && (x) <= 879)
./lib/readline/rlmbutil.h:# define WCWIDTH(wc) ((_rl_utf8locale && UNICODE_COMBINING_CHAR(wc)) ? 0 : wcwidth(wc))

And sorry for not making this clear for those who never dealt with
the problem (which is probably not uncommon for filesystem or
other kernel hackers): `EastAsianWidth' refers to a property of
Unicode and ISO 10646:

# EastAsianWidth-6.3.0.txt
# Date: 2013-02-05, 20:09:00 GMT [KW, LI]
#
# East Asian Width Properties
#
# This file is an informative contributory data file in the
# Unicode Character Database.
#
# Copyright (c) 1991-2013 Unicode, Inc.
# For terms of use, see http://www.unicode.org/terms_of_use.html

--steffen

...
To be honest i must admit i first was pissed, so let me append the
original first part of this message, please:

and so the landslide had brought it down.
But i would quote Paul Vixie, who stated in a todays' message

gentlemen and ladies, we have met the enemy, and they are our
egos.

vixie

From my point of view it's the matter of culture and philosophy
(including religion) how to deal with that very problem.
And i can assure you that Jehovas Witnesses, which visit me
regulary for some years now, like to drink a bit of my Buddhistic
tea.

Paul Vixie is correct.
I am stupid.
With greetings from someone who will undergo his 42nd birthday soon

Chet Ramey

2014-04-11 12:25:11 UTC

Permalink

Post by Steffen Nurpmeso
Hello,
|
|> Even better would nonetheless be the great picture with
|> a termios(4) IUTF8 flag, some extended xywidth(3) that returns
|> a tuple of {[EastAsianWidth indication,] is-combining,
|> width-if-non-combining} and best even some composition function.
|
|But we have always been at war with EastAsia!

I'm sorry, I realize that was rather obscure. It's from "1984", by George
Orwell. It's a central theme to the book. The quote was an attempt to
inject levity into the discussion.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU ***@case.edu http://cnswww.cns.cwru.edu/~chet/

Steffen Nurpmeso

2014-04-11 13:40:41 UTC

Permalink

Chet Ramey <chet.ramey-***@public.gmane.org> wrote:
|On 4/11/14, 6:16 AM, Steffen Nurpmeso wrote:
|> Hello,
|>
|> Chet Ramey <chet.ramey-***@public.gmane.org> wrote:
|>|On 4/10/14, 12:16 PM, Steffen Nurpmeso wrote:
|>|
|>|> Even better would nonetheless be the great picture with
|>|> a termios(4) IUTF8 flag, some extended xywidth(3) that returns
|>|> a tuple of {[EastAsianWidth indication,] is-combining,
|>|> width-if-non-combining} and best even some composition function.
|>|
|>|But we have always been at war with EastAsia!
|>
|> I see you really would love to get a hand from POSIX too:
|
|I'm sorry, I realize that was rather obscure. It's from "1984", by George
|Orwell. It's a central theme to the book. The quote was an attempt to

oh, ah, yes. So.. i got it right without getting it right.

Interestingly, yesterday started a retrospective work on Walter
Benjamin (<http://www.eingedenken.de/enter.html> --
"rememberance"): an artist (Christoph Korn) walked hist last trip
from Banyuls-sur-Mer (France) to Portbou (Spain; where he
committed suicide due to the impossibility to reach the U.S.),
following a fixated time frame (monotonic tick, so to say) after
which he spoke thesis of Benjamin (like, e.g., "There is no
document of civilization which is not at the same time a document
of barbarism."), followed by holding in and taking a (steady cam)
video of the recent leg. Association with Paul Klees "Angelus
Novus" is desired (from both parties).

|inject levity into the discussion.

That was easy.

--steffen

Shware Systems

2014-04-16 14:42:09 UTC

Permalink

Afaik, the current expectation of the standards, C99 and POSIX,
is that if you want to count characters instead of bytes you use
mbstowcs first and wprintf("%*s", len, str) to accomplish that.
For use with a byte-oriented device or file it is the responsibility
of wprintf() to use wctomb() internally to reconvert that to the
appropriate printf("%s") usage, which would nominally be UTF-8
of minimum length representation if multiple code points can
represent the logical graphic. If an output device requires the
longer code point sequence it is on the device driver to do any
additional conversions transparently. From an application writer's
perspective an addition like this may make their lives a little
easier, but at the expense of it being much more likely the
interface call will fail with ENOMEM. As someone that is
putting a full Unicode proposal together, I do not see this as
a "best practice" example; simply a nice first attempt that some
implementations have gotten to work and should be allowed,
but not required.

What is missing from both standards is something that can
handle Unicode's Multi-Wide code unit characters that show
up as a single graphic, or are supposed to, such as the Asian
multi-cells or European combining diacritic forms, or even
surrogate pairs in UTF-16. Generically, these would handle
additional Unicode attributes and control code sequences,
not overload the simpler print() and wprint() specification.

A separate set of mwstowcs() and wcstomws() interfaces,
using an array of wchar[6] arrays as source and target
respectively is needed for that sort of counting. I think 6 is
the max length used by the Asian composites. These
might be used with a separate specifier char to wprintf(),
not printf(), to take a pointer to such a two-dimensional
array as parameter, but it's technically superfluous. I think
having a similar letter for printf(), entailing an internal
mbstowcs() followed by wcstomws() and back is way over
complicating it as the source of those extra errors.

The thornier problem, theoretically, is adding robust support
for UTF-16 to any general handling of Unicode in locales,
including any specifier letters possibly needed for that. The C11
standard presumes only UCS-2 and UCS-4 forms of Unicode
will be used, so that is glossed over, and the u8"" form is for
constants to be passed to mbtowc() also before any semantic
processing. It is more POSIX problem as additional UTF-8 or
UTF-16 handling in general is left as implementation-defined.

-----Original Message-----
From: Steffen Nurpmeso <sdaoden-zJpx2rpV7r/QT0dZR+***@public.gmane.org>
To: Rich Felker <dalias-/***@public.gmane.org>
Cc: 17196 <17196-ubl+/3LiMTaZdePnXv/***@public.gmane.org>; Austin Group
<austin-group-l-7882/***@public.gmane.org>; Bob Proulx <bob-5cAygf9QrE/QT0dZR+***@public.gmane.org>; Eric Blake
<eblake-H+wXaHxf7aLQT0dZR+***@public.gmane.org>; Jan Novak <jn-RP+***@public.gmane.org>; Pádraig Brady
<***@draigBrady.com>
Sent: Thu, Apr 10, 2014 11:20 am
Subject: Re: bug#17196: UTF-8 printf string formating problem

Rich Felker <dalias-/***@public.gmane.org> wrote:
|On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote:
|> Eric Blake <eblake-H+wXaHxf7aLQT0dZR+***@public.gmane.org> wrote:
|>|Hmm. POSIX requires support for %ls (aka %S) according to byte
counts,
|>|and currently states that %Ls is undefined. But I would LOVE to
have a
|>|standardized spelling for counting characters instead of bytes. The
|>|extension %Ls looks like a good candidate for standardization,
precisely
|>|because counting characters when printing a multibyte string is more
|>|useful than counting bytes (you do NOT want to end in the middle of
a
|>|multibyte character), and because ksh offers it as existing
practice.
|>|
|>|Your idea for counting "cells" (by which I'm assuming you mean one
or
|>|more characters that all display within the same cell of the
terminal,
|>|as if the end user saw only one grapheme), on the other hand, does
not
|>|seem to have any precedence, and I would strongly object to having
%s
[.]
|> I see you are trying to invent the word character for code points
|> and reserve the term "graphem" for user-perceived characters.
|> This goes in line with the GNU library which has the existing
|> practice to let wcwidth(3) return the value 1 for accents and
|> other combining code points as well as so-called (Unicode)
|> noncharacters. And who would call wcwidth(3) on something that is
|> not to be drawn onto the screen directly afterwards. And, of
|> course, which terminal will perform the composition of code points
|> written via STD I/O to characters on its own.
|> I think for quite a while it is up to the input methods to combine
|> into something precomposed in order to let POSIX programs finally
|> work with it.
|
|Many languages do not have precomposed forms for all the character
|sequences they need, and for some, it would not even be practical to
|have precomposed forms, and would force the use of complex input
|methods instead of simple keyboard maps.

And of course with UTF-8 decomposed forms of characters from an
immense number of languages can occur in at least theory, in,
e.g., a text file.
The german U+00F6 (LATIN SMALL LETTER U WITH DIAERESIS) could very
well be «ü» but also U+0076 U+0308 «u ̈», dependent on where it
came from. And note that my vim(1) composed U+00F6 when i tried
to input the latter string automatically, i had to separate, enter
each, and join them together to get at «u» plus, actually non-,
combining diaeresis. (In fact actually «combining with a space».)
Of course a wcwidth(3) of 1 for U+0308 is much better than 0 when
it really produces something visual.

Even better would nonetheless be the great picture with
a termios(4) IUTF8 flag, some extended xywidth(3) that returns
a tuple of {[EastAsianWidth indication,] is-combining,
width-if-non-combining} and best even some composition function.
I don't think that «user-perceived characters don't have any
precedence». A whole lot of development in the past decade on the
winner side (that is, the other :) was exactly that -- making
software barrier-free.
If POSIX beams itself onto UTF-8 it should really consider to
offer a way to be able to act on what the user really deals with.
And that is, in the Unicode world -- and isn't that what the bug
report is about --, not necessarily a mbrlen(3)-division of bytes.

--steffen

Steffen Nurpmeso

2014-04-17 11:54:38 UTC

Permalink

Shware Systems <shwaresyst-***@public.gmane.org> wrote:
|The thornier problem, theoretically, is adding robust support
|for UTF-16 to any general handling of Unicode in locales,
|including any specifier letters possibly needed for that. The C11

I hope it's a theoretical problem. Drastic improvement of UTF-8
performance was announced as a major improvement of the last ICU
release. Of course noone can accuse major companies that they
have written engines that internally work with UTF-16, since that
is the native encoding of the most widespread and thus
economically interesting operating system, and in comparing
performance spreadsheets back and forth character set conversion
doesn't look pretty good. Businessman would surely find more
drastic words. (I usually call them "businesskids" instead
because of the regression in their responsibleness. But that is
completely off-topic.)

Maybe your drafted approach is the road to go, permit me to hope
for something else. I'm hoping for a sane, complete, efficient
and easy usable interface, and in practice all those items cannot
be expected from ISO C (except maybe sometimes by presupposing
additional compiler plus support to re-reduce complexity; e.g.,
atomic operations). Now this: One should not under-estimate the
freedom you effectively gain in a panel that large as ISO: don't
you state, obviously confused, but seriously looking

If n is zero, the application shall ensure that ws1 and ws2 are
valid pointers, and the function shall copy zero wide characters.

with the attendance of psychiatrists.

--steffen