Eric Blake
2014-04-07 21:57:03 UTC
[adding the Austin Group]
that ksh uses the L modifier.
http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
So we can count bytes, chars or cells (graphemes).
Thinking a bit more about it, I think shell level printf
should be dealing in text of the current encoding and counting cells.
LC_ALL=C printf ...
I see that ksh behaves as I would expect and counts cells,
$ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
aÌâ â
$ ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605'"
â
$ ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605'"

$ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
aÌâ
$ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
aÌâ
$ zsh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605'"
â â
I see that dash gives invalid directive for any of %ls %Ls %S.
Pity there is no consensus here.
printf '%3s' 'blah' # count cells
printf '%3Ls' 'blah' # count chars
LANG=C '%3Ls' 'blah' # count bytes
LANG=C '%3s' 'blah' # count bytes
Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
and currently states that %Ls is undefined. But I would LOVE to have a
standardized spelling for counting characters instead of bytes. The
extension %Ls looks like a good candidate for standardization, precisely
because counting characters when printing a multibyte string is more
useful than counting bytes (you do NOT want to end in the middle of a
multibyte character), and because ksh offers it as existing practice.
Your idea for counting "cells" (by which I'm assuming you mean one or
more characters that all display within the same cell of the terminal,
as if the end user saw only one grapheme), on the other hand, does not
seem to have any precedence, and I would strongly object to having %s
count by cells because %s already has a standardized (if unfortunate)
meaning of counting by bytes. Maybe yet another extension is warranted
(perhaps %LLs?) as a new notion for counting by cells instead of
characters, but it's harder to justify that without existing practice.
Yes printf follows the C standard which only considers bytes.
...
I don't think we'd be able to change the current operation of printf
due to backwards compat reasons? Though we might be able to somehow leverage
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
Dan Douglas pointed out in the corresponding discussion in bug-bash...
I don't think we'd be able to change the current operation of printf
due to backwards compat reasons? Though we might be able to somehow leverage
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
that ksh uses the L modifier.
http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
â â â
At least there is prior art for it.â â â
Thinking a bit more about it, I think shell level printf
should be dealing in text of the current encoding and counting cells.
LC_ALL=C printf ...
I see that ksh behaves as I would expect and counts cells,
$ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
aÌâ â
$ ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605'"
â
$ ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605'"

$ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
aÌâ
$ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
aÌâ
$ zsh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605'"
â â
I see that dash gives invalid directive for any of %ls %Ls %S.
Pity there is no consensus here.
printf '%3s' 'blah' # count cells
printf '%3Ls' 'blah' # count chars
LANG=C '%3Ls' 'blah' # count bytes
LANG=C '%3s' 'blah' # count bytes
and currently states that %Ls is undefined. But I would LOVE to have a
standardized spelling for counting characters instead of bytes. The
extension %Ls looks like a good candidate for standardization, precisely
because counting characters when printing a multibyte string is more
useful than counting bytes (you do NOT want to end in the middle of a
multibyte character), and because ksh offers it as existing practice.
Your idea for counting "cells" (by which I'm assuming you mean one or
more characters that all display within the same cell of the terminal,
as if the end user saw only one grapheme), on the other hand, does not
seem to have any precedence, and I would strongly object to having %s
count by cells because %s already has a standardized (if unfortunate)
meaning of counting by bytes. Maybe yet another extension is warranted
(perhaps %LLs?) as a new notion for counting by cells instead of
characters, but it's harder to justify that without existing practice.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org