Discussion:
fseek(3) question/typo fix
Steffen "Daode" Nurpmeso
2012-09-14 21:56:22 UTC
Permalink
I guess that on page 978, line 31922, "open file description"
truly means an "open file descriptor". In full

If the most recent operation, other than ftell(), on a given
stream is fflush(), the file offset in the underlying open file
descriptOR shall be adjusted to reflect the location specified
by fseek()

and the entire paragraph is marked as CX (supported on all
conforming systems) and furthermore this very behaviour was
already described for IEEE Std 1003.1, 1996 Edition.

While searching for a "bug" in my copy of nail(1) (Heirloom
mailx(1)) i've actually ended up at

fflush(fi);
rewind(fi);
lseek(fileno(fi), 0, SEEK_SET);

which caused nail(1) to fail on Mac OS X Snow Leopard, FreeBSD 9,
NetBSD 6 RC1 and OpenBSD 5.1. (Where "fail" means that the
content of dead.letter, to which *fi* will be dumped next,
contained the content of *fi* twice..) This behaviour disappeared
on all systems when the lseek(2) was removed.
I for one really don't know why the STDIO stuff simply assumes
it's offset is correct.
Ciao,
Philip Guenther
2012-09-15 04:26:35 UTC
Permalink
On Fri, Sep 14, 2012 at 2:56 PM, Steffen Daode <sdaoden-***@public.gmane.org> wrote:
...
Post by Steffen "Daode" Nurpmeso
While searching for a "bug" in my copy of nail(1) (Heirloom
mailx(1)) i've actually ended up at
fflush(fi);
rewind(fi);
lseek(fileno(fi), 0, SEEK_SET);
which caused nail(1) to fail on Mac OS X Snow Leopard, FreeBSD 9,
NetBSD 6 RC1 and OpenBSD 5.1. (Where "fail" means that the
content of dead.letter, to which *fi* will be dumped next,
contained the content of *fi* twice..) This behaviour disappeared
on all systems when the lseek(2) was removed.
I for one really don't know why the STDIO stuff simply assumes
it's offset is correct.
The stdio implementations assume that because the standard says that
they're allowed to do so because the standard requires the application
to follow various rules when using multiple 'handles' (FILE handles or
file descriptors) for the same stream. These rules are described in
detail in the "General Information" section of XSH. Looking at POSIX
2008, it's covered in section 2.5.1, "Interaction of File Descriptors
and Standard I/O Streams":

http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_05_01


You don't describe the buffering or how it was opened, so you'll have
to go through the description yourself. The key thing to note is that
you've created a second handle using fileno() and done an lseek() on
it, so for 'fi' to become the active handle again without undefined
behavior, the first call on it after the lseek() must be an fseek().

(Why these rules? They let stdio optimize away lseek()s and read()s
when you do all access through stdio. rewind() can leave the kernel's
view of the file offset alone and retain stdios buffers of the file
contents...which is exactly why you're seeing duplicated data in this
case.)


Philip Guenther
Steffen "Daode" Nurpmeso
2012-09-15 13:28:21 UTC
Permalink
Philip Guenther <guenther-***@public.gmane.org> wrote:

|On Fri, Sep 14, 2012 at 2:56 PM, Steffen Daode <sdaoden-***@public.gmane.org> wrote:
|...
|> While searching for a "bug" in my copy of nail(1) (Heirloom
|> mailx(1)) i've actually ended up at
|>
|> fflush(fi);
|> rewind(fi);
|> lseek(fileno(fi), 0, SEEK_SET);
|>
|> which caused nail(1) to fail on Mac OS X Snow Leopard, FreeBSD 9,
|> NetBSD 6 RC1 and OpenBSD 5.1. (Where "fail" means that the
|> content of dead.letter, to which *fi* will be dumped next,
|> contained the content of *fi* twice..) This behaviour disappeared
|> on all systems when the lseek(2) was removed.
|> I for one really don't know why the STDIO stuff simply assumes
|> it's offset is correct.
|
|The stdio implementations assume that because the standard says that
|they're allowed to do so because the standard requires the application
|to follow various rules when using multiple 'handles' (FILE handles or
|file descriptors) for the same stream. These rules are described in
|detail in the "General Information" section of XSH. Looking at POSIX
|2008, it's covered in section 2.5.1, "Interaction of File Descriptors
|and Standard I/O Streams":
|
|http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag\
|_15_05_01

hmm at least i see the term "Open File Description" now, and it's
description in 3.254. So -- it's not a typo, just two letters not
making enough difference to enlighten me.

|You don't describe the buffering or how it was opened, so you'll have
|to go through the description yourself. The key thing to note is that
|you've created a second handle using fileno() and done an lseek() on
|it, so for 'fi' to become the active handle again without undefined
|behavior, the first call on it after the lseek() must be an fseek().

I also find it disturbing that this doesn't work!

|(Why these rules? They let stdio optimize away lseek()s and read()s
|when you do all access through stdio. rewind() can leave the kernel's
|view of the file offset alone and retain stdios buffers of the file
|contents...which is exactly why you're seeing duplicated data in this
|case.)

I think it's an anachronism that today, where there seem to be
dynamic memory streams (even) in POSIX, there is a relationship
in between FILE* and underlaying file descriptors that *users*
have to take care of. POSIX defines anything that is necessary
to detach them. (And until then I/O libraries maybe shouldn't
refill buffers during a fseek(). Or, even better: don't assume
anything about contents until next real I/O op.)

The standard says

If the most recent operation, other than ftell(), on a given
stream is fflush(), the file offset in the underlying open file
description shall be adjusted to reflect the location specified by
fseek().

and noone seems to implement it after sixteen years (or
deliberately fails because of automatic refilling that
repositions).
Shouldn't this paragraph be removed, then?

|Philip Guenther

Ciao,

--steffen
Terry Lambert
2012-09-15 13:39:25 UTC
Permalink
Post by Steffen "Daode" Nurpmeso
|...
|> While searching for a "bug" in my copy of nail(1) (Heirloom
|> mailx(1)) i've actually ended up at
|>
|> fflush(fi);
|> rewind(fi);
|> lseek(fileno(fi), 0, SEEK_SET);
|>
|> which caused nail(1) to fail on Mac OS X Snow Leopard, FreeBSD 9,
|> NetBSD 6 RC1 and OpenBSD 5.1. (Where "fail" means that the
|> content of dead.letter, to which *fi* will be dumped next,
|> contained the content of *fi* twice..) This behaviour disappeared
|> on all systems when the lseek(2) was removed.
|> I for one really don't know why the STDIO stuff simply assumes
|> it's offset is correct.
|
|The stdio implementations assume that because the standard says that
|they're allowed to do so because the standard requires the application
|to follow various rules when using multiple 'handles' (FILE handles or
|file descriptors) for the same stream. These rules are described in
|detail in the "General Information" section of XSH. Looking at POSIX
|2008, it's covered in section 2.5.1, "Interaction of File Descriptors
|
|http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag\
|_15_05_01
hmm at least i see the term "Open File Description" now, and it's
description in 3.254. So -- it's not a typo, just two letters not
making enough difference to enlighten me.
|You don't describe the buffering or how it was opened, so you'll have
|to go through the description yourself. The key thing to note is that
|you've created a second handle using fileno() and done an lseek() on
|it, so for 'fi' to become the active handle again without undefined
|behavior, the first call on it after the lseek() must be an fseek().
I also find it disturbing that this doesn't work!
|(Why these rules? They let stdio optimize away lseek()s and read()s
|when you do all access through stdio. rewind() can leave the kernel's
|view of the file offset alone and retain stdios buffers of the file
|contents...which is exactly why you're seeing duplicated data in this
|case.)
I think it's an anachronism that today, where there seem to be
dynamic memory streams (even) in POSIX, there is a relationship
in between FILE* and underlaying file descriptors that *users*
have to take care of. POSIX defines anything that is necessary
to detach them. (And until then I/O libraries maybe shouldn't
refill buffers during a fseek(). Or, even better: don't assume
anything about contents until next real I/O op.)
The standard says
If the most recent operation, other than ftell(), on a given
stream is fflush(), the file offset in the underlying open file
description shall be adjusted to reflect the location specified by
fseek().
and noone seems to implement it after sixteen years (or
deliberately fails because of automatic refilling that
repositions).
Shouldn't this paragraph be removed, then?
That's fseek(), not lseek(). There is no guarantee that the lseek()
location (descriptor structure in the kernel per process open file
table) will be synchronized with the fseek() location (user space FILE
* structure), since doing so could require a user/kernel protection
domain crossing via system call.

I'd argue that even the fseek()-after-lseek() doesn't constitute a
guarantee, since that relies on an SVR4 stdio implementation detail,
and there are memory mapped implementations of stdio, on the
assumption of something like a 2G/2G or 1G/3G split between user and
kernel address spaces so the user process is always mapped into kernel
space at system call time. On 64 bit systems, this is almost a
certainty, where the processes all live above 4G virtual to avoid TLB
shootdown overhead. It's basically an SVR4-ism left over from the
USL/Novell/Caldera/SCO days.

Pretty much this is as dodgy as assuming memory mapped I/O and file
I/O can be mixed without msync() barriers (think old nntp
implementations) because "of course everyone has a unified VM and
buffer cache implementation!".

-- Terry
Post by Steffen "Daode" Nurpmeso
|Philip Guenther
Ciao,
--steffen
Steffen "Daode" Nurpmeso
2012-09-15 13:58:01 UTC
Permalink
Terry Lambert <tlambert-hpIqsD4AKlfQT0dZR+***@public.gmane.org> wrote:

|On Sat, Sep 15, 2012 at 6:28 AM, Steffen Daode <sdaoden-***@public.gmane.org> wrote:
|>
|> Philip Guenther <guenther-***@public.gmane.org> wrote:
|>
|> |On Fri, Sep 14, 2012 at 2:56 PM, Steffen Daode <sdaoden-***@public.gmane.org>
|> wrote:
[.]
|> |> fflush(fi);
|> |> rewind(fi);
|> |> lseek(fileno(fi), 0, SEEK_SET);
[.]
|> The standard says
|>
|> If the most recent operation, other than ftell(), on a given
|> stream is fflush(), the file offset in the underlying open file
|> description shall be adjusted to reflect the location specified by
|> fseek().
|>
|> and noone seems to implement it after sixteen years (or
|> deliberately fails because of automatic refilling that
|> repositions).
|> Shouldn't this paragraph be removed, then?
[.]
|That's fseek(), not lseek(). There is no guarantee that the lseek()
|location (descriptor structure in the kernel per process open file
|table) will be synchronized with the fseek() location (user space FILE
|* structure), since doing so could require a user/kernel protection
|domain crossing via system call.
|
|I'd argue that even the fseek()-after-lseek() doesn't constitute a
|guarantee, since that relies on an SVR4 stdio implementation detail,
|and there are memory mapped implementations of stdio, on the
|assumption of something like a 2G/2G or 1G/3G split between user and
|kernel address spaces so the user process is always mapped into kernel
|space at system call time. On 64 bit systems, this is almost a
|certainty, where the processes all live above 4G virtual to avoid TLB
|shootdown overhead. It's basically an SVR4-ism left over from the
|USL/Novell/Caldera/SCO days.
|
|Pretty much this is as dodgy as assuming memory mapped I/O and file
|I/O can be mixed without msync() barriers (think old nntp
|implementations) because "of course everyone has a unified VM and
|buffer cache implementation!".

Wow.

|-- Terry

--steffen
Philip Guenther
2012-09-15 22:20:11 UTC
Permalink
...
Post by Steffen "Daode" Nurpmeso
|You don't describe the buffering or how it was opened, so you'll have
|to go through the description yourself. The key thing to note is that
|you've created a second handle using fileno() and done an lseek() on
|it, so for 'fi' to become the active handle again without undefined
|behavior, the first call on it after the lseek() must be an fseek().
I also find it disturbing that this doesn't work!
And yet a bunch of *really* smart people came to agreement that these
rules were the best balance and many programs have been written based
on them...
Post by Steffen "Daode" Nurpmeso
|(Why these rules? They let stdio optimize away lseek()s and read()s
|when you do all access through stdio. rewind() can leave the kernel's
|view of the file offset alone and retain stdios buffers of the file
|contents...which is exactly why you're seeing duplicated data in this
|case.)
I think it's an anachronism that today, where there seem to be
dynamic memory streams (even) in POSIX, there is a relationship
in between FILE* and underlaying file descriptors that *users*
have to take care of. POSIX defines anything that is necessary
to detach them.
I suspect you need to read that section of the standard more closely:
that section covers more than just stdio vs fds: it also covers stdio
vs fork(). Your statement that POSIX should detach stdio and fds
provides no explanation about how fork(), or fileno(), or fdopen()
should behave.
Post by Steffen "Daode" Nurpmeso
(And until then I/O libraries maybe shouldn't
refill buffers during a fseek(). Or, even better: don't assume
anything about contents until next real I/O op.)
Are you sure it's refilling buffers during the fseek()? It may have
*already* had it buffered and you gave it no reason to discard those
buffers...
Post by Steffen "Daode" Nurpmeso
The standard says
If the most recent operation, other than ftell(), on a given
stream is fflush(), the file offset in the underlying open file
description shall be adjusted to reflect the location specified by
fseek().
and noone seems to implement it after sixteen years (or
deliberately fails because of automatic refilling that
repositions).
Shouldn't this paragraph be removed, then?
Hmm, given the rules in section 2.5.1 of the standard, is that
requirement actually *observable* by a compliant program? To observe
it, a program has to access the file description via a handle other
than the stdio stream, but the requirements in 2.5.1 appear to require
some other operation between the fseek() and the operation on the
other handle, no?

That said, in a quick check, both Solaris 10 and Linux/glibc appear to
implement this. In a couple quick tests, it appears that BSD-derived
stdio implementations appear to fail this requirement for streams that
are opened read-only on normal files.


Philip Guenther
Steffen "Daode" Nurpmeso
2012-09-17 13:32:50 UTC
Permalink
Philip Guenther <guenther-***@public.gmane.org> wrote:

|On Sat, Sep 15, 2012 at 6:28 AM, Steffen Daode <sdaoden-***@public.gmane.org> wrote:
|> Philip Guenther <guenther-***@public.gmane.org> wrote:
[.]
|> I also find it disturbing that this doesn't work!
|
|And yet a bunch of *really* smart people came to agreement that these
|rules were the best balance and many programs have been written based
|on them...
[.]
|Are you sure it's refilling buffers during the fseek()? It may have
|*already* had it buffered and you gave it no reason to discard those
|buffers...
[.]

In short: the correct thing to do is

fflush()
rewind()
lseek()
fseek()

and then dead.letter would be written only once, too. I now
remember the big fat "don't mix stream and descriptor I/O"
(fuzzy), i guess i've read it in GLibC info (thus ~2001).

A possible improvement however would define exactly of how
line-buffered streams should behave. It seems that some libraries
flush all line-buffered (and unbuffered) output streams once a
line-buffered reader is refilled, which i think is horrible.

|Philip Guenther

--steffen (maybe)
Philip Guenther
2012-09-17 15:26:50 UTC
Permalink
On Mon, Sep 17, 2012 at 6:32 AM, Steffen Daode <sdaoden-***@public.gmane.org> wrote:
...
Post by Steffen "Daode" Nurpmeso
A possible improvement however would define exactly of how
line-buffered streams should behave. It seems that some libraries
flush all line-buffered (and unbuffered) output streams once a
line-buffered reader is refilled, which i think is horrible.
That would be specified in the section immediately before the "fds vs
streams" section. SUSv7 section 2.5p2
It's also in the C standard, described as the "intended behavior".


Philip Guenther
Steffen "Daode" Nurpmeso
2012-09-17 19:32:29 UTC
Permalink
Philip Guenther <guenther-***@public.gmane.org> wrote:

|On Mon, Sep 17, 2012 at 6:32 AM, Steffen Daode <sdaoden-***@public.gmane.org> wrote:
|...
|> A possible improvement however would define exactly of how
|> line-buffered streams should behave. It seems that some libraries
|> flush all line-buffered (and unbuffered) output streams once a
|> line-buffered reader is refilled, which i think is horrible.
|
|That

Not to be misunderstood. I think today setting up buffers
yourself and then using writev(2)/readv(2) or even aio or the like
is what one would actually do to get performance and control.
I don't think that stdio streams are your friend there.

|That would be specified in the section immediately before the "fds vs
|streams" section. SUSv7 section 2.5p2
|It's also in the C standard, described as the "intended behavior".

Well it's at least implementation-defined. C11 (Draft as of
2011-04-12) says in 7.21.3 Files, item 3:

Furthermore, characters are intended to be transmitted as a
block to the host environment when a buffer is filled, when
input is requested on an unbuffered stream, or when input is
requested on a line buffered stream that requires the
transmission of characters from the host environment. Support
for these characteristics is implementation-defined, and may be
affected via the setbuf and setvbuf functions.

I must admit that i don't know where this requirement comes from,
especially given that at other places users are fully responsible
to ensure proper state themselves.
Maybe the term is unfortunate.
Because one paragraph above i read

When a stream is fully buffered, characters are intended to be
transmitted to or from the host environment as a block when
a buffer is filled.

But noone flushes all currently open FILE* objects when a fully
buffered stream gets into an underflow situation.
And in general i also think fflush(3) is necessary when reading
follows an output operation.

But please, i'm a poor man, i cannot afford to hire a horde of
lawyers to workout phrases that are undestroyable.
I'm not even a native english speaker, which may be part of the
problem.

|Philip Guenther

--steffen
Steffen "Daode" Nurpmeso
2012-09-19 13:20:24 UTC
Permalink
To end this thread with something possibly useful.

[.]
|Well it's at least implementation-defined. C11 (Draft as of
|2011-04-12) says in 7.21.3 Files, item 3:
|
| Furthermore, characters are intended to be transmitted as a
| block to the host environment when a buffer is filled, when
| input is requested on an unbuffered stream, or when input is
| requested on a line buffered stream that requires the
| transmission of characters from the host environment. Support
| for these characteristics is implementation-defined, and may be
| affected via the setbuf and setvbuf functions.
[.]
| When a stream is fully buffered, characters are intended to be
| transmitted to or from the host environment as a block when
| a buffer is filled.

If it would be possible to exclude individual FILE* objects from
the autoflush functionality, i.e., via a generic approach that
sets "flags".
If it would be possible to set flushing policies on FILE* objects
unrelated to set*buf(). (Just in case the types that operate on
memory are extended to be really on par in the future.)
If it would be possible for users to register their own r/w or
mutex lock on a FILE* object base (possibly shared between
multiple such). Though that is of course a problem for
a standard that simply has to fit into all SMP and jump situations.
Thanks,

--steffen

Loading...