Discussion:
Aligning POSIX with C11/C++11's memory model
Matthew Dempsky
2014-07-17 21:14:20 UTC
Permalink
In just the past week, I've been pulled into two separate discussions
that hinged on how POSIX behaves under C11/C++11's memory model, both
of which ended somewhat unsatisfactorily by having to make
"reasonable" extrapolations from the current definitions.

I seem to recall reading somewhere that the plan is for a future
version of POSIX to align with C11, and it also seems like that will
require defining interactions between various POSIX functionality and
C11's atomic primitives. If so, I think there would be benefits to
starting on that process now so that implementations and applications
can start preparing for that future POSIX version.

What do others think? And if this is worth doing, what would be the
proper way to proceed? E.g., just discuss on this mailing list, or
maybe try to setup a subgroup to focus on this work area, or something
else altogether?


Some concrete examples of issues that have been brought up:

1. If one thread (successfully) calls mmap() and then passes the
return value pointer to another thread via relaxed atomic store/load
operations, what guarantees does the second thread have (if any) about
accessing the newly mapped memory? I reasoned that mmap() should be
thought of as a non-atomic memory write operation to the affected
pages, so the allocating thread needs to use a store-release operation
(or stronger), and the accessing thread needs a load-consume operation
(or stronger), for the second thread to safely access the mapped
memory. However, I can also imagine mmap() might guarantee that when
it has returned, the newly mapped pages (and any implied memory
initialization) must be globally visible to the process.

[This came up because LLVM's ThreadSanitizer uses the first
interpretation, whereas TCMalloc in some cases may mmap() some memory
and then communicate it to other threads via relaxed atomics assuming
the second interpretation.]

2. If a multi-threaded application calls fork(), what affordances are
allowed when reasoning about the state of the state of the child
process's address space? I'd reason that any private objects (e.g.,
objects in memory mmap()'d with MAP_PRIVATE) that are being
concurrently non-atomically modified when fork() is called will be
left in an unspecified state in the child process; but still those
objects will now refer to new memory locations, so there should be no
"conflict" (per C11 definition) by simply storing to them in the child
process.

[This came up because LibreSSL portable uses a pthread_atfork() hook
to mark its random number generator state as requiring a re-seed in
child processes, and arguably this could be seen as a data race
because the object being written to in the child handler might be
concurrently written to in the parent process when fork() was
invoked.]

3. If an application has concurrent calls to fork() and
pthread_atfork(), are memory operations performed prior to the
pthread_atfork() call guaranteed to be visible within the context of
the atfork callbacks? Looking more closely now, it actually seems
that currently POSIX doesn't make *any* guarantees about concurrent
fork() and pthread_atfork() being safe, which seems surprising.
Perhaps worth fixing, but then pthread_atfork() is scheduled for
possible deprecation anyway.

[Again, came up because of LibreSSL portable.]
Rich Felker
2014-07-18 00:01:42 UTC
Permalink
Post by Matthew Dempsky
In just the past week, I've been pulled into two separate discussions
that hinged on how POSIX behaves under C11/C++11's memory model, both
of which ended somewhat unsatisfactorily by having to make
"reasonable" extrapolations from the current definitions.
I seem to recall reading somewhere that the plan is for a future
version of POSIX to align with C11, and it also seems like that will
require defining interactions between various POSIX functionality and
C11's atomic primitives. If so, I think there would be benefits to
starting on that process now so that implementations and applications
can start preparing for that future POSIX version.
What do others think? And if this is worth doing, what would be the
proper way to proceed? E.g., just discuss on this mailing list, or
maybe try to setup a subgroup to focus on this work area, or something
else altogether?
1. If one thread (successfully) calls mmap() and then passes the
return value pointer to another thread via relaxed atomic store/load
operations, what guarantees does the second thread have (if any) about
accessing the newly mapped memory? I reasoned that mmap() should be
thought of as a non-atomic memory write operation to the affected
pages, so the allocating thread needs to use a store-release operation
(or stronger), and the accessing thread needs a load-consume operation
(or stronger), for the second thread to safely access the mapped
memory. However, I can also imagine mmap() might guarantee that when
it has returned, the newly mapped pages (and any implied memory
initialization) must be globally visible to the process.
[This came up because LLVM's ThreadSanitizer uses the first
interpretation, whereas TCMalloc in some cases may mmap() some memory
and then communicate it to other threads via relaxed atomics assuming
the second interpretation.]
I think the latter interpretation is preferable. mmap is a
sufficiently "heavy" operation that it seems silly (and gratuitously
painful for application developers) not to require it to synchronize
the visibility of the new pages.
Post by Matthew Dempsky
2. If a multi-threaded application calls fork(), what affordances are
allowed when reasoning about the state of the state of the child
process's address space? I'd reason that any private objects (e.g.,
objects in memory mmap()'d with MAP_PRIVATE) that are being
concurrently non-atomically modified when fork() is called will be
left in an unspecified state in the child process; but still those
objects will now refer to new memory locations, so there should be no
"conflict" (per C11 definition) by simply storing to them in the child
process.
I agree.
Post by Matthew Dempsky
[This came up because LibreSSL portable uses a pthread_atfork() hook
to mark its random number generator state as requiring a re-seed in
child processes, and arguably this could be seen as a data race
because the object being written to in the child handler might be
concurrently written to in the parent process when fork() was
invoked.]
This usage is buggy; it breaks async-signal-safety of fork. Really
pthread_atfork should not be used at all. So I don't think effort
should be spent reasoning about what should happen when this function
is used. It just needs to be marked obsolescent and scheduled for
removal.
Post by Matthew Dempsky
3. If an application has concurrent calls to fork() and
pthread_atfork(), are memory operations performed prior to the
pthread_atfork() call guaranteed to be visible within the context of
the atfork callbacks? Looking more closely now, it actually seems
that currently POSIX doesn't make *any* guarantees about concurrent
fork() and pthread_atfork() being safe, which seems surprising.
Perhaps worth fixing, but then pthread_atfork() is scheduled for
possible deprecation anyway.
[Again, came up because of LibreSSL portable.]
Likewise, I think the answer is just to proceed with deprecation.

Rich
Matthew Dempsky
2014-07-18 00:55:52 UTC
Permalink
Responding to Rich's specific points here, though let me emphasize
that my primary goal is to better define POSIX's overall interactions
with C11/C++11's memory model, not just the initial examples
presented. :)

I've also been privately pinging people (e.g., coworkers and other
C/C++ standards people that I've been told are interested in memory
model stuff), and there seems to be at least a decent amount of
interest in this specific topic area. So perhaps a Study Group is in
order?
Post by Rich Felker
Post by Matthew Dempsky
1. If one thread (successfully) calls mmap() and then passes the
return value pointer to another thread via relaxed atomic store/load
operations, what guarantees does the second thread have (if any) about
accessing the newly mapped memory? I reasoned that mmap() should be
thought of as a non-atomic memory write operation to the affected
pages, so the allocating thread needs to use a store-release operation
(or stronger), and the accessing thread needs a load-consume operation
(or stronger), for the second thread to safely access the mapped
memory. However, I can also imagine mmap() might guarantee that when
it has returned, the newly mapped pages (and any implied memory
initialization) must be globally visible to the process.
[This came up because LLVM's ThreadSanitizer uses the first
interpretation, whereas TCMalloc in some cases may mmap() some memory
and then communicate it to other threads via relaxed atomics assuming
the second interpretation.]
I think the latter interpretation is preferable. mmap is a
sufficiently "heavy" operation that it seems silly (and gratuitously
painful for application developers) not to require it to synchronize
the visibility of the new pages.
I discussed this briefly with Jeffrey Yasskin (C++ Library Evolution
subgroup chair), and the issue we ran into is how to define the notion
of "globally visible". Even if mmap() was defined to imply a release
fence, the second thread would still need acquire (or consume)
semantics to synchronize properly, which is what TCMalloc was hoping
to avoid. So we'd at least need to come up with some new formal
wording to explain this idea.
Post by Rich Felker
This usage is buggy; it breaks async-signal-safety of fork. Really
pthread_atfork should not be used at all. So I don't think effort
should be spent reasoning about what should happen when this function
is used. It just needs to be marked obsolescent and scheduled for
removal.
I'm not a fan of pthread_atfork() either, but it seems like a
necessary evil in the case I mentioned. I'm happy to discuss with you
further (probably off-list since it's not really on topic for
austin-group-l), if you're so inclined.
Rich Felker
2014-07-18 01:03:07 UTC
Permalink
Post by Matthew Dempsky
Responding to Rich's specific points here, though let me emphasize
that my primary goal is to better define POSIX's overall interactions
with C11/C++11's memory model, not just the initial examples
presented. :)
I've also been privately pinging people (e.g., coworkers and other
C/C++ standards people that I've been told are interested in memory
model stuff), and there seems to be at least a decent amount of
interest in this specific topic area. So perhaps a Study Group is in
order?
OK, sorry for focusing on just the examples. I'm quite interested in
the topic in general and would like to be included in future
discussions.
Post by Matthew Dempsky
Post by Rich Felker
Post by Matthew Dempsky
1. If one thread (successfully) calls mmap() and then passes the
return value pointer to another thread via relaxed atomic store/load
operations, what guarantees does the second thread have (if any) about
accessing the newly mapped memory? I reasoned that mmap() should be
thought of as a non-atomic memory write operation to the affected
pages, so the allocating thread needs to use a store-release operation
(or stronger), and the accessing thread needs a load-consume operation
(or stronger), for the second thread to safely access the mapped
memory. However, I can also imagine mmap() might guarantee that when
it has returned, the newly mapped pages (and any implied memory
initialization) must be globally visible to the process.
[This came up because LLVM's ThreadSanitizer uses the first
interpretation, whereas TCMalloc in some cases may mmap() some memory
and then communicate it to other threads via relaxed atomics assuming
the second interpretation.]
I think the latter interpretation is preferable. mmap is a
sufficiently "heavy" operation that it seems silly (and gratuitously
painful for application developers) not to require it to synchronize
the visibility of the new pages.
I discussed this briefly with Jeffrey Yasskin (C++ Library Evolution
subgroup chair), and the issue we ran into is how to define the notion
of "globally visible". Even if mmap() was defined to imply a release
fence, the second thread would still need acquire (or consume)
semantics to synchronize properly, which is what TCMalloc was hoping
to avoid. So we'd at least need to come up with some new formal
wording to explain this idea.
Except with MAP_FIXED mapping over an existing mapping, I don't see
how another thread could come to have a valid pointer to the
newly-mapped region without already having performed some sort of
synchronization to get it from the thread that called mmap. Am I
missing something here?
Post by Matthew Dempsky
Post by Rich Felker
This usage is buggy; it breaks async-signal-safety of fork. Really
pthread_atfork should not be used at all. So I don't think effort
should be spent reasoning about what should happen when this function
is used. It just needs to be marked obsolescent and scheduled for
removal.
I'm not a fan of pthread_atfork() either, but it seems like a
necessary evil in the case I mentioned. I'm happy to discuss with you
further (probably off-list since it's not really on topic for
austin-group-l), if you're so inclined.
Yes, let's take it off-list if you're interested in discussing
alternatives. Some of our user/developer community for musl libc came
across the pthread_atfork thing and jumped on it as being a broken
"fix" for the prng issue. Getting a viable SSL implementation that's
not full of undefined behavior and non-library-safe code has been a
major goal of ours, and we're interested in contributing to solving
the problems.

Rich
Matthew Dempsky
2014-07-18 01:21:36 UTC
Permalink
Post by Rich Felker
OK, sorry for focusing on just the examples.
No apology needed. :)
Post by Rich Felker
I'm quite interested in
the topic in general and would like to be included in future
discussions.
Great!
Post by Rich Felker
Except with MAP_FIXED mapping over an existing mapping, I don't see
how another thread could come to have a valid pointer to the
newly-mapped region without already having performed some sort of
synchronization to get it from the thread that called mmap. Am I
missing something here?
The motivating example was using C11/C++11 "relaxed" atomics, which
are suitable for synchronizing access to a single scalar value (e.g.,
a simple event counter), but don't generally imply synchronization of
other memory locations. However, if you want to communicate a pointer
to a memory region to another thread, you generally need to use
"store-release" and "load-acquire" operations to ensure the related
memory locations (and not just the atomic object itself) are properly
synchronized too.

E.g., in common implementations, calling calloc() and then
communicating the return value to another thread via relaxed atomics
would generally be unsafe: if the receiving thread tries to read the
allocated memory, it might read garbage data instead of the zero
initialized memory (assuming calloc() internally decides to reuse an
existing dirty page of memory that it needs to memset() rather than
getting freshly initialized anonymous memory from the kernel). By
using store-release/load-acquire, this race is avoid.
Post by Rich Felker
Yes, let's take it off-list if you're interested in discussing
alternatives. Some of our user/developer community for musl libc came
across the pthread_atfork thing and jumped on it as being a broken
"fix" for the prng issue.
Yep, I was already privately emailing with Szabolcs Nagy some about
this in response to his blog post. I'll add you to that thread.
Post by Rich Felker
Getting a viable SSL implementation that's
not full of undefined behavior and non-library-safe code has been a
major goal of ours, and we're interested in contributing to solving
the problems.
Also great to hear! :)
Matthew Dempsky
2014-07-18 02:57:53 UTC
Permalink
I don't think that the synchronization TCMalloc wants to rely on exists today, and I don't think it would be a good idea to require it either.
To be clear, the synchronization that TCMalloc wants is that if one
thread calls mmap() with MAP_ANON to allocate a page of new zero
initialized memory, and then communicates a pointer to that page to
another thread via relaxed atomics (i.e., no memory barriers), then
the second thread can read/write the newly allocated memory without
any data races/undefined behavior. I.e., reads from the page on
another thread should return zero-initialized memory values, and never
garbage that might be present from previous uses of the underlying
physical memory page.

When I said "globally visible", I was referring to the data logically
stored in the newly mapped pages; not that the pages would necessarily
be faulted into memory, present in a CPU's TLB, or anything else.

I'm told that requirement is satisfied by Linux, I'm fairly confident
it's also satisfied on OpenBSD, and I would be surprised if any other
OS that cares about security did differently. I've come up with some
silly ways an implementation might subvert this expectation, but I
can't imagine any realistic ways they would.
Tom Honermann
2014-07-22 15:15:43 UTC
Permalink
Post by Rich Felker
Post by Matthew Dempsky
[This came up because LibreSSL portable uses a pthread_atfork() hook
to mark its random number generator state as requiring a re-seed in
child processes, and arguably this could be seen as a data race
because the object being written to in the child handler might be
concurrently written to in the parent process when fork() was
invoked.]
This usage is buggy; it breaks async-signal-safety of fork. Really
pthread_atfork should not be used at all. So I don't think effort
should be spent reasoning about what should happen when this function
is used. It just needs to be marked obsolescent and scheduled for
removal.
My understanding is that fork() has already been declared not
async-signal-safe following resolution of
http://austingroupbugs.net/view.php?id=62. Is that not the case? The
issue 7 text for system interfaces/2.4.3 still lists fork(), but I
presume that will change in the next issue (and perhaps _Fork() will be
listed then? http://austingroupbugs.net/view.php?id=18).

I believe glibc's fork() implementation is not async-signal-safe when
the parent process is multi-threaded even if the application does not
use pthread_atfork(). The implementation registers its own atfork()
handlers (at least when the parent process is multi-threaded).

Tom.
Rich Felker
2014-07-22 15:52:50 UTC
Permalink
Post by Tom Honermann
Post by Rich Felker
Post by Matthew Dempsky
[This came up because LibreSSL portable uses a pthread_atfork() hook
to mark its random number generator state as requiring a re-seed in
child processes, and arguably this could be seen as a data race
because the object being written to in the child handler might be
concurrently written to in the parent process when fork() was
invoked.]
This usage is buggy; it breaks async-signal-safety of fork. Really
pthread_atfork should not be used at all. So I don't think effort
should be spent reasoning about what should happen when this function
is used. It just needs to be marked obsolescent and scheduled for
removal.
My understanding is that fork() has already been declared not
async-signal-safe following resolution of
http://austingroupbugs.net/view.php?id=62. Is that not the case?
The issue 7 text for system interfaces/2.4.3 still lists fork(), but
I presume that will change in the next issue (and perhaps _Fork()
will be listed then? http://austingroupbugs.net/view.php?id=18).
I believe these changes are for issue 8, but I may be mistaken. So yes
the AS-safety thing may be a non-issue. However it doesn't change my
position that pthread_atfork should be deprecated and that using it is
generally unsafe.
Post by Tom Honermann
I believe glibc's fork() implementation is not async-signal-safe
when the parent process is multi-threaded even if the application
does not use pthread_atfork(). The implementation registers its own
atfork() handlers (at least when the parent process is
multi-threaded).
As far as I know this is correct.

Rich

Schwarz, Konrad
2014-07-18 09:30:08 UTC
Permalink
-----Original Message-----
Sent: Donnerstag, 17. Juli 2014 23:14
Subject: Aligning POSIX with C11/C++11's memory model
1. If one thread (successfully) calls mmap() and then passes the return
value pointer to another thread via relaxed atomic store/load
operations, what guarantees does the second thread have (if any) about
accessing the newly mapped memory? I reasoned that mmap() should be
thought of as a non-atomic memory write operation to the affected
pages, so the allocating thread needs to use a store-release operation
(or stronger), and the accessing thread needs a load-consume operation
(or stronger), for the second thread to safely access the mapped
memory. However, I can also imagine mmap() might guarantee that when
it has returned, the newly mapped pages (and any implied memory
initialization) must be globally visible to the process.
Looking at this at the operational level, mmap() implies a change
to the process's address mapping, which in turn requires a system call.
This system call will be context synchronizing, i.e., all threads
can access the new memory when mmap() returns.

However, I think the process of passing the pointer from one thread
to the other still requires release/acquire semantics. I'm not familiar
with the C/C++ language, but a bare load-locked/store conditional
sequence of RISC processors offers little or no guarantees about when
this update actually becomes visible to other processors (i.e. threads)
in relation to other memory accesses.

As such, the situations in which an atomic operation synthesized from
ll/sc operations would suffice for inter-thread operations are extremely
limited.

Since the potential for error is great, and the performance impact of enforcing
memory barriers here are minimal (i.e., a mmap() has just been performed,
dwarfing the time needed by any additional barriers), I think it would
be counter-productive for POSIX to specify any exemptions from the basic
rule that inter-thread communication may only be done with synchronized
operations that use barriers.

Regards,
Konrad Schwarz

P.S.: I would join the proposed group.
Martin Sebor
2014-07-22 15:22:28 UTC
Permalink
Post by Matthew Dempsky
In just the past week, I've been pulled into two separate discussions
that hinged on how POSIX behaves under C11/C++11's memory model, both
of which ended somewhat unsatisfactorily by having to make
"reasonable" extrapolations from the current definitions.
I seem to recall reading somewhere that the plan is for a future
version of POSIX to align with C11, and it also seems like that will
require defining interactions between various POSIX functionality and
C11's atomic primitives. If so, I think there would be benefits to
starting on that process now so that implementations and applications
can start preparing for that future POSIX version.
What do others think? And if this is worth doing, what would be the
proper way to proceed? E.g., just discuss on this mailing list, or
maybe try to setup a subgroup to focus on this work area, or something
else altogether?
I agree aligning POSIX with C11 is a worthwhile effort. I'm interested
in participating in it.

Martin
Loading...