You're talking about atomic operations and synchronization. This is not about th...

StillBored · on May 12, 2017

Yes it is (about atomics/sync). ARM like every other major core is "sequentially consistent" with respect to the core executing the code. The memory order only matters for external visibility (aka other threads or devices). This means that the load/store order from a single thread doesn't matter to the other threads except at synchronization points (aka locks/etc). Those sync primitives have implied barriers.

If someone has a program which is depending on load/store order in userspace then they likely have bugs on x86 as well since threads can be migrated between cores and the compilers are fully allowed to reorder load/stores as well as long as visible side effects are maintained.. I could go into the finer points of how compilers have to create internal barriers (this has nothing to do with DMB/SFENCE/etc) across external function calls as well (which plays into why you should be using library locking calls rather than creating your own) but that is another whole subject.

The latter case is something I don't think most people understand... Particularly as GCC and friends get more aggressive about determining side effects and tossing code. Also, volatile doesn't do what most people think it does and trying to create sync primitives simply by forcing load/store does nothing when the compiler is still free to reorder the operations.

An emulator is also going to maintain this contract as well. That is why things like qemu work just fine to run x86 binaries on random ARMs today without having to modify the hardware memory model.

vardump · on May 12, 2017

> If someone has a program which is depending on load/store order in userspace then they likely have bugs on x86 as well

Incorrect. On x86, if you write to memory n times, other cores are guaranteed to see the writes in same order. Second write is never going to be visible to other cores before first write. It's correct to rely on x86 memory model in x86 software.

On ARM, those stores can become visible to other cores in any order.

> since threads can be migrated between cores

Irrelevant. This is about code executing concurrently on multiple cores. Operating system and threads are irrelevant. This is about hardware behavior, CPU core load/store system and instruction reordering, not software.

> ... and the compilers are fully allowed to reorder load/stores...

This has nothing to do with compilers. This has everything to do how CPU cores reorder reads and writes.

> Particularly as GCC and friends get more aggressive about determining side effects and tossing code

If GCC has bugs, please report them. Undefined behavior can give that impression, but again, this topic has nothing to do with compilers.

> An emulator is also going to maintain this contract as well. That is why things like qemu work just fine to run x86 binaries on random ARMs today without having to modify the hardware memory model.

This is not true. See: http://wiki.qemu.org/Features/tcg-multithread#Memory_consist...

Remaining Case: strong on weak, ex. emulating x86 memory model on ARM systems

I recommend you read this: https://en.wikipedia.org/wiki/Memory_ordering.

StillBored · on May 12, 2017

> It's correct to rely on x86 memory model in x86 software.

If you have complete control over the whole stack, sure. But we are taking windows user space applications. You continue to ignore my point originally that the edge cases you are describing may cause problems, but are just that, edge cases which can be solved with slow path code (put a DSB following every store if you like), and are likely depending on behaviors higher in the stack which aren't guaranteed and are therefor "broken". If your code is in assembly, and never makes library calls, etc then you might consider it "correctly written" otherwise your probably fooling yourself for the couple percent you gain over simply calling EnterCriticalSection().

vardump · on May 12, 2017

> If you have complete control over the whole stack, sure. But we are taking windows user space applications.

Well, hardware feature is a hardware feature. Load/store ordering is a hardware feature.

Operating system, user space or kernel space, compiler, etc. are not relevant when discussing about CPU core hardware operation.

You don't need complete control of the stack. You just need to have CPU cores executing instructions on multiple cores.

StillBored · on May 12, 2017

> On ARM, those stores can become visible to other cores in any order.

And I will repeat this again, for "correctly" written code this doesn't matter. Because the data areas being stored to should be protected by _LOCKs_ which will enforce visibility. If you think your being clever and writing "lock free" code by depending on the memory model your likely fooling yourself.

vardump · on May 12, 2017

Lock free code indeed does indeed need to rely on memory model. Relying on underlying machine model is not fooling oneself.

Please don't conflate visibility with load/store order. It's a different matter.

See how C++ memory model operations map to different processors, especially how many cases are simple loads or stores on x86.

https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html (link from another comment in this discussion)