转载

[译]Memory Reordering Caught in the Act

原文： http://preshing.com/20120515/memory-reordering-caught-in-the-act/

编写lock-free 的 C/C++ 程序时，如何保证memory ordering 的正确性上要非常小心，否则，奇怪的事就来了。

Intel 在《 x86/64 Architecture Specification 》 Volume 3, §8.2.3 一节中列出了一些可能发生的 “ 奇怪的事 ” 。

来看一个小例子： X ， Y 两个变量均被初始化为 0 ，编写如下汇编代码，由两个处理器 (processor) 执行：

为了清楚的阐述 CPU ordering，此处使用了汇编指令。每个处理器将一个变量赋值为 1(store) ，并读取另外一个变量 (load) ，此处的 r1 、 r2 代表寄存器。

正常情况下，不管处理器的执行顺序如何， r1 ， r2 所有可能的结果为：

r1==0, r2 ==1

r1==1, r2==0

r1==1, r2==1

不可能的结果为 :

r1==0, r2==0 。

但按 Intel 说明书中的说法，这种不可能是有可能的，这就是本文所述的 “ 奇怪的事 ” 。至少，这是违反直觉的。

Memory Reordering

Intel x86/x64 处理器和大多数处理器家族一样，在不影响单个线程执行结果的前提下，允许对内存交互指令进行重排 (reorder) 。特别指出：处理器允许将 store 动作延迟到任何 load 动作之后 , 只要 load 、store 的操作的不是同一块内存。因此，您编写的汇编代码在执行时可能变成了这样：

[译]Memory Reordering Caught in the Act

来，试试！

“好好好，你说这事可能发生，但我从来没见过，叫我如何相信？”

那 ... 叫我们来试试如何：源码在这。

这份代码包括 win32 、 POSIX 两个版本，由两个派生线程重复执行上述transaction代码，并由主线程核对执行结果。

第一个工作线程源码如下： x,y,r1,r2 为全局变量，两个 POSIX semaphores 用于和主线程的并发处理：

  1 sem_t beginSema1;  2 sem_t endSema;  3 int X, Y;int r1, r2;  4 void *thread1Func(void *param)  5 {  6     MersenneTwister random(1);                // Initialize random number generator  7     for (;;)                                  // Loop indefinitely  8     {  9         sem_wait(&beginSema1);                // Wait for signal from main thread 10         while (random.integer() % 8 != 0) {}  // Add a short, random delay 11  12         // ----- THE TRANSACTION! ----- 13         X = 1; 14         asm volatile("" ::: "memory");        // Prevent compiler reordering 15         r1 = Y; 16  17         sem_post(&endSema);                   // Notify transaction complete 18     } 19     return NULL;  // Never returns 20 };

补充一句，每个transaction 执行前，插入随机延时逻辑以保证线程切换。示例有两个工作线程，这里试图让它们的执行尽可能的重叠 ( 译注：或许并无必要 ) 。本例采用的随机延时实现： MersennsTwister 和 measuring lock contention 、

validating that the recursive Benaphore worked 中的一样。别被 asm volatile这行代码唬到，这只是告诉 GCC 编译器在生成机器码时，不要重排 store 和 load ，以防 GCC 在编译优化时又想出了什么“歪点子”。

来看编译后的汇编代码：

 $ gcc -O2 -c -S -masm=intel ordering.cpp $ cat ordering.s     ...     mov    DWORD PTR _X, 1     mov    eax, DWORD PTR _Y     mov    DWORD PTR _r1, eax     ...

Store 和 load 的顺序和预期的一致，先执行 X=1 ，随后执行 r1=Y 。

下面是主线程代码，职责如下：初始化后，进入无限循环，重置 x ， y 为 0 ，通过信号量触发两个线程运行。

Pay particular attention to the way all writes to shared memory occur before sem_post, and all reads from shared memory occur after sem_wait. The same rules are followed in the worker threads when communicating with the main thread. Semaphores give us acquire and release semantics on every platform. That means we are guaranteed that the initial values of X = 0 and Y = 0 will propagate completely to the worker threads, and that the resulting values of r1 and r2 will propagate fully back here. In other words, the semaphores prevent memory reordering issues in the framework, allowing us to focus entirely on the experiment itself!（这段怎么读都像废话，不翻译了）

  1 int main()  2 {  3     // Initialize the semaphores  4     sem_init(&beginSema1, 0, 0);  5     sem_init(&beginSema2, 0, 0);  6     sem_init(&endSema, 0, 0);  7   8     // Spawn the threads  9     pthread_t thread1, thread2; 10     pthread_create(&thread1, NULL, thread1Func, NULL); 11     pthread_create(&thread2, NULL, thread2Func, NULL); 12  13     // Repeat the experiment ad infinitum 14     int detected = 0; 15     for (int iterations = 1; ; iterations++) 16     { 17         // Reset X and Y 18         X = 0; 19         Y = 0; 20         // Signal both threads 21         sem_post(&beginSema1); 22         sem_post(&beginSema2); 23         // Wait for both threads 24         sem_wait(&endSema); 25         sem_wait(&endSema); 26         // Check if there was a simultaneous reorder 27         if (r1 == 0 && r2 == 0) 28         { 29             detected++; 30             printf("%d reorders detected after %d iterations/n", detected, iterations); 31         } 32     } 33     return 0;  // Never returns 34 }

检验真理的时刻到了，这是我在Intel Xeon W3520 、 Cygwin 环境下运行的结果：

[译]Memory Reordering Caught in the Act

这下你总算信了吧！运行过程中，内存重排序大约每6600 次检测到一次。当我在 Core 2 Duo E6300、Ubuntu 环境下测试时，出现的概率甚至更低。你已经开始意识到，微妙的“时机” bugs 可以在不被发现的情况下蔓延到 lock-free 的代码中。现在，你可能在想：“我不需要这该死的reording”。OK ，至少有两种方法。

一种是将两个线程绑定到同一个 CPU core 上， pthread 并未提供相应的结构，但 linux 上可以这样做：

     cpu_set_t cpus;     CPU_ZERO(&cpus);     CPU_SET(0, &cpus);     pthread_setaffinity_np(thread1, sizeof(cpu_set_t), &cpus);     pthread_setaffinity_np(thread2, sizeof(cpu_set_t), &cpus);

自此之后，重排序消失了。因为单个处理器上是保序的，哪怕线程是抢占的、将在任意时间被重新调度（That’s because a single processor never sees its own operations out of order, even when threads are pre-empted and rescheduled at arbitrary times.）。当然，将两个线程绑定到一个核上，致使其它CPU Core未被有效利用(由此看来，这并不是个好办法)。

我在Playstation 3 上编译、运行，并未检测到内存重排。This suggests (but doesn’t confirm) that the two hardware threads inside the PPU may effectively act as a single processor, with very fine-grained hardware scheduling.

采用 StoreLoad Barrier避免memory reordering

另一种避免memory reordering的方法是：在两条指令间引入CPU Barrier 。本例中，我们要阻止 Store 和随后的 Load 指令发生重排，引入的 CPU Barrier 通常称为 StoreLoad Barrier 。

在 X86/X64 处理器上，没有专门的 StoreLoad barrier 指令，但有一些指令可完成另丰富的功能。 Mfence 指令为full memory barrier 指令，它可以避免任何情况的内存重排。 GCC 中的实现方式如下：

 1     for (;;)                                  // Loop indefinitely 2     { 3         sem_wait(&beginSema1);                // Wait for signal from main thread 4         while (random.integer() % 8 != 0) {}  // Add a short, random delay 5  6         // ----- THE TRANSACTION! ----- 7         X = 1; 8         asm volatile("mfence" ::: "memory");  // Prevent memory reordering 9         r1 = Y;

查看编译生成的汇编代码来验证效果：

     ...     mov    DWORD PTR _X, 1     mfence     mov    eax, DWORD PTR _Y     mov    DWORD PTR _r1, eax     ...

修改后，内存重排消失了，两个线程可运行在两个不同的 CPU cores 上。

Similar Instructions and Different Platforms

其实， mfence 不是x86/x64 下唯一的 full memory barrier. 在这些处理器上，任何 locked 指令，如 xchg 均属于 full memory barrier ，此时无需使用其他的 SSE 指令或 write-combined memory。实际上，如果你使用 MemoryBarrier 指令时， Microsoft C++ 编译器会生成 xchg 指令 ( 至少 VS2008 如此 )

Mfence 指令适用于 x86/x64 ，如果想编写可移植的代码，可以采用预处理宏技术。 Linux 内核提供了一组宏： smp_mb 、 smp_rmb 、 smp_wmb ，并提供了一组实现 alternate implementations on different architectures . 如在 PowerPC 上， smp_mb 被实现为 sync.

不同的CPU 家族有其自己的 memory ordering 指令集，编译器根据自身喜好提供此类功能，而跨平台项目则为此封装自己的抽象层 ... 而这些对简化 lock-free 编程毫无益处。这也是为何 C++11 引入 C++11 atomic library 标准的原因，标准化、更为方便的编写 lock-free 的可移植代码。