Research on the Impact of Intel PAUSE Instruction on Applications and Recommendations
Rui Guo (Meituan), Hongtao Zhu (Meituan), Yajing Liu - 30/Jul/2020
Rui Guo (Meituan), Hongtao Zhu (Meituan), Yajing Liu - 30/Jul/2020
[SHOWTOGROUPS=4,20,22]
Introduces the principle, purpose, and effect of the PAUSE instruction in Intel processors
This article studies PAUSE latency influences of different Intel processor microarchitectures on application performance, and offers optimization recommendations. Such type of articles are intended to provide information about products and services we believe are useful and valuable to developers.
Introduction
This article introduces the principle, purpose, and effect of the PAUSE instruction in Intel processors, lists the evolution of PAUSE latency in three processor microarchitectures - Broadwell, Skylake, and Cascade Lake, as well as PAUSE latency influences on the performance of specific applications at system kernel level. In addition, this article takes MySQL database as an example to introduce the differences in application performance in the case of different PAUSE cycles and the optimization process.
Target Audience
Software developers, platform architects, data scientists, and scholars that seek to maximize performance advantages of Intel processors.
Note
This article only illustrates that PAUSE instruction may affect the performance of specific applications, but does not imply that all applications can optimize performance in the same way.
Background
Success requires not only updated hardware but also optimized software and instruction sets to improve the performance of the application system. During application optimization, we found that PAUSE latency change would likely bring about unpredictable performance penalty. On the basis of this, we took a deeper look at the PAUSE instruction to explore how to step up its flexibility through continued optimization, in an effort to break down the silos in application performance improvement.
Instruction Set Architecture for Processor
In an x86 CPU, PAUSE instruction tells the processor what it needs to do. The instruction system, encompassing instruction formats, addressing modes and data types, is critical to the performance and functionality of the processor. Therefore, before analyzing the PAUSE instruction, it is a pivotal step to navigate the evolution of recent generations of Intel® Xeon® processors and their microarchitectures, specifically the instruction sets.
Broadwell (BDX): Broadwell is an enhanced 14nm+ process microarchitecture of Haswell in the Tick-Tock model. It incorporates several enhancements and supports AVX 2.0 instruction set. With the help of ADOX, ADCX and MULX, Broadwell improves the performance of high-precision integer operations while introducing a number of new instruction sets such as RDSEED and PREFETCHW.
Skylake (SKX): Compared to Broadwell microarchitecture, Skylake features higher IPC and better power efficiency, as well as enhanced ring bus/ L3 cache. Additionally, with modified PAUSE instruction, Skylake adopted such instruction sets as Memory Protection Extension (MPX) and introduced AVX-512 instruction set.
Cascade Lake (CLX): Successor to Skylake, the higher-performance Cascade Lake added support for AVX512_VNNI which is designed to accelerate deep learning/AI workloads by boosting INT8 computing performance while modifying PAUSE instruction.
Research on PAUSE Instruction
The PAUSE instruction is first introduced for Intel Pentium 4 processor to improve the performance of “spin-wait loop”. The PAUSE instruction is typically used with software threads executing on two logical processors located in the same processor core, waiting for a lock to be released. Such short wait loops tend to last between tens and a few hundreds of cycles. When the wait loop is expected to last for thousands of cycles or more, it is preferable to yield to the operating system by calling one of the OS synchronization API functions, such as WaitForSingleObject on Windows OS.
An Intel® processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations. The PAUSE instruction can improve the performance of the processors supporting Intel Hyper-Threading Technology when executing “spin-wait loops”. With pause instruction, processors are able to avoid the memory order violation and pipeline flush, and reduce power consumption through pipeline stall.
The PAUSE instruction is intended to:
Latency of PAUSE instruction may vary depending on processor architectures (Table 1):
Table 1: Xeon CPU cycles of pause instruction
The latency of PAUSE instruction in Broadwell microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles. The increased latency is expected to allow more effective utilization of competitively-shared microarchitectural resources to the logical processor ready to make forward progress.
The increased latency of Skylake PAUSE instruction has a small positive performance impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded applications if forward progress is not blocked on executing a fixed number of looped PAUSE instructions. There’s also a small power benefit in 2-core and 4-core systems. Less threaded applications that are sensitive to PAUSE latency will suffer some performance loss.
To lower the impact of PAUSE instruction delay, 2nd generation Intel® Xeon® Scalable processor based on Cascade Lake reduces the latency and mitigates the performance penalty suffered by latency-sensitive applications.
Possible Impact of PAUSE Instruction on Linux Kernel
In designing and optimizing programs for Broadwell microarchitecture, the increased instruction latency for Skylake Intel® Xeon® Scalable processor can lead to performance penalty. This is because the spin-wait loops of these programs are achieved through a fixed number of looped PAUSE instructions. The increase in PAUSE instruction cycles will extend the duration of a spin-wait loop, which can hamper the overall throughput of the system.
Spinning with a fixed count of PAUSE instructions as a time-delay technique to block program execution can produce unpredictable latency, especially the latency for the Spinlock.
Spinlock (Figure 1) is a mechanism to control access to shared resources. An Intel® processor may suffer a severe performance penalty when exiting the spin-wait loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. As mentioned above, the spin-wait loops of these programs are achieved through a fixed number of looped PAUSE instructions, and the increased latency of PAUSE instruction can prolong the duration of a spin-wait loop. Another key finding of the research on Spinlock is that an ordinary Spinlock can cause only one CPU thread to fetch variables and spin at a time in the presence of multiple cores, while the cache coherence protocol will synchronize and invalidate the state and data of all CPU threads to ensure data correctness, which may cause performance loss.
[/SHOWTOGROUPS]
Introduces the principle, purpose, and effect of the PAUSE instruction in Intel processors
This article studies PAUSE latency influences of different Intel processor microarchitectures on application performance, and offers optimization recommendations. Such type of articles are intended to provide information about products and services we believe are useful and valuable to developers.
Introduction
This article introduces the principle, purpose, and effect of the PAUSE instruction in Intel processors, lists the evolution of PAUSE latency in three processor microarchitectures - Broadwell, Skylake, and Cascade Lake, as well as PAUSE latency influences on the performance of specific applications at system kernel level. In addition, this article takes MySQL database as an example to introduce the differences in application performance in the case of different PAUSE cycles and the optimization process.
Target Audience
Software developers, platform architects, data scientists, and scholars that seek to maximize performance advantages of Intel processors.
Note
This article only illustrates that PAUSE instruction may affect the performance of specific applications, but does not imply that all applications can optimize performance in the same way.
Background
Success requires not only updated hardware but also optimized software and instruction sets to improve the performance of the application system. During application optimization, we found that PAUSE latency change would likely bring about unpredictable performance penalty. On the basis of this, we took a deeper look at the PAUSE instruction to explore how to step up its flexibility through continued optimization, in an effort to break down the silos in application performance improvement.
Instruction Set Architecture for Processor
In an x86 CPU, PAUSE instruction tells the processor what it needs to do. The instruction system, encompassing instruction formats, addressing modes and data types, is critical to the performance and functionality of the processor. Therefore, before analyzing the PAUSE instruction, it is a pivotal step to navigate the evolution of recent generations of Intel® Xeon® processors and their microarchitectures, specifically the instruction sets.
Broadwell (BDX): Broadwell is an enhanced 14nm+ process microarchitecture of Haswell in the Tick-Tock model. It incorporates several enhancements and supports AVX 2.0 instruction set. With the help of ADOX, ADCX and MULX, Broadwell improves the performance of high-precision integer operations while introducing a number of new instruction sets such as RDSEED and PREFETCHW.
Skylake (SKX): Compared to Broadwell microarchitecture, Skylake features higher IPC and better power efficiency, as well as enhanced ring bus/ L3 cache. Additionally, with modified PAUSE instruction, Skylake adopted such instruction sets as Memory Protection Extension (MPX) and introduced AVX-512 instruction set.
Cascade Lake (CLX): Successor to Skylake, the higher-performance Cascade Lake added support for AVX512_VNNI which is designed to accelerate deep learning/AI workloads by boosting INT8 computing performance while modifying PAUSE instruction.
Research on PAUSE Instruction
The PAUSE instruction is first introduced for Intel Pentium 4 processor to improve the performance of “spin-wait loop”. The PAUSE instruction is typically used with software threads executing on two logical processors located in the same processor core, waiting for a lock to be released. Such short wait loops tend to last between tens and a few hundreds of cycles. When the wait loop is expected to last for thousands of cycles or more, it is preferable to yield to the operating system by calling one of the OS synchronization API functions, such as WaitForSingleObject on Windows OS.
An Intel® processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations. The PAUSE instruction can improve the performance of the processors supporting Intel Hyper-Threading Technology when executing “spin-wait loops”. With pause instruction, processors are able to avoid the memory order violation and pipeline flush, and reduce power consumption through pipeline stall.
The PAUSE instruction is intended to:
- Temporarily provide the sibling logical processor (ready to make forward progress exiting the spin loop) with competitively shared hardware resources. The competitively-shared microarchitectural resources that the sibling logical processor can utilize in the Skylake microarchitecture include more front end slots in the ICache, LSD and IDQ, and more execution slots in the RS.
- Save power consumed by the processor core compared to executing equivalent spin loop instruction sequence in the configurations that one logical processor is inactive, or both logical processors in the same core execute the PAUSE instruction, or HT is disabled.
Latency of PAUSE instruction may vary depending on processor architectures (Table 1):
CPU | Cycles |
Intel Xeon Broadwell | 10 |
Intel Xeon Skylake | 140 |
Intel Xeon Cascade Lake | 44 |
The latency of PAUSE instruction in Broadwell microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles. The increased latency is expected to allow more effective utilization of competitively-shared microarchitectural resources to the logical processor ready to make forward progress.
The increased latency of Skylake PAUSE instruction has a small positive performance impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded applications if forward progress is not blocked on executing a fixed number of looped PAUSE instructions. There’s also a small power benefit in 2-core and 4-core systems. Less threaded applications that are sensitive to PAUSE latency will suffer some performance loss.
To lower the impact of PAUSE instruction delay, 2nd generation Intel® Xeon® Scalable processor based on Cascade Lake reduces the latency and mitigates the performance penalty suffered by latency-sensitive applications.
Possible Impact of PAUSE Instruction on Linux Kernel
In designing and optimizing programs for Broadwell microarchitecture, the increased instruction latency for Skylake Intel® Xeon® Scalable processor can lead to performance penalty. This is because the spin-wait loops of these programs are achieved through a fixed number of looped PAUSE instructions. The increase in PAUSE instruction cycles will extend the duration of a spin-wait loop, which can hamper the overall throughput of the system.
Spinning with a fixed count of PAUSE instructions as a time-delay technique to block program execution can produce unpredictable latency, especially the latency for the Spinlock.
Spinlock (Figure 1) is a mechanism to control access to shared resources. An Intel® processor may suffer a severe performance penalty when exiting the spin-wait loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. As mentioned above, the spin-wait loops of these programs are achieved through a fixed number of looped PAUSE instructions, and the increased latency of PAUSE instruction can prolong the duration of a spin-wait loop. Another key finding of the research on Spinlock is that an ordinary Spinlock can cause only one CPU thread to fetch variables and spin at a time in the presence of multiple cores, while the cache coherence protocol will synchronize and invalidate the state and data of all CPU threads to ensure data correctness, which may cause performance loss.
[/SHOWTOGROUPS]