Articles Research on the Impact of Intel PAUSE Instruction on Applications and Recommendations

emailx45 · 30 Июл 2020

Research on the Impact of Intel PAUSE Instruction on Applications and Recommendations
Rui Guo (Meituan), Hongtao Zhu (Meituan), Yajing Liu - 30/Jul/2020

[SHOWTOGROUPS=4,20,22]

Introduces the principle, purpose, and effect of the PAUSE instruction in Intel processors
This article studies PAUSE latency influences of different Intel processor microarchitectures on application performance, and offers optimization recommendations. Such type of articles are intended to provide information about products and services we believe are useful and valuable to developers.
Introduction
This article introduces the principle, purpose, and effect of the PAUSE instruction in Intel processors, lists the evolution of PAUSE latency in three processor microarchitectures - Broadwell, Skylake, and Cascade Lake, as well as PAUSE latency influences on the performance of specific applications at system kernel level. In addition, this article takes MySQL database as an example to introduce the differences in application performance in the case of different PAUSE cycles and the optimization process.
Target Audience
Software developers, platform architects, data scientists, and scholars that seek to maximize performance advantages of Intel processors.
Note
This article only illustrates that PAUSE instruction may affect the performance of specific applications, but does not imply that all applications can optimize performance in the same way.
Background
Success requires not only updated hardware but also optimized software and instruction sets to improve the performance of the application system. During application optimization, we found that PAUSE latency change would likely bring about unpredictable performance penalty. On the basis of this, we took a deeper look at the PAUSE instruction to explore how to step up its flexibility through continued optimization, in an effort to break down the silos in application performance improvement.
Instruction Set Architecture for Processor
In an x86 CPU, PAUSE instruction tells the processor what it needs to do. The instruction system, encompassing instruction formats, addressing modes and data types, is critical to the performance and functionality of the processor. Therefore, before analyzing the PAUSE instruction, it is a pivotal step to navigate the evolution of recent generations of Intel® Xeon® processors and their microarchitectures, specifically the instruction sets.
Broadwell (BDX): Broadwell is an enhanced 14nm+ process microarchitecture of Haswell in the Tick-Tock model. It incorporates several enhancements and supports AVX 2.0 instruction set. With the help of ADOX, ADCX and MULX, Broadwell improves the performance of high-precision integer operations while introducing a number of new instruction sets such as RDSEED and PREFETCHW.
Skylake (SKX): Compared to Broadwell microarchitecture, Skylake features higher IPC and better power efficiency, as well as enhanced ring bus/ L3 cache. Additionally, with modified PAUSE instruction, Skylake adopted such instruction sets as Memory Protection Extension (MPX) and introduced AVX-512 instruction set.
Cascade Lake (CLX): Successor to Skylake, the higher-performance Cascade Lake added support for AVX512_VNNI which is designed to accelerate deep learning/AI workloads by boosting INT8 computing performance while modifying PAUSE instruction.
Research on PAUSE Instruction
The PAUSE instruction is first introduced for Intel Pentium 4 processor to improve the performance of “spin-wait loop”. The PAUSE instruction is typically used with software threads executing on two logical processors located in the same processor core, waiting for a lock to be released. Such short wait loops tend to last between tens and a few hundreds of cycles. When the wait loop is expected to last for thousands of cycles or more, it is preferable to yield to the operating system by calling one of the OS synchronization API functions, such as WaitForSingleObject on Windows OS.
An Intel® processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations. The PAUSE instruction can improve the performance of the processors supporting Intel Hyper-Threading Technology when executing “spin-wait loops”. With pause instruction, processors are able to avoid the memory order violation and pipeline flush, and reduce power consumption through pipeline stall.
The PAUSE instruction is intended to:

Temporarily provide the sibling logical processor (ready to make forward progress exiting the spin loop) with competitively shared hardware resources. The competitively-shared microarchitectural resources that the sibling logical processor can utilize in the Skylake microarchitecture include more front end slots in the ICache, LSD and IDQ, and more execution slots in the RS.
Save power consumed by the processor core compared to executing equivalent spin loop instruction sequence in the configurations that one logical processor is inactive, or both logical processors in the same core execute the PAUSE instruction, or HT is disabled.

Intel processor implements the PAUSE instruction as a finite pre-defined delay. This instruction does not change the architectural state of the processor. That is, it performs essentially a delaying no-op operation.
Latency of PAUSE instruction may vary depending on processor architectures (Table 1):

CPU	Cycles
Intel Xeon Broadwell	10
Intel Xeon Skylake	140
Intel Xeon Cascade Lake	44

Table 1: Xeon CPU cycles of pause instruction
The latency of PAUSE instruction in Broadwell microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles. The increased latency is expected to allow more effective utilization of competitively-shared microarchitectural resources to the logical processor ready to make forward progress.
The increased latency of Skylake PAUSE instruction has a small positive performance impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded applications if forward progress is not blocked on executing a fixed number of looped PAUSE instructions. There’s also a small power benefit in 2-core and 4-core systems. Less threaded applications that are sensitive to PAUSE latency will suffer some performance loss.
To lower the impact of PAUSE instruction delay, 2nd generation Intel® Xeon® Scalable processor based on Cascade Lake reduces the latency and mitigates the performance penalty suffered by latency-sensitive applications.
Possible Impact of PAUSE Instruction on Linux Kernel
In designing and optimizing programs for Broadwell microarchitecture, the increased instruction latency for Skylake Intel® Xeon® Scalable processor can lead to performance penalty. This is because the spin-wait loops of these programs are achieved through a fixed number of looped PAUSE instructions. The increase in PAUSE instruction cycles will extend the duration of a spin-wait loop, which can hamper the overall throughput of the system.
Spinning with a fixed count of PAUSE instructions as a time-delay technique to block program execution can produce unpredictable latency, especially the latency for the Spinlock.
Spinlock (Figure 1) is a mechanism to control access to shared resources. An Intel® processor may suffer a severe performance penalty when exiting the spin-wait loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. As mentioned above, the spin-wait loops of these programs are achieved through a fixed number of looped PAUSE instructions, and the increased latency of PAUSE instruction can prolong the duration of a spin-wait loop. Another key finding of the research on Spinlock is that an ordinary Spinlock can cause only one CPU thread to fetch variables and spin at a time in the presence of multiple cores, while the cache coherence protocol will synchronize and invalidate the state and data of all CPU threads to ensure data correctness, which may cause performance loss.

[/SHOWTOGROUPS]

emailx45 · 30 Июл 2020

[SHOWTOGROUPS=4,20,22]

Figure 1: Spinlock
The MCS lock (Figure 2) can reduce Spinlock overhead and achieve better performance. It is a high-performance, fair Spinlock based on a unidirectional chain list, where each applying CPU thread only spin on a local variable, and its direct precursor is responsible for notifying the end of the spin, greatly reducing unnecessary processor cache synchronization frequency and bus and memory overhead.

An MCS lock is more complicated than a regular Spinlock, removing much of the cache-line bouncing from the contended case. It is also entirely fair, passing the lock to each CPU in the order that the CPUs arrived.

Figure 2: MCS lock
In addition to Spinlock, PAUSE instruction also has an impact on memory allocation. In the memory allocation process, threads spend a great deal time waiting on the lock, which greatly affects performance.

JeMalloc is a memory allocator that features higher performance in multi-thread situations and lower fragmentation than other categories. To prevent threads from competing for locks, JeMalloc utilizes thread variables and enables memory allocation in the thread-local memory manager without competing with other threads. In addition, each thread in JeMalloc, through the mapping of thread-ID, corresponds to an array element, which helps to reduce the risk of multiple threads competing for the same single element. As the rate of lock contention is reduced, the influence of an increased latency in the PAUSE instruction on memory management can be brought under control correspondingly. Therefore, replacing the original memory allocator in applications with JeMalloc can help to alleviate the problem of reduced throughput due to the increased PAUSE latency.

In summary, in order to reduce the negative impact of the PAUSE latency to performance, one solution is to optimize the software. For instance, decreasing the number of cycles in applications to control the PAUSE latency, adopting MCS locks to reduce bus overhead and memory overhead, replacing the libc memory allocation with JeMalloc to reduce the use of locks. All these methods serve as potential solutions to the problem.

Another more straightforward solution is to replace the original processors with 2nd Generation Intel® Xeon® Scalable Processor (Cascade Lake) architecture, which addresses the root cause of the problem by directly reducing the PAUSE latency.

The Impact of PAUSE Instruction on MySQL and the Coping mechanism
As the most important online storage services for Meituan Dianping, MySQL process trillions of queries each day. As business continues to grow, MySQL needs to continuously improve features like performance and availability. In theory, applying a new generation of processors will help improve system performance. In practice, however, we’ve found that under high-load conditions, upgrading processors does not necessarily increase the throughput; it might lead to a decrease in throughput (Figure 3).

Figure 3: Throughput on different Xeon Processors
By using vTune to capture log information, we’ve found that when the write load is great, MySQL ut_delay and kernel spin_lock occupy a significant proportion of system loads (Figure 4).

Figure 4: Hotspots analysis with vTune
We have worked with Intel technical experts to analyze the cause, and found that both MySQL ut_delay and kernel spin_lock called PAUSE instruction. As a result, they were inevitably affected by the latency of PAUSE instruction.

In order to avoid the cache invalidation arising from the locks of multi-core processors, MySQL InnoDB adopts the random delay mechanism of spin-wait loop, the duration of which is decided by the number of PAUSE instructions and their latency set up in the loop. It indicates that the number of PAUSE instructions unchanged, the increase in PAUSE instruction cycles will extend the duration of a spin-wait loop, hence hampering the load and throughput of the system.

According to the part of “Configuring Spin Lock Polling” in the MySQL Refence Manual provided on the official website, MySQL 5.7 uses “innodb_spin_wait_delay * 50” to specify the number of callings to the PAUSE instruction, which is more adaptive to Broadwell architecture. In the Skylake architecture, if the “innodb_spin_wait_delay” parameter is adjusted for the purpose of reducing callings to the PAUSE instruction, the influence of cache invalidation may be amplified. In order to solve the problem, MySQL 8.0.16 introduces the innodb_spin_wait_pause_multiplier variable. Specifying the number of callings to PAUSE instruction with “innodb_spin_wait_delay * innodb_spin_wait_pause_multiplier” can both reduce the number of callings and alleviate the impact of cache invalidation.

To deal with the impact of PAUSE instruction on MySQL, we have migrated the innodb_spin_wait_pause_multiplier patch to the online version, replaced the kernel spin_lock with MCS lock, and upgraded the processor microarchitecture to Cascade Lake. The validation result shows that not only the influence of the changes in PAUSE instruction is eliminated, but MySQL performance is up 15% (Figure 5).

Figure 5: MySQL Throughput
Summary
In a word, PAUSE instruction brings notable influences on the performance of some applications. The increased latency of PAUSE instruction may affect applications like kernel and MySQL, which leads to reduced business performance. By working with Intel technical experts, we adopted some optimization measures, such as dynamically adjusting the number of callings to PAUSE instruction and upgrading processor microarchitecture to Cascade Lake, thus managed to eliminate the impact of the PAUSE latency on MySQL and improve the general performance.

[/SHOWTOGROUPS]

Articles Research on the Impact of Intel PAUSE Instruction on Applications and Recommendations

emailx45

emailx45