OpenZFS —— 减少 ARC 锁争用(OpenZFS: Reducing ARC Lock Contention)

系统及网络管理 William 337浏览 0评论

tl;dr; Cached random read performance of 8K blocks was
improved by 225% by reducing internal lock contention within the OpenZFS
ARC on illumos.

1. Introduction

Locks are a pain. Even worse, is a single lock serializing all of the page cache accesses for a filesystem. While that’s not quite the situation the OpenZFS ARC was in earlier this year, it was pretty close.

For those unfamiliar, the OpenZFS file system is built atop its very
own page cache based on the paper by Nimrod Megiddo and Dharmendra S.
Modha, “ARC: A Self-Tuning, Low Overhead Replacement Cache”. While the
paper provides a very good basis from which actual code can be written
to implement its concepts, there’s a number of real world problems it
fails to address. For this article, we’ll be looking at one of these
problems: concurrency.

3. 之前的性能

你也可能想到了,这个方法的最大弊端就是性能。虽然安全了,但在多 CPU 系统下会引起很可怕的多线程任务间的锁竞争。为了强调这个性能问题,我用 FIO 发起 32 个线程的缓存随机读取任务。每个线程使用单独的文件,从文件 [Appendix 1] 的随机 n*8K 偏移位置读取 8K 区块的数据。在分配给虚拟机的 CPU 数目变化时测得汇总带宽,结果如下:

+-----------+--------------+                                               
| # of CPUS | Bandwidth    |                                                    
+-----------+--------------+                                                    
|         1 | 840893 KB/s  |                                                    
|         2 | 1559.5 MB/s  |                                                    
|         4 | 3022.1 MB/s  |                                                    
|         8 | 5104.9 MB/s  |                                                    
|        16 | 3854.3 MB/s  |                                                    
|        32 | 3036.2 MB/s  |                                                    
+-----------+--------------+

 [附件 2] 提供了更详细的性能指标

如你所见,增加系统的 CPU 数性能会有很大的变化。当增加较多 CPU 时,其性能不仅没有提高反而变得更差了。显然这种情况不太妙,特别是多CPU系统正越来越普及。

Looking at a snippet of lockstat profiling data, it’s clear which
locks are to blame (the following was taken from a run on a system with
32 CPUs):

Count indv cuml rcnt     nsec Lock                   Caller                  
-------------------------------------------------------------------------------
4191777  47%  47% 0.00    67709 ARC_mfu+0x58           remove_reference+0x63   
4103160  46%  92% 0.00    75366 ARC_mfu+0x58           add_reference+0x8d      
   1877   0%  92% 0.00   436330 buf_hash_table+0x750   arc_buf_add_ref+0x6a

The two locks causing the bottleneck are used to protect the ARC’s
internal lists. These lists contain all ARC buffers residing in the
cache that aren’t actively used by consumers of the ARC (i.e. buffers
with a reference count of zero). Every time a buffer is added or removed
from one of these lists, the list’s lock must be held. This means the
lists’ lock must be acquired for all of the following operations: a read
from disk, an ARC cache hit, when evicting a buffer, when dirtying a
buffer, when all references to a buffer are released by consumers, etc.

看看这段 lockstat profiling 数据片段,它清晰地指出了哪些锁有问题(取自 32 个 CPU 的系统):

Count indv cuml rcnt     nsec Lock                   Caller                  
-------------------------------------------------------------------------------
4191777  47%  47% 0.00    67709 ARC_mfu+0x58           remove_reference+0x63   
4103160  46%  92% 0.00    75366 ARC_mfu+0x58           add_reference+0x8d      
   1877   0%  92% 0.00   436330 buf_hash_table+0x750   arc_buf_add_ref+0x6a

引起瓶颈的两个锁用于保护ARC内部列表。这些列表包含缓存中未被使用的全部ARC缓冲区(即引用数为0的缓冲区)。每次从列表中添加或删除缓冲区时必须获得这些列表的锁。也就是说以下这些这操作都需要获得这个锁:从磁盘读取,ARC缓存命中, 收回缓冲区,修改缓冲区数据,释放缓冲区全部引用,等等 。

4. Remove the Lists?

Before I dive into the solution that was taken, I should make it
clear why the lists are needed in the first place. These lists allow for
iterating over the buffers from the least recently accessed buffer to
the most recently accessed buffer, or vice versa. The ability to do this
iteration is critical to implementing the ARC as described by the
paper. It allows for the least recently accessed buffer to be selected
in constant-time during the eviction process [Appendix 3].

5. Solution: Use More Locks!

To alleviate the issue, we really needed a way to perform this
iteration without causing so much locking overhead to the performance
critical code paths (e.g. during a ARC cache hit). The solution we
decided on is to split each of the ARC’s internal lists into a variable
number of sublists, each sublist having its own lock and maintaining a
time ordering within itself (i.e. the buffers in each sublist are
ordered based on access time, but there’s no ordering of buffers between
the different sublists). Thus, a new data structure was created to
encapsulate these semantics, a “multilist” [Appendix 4].

4. 去掉这些列表?

在开始研究所采用的解决方案前,我应该弄明白为什么从一开始就需要这些列表。这些列表用于迭代从最早前到最近访问过的缓冲区,或与之相反。这个迭代功能是实现论文中所描述的 ARC 的重要部分。它使得最早访问的缓冲区也能在一个固定时间内被回收[Appendix 3].

5. 解决:使用更多的锁!

为了缓解这个问题,我们需要一种不会在性能关键代码路径(例如,ARC 缓存命中过程中)上引起这么多锁开销的迭代方法。最后我们决定将 ARC 内部列表切割为多个子列表,每个子列表拥有单独的锁并维护自身的时间排序(即在子列表中的缓冲区要根据访问时间排序,但不同子列表中的缓冲区之间没有顺序关系)。这样,一种封装了这些语义的全新数据结构诞生了——一个“multilist” [Appendix 4].

On the surface this seems like a pretty trivial change, but there’s a
key aspect that make it more complex to implement than one might
expect. Due to the sublists not maintaining a strict ordering between
each other, selecting the least (or most) recently accessed buffer is no
longer a constant-time operation. In general, it becomes a linear-time
operation as each sublist must be iterated, comparing the elements in
each sublists with each other.

A linear-time lookup of the least recently accessed buffer is
insufficient, so a clever domain-specific workaround was used to allow
for this lookup to maintain its constant-time complexity. Rather than
selecting the absolute least recently accessed buffer during eviction, a
buffer that is “old enough” is selected instead; this selection
happening in constant-time.

表面上看起来这些改变微不足道,但是在关键的方面你可以想象得到它有着复杂的实现。由于子表没有在相互之间维护一个严格的排序,选择最新(或最)近被访问过的缓冲区不是一个常量的时间操作。总的来说,在每个子表被遍历,每个子表元素的比较会变成一个线性的操作。

在最近被访问过的缓冲区中的一个线性查找是存在不足的,因此一个聪明的在特定领域工作查找应该是常量复杂性。在回收最近被访问过缓冲区期间,其中一个缓冲区已经“足够老”到被替换,这个选择的发生应该是常量的。

We do this by evenly inserting buffers into each sublist, and then
during eviction, a randomly selected sublist is chosen. This sublist’s
least recently accessed buffer is selected and evicted. This process
will select a buffer that’s “old enough” based on the fact that buffers
are evenly inserted into each of the sublists, so each sublist’s least
recently accessed buffer will have an access time proportional to all
other sublists’ least recently accessed buffer [Appendix 5].

6. Performance: After

As with any performance related change, empirical results are
necessary to give the change any sort of credibility. So, the workload
that was previously used to exemplify the problem [Appendix 1],
was used again to ensure the expected performance gains were achieved.
The following table contains aggregate bandwidth numbers with the fix in
place (“Bandwidth After” column), as well as the aggregate bandwidth
numbers without the fix (“Bandwidth Before” column, the same results
previously shown):

+-----------+------------------+-------------------+                       
| # of CPUS | Bandwidth After  | Bandwidth Before  |                            
+-----------+------------------+-------------------+                            
|         1 |     833163 KB/s  |      840893 KB/s  |                            
|         2 |     1520.2 MB/s  |      1559.5 MB/s  |                            
|         4 |     2985.3 MB/s  |      3022.1 MB/s  |                            
|         8 |     5913.2 MB/s  |      5104.9 MB/s  |                            
|        16 |     8938.9 MB/s  |      3854.3 MB/s  |                            
|        32 |     9896.6 MB/s  |      3036.2 MB/s  |                            
+-----------+------------------+-------------------+

我们通过在子表中均匀地插入缓冲区来完成,然后在整个回收期间,选择一个随机的子表。这个子表的最近最少访问缓冲区被选择和回收。这个过程将选择一个“足够老”的缓冲区,这是基于这些缓冲区是均匀插入每个子表的事实的,所以每个子表的最近最少访问缓冲区将会与所有其他子表的最近最少访问缓冲区具有一个访问时间比例。[附录 5].

6. 之后的性能

任何与性能相关的改变,都需要给出相应形式的可靠地实验结果。所以,工作负载设置为之前用于例证这个问题的工作负载[附录 1],再一次使用用于确定获得了期望的性能。下表包含了有合适修正的合计带宽数(“Bandwidth After” 列),还有无修正的合计带宽数(“Bandwidth Before” 列,与前面相同):

+-----------+------------------+-------------------+                       
| # of CPUS | Bandwidth After  | Bandwidth Before  |                            
+-----------+------------------+-------------------+                            
|         1 |     833163 KB/s  |      840893 KB/s  |                            
|         2 |     1520.2 MB/s  |      1559.5 MB/s  |                            
|         4 |     2985.3 MB/s  |      3022.1 MB/s  |                            
|         8 |     5913.2 MB/s  |      5104.9 MB/s  |                            
|        16 |     8938.9 MB/s  |      3854.3 MB/s  |                            
|        32 |     9896.6 MB/s  |      3036.2 MB/s  |                            
+-----------+------------------+-------------------+

As expected, performance scales much better as more CPUs are added to
the system, and the overall performance saw a 225% improvement when
using 32 CPUs (the largest CPU count I had available to test). Also, the
performance difference when using a small number of CPUs was
negligible.

Lockstat was used again, and contention is clearly much less than
before (as before, this snippet was taken from a run on a system with 32
CPUs):

Count indv cuml rcnt     nsec Lock                   Caller                  
-------------------------------------------------------------------------------
142396   3%   3% 0.00     2785 0xffffff09168112b0     rrw_exit+0x23           
136499   3%   6% 0.00     3012 0xffffff09168112b0     rrw_enter_read+0x27     
 82009   2%   8% 0.00     3359 0xffffff09168110d0     rrw_enter_read+0x27

Success!

7. Commit Details

For those interested, this fix was committed to illumos on January 12th, 2015 via this commit:

commit 244781f10dcd82684fd8163c016540667842f203                            
Author:     Prakash Surya                            
AuthorDate: Mon Jan 12 19:52:19 2015 -0800                                      
Commit:     Christopher Siden                                
CommitDate: Mon Jan 12 19:52:19 2015 -0800                                      
                                                                                
    5497 lock contention on arcs_mtx                                            
    Reviewed by: George Wilson                       
    Reviewed by: Matthew Ahrens                            
    Reviewed by: Richard Elling               
    Approved by: Dan McDonald

It’s also in the process of being ported and merged into FreeBSD and
OpenZFS on Linux (at least, it was at the time of this writing).

正如人们的所想,随着更多的CPU加入系统中,其性能成比例地增长,当使用32个CPU(我所能测试的最大CPU数量)时,总体性能改善了225%。同时,使用少量CPU时的性能差异可以忽略。

再次使用锁后,争用明显比以前少(同上,这一片段数据也是来自32个CPU组成的系统上的运行结果)

Count indv cuml rcnt     nsec Lock                   Caller                  
-------------------------------------------------------------------------------
142396   3%   3% 0.00     2785 0xffffff09168112b0     rrw_exit+0x23           
136499   3%   6% 0.00     3012 0xffffff09168112b0     rrw_enter_read+0x27     
 82009   2%   8% 0.00     3359 0xffffff09168110d0     rrw_enter_read+0x27

成功!

7. 提交细节

为了那些对此感兴趣的人,这一改进在2015年1月12日提交给了illumos,提交信息如下:

commit 244781f10dcd82684fd8163c016540667842f203                            
Author:     Prakash Surya                            
AuthorDate: Mon Jan 12 19:52:19 2015 -0800                                      
Commit:     Christopher Siden                                
CommitDate: Mon Jan 12 19:52:19 2015 -0800                                      
                                                                                
    5497 lock contention on arcs_mtx                                            
    Reviewed by: George Wilson                       
    Reviewed by: Matthew Ahrens                            
    Reviewed by: Richard Elling               
    Approved by: Dan McDonald

这一改进同时也正在接入,合并到FreeBSD和linux上的OpenZFS中(至少在写本文时是这样)。

8. Appendix

  1. The FIO script used to run the performance benchmarks is below. Keep
    in mind, this script was run a number of times prior to gathering
    the performance data, as I was specifically targeting a fully cached
    workload and needed to ensure the files were cached in the ARC.

    [global]                                                                   
    group_reporting=1                                                               
    fallocate=none                                                                  
    ioengine=psync                                                                  
    numjobs=32                                                                      
    iodepth=1                                                                       
    bs=8k                                                                           
    filesize=512m                                                                   
    size=512m                                                                       
    randrepeat=0                                                                    
    use_os_rand=1                                                                   
                                                                                    
    [32threads]                                                                     
    rw=randread                                                                     
    time_based                                                                      
    runtime=1m                                                                      
    directory=/tank/fish
  2. In addition to the run used to gather the aggregate
    bandwidth numbers, the FIO script was run a couple more times to gather
    more useful debugging and performance metrics. Specifically, it was run
    with lockstat profiling enabled, which was used to pinpoint the lock(s)
    causing the poor scaling. As well as another iteration with flame graph
    profiling enabled to pinpoint the specific functions of interest
    (additionally to provide corroborating evidence to back up the lockstat
    data). Unfortunately, it looks like I can’t upload SVG or TAR files; so
    if anybody is interested in the full lockstat output or the flamegraphs,
    drop me an email and we can work something out (probably just email the
    TAR archive directly as it’s only 227K compressed).

  3. The lists can also be iterated starting from the most recently
    accessed buffer to the least recently accessed buffer too. This
    iteration direction is used when populating the L2ARC device prior to
    the cache being “warmed up”.

  4. This concept is not original to me, as FreeBSD has
    included a similar fix to the ARC in their platform for some time.
    FreeBSD’s usage of this concept actually gave me confidence that it
    would work out in practice, prior to actually having a working solution.

  5. I realize “old enough” is vague, but essentially
    what we’re doing here is using a heuristic to determine what buffer
    would be the best candidate to be evicted from the cache. So, we’re
    modifying the previous heuristic from always selecting the absolutely
    oldest buffer, to being a randomly selected buffer out of a group of
    “old” buffers. The group of old buffers being composed of the tails of
    all the sublists.

8. 附录

  1. 用于性能基线测试的FIO脚本如下。值得提醒的是,此脚本运行了非常多次以收集不同时期的性能数据,因为我专门命中了完整的缓存工作量并且需要确保这些文件都已被APC缓存。 

    [global]                                                                   
    group_reporting=1                                                               
    fallocate=none                                                                  
    ioengine=psync                                                                  
    numjobs=32                                                                      
    iodepth=1                                                                       
    bs=8k                                                                           
    filesize=512m                                                                   
    size=512m                                                                       
    randrepeat=0                                                                    
    use_os_rand=1                                                                   
                                                                                    
    [32threads]                                                                     
    rw=randread                                                                     
    time_based                                                                      
    runtime=1m                                                                      
    directory=/tank/fish
  2. 为了能收集到汇总带宽的数量,FIO脚本运行了比“非常多次”还要多两倍的次数以收集更加有用的调试数据和性能度量。特别地, 脚本在运行时也开启了可用于精确定位锁而引发缩放的lockstat分析。 同样另一个带有图表分析的迭代也被开启了,以精确定位指定感兴趣的函数(额外提供支撑证据来协助lockstat数据) 。不幸的是,好像我不能上传SVG或者TAR文件;所以如果哪位同学对于完整的lockstat输出或者flame图表感兴趣的话,请邮件联系我,我们可以一起找个时间探讨一下(可能是仅仅直接回复TAR归档因为它打包后只有227K)。

  3. 列表也可以从最近使用的缓存开始,再到最近最少使用的缓存开始迭代。此迭代方向用于将L2ARC设备时期填充到预热的缓存时期。

  4. 这份内容不是源于我,因为FreeBSD在过去某个时期的性能中对ARC也有一个类似的修正。 FreeBSD的这篇内容实际上给了我信心,让我相信它可以通过实践来解决,而且之前实际上也已经有了一个解决的方案。

  5. 我意识到“足够老”是模糊的,但本质上我们在这里所做的,正是使用启发式的方式去决定应该候最佳选缓存什么才能不被缓存所淘汰。所以, 我们修改以往的方式而不再是选择绝对最老的缓存,而是从一组“老”的缓存中随机选取。 这些老缓存的组由全部的子列表的尾部构成。


via:oschina

转载请注明:AspxHtml学习分享网 » OpenZFS —— 减少 ARC 锁争用(OpenZFS: Reducing ARC Lock Contention)

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址