CEPH 的 CRUSH 算法原理(CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data)

By | 2018年7月12日


Emerging large-scale distributed storage systems are faced with the task of distributing petabytes of data among tens or hundreds of thousands of storage devices. Such systems must evenly distribute data and workload to efficiently uti- lize available resources and maximize system performance, while facilitating system growth and managing hardware failures. We have developed CRUSH, a scalable pseudo- random data distribution function designed for distributed object-based storage systems that efficiently maps data ob- jects to storage devices without relying on a central direc- tory. Because large systems are inherently dynamic, CRUSH is designed to facilitate the addition and removal of storage while minimizing unnecessary data movement. The algo- rithm accommodates a wide variety of data replication and reliability mechanisms and distributes data in terms of user- defined policies that enforce separation of replicas across failure domains.


新兴的的大规模分布式存储系统面临着在数十甚至是数百数千的存储设备之间分发PB这个数量级别数据的艰巨任务. 这样的系统必须能够均匀的分配数据和工作负载,以获取对可用资源的高效使用,和系统性能的最大化, 同时要便于系统的扩展以及对硬件故障的管理. 我们已经开发了CRUSH 这样一个伪随机的数据分发功能,它被设计用于分布式的基于对象的存储系统,这样的系统不依赖于某个中央目录就能够将数据对象映射到存储设备. 因为大型系统先天就是动态的,所以CRUSH被设计成便于增加和移除存储,同时能最小化非必要的数据移动. 该算法可容纳各种各样的数据复制和可靠性机制,并按照用户定义的策略来分发数据,这样的策略会强制执行跨越故障域的备份分离.

1 Introduction

Object-based storage is an emerging architecture that promises improved manageability, scalability, and perfor- mance [Azagury et al. 2003]. Unlike conventional block- based hard drives, object-based storage devices (OSDs) man- age disk block allocation internally, exposing an interface that allows others to read and write to variably-sized, named objects. In such a system, each file’s data is typically striped across a relatively small number of named objects distributed throughout the storage cluster. Objects are repli- cated across multiple devices (or employ some other data redundancy scheme) in order to protect against data loss in the presence of failures. Object-based storage systems sim- plify data layout by replacing large block lists with small object lists and distributing the low-level block allocation problem. Although this vastly improves scalability by re- ducing file allocation metadata and complexity, the funda-mental task of distributing data among thousands of storage devices—typically with varying capacities and performance characteristics—remains.

1 简介


Most systems simply write new data to underutilized de- vices. The fundamental problem with this approach is that data is rarely, if ever, moved once it is written. Even a per- fect distribution will become imbalanced when the storage system is expanded, because new disks either sit empty or contain only new data. Either old or new disks may be busy, depending on the system workload, but only the rarest of conditions will utilize both equally to take full advantage of available resources.

A robust solution is to distribute all data in a system ran- domly among available storage devices. This leads to a prob- abilistically balanced distribution and uniformly mixes old and new data together. When new storage is added, a random sample of existing data is migrated onto new storage devices to restore balance. This approach has the critical advantage that, on average, all devices will be similarly loaded, allow- ing the system to perform well under any potential work- load [Santos et al. 2000]. Furthermore, in a large storage system, a single large file will be randomly distributed across a large set of available devices, providing a high level of par- allelism and aggregate bandwidth. However, simple hash- based distribution fails to cope with changes in the number of devices, incurring a massive reshuffling of data. Further, ex- isting randomized distribution schemes that decluster repli- cation by spreading each disk’s replicas across many other devices suffer from a high probability of data loss from co- incident device failures.



We have developed CRUSH (Controlled Replication Un- der Scalable Hashing), a pseudo-random data distribution algorithm that efficiently and robustly distributes object replicas across a heterogeneous, structured storage cluster. CRUSH is implemented as a pseudo-random, deterministic function that maps an input value, typically an object or ob- ject group identifier, to a list of devices on which to store object replicas. This differs from conventional approaches in that data placement does not rely on any sort of per-file or per-object directory—CRUSH needs only a compact, hi- erarchical description of the devices comprising the storage cluster and knowledge of the replica placement policy. This approach has two key advantages: first, it is completely dis- tributed such that any party in a large system can indepen- dently calculate the location of any object; and second, what little metadata is required is mostly static, changing only when devices are added or removed.

CRUSH is designed to optimally distribute data to uti- lize available resources, efficiently reorganize data when storage devices are added or removed, and enforce flexible constraints on object replica placement that maximize data safety in the presence of coincident or correlated hardware failures. A wide variety of data safety mechanisms are sup- ported, including n-way replication (mirroring), RAID parity schemes or other forms of erasure coding, and hybrid ap- proaches (e. g., RAID-10). These features make CRUSH ideally suited for managing object distribution in extremely large (multi-petabyte) storage systems where scalability, per- formance, and reliability are critically important.

我们开发了CRUSH算法 (Controlled Replication Under Scalable Hashing),一个伪随机数据分布算法,高效、强劲跨异构分布对象副本,结构化存储集群。CRUSH被实现为一个伪随机的确定性函数,CRUSH 决定一个输入值(通常是一个对象或对象组标识符)映射到一系列的设备中的某一个来存储对象的副本。这与传统方法的不同之处在于,数据放置是不依赖于任何形式的单个文件或者单个对象的目录-CRUSH只需要一个简洁而层次清晰的设备描述,包括存储集群和副本放置策略。这种方法有两个关键的优点:首先,它是完全分布式的,在这个大系统的中的任何一方都可以独立计算任何对象的位置;第二,无论多小的元数据都是静态的,只有当设备被添加或移除是才被改变。


2 Related Work

Object-based storage has recently garnered significant inter- est as a mechanism for improving the scalability of stor- age systems. A number of research and production file systems have adopted an object-based approach, includ- ing the seminal NASD file system [Gobioff et al. 1997], the Panasas file system [Nagle et al. 2004], Lustre [Braam 2004], and others [Rodeh and Teperman 2003; Ghemawat et al. 2003]. Other block-based distributed file systems like GPFS [Schmuck and Haskin 2002] and Federated Array of Bricks (FAB) [Saito et al. 2004] face a similar data distribu- tion challenge. In these systems a semi-random or heuristic- based approach is used to allocate new data to storage de- vices with available capacity, but data is rarely relocated to maintain a balanced distribution over time. More im- portantly, all of these systems locate data via some sort of metadata directory, while CRUSH relies instead on a com- pact cluster description and deterministic mapping function. This distinction is most significant when writing data, as sys- tems utilizing CRUSH can calculate any new data’s stor- age target without consulting a central allocator. The Sor- rento [Tang et al. 2004] storage system’s use of consistent hashing [Karger et al. 1997] most closely resembles CRUSH, but lacks support for controlled weighting of devices, a well- balanced distribution of data, and failure domains for im- proving data safety.

2 相关工作

对象存储作为一种可扩展的存储系统,最近被大量关注。大量的文件系统研究和产品采用了对象存储的方法,例如NASD,Panasas, Lustre等等。其他的块存储分布式文件系统像GPFS,FAB 面临着类似数据分布的挑战。在这些系统中使用半随机或启发式算法分配新的数据到可用的存储设备中,但数据很少被迁移以保证数据的平均分布。更重要的是,所有这些系统定位数据都是通过某种形式的元数据目录,而CRUSH算法是依赖于紧凑的集群描述和确定性的映射方法。CRUSH 算法不需要依赖于任何中枢分布器便可以计算出新数据的存储目的地,这是很有进步性意义的分布式算法。Sorrento存储系统使用一致性哈希的最接近CRUSH,但Sorrento不支持对设备的加权控制,设备的加权控制可以均衡分布数据和失败域,提高数据安全性。

Although the data migration problem has been studied ex- tensively in the context of systems with explicit allocation maps [Anderson et al. 2001; Anderson et al. 2002], such ap- proaches have heavy metadata requirements that functional approaches like CRUSH avoid. Choy, et al. [1996] describe algorithms for distributing data over disks which move an optimal number of objects as disks are added, but do not support weighting, replication, or disk removal. Brinkmann,et al. [2000] use hash functions to distribute data to a het- erogeneous but static cluster. SCADDAR [Goel et al. 2002] addresses the addition and removal of storage, but only sup- ports a constrained subset of replication strategies. None of these approaches include CRUSH’s flexibility or failure do-mains for improved reliability.

CRUSH most closely resembles the RUSH [Honicky and Miller 2004] family of algorithms upon which it is based. RUSH remains the only existing set of algorithms in the lit- erature that utilizes a mapping function in place of explicit metadata and supports the efficient addition and removal of weighted devices. Despite these basic properties, a number of issues make RUSH an insufficient solution in practice. CRUSH fully generalizes the useful elements of RUSHP and RUSHT while resolving previously unaddressed reliability and replication issues, and offering improved performance and flexibility.


CRUSH 最接近于RUSH系列的算法基础。RUSH 目前仍然是唯一的利用一致性数据元映射函数有效支持可添加删除加权控制设备的算法文献。尽管有这些基本性质,RUSH解决方案在实践中依然存在不足。CRUSH 完全包含了RUSHP和RUSHT的有用元素,并且解决了以前的未知可靠性问题和数据复制问题,并且提高了性能,具有灵活性。

3 The CRUSH algorithm

The CRUSH algorithm distributes data objects among stor- age devices according to a per-device weight value, approx- imating a uniform probability distribution. The distribution is controlled by a hierarchical cluster map representing the available storage resources and composed of the logical el- ements from which it is built. For example, one might de- scribe a large installation in terms of rows of server cabinets, cabinets filled with disk shelves, and shelves filled with stor- age devices. The data distribution policy is defined in terms of placement rules that specify how many replica targets are chosen from the cluster and what restrictions are imposed on replica placement. For example, one might specify that three mirrored replicas are to be placed on devices in dif- ferent physical cabinets so that they do not share the same electrical circuit.

Given a single integer input value x, CRUSH will output an ordered list ?R of n distinct storage targets. CRUSH uti- lizes a strong multi-input integer hash function whose inputs include x, making the mapping completely deterministic and independently calculable using only the cluster map, place- ment rules, and x. The distribution is pseudo-random in that there is no apparent correlation between the resulting output from similar inputs or in the items stored on any storage de- vice. We say that CRUSH generates a declustered distribu- tion of replicas in that the set of devices sharing replicas for one item also appears to be independent of all other items.

3 CRUSH 算法


给定一个输入x,CRUSH 算法将输出一个确定的有序的储存目标向量   ?。当输入x,CRUSH利用强大的多重整数hash函数根据集群map、定位规则、以及x计算出独立的完全确定可靠的映射关系。CRUSH分配算法是伪随机算法,并且输入的内容和输出的储存位置之间是没有显式相关的。我们可以说CRUSH 算法在集群设备中生成了“伪集群”的数据副本。集群的设备对一个数据项目共享数据副本,对其他数据项目又是独立的。

3.1 Hierarchical Cluster Map

The cluster map is composed of devices and buckets, both of which have numerical identifiers and weight values associ- ated with them. Buckets can contain any number of devices or other buckets, allowing them to form interior nodes in a storage hierarchy in which devices are always at the leaves. Storage devices are assigned weights by the administrator to control the relative amount of data they are responsible for storing. Although a large system will likely contain devices with a variety of capacity and performance characteristics, randomized data distributions statistically correlate device

3.1 分层集群映射


utilization with workload, such that device load is on aver- age proportional to the amount of data stored. This results in a one-dimensional placement metric, weight, which should be derived from the device’s capabilities. Bucket weights are defined as the sum of the weights of the items they contain.

Buckets can be composed arbitrarily to construct a hierar- chy representing available storage. For example, one might create a cluster map with “shelf” buckets at the lowest level to represent sets of identical devices as they are installed, and then combine shelves into “cabinet” buckets to group together shelves that are installed in the same rack. Cabinets might be further grouped into “row” or “room” buckets for a large system. Data is placed in the hierarchy by recursively selecting nested bucket items via a pseudo-random hash-like function. In contrast to conventional hashing techniques, in which any change in the number of target bins (devices) re- sults in a massive reshuffling of bin contents, CRUSH is based on four different bucket types, each with a different selection algorithm to address data movement resulting from the addition or removal of devices and overall computational complexity.



3.2 Replica Placement

CRUSH is designed to distribute data uniformly among weighted devices to maintain a statistically balanced utiliza- tion of storage and device bandwidth resources. The place- ment of replicas on storage devices in the hierarchy can also have a critical effect on data safety. By reflecting the un- derlying physical organization of the installation, CRUSH can model—and thereby address—potential sources of cor- related device failures. Typical sources include physical proximity, a shared power source, and a shared network. By encoding this information into the cluster map, CRUSH placement policies can separate object replicas across differ- ent failure domains while still maintaining the desired distri- bution. For example, to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices in different shelves, racks, power supplies, con- trollers, and/or physical locations.

3.2 副本放置

CRUSH 算法的设置目的是使数据能够根据设备的存储能力和宽带资源加权平均地分布,并保持一个相对的概率平衡。副本放置在具有层次结构的存储设备中,这对数据安全也有重要影响。通过反射系统的物理安装组织,CRUSH算法可以将系统模块化,从而定位潜在的设备故障。这些潜在故障的资源包括物理的,比如共用电源,共用的网络。通过向集群映射编码信息,CRUSH副本放置策略可以将数据对象独立在不同故障域,同时仍然保持所需的分布。例如,为了定位可能存在的并发故障,应该确保设备上的数据副本放置在不同的机架、主机、电源、控制器、或其他的物理位置。



电子邮件地址不会被公开。 必填项已用*标注