CEPH 的 CRUSH 算法原理(CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data)

服务器 William 328浏览 0评论

Abstract

Emerging large-scale distributed storage systems are faced with the task of distributing petabytes of data among tens or hundreds of thousands of storage devices. Such systems must evenly distribute data and workload to efficiently uti- lize available resources and maximize system performance, while facilitating system growth and managing hardware failures. We have developed CRUSH, a scalable pseudo- random data distribution function designed for distributed object-based storage systems that efficiently maps data ob- jects to storage devices without relying on a central direc- tory. Because large systems are inherently dynamic, CRUSH is designed to facilitate the addition and removal of storage while minimizing unnecessary data movement. The algo- rithm accommodates a wide variety of data replication and reliability mechanisms and distributes data in terms of user- defined policies that enforce separation of replicas across failure domains.

概要

新兴的的大规模分布式存储系统面临着在数十甚至是数百数千的存储设备之间分发PB这个数量级别数据的艰巨任务. 这样的系统必须能够均匀的分配数据和工作负载,以获取对可用资源的高效使用,和系统性能的最大化, 同时要便于系统的扩展以及对硬件故障的管理. 我们已经开发了CRUSH 这样一个伪随机的数据分发功能,它被设计用于分布式的基于对象的存储系统,这样的系统不依赖于某个中央目录就能够将数据对象映射到存储设备. 因为大型系统先天就是动态的,所以CRUSH被设计成便于增加和移除存储,同时能最小化非必要的数据移动. 该算法可容纳各种各样的数据复制和可靠性机制,并按照用户定义的策略来分发数据,这样的策略会强制执行跨越故障域的备份分离.

1 Introduction

Object-based storage is an emerging architecture that promises improved manageability, scalability, and perfor- mance [Azagury et al. 2003]. Unlike conventional block- based hard drives, object-based storage devices (OSDs) man- age disk block allocation internally, exposing an interface that allows others to read and write to variably-sized, named objects. In such a system, each file’s data is typically striped across a relatively small number of named objects distributed throughout the storage cluster. Objects are repli- cated across multiple devices (or employ some other data redundancy scheme) in order to protect against data loss in the presence of failures. Object-based storage systems sim- plify data layout by replacing large block lists with small object lists and distributing the low-level block allocation problem. Although this vastly improves scalability by re- ducing file allocation metadata and complexity, the funda-mental task of distributing data among thousands of storage devices—typically with varying capacities and performance characteristics—remains.

1 简介

新兴的对象存储架构注重提高可管理性,可扩展性以及高性能。与常规磁盘块存储相比,对象存储设备在内部对磁盘进行分配管理,对外暴露一个接口,使得外部可以对任意大小的数据进行读写(称之为对象)。在这样的系统中,每个文件被分割称相对小的对象分散地保存在存储集群中。为了防止可能发生的数据丢失,这样的对象在集群中拥有多个拷贝(或者采用其他的一下冗余策略)。对象存储系统以小对象的列表取代了大的块列表简化了数据的布局,分散了低级别的块分配问题。尽管通过降低文件配置元数据和复杂性将极大地改善可伸缩性,但分发数据的基本任务在成千上万的存储设备中,仍然存在不同的能力和性能特征问题。

Most systems simply write new data to underutilized de- vices. The fundamental problem with this approach is that data is rarely, if ever, moved once it is written. Even a per- fect distribution will become imbalanced when the storage system is expanded, because new disks either sit empty or contain only new data. Either old or new disks may be busy, depending on the system workload, but only the rarest of conditions will utilize both equally to take full advantage of available resources.

A robust solution is to distribute all data in a system ran- domly among available storage devices. This leads to a prob- abilistically balanced distribution and uniformly mixes old and new data together. When new storage is added, a random sample of existing data is migrated onto new storage devices to restore balance. This approach has the critical advantage that, on average, all devices will be similarly loaded, allow- ing the system to perform well under any potential work- load [Santos et al. 2000]. Furthermore, in a large storage system, a single large file will be randomly distributed across a large set of available devices, providing a high level of par- allelism and aggregate bandwidth. However, simple hash- based distribution fails to cope with changes in the number of devices, incurring a massive reshuffling of data. Further, ex- isting randomized distribution schemes that decluster repli- cation by spreading each disk’s replicas across many other devices suffer from a high probability of data loss from co- incident device failures.

大多数系统简单滴写入新数据,这带来基本的问题是数据一旦写入,几乎不会再被移动。当数据存储扩展的时候,及时再完美的系统也会变得不稳定,因为新加入的磁盘要不就是空的要不就是装满了新数据。只有根据系统的工作负载,充分利用其余所有的可用资源的,才能使旧或新的磁盘都得到充分利用。

一个健壮的解决方案是将所有的数据随机的分布到可利用的存储设备上。这导致概率平衡地分布和混淆了新旧数据。当新的存储设备加入进来,原有数据的随机样本将被迁移到新的存储设备以保持系统的平衡。这种方法的关键优势是所有设备将同样加载,系统在任何潜在的工作负载量下都表现良好。此外,在一个大的提供一个高水平的并行性和聚合带宽的存储系统,一个大文件将被随机分布在大量可用的设备上。然而,基于简单的散列的分布数量无法应对变化的设备,导致数据的大规模重组。此外,现有的随机分配方案,当数据副本在集群中传播的时候,由于肯那个存储的设备故障,冒着极大的数据丢失风险。

We have developed CRUSH (Controlled Replication Un- der Scalable Hashing), a pseudo-random data distribution algorithm that efficiently and robustly distributes object replicas across a heterogeneous, structured storage cluster. CRUSH is implemented as a pseudo-random, deterministic function that maps an input value, typically an object or ob- ject group identifier, to a list of devices on which to store object replicas. This differs from conventional approaches in that data placement does not rely on any sort of per-file or per-object directory—CRUSH needs only a compact, hi- erarchical description of the devices comprising the storage cluster and knowledge of the replica placement policy. This approach has two key advantages: first, it is completely dis- tributed such that any party in a large system can indepen- dently calculate the location of any object; and second, what little metadata is required is mostly static, changing only when devices are added or removed.

CRUSH is designed to optimally distribute data to uti- lize available resources, efficiently reorganize data when storage devices are added or removed, and enforce flexible constraints on object replica placement that maximize data safety in the presence of coincident or correlated hardware failures. A wide variety of data safety mechanisms are sup- ported, including n-way replication (mirroring), RAID parity schemes or other forms of erasure coding, and hybrid ap- proaches (e. g., RAID-10). These features make CRUSH ideally suited for managing object distribution in extremely large (multi-petabyte) storage systems where scalability, per- formance, and reliability are critically important.

我们开发了CRUSH算法 (Controlled Replication Under Scalable Hashing),一个伪随机数据分布算法,高效、强劲跨异构分布对象副本,结构化存储集群。CRUSH被实现为一个伪随机的确定性函数,CRUSH 决定一个输入值(通常是一个对象或对象组标识符)映射到一系列的设备中的某一个来存储对象的副本。这与传统方法的不同之处在于,数据放置是不依赖于任何形式的单个文件或者单个对象的目录-CRUSH只需要一个简洁而层次清晰的设备描述,包括存储集群和副本放置策略。这种方法有两个关键的优点:首先,它是完全分布式的,在这个大系统的中的任何一方都可以独立计算任何对象的位置;第二,无论多小的元数据都是静态的,只有当设备被添加或移除是才被改变。

CRUSH的目的是利用可用资源优化分配数据,当存储设备添加或删除时高效地重组数据,以及灵活地约束对象副本放置,当数据同步或者相关硬件故障的时候最大化保证数据安全。支持各种各样的数据安全机制,包括多方复制(镜像),RAID奇偶校验方案或者其他形式的校验码,以及混合方法(比如RAID-10)。这些特性使得CRUSH适合管理对象分布非常大的(PB级别)、要求可伸缩性,性能和可靠性非常高的存储系统。

2 Related Work

Object-based storage has recently garnered significant inter- est as a mechanism for improving the scalability of stor- age systems. A number of research and production file systems have adopted an object-based approach, includ- ing the seminal NASD file system [Gobioff et al. 1997], the Panasas file system [Nagle et al. 2004], Lustre [Braam 2004], and others [Rodeh and Teperman 2003; Ghemawat et al. 2003]. Other block-based distributed file systems like GPFS [Schmuck and Haskin 2002] and Federated Array of Bricks (FAB) [Saito et al. 2004] face a similar data distribu- tion challenge. In these systems a semi-random or heuristic- based approach is used to allocate new data to storage de- vices with available capacity, but data is rarely relocated to maintain a balanced distribution over time. More im- portantly, all of these systems locate data via some sort of metadata directory, while CRUSH relies instead on a com- pact cluster description and deterministic mapping function. This distinction is most significant when writing data, as sys- tems utilizing CRUSH can calculate any new data’s stor- age target without consulting a central allocator. The Sor- rento [Tang et al. 2004] storage system’s use of consistent hashing [Karger et al. 1997] most closely resembles CRUSH, but lacks support for controlled weighting of devices, a well- balanced distribution of data, and failure domains for im- proving data safety.

2 相关工作

对象存储作为一种可扩展的存储系统,最近被大量关注。大量的文件系统研究和产品采用了对象存储的方法,例如NASD,Panasas, Lustre等等。其他的块存储分布式文件系统像GPFS,FAB 面临着类似数据分布的挑战。在这些系统中使用半随机或启发式算法分配新的数据到可用的存储设备中,但数据很少被迁移以保证数据的平均分布。更重要的是,所有这些系统定位数据都是通过某种形式的元数据目录,而CRUSH算法是依赖于紧凑的集群描述和确定性的映射方法。CRUSH 算法不需要依赖于任何中枢分布器便可以计算出新数据的存储目的地,这是很有进步性意义的分布式算法。Sorrento存储系统使用一致性哈希的最接近CRUSH,但Sorrento不支持对设备的加权控制,设备的加权控制可以均衡分布数据和失败域,提高数据安全性。

Although the data migration problem has been studied ex- tensively in the context of systems with explicit allocation maps [Anderson et al. 2001; Anderson et al. 2002], such ap- proaches have heavy metadata requirements that functional approaches like CRUSH avoid. Choy, et al. [1996] describe algorithms for distributing data over disks which move an optimal number of objects as disks are added, but do not support weighting, replication, or disk removal. Brinkmann,et al. [2000] use hash functions to distribute data to a het- erogeneous but static cluster. SCADDAR [Goel et al. 2002] addresses the addition and removal of storage, but only sup- ports a constrained subset of replication strategies. None of these approaches include CRUSH’s flexibility or failure do-mains for improved reliability.

CRUSH most closely resembles the RUSH [Honicky and Miller 2004] family of algorithms upon which it is based. RUSH remains the only existing set of algorithms in the lit- erature that utilizes a mapping function in place of explicit metadata and supports the efficient addition and removal of weighted devices. Despite these basic properties, a number of issues make RUSH an insufficient solution in practice. CRUSH fully generalizes the useful elements of RUSHP and RUSHT while resolving previously unaddressed reliability and replication issues, and offering improved performance and flexibility.

尽管显式分配系统的数据迁移问题已经被广泛的研究,但这样的方法严重依赖于元数据,而CRUSH算法则避免了这样的情况。Choy描述算法在新设备加入的时候也能最优地分布对象数据,但不支持权重,复制,或磁盘删除。Brinkmann使用散列函数能在异构的静态集群中分发数据。SCADDAR声称支持添加和删除存储,但只支持有限子集的复制策略。所有这些方法,包括CRUSH的灵活性或者失败域都不是为了提高可靠性。

CRUSH 最接近于RUSH系列的算法基础。RUSH 目前仍然是唯一的利用一致性数据元映射函数有效支持可添加删除加权控制设备的算法文献。尽管有这些基本性质,RUSH解决方案在实践中依然存在不足。CRUSH 完全包含了RUSHP和RUSHT的有用元素,并且解决了以前的未知可靠性问题和数据复制问题,并且提高了性能,具有灵活性。

3 The CRUSH algorithm

The CRUSH algorithm distributes data objects among stor- age devices according to a per-device weight value, approx- imating a uniform probability distribution. The distribution is controlled by a hierarchical cluster map representing the available storage resources and composed of the logical el- ements from which it is built. For example, one might de- scribe a large installation in terms of rows of server cabinets, cabinets filled with disk shelves, and shelves filled with stor- age devices. The data distribution policy is defined in terms of placement rules that specify how many replica targets are chosen from the cluster and what restrictions are imposed on replica placement. For example, one might specify that three mirrored replicas are to be placed on devices in dif- ferent physical cabinets so that they do not share the same electrical circuit.

Given a single integer input value x, CRUSH will output an ordered list ?R of n distinct storage targets. CRUSH uti- lizes a strong multi-input integer hash function whose inputs include x, making the mapping completely deterministic and independently calculable using only the cluster map, place- ment rules, and x. The distribution is pseudo-random in that there is no apparent correlation between the resulting output from similar inputs or in the items stored on any storage de- vice. We say that CRUSH generates a declustered distribu- tion of replicas in that the set of devices sharing replicas for one item also appears to be independent of all other items.

3 CRUSH 算法

CRUSH算法根据种每个设备的权重尽可能概率平均地分配数据。分布算法是由集群可用存储资源以及其逻辑单元的map控制的。这个map的描述类似于一个大型服务器的描述:服务器由一系列的机柜组成,机柜装满服务器,服务器装满磁盘。数据分配的策略是由定位规则来定义的,定位规则指定了集群中将保存多少个副本,以及数据副本的放置有什么限制。例如,可以指定数据有三个副本,这三个副本必须放置在不同的机柜中,使得三个数据副本不公用一个物理电路。

给定一个输入x,CRUSH 算法将输出一个确定的有序的储存目标向量   ?。当输入x,CRUSH利用强大的多重整数hash函数根据集群map、定位规则、以及x计算出独立的完全确定可靠的映射关系。CRUSH分配算法是伪随机算法,并且输入的内容和输出的储存位置之间是没有显式相关的。我们可以说CRUSH 算法在集群设备中生成了“伪集群”的数据副本。集群的设备对一个数据项目共享数据副本,对其他数据项目又是独立的。

3.1 Hierarchical Cluster Map

The cluster map is composed of devices and buckets, both of which have numerical identifiers and weight values associ- ated with them. Buckets can contain any number of devices or other buckets, allowing them to form interior nodes in a storage hierarchy in which devices are always at the leaves. Storage devices are assigned weights by the administrator to control the relative amount of data they are responsible for storing. Although a large system will likely contain devices with a variety of capacity and performance characteristics, randomized data distributions statistically correlate device

3.1 分层集群映射

集群映射由设备和桶(buckets)组成,设备和桶都有数值的描述和权重值。桶可以包含任意多的设备或者其他的桶,使他们形成内部节点的存储层次结构,设备总是在叶节点。存储设备的权重由管理员设置以控制相设备负责存储的相对数据量。尽管大型系统的设备含不同的容量大小和性能特点,随机数据分布算法可以根据设备的利用率和负载来分布数据。

utilization with workload, such that device load is on aver- age proportional to the amount of data stored. This results in a one-dimensional placement metric, weight, which should be derived from the device’s capabilities. Bucket weights are defined as the sum of the weights of the items they contain.

Buckets can be composed arbitrarily to construct a hierar- chy representing available storage. For example, one might create a cluster map with “shelf” buckets at the lowest level to represent sets of identical devices as they are installed, and then combine shelves into “cabinet” buckets to group together shelves that are installed in the same rack. Cabinets might be further grouped into “row” or “room” buckets for a large system. Data is placed in the hierarchy by recursively selecting nested bucket items via a pseudo-random hash-like function. In contrast to conventional hashing techniques, in which any change in the number of target bins (devices) re- sults in a massive reshuffling of bin contents, CRUSH is based on four different bucket types, each with a different selection algorithm to address data movement resulting from the addition or removal of devices and overall computational complexity.

这样设备的平均负载与存储的数据量成正比。这导致一维位置指标、权重、应来源于设备的能力。桶的权重是它所包含的元素的权重的总和。

桶可由任意可用存储的层次结构组成。例如,可以创建这样一个集群映射,用名为“shelf”的桶代表最低层的一个主机来包含主机上的磁盘设备,然后用名为“cabinet”的桶来包含安装在同一个机架上的主机。在一个大的系统中,代表机架的“cabinet”桶可能还会包含在“row”桶或者“room”桶里。数据被通过一个伪随机类hash函数递归地分配到层级分明的桶元素中。传统的散列分布技术,一旦存储目标数量有变,就会导致大量的数据迁移;而CRUSH算法是基于桶四个不同的类型,每一个都有不同的选择算法,以解决添加或删除设备造成的数据移动和整体的计算复杂度。

3.2 Replica Placement

CRUSH is designed to distribute data uniformly among weighted devices to maintain a statistically balanced utiliza- tion of storage and device bandwidth resources. The place- ment of replicas on storage devices in the hierarchy can also have a critical effect on data safety. By reflecting the un- derlying physical organization of the installation, CRUSH can model—and thereby address—potential sources of cor- related device failures. Typical sources include physical proximity, a shared power source, and a shared network. By encoding this information into the cluster map, CRUSH placement policies can separate object replicas across differ- ent failure domains while still maintaining the desired distri- bution. For example, to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices in different shelves, racks, power supplies, con- trollers, and/or physical locations.

3.2 副本放置

CRUSH 算法的设置目的是使数据能够根据设备的存储能力和宽带资源加权平均地分布,并保持一个相对的概率平衡。副本放置在具有层次结构的存储设备中,这对数据安全也有重要影响。通过反射系统的物理安装组织,CRUSH算法可以将系统模块化,从而定位潜在的设备故障。这些潜在故障的资源包括物理的,比如共用电源,共用的网络。通过向集群映射编码信息,CRUSH副本放置策略可以将数据对象独立在不同故障域,同时仍然保持所需的分布。例如,为了定位可能存在的并发故障,应该确保设备上的数据副本放置在不同的机架、主机、电源、控制器、或其他的物理位置。


via:oschina

转载请注明:AspxHtml学习分享网 » CEPH 的 CRUSH 算法原理(CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data)

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址