HDFS-High-Availability资源-CSDN文库

需积分: 9 103 浏览量 2017-09-18 15:36:59 上传评论收藏 163KB PDF 举报

资源推荐

资源详情

资源评论

HDFS High Availability

Eli Collins, Todd Lipcon, Aaron T Myers

Motivation

Large users often mandate that their IT systems are highly available, or are using Hadoop-

based platforms as part of a service with SLAs that requires high availability. While high

availability needs to be addressed across the stack it makes sense for the work to start with

HDFS because most components in a Hadoop-based system are dependent on HDFS, and

therefore their own availability may be limited by HDFS availability.

Use Cases

The point of high availability is to increase the proportion of time the platform is functioning for

users. We can split the uses cases according to times when the system is not functioning:

1. Planned downtime, eg due to software upgrades and configuration changes. Upgrades and

configuration changes are likely more common than failures that currently cause downtime, and

are therefore a bigger source of downtime. Unplanned downtime is more or less acceptable to

different users, for example some users may have regular maintenance windows while others

need to keep a service up 24 x 7. If an administrator needs to take the system offline in order to

perform maintenance, what steps need to be performed, how long do they take.

2. Un-planned downtime, eg due to unexpected hardware failures. If the systems stops

functioning, what steps need to be performed to bring it back on line, and how long do they take.

If users have a process in place to deal with planned downtime (eg a regular service window)

then un-planned downtime is likely their primary concern.

3. Poor quality of service (QOS), even when the cluster is functioning poor QOS may result in

a lack of availability. A cluster that does not scale may not be available eg if a job can use a

disproportionate amount of resources, block other jobs, etc.

We make the following assumptions:

1. Because more users can tolerate planned downtime (eg will have regular maintenance

windows) the un-planned downtime is higher priority. Scalability and resource management are

out of the scope of this document.

2. Intermediate HDFS releases may rely on an HA NFS filer since this investment can be

amortized over multiple clusters, and is complementary to existing HDFS systems (eg users

often already buy HA filers to store the image and edits log). There is value in supporting both

options as some users may already be comfortable operating filers and want to avoid the

operational complexity of a new storage options.

3. Because most components in the platform store data in HDFS they depend on it for their own

availability. HDFS is therefore the natural place to start when addressing platform availability.

This writeup focuses on improving HDFS availability with the intent of increasing overall platform

availability, for example MapReduce and HBase may need to be modified to benefit from

improvements in HDFS availability (for example by continuing to function during Namenode fail-

over). Components dependent on these, eg Pig and Hive will benefit transitively.

Requirements / Assumptions

Both manual and automatic fail-over should be supported. Manual hot fail-over and automatic

hot fail-over are the most important use cases. Warm standby should be supported but is less

important than hot fail-over.

An active-passive configuration with two dedicated servers is sufficient for the near term. Future

releases should not require dedicated hosts be specified up-front (assuming any host is capable

of running the Namenode).

It is acceptable to require an HA NFS filer. Future releases/updates should not, ie no additional

hardware aside from the servers and switches is required for high availability.

An admin should be able to fail-back after fail-over.

The standby should not be required to share a switch with the master. Ie you can run the

standby cross-rack.

Failure types should be handled according current recommended hardware configurations (eg

it’s OK to require the primary and standby use ECC memory, redundant power, etc).

It is important to handle soft failures, eg components are frequently flaky rather than fail-stop.

Adding a dependency on Linux HA projects (eg Heartbeat) is acceptable, if necessary.

Operators (not using Enterprise) will perform and monitor fail-over tasks via the command-line

tools and Web UIs.

Goals

The following goals apply to HA generally:

HA configuration and fail-over management steps needs to be simple to prevent unavailability

and data loss due to configuration/operational mistakes.

HA should use consistent mechanisms and techniques across components in a Hadoop-based

剩余6页未读，继续阅读

内容反馈

weixin_40294485

粉丝: 0
资源: 1

最新资源

资源上传下载、课程学习等过程中有任何疑问或建议，欢迎提出宝贵意见哦~我们会及时处理！点击此处反馈

feedback-tip