File Deletes and Undeletes
文件的删除和恢复
Decrease Replication Factor
减少副本参数设置
References
参考
Introduction
介绍介绍
The Hadoop Distributed File System (HDFS ) is a distributed file system designed to run on commodity hardware. It has
many similarities with existing distributed file systems. However, the differences from other distributed file systems are
significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high
throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX
requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache
Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is
http://hadoop.apache.org/core/ .
Hadoop分布式文件系统(HDFS ) 是一种设计运行在一般硬件条件(非服务器)下的分布式文件系统. 他和现有的其他分布式文
件系统有很多相似. 但,和其他分布式文件系统的不同之处才是最重要的. HDFS 设计为运行在低成本的硬件上,且提供高可靠
性的服务器. HDFS设计满足大数据量,大吞吐量的情况。HDFS提供POSIX标准的按流方式访问数据的方法。HDFS原先是
Apache Nutch 网站搜索引擎项目的一个基础部分. HDFS 是Hadoop Corex项目的一部分. 项目网
址:http://hadoop.apache.org/core/ .
Assumptions and Goals
假定和目标假定和目标
Hardware Failure
硬件失效硬件失效
Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server
machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each
component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore,
detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
硬件失效比一般一场更为普遍. 一个HDFS运行实例可有包含几百或几千台服务器, 每一个存储一部分文件系统的数据. 因为由
大量的服务器组成,任何一个服务器的小概率的失效意味着整个文件系统的不能工作。因此检测错误,并且快速自动的恢复是
HDFS的一个核心构架目标.
Streaming Data Access
流方式的数据访问流方式的数据访问
Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that
typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by
users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many
hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has
been traded to increase data throughput rates.
应用程序需要通过流方式访问数据,他们不是运行在一般文件系统上的应用. HDFS设计为批处理模式,而不是交互模式. 强调
高吞吐量而不是低延时。POSIX的一些语义,被认为是提高吞吐量的方式,对运行在HDFS上的应用是不需要。
Large Data Sets
大的数据集支持大的数据集支持
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is
tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single
cluster. It should support tens of millions of files in a single instance.
运行在HDFS上的应用程序有很大的数据量. 典型的文件大小是G bytes 到 T bytes. 因此,HDFS需要调整到支持很大的文件,
需要支持很大的数据带宽,以及在一个服务器群集中可扩展到几百个节点,支持数千万个文件。