IBM Systems and Technology Group
By Scott Fadden, IBM Corporation
August 2012
An Introduction to GPFS Version 3.5
Technologies that enable the management of big data.
An introduction to GPFS
Page 2
Contents
INTRODUCTION ........................................................................ 2
WHAT IS GPFS? ...................................................................... 3
THE FILE SYSTEM ...................................................................... 3
Application interfaces ..................................................... 4
Performance and scalability ............................................ 4
Administration ................................................................ 5
DATA AVAILABILITY ................................................................... 7
Data Replication .............................................................. 8
GPFS NATIVE RAID (GNR) ........................................................ 8
INFORMATION LIFECYCLE MANAGEMENT (ILM) TOOLSET.................. 9
CLUSTER CONFIGURATIONS ....................................................... 11
Shared disk .................................................................... 12
Network-based block IO ................................................ 13
Mixed Clusters ............................................................... 15
Sharing data between clusters ...................................... 16
WHAT’S NEW IN GPFS VERSION 3.5 ......................................... 19
Active File Management ............................................... 20
High Performance Extended Attributes ......................... 20
Independent Filesets ..................................................... 20
Fileset Level Snapshots .................................................. 21
Fileset Level Quotas....................................................... 21
File Cloning .................................................................... 21
IPv6 Support .................................................................. 21
GPFS Native RAID .......................................................... 21
SUMMARY ............................................................................ 22
An introduction to GPFS
Page 2
Introduction
Big Data, Cloud Storage it doesn’t matter what you call it, there is certainly
increasing demand to store larger and larger amounts of unstructured data.
The IBM General Parallel File System (GPFS
TM
) has always been considered a
pioneer of big data storage and continues today to lead in introducing industry
leading storage technologies
1
. Since 1998 GPFS has lead the industry with many
technologies that make the storage of large quantities of file data possible. The
latest version continues in that tradition, GPFS 3.5 represents a significant
milestone in the evolution of big data management. GPFS 3.5 introduces some
revolutionary new features that clearly demonstrate IBM’s commitment to
providing industry leading storage solutions.
This paper does not just throw out a bunch of buzzwords it explains the features
available today in GPFS that you can use to manage your file data. This includes
core GPFS concepts like striped data storage, cluster configuration options
including direct storage access, network based block I/O storage automation
technologies like information lifecycle management (ILM) tools and new
features including file cloning, more flexible snapshots and an innovative global
namespace feature called Active File Management (AFM).
This paper is based on the latest release of GPFS though much of the
information applies to prior releases. If you are already familiar with GPFS you
can take a look at the “What’s new” section for a quick update on the new
features introduced in GPFS 3.5.
1
2011 Annual HPCwire Readers' Choice Awards
http://www.hpcwire.com/specialfeatures/2011_Annual_HPCwire_Readers_Choice_Awa
http://www.hpcwire.com/specialfeatures/2011_Annual_HPCwire_Readers_Choice_Awa
rds.html
An introduction to GPFS
Page 3
What is GPFS?
GPFS is more than clustered file system software; it is a full featured set of file
management tools. This includes advanced storage virtualization, integrated
high availability, automated tiered storage management and the performance
to effectively manage very large quantities of file data.
GPFS allows a group of computers concurrent access to a common set of file
data over a common SAN infrastructure, a network or a mix of connection types.
The computers can run any mix of AIX, Linux or Windows Server operating
systems. GPFS provides storage management, information life cycle
management tools, centralized administration and allows for shared access to
file systems from remote GPFS clusters providing a global namespace.
A GPFS cluster can be a single node, two nodes providing a high availability
platform supporting a database application, for example, or thousands of nodes
used for applications like the modeling of weather patterns. The largest existing
configurations exceed 5,000 nodes. GPFS has been available on since 1998 and
has been field proven for more than 14 years on some of the world's most
powerful supercomputers
2
to provide reliability and efficient use of
infrastructure bandwidth.
GPFS was designed from the beginning to support high performance parallel
workloads and has since been proven very effective for a variety of applications.
Today it is installed in clusters supporting big data analytics, gene sequencing,
digital media and scalable file serving. These applications are used across many
industries including financial, retail, digital media, biotechnology, science and
government. GPFS continues to push technology limits by being deployed in
very demanding large environments. You may not need multiple petabytes of
data today, but you will, and when you get there you can rest assured GPFS has
already been tested in these enviroments. This leadership is what makes GPFS a
solid solution for any size application.
Supported operating systems for GPFS Version 3.5 include AIX, Red Hat, SUSE
and Debian Linux distributions and Windows Server 2008.
The file system
A GPFS file system is built from a collection of arrays that contain the file system
data and metadata. A file system can be built from a single disk or contain
thousands of disks storing petabytes of data. Each file system can be accessible
from all nodes within the cluster. There is no practical limit on the size of a file
2
See the top 100 list from November, 2011 - Source: Top 500 Super Computer Sites:
http://www.top500.org/
An introduction to GPFS
Page 4
system. The architectural limit is 2
99
bytes. As an example, current GPFS
customers are using single file systems up to 5.4PB in size and others have file
systems containing billions of files.
Application interfaces
Applications access files through standard POSIX file system interfaces. Since all
nodes see all of the file data applications can scale-out easily. Any node in the
cluster can concurrently read or update a common set of files. GPFS maintains
the coherency and consistency of the file system using sophisticated byte range
locking, token (distributed lock) management and journaling. This means that
applications using standard POSIX locking semantics do not need to be modified
to run successfully on a GPFS file system.
In addition to standard interfaces GPFS provides a unique set of extended
interfaces which can be used to provide advanced application functionality.
Using these extended interfaces an application can determine the storage pool
placement of a file, create a file clone and manage quotas. These extended
interfaces provide features in addition to the standard POSIX interface.
Performance and scalability
GPFS provides unparalleled performance for unstructured data. GPFS achieves
high performance I/O by:
Striping data across multiple disks attached to multiple nodes.
High performance metadata (inode) scans.
Supporting a wide range of file system block sizes to match I/O
requirements.
Utilizing advanced algorithms to improve read-ahead and write-behind
IO operations.
Using block level locking based on a very sophisticated scalable token
management system to provide data consistency while allowing
multiple application nodes concurrent access to the files.
When creating a GPFS file system you provide a list of raw devices and they are
assigned to GPFS as Network Shared Disks (NSD). Once a NSD is defined all of
the nodes in the GPFS cluster can access the disk, using local disk connection, or
using the GPFS NSD network protocol for shipping data over a TCP/IP or
InfiniBand connection.