ARGONNE NATIONAL LABORATORY
9700 South Cass Avenue
Argonne, IL 60439
ANL/MCS-TM-234
Users Guide for ROMIO: A High-Performance,
Portable MPI-IO Implementation
by
Rajeev Thakur, Robert Ross, Ewing Lusk, William Gropp, Robert Latham
Mathematics and Computer Science Division
Technical Memorandum No. 234
Revised May 2004, November 2007, April 2010
This work was supported by the Mathematical, Information, and Computational Sciences Division subpro-
gram of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract
W-31-109-Eng-38; and by the Scalable I/O Initiative, a multiagency project funded by the Defense Ad-
vanced Research Projects Agency (Contract DABT63-94-C-0049), the Department of Energy, the National
Aeronautics and Space Administration, and the National Science Foundation.
Contents
Abstract 1
1 Introduction 1
2 Major Changes in This Version 1
3 General Information 1
3.1 ROMIO Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2.1 Hints for PFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.2 Hints for XFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.3 Hints for PVFS (v1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.4 Hints for PVFS (v2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.5 Hints for Lustre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.6 Hints for PANFS (Panasas) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.7 Systemwide Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Using ROMIO on NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.1 ROMIO, NFS, and Synchronization . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Using testfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5 ROMIO and MPI FILE SYNC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.6 ROMIO and MPI FILE SET SIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Installation Instructions 10
4.1 Configuring for Linux and Large Files . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Testing ROMIO 12
6 Compiling and Running MPI-IO Programs 12
7 Limitations of This Version of ROMIO 13
8 Usage Tips 13
9 Reporting Bugs 14
10 ROMIO Internals 14
11 Learning MPI-IO 14
12 Major Changes in Previous Releases 14
12.1 Major Changes in Version 1.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
12.2 Major Changes in Version 1.0.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
12.3 Major Changes in Version 1.0.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
12.4 Major Changes in Version 1.0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
References 18
Users Guide for ROMIO: A High-Performance,
Portable MPI-IO Implementation
by
Rajeev Thakur, Robert Ross, Ewing Lusk, and William Gropp
Abstract
ROMIO is a high-performance, portable implementation of MPI-IO (the I/O chapter in the
MPI Standard). This document describes how to install and use ROMIO version 1.2.4 on
various machines.
1 Introduction
ROMIO
1
is a high-performance, portable implementation of MPI-IO (the I/O chapter in MPI [4]).
This document describes how to install and use ROMIO version 1.2.4 on various machines.
2 Major Changes in This Version
• Added section describing ROMIO MPI FILE SYNC and MPI FILE CLOSE behavior to User’s
Guide
• Bug removed from PVFS ADIO implementation regarding resize operations
• Added support for PVFS listio operations (see Section 3.2)
• Added the following working hints: romio pvfs listio read, romio pvfs listio write
3 General Information
This version of ROMIO includes everything defined in the MPI I/O chapter except supp ort for file
interoperability and user-defined error handlers for files (§ 4.13.3). The subarray and distributed
array datatype constructor functions from Chapter 4 (§ 4.14.4 & § 4.14.5) have been implemented.
They are useful for accessing arrays stored in files. The functions MPI File f2c and MPI File c2f
(§ 4.12.4) are also implemented. C, Fortran, and profiling interfaces are provided for all functions
that have been implemented.
This version of ROMIO runs on at least the following machines: IBM SP; Intel Paragon; HP
Exemplar; SGI Origin2000; Cray T3E; NEC SX-4; other symmetric multiprocessors from HP,
SGI, DEC, Sun, and IBM; and networks of workstations (Sun, SGI, HP, IBM, DEC, Linux, and
FreeBSD). Supported file systems are IBM PIOFS, Intel PFS, HP/Convex HFS, SGI XFS, NEC
SFS, PVFS, NFS, NTFS, and any Unix file system (UFS).
This version of ROMIO is included in MPICH 1.2.4; an earlier version is included in at least
the following MPI implementations: LAM, HP MPI, SGI MPI, and NEC MPI.
1
http://www.mcs.anl.gov/romio
1
Note that proper I/O error codes and classes are returned and the status variable is filled only
when used with MPICH revision 1.2.1 or later.
You can open files on multiple file systems in the same program. The only restriction is that
the directory where the file is to be opened must be accessible from the process opening the file.
For example, a process running on one workstation may not be able to access a directory on the
local disk of another workstation, and therefore ROMIO will not be able to open a file in such a
directory. NFS-mounted files can be accessed.
An MPI-IO file created by ROMIO is no different from any other file created by the underlying
file system. Therefore, you may use any of the commands provided by the file system to access the
file, for example, ls, mv, cp, rm, ftp.
Please read the limitations of this version of ROMIO that are listed in Section 7 of this document
(e.g., restriction to homogeneous environments).
3.1 ROMIO Optimizations
ROMIO implements two I/O optimization techniques that in general result in improved perfor-
mance for applications. The first of these is data sieving [2]. Data sieving is a technique for
efficiently accessing noncontiguous regions of data in files when noncontiguous accesses are not
provided as a file system primitive. The naive approach to accessing noncontiguous regions is to
use a separate I/O call for each contiguous region in the file. This results in a large number of
I/O operations, each of which is often for a very small amount of data. The added network cost of
performing an I/O operation across the network, as in parallel I/O systems, is often high because
of latency. Thus, this naive approach typically performs very poorly because of the overhead of
multiple operations. In the data sieving technique, a number of noncontiguous regions are accessed
by reading a block of data containing all of the regions, including the unwanted data b etween them
(called “holes”). The regions of interest are then extracted from this large block by the client. This
technique has the advantage of a single I/O call, but additional data is read from the disk and
passed across the network.
There are four hints that can be used to control the application of data sieving in ROMIO:
ind rd buffer size, ind wr buffer size, romio ds read, and romio ds write. These are dis-
cussed in Section 3.2.
The second optimization is two-phase I/O [1]. Two-phase I/O, also called collective buffering,
is an optimization that only applies to collective I/O operations. In two-phase I/O, the collection
of independent I/O operations that make up the collective operation are analyzed to determine
what data regions must be transferred (read or written). These regions are then split up amongst
a set of aggregator processes that will actually interact with the file system. In the case of a read,
these aggregators first read their regions from disk and redistribute the data to the final locations,
while in the case of a write, data is first collected from the processes before being written to disk
by the aggregators.
There are five hints that can be used to control the application of two-phase I/O: cb config list,
cb nodes, cb buffer size, romio cb read, and romio cb write. These are discussed in Subsec-
tion 3.2.
2
3.2 Hints
If ROMIO doesn’t understand a hint, or if the value is invalid, the hint will be ignored. The values
of hints being used by ROMIO for a file can be obtained at any time via MPI File get info.
The following hints control the data sieving optimization and are applicable to all file system
types:
• ind rd buffer size – Controls the size (in bytes) of the intermediate buffer used by ROMIO
when performing data sieving during read operations. Default is 4194304 (4 Mbytes).
• ind wr buffer size – Controls the size (in bytes) of the intermediate buffer used by ROMIO
when performing data sieving during write operations. Default is 524288 (512 Kbytes).
• romio ds read – Determines when ROMIO will choose to perform data sieving. Valid values
are enable, disable, or automatic. Default value is automatic. In automatic mode ROMIO
may choose to enable or disable data sieving based on heuristics.
• romio ds write – Same as above, only for writes.
The following hints control the two-phase (collective buffering) optimization and are applicable
to all file system types:
• cb buffer size – Controls the size (in bytes) of the intermediate buffer used in two-phase
collective I/O. If the amount of data that an aggregator will transfer is larger than this value,
then multiple operations are used. The default is 4194304 (4 Mbytes).
• cb nodes – Controls the maximum number of aggregators to be used. By default this is set
to the number of unique hosts in the communicator used when opening the file.
• romio cb read – Controls when collective buffering is applied to collective read operations.
Valid values are enable, disable, and automatic. Default is automatic. When enabled,
all collective reads will use collective buffering. When disabled, all collective reads will be
serviced with individual operations by each process. When set to automatic, ROMIO will
use heuristics to determine when to enable the optimization.
• romio cb write – Controls when collective buffering is applied to collective write operations.
Valid values are enable, disable, and automatic. Default is automatic. See the description
of romio cb read for an explanation of the values.
• romio no indep rw – This hint controls when “deferred open” is used. When set to true,
ROMIO will make an effort to avoid performing any file operation on non-aggregator nodes.
The application is expected to use only collective operations. This is discussed in further
detail below.
• cb config list – Provides explicit control over aggregators. This is discussed in further
detail below.
For some systems configurations, more control is needed to specify which hardware resources
(processors or nodes in an SMP) are preferred for collective I/O, either for performance reasons
3