I/O Performance Experiments

Two popular benchmarking tools for I/O performance experiments are IOR and IOZone.  For single client (even multi-core) experiments, IOZone is by far the easiest and most versatile.  To calculate aggregate I/O throughput, IOR scales much better than IOZone.


IOR leverages the scalability of MPI to easily and accurately calculate the aggregate bandwidth of an (almost) unlimited number of client machines.  In addition, IOR can utilize the POSIX, MPI-IO, and HDF5 I/O interfaces.  The main downside of IOR is that you need to have a working version of MPI installed on your machines (and know how to use it).  Another downside is that it is quite limited in its capabilities, focusing on reading and writing a file from beginning to end in a sequential or strided manner.


IOZone is ideal for performing single client I/O throughput experiments.  It can be used in “cluster mode” with multiple clients, but it is slow (using rsh/ssh for communication) and sometimes a little buggy.  IOZone can execute several different types of I/O workloads as well as execute I/O requests listed in a file.


More benchmarks can be found at:

·        http://www.cs.dartmouth.edu/pario/examples.html

·        http://www.flash.uchicago.edu/website/home/

·        http://www.llnl.gov/asc/computing_resources/purple/rfp/benchmarks/limited/code_list.html

·        http://www.mpiblast.org/index.html

·        http://www.ncbi.nlm.nih.gov/BLAST/

·        http://sourceforge.net/projects/filebench/

·        http://lbs.sourceforge.net/

·        http://perc.nersc.gov/applications.htm

·        http://www.nas.nasa.gov/Resources/Software/npb.html

·        http://www.llnl.gov/icc/lc/siop/downloads/download.html

·        http://www.nersc.gov/projects/esp.php


WARNING: Every tool calculates I/O throughput in a slightly different manner.  The results from one benchmarking tool cannot be compared with the results from another.  For example, some benchmarks include the time required to open and close a file, or simply the close but not the open.  Be very careful and ensure you understand what the benchmark is measuring.  In addition, if IOR or IOZone is used for multiple client experiments, the same tool should also be used for a single client experiments (and vice-versa).  All experimental results should include a minimum of 10 executions, with the outliers closely scrutinized and evaluated.


Machine Characterization To see the disk, network, and CPU utilization on a machine, you can use the sar, vmstat, and iostat tools.

IOZone (www.iozone.org)

Iozone has many options, including the ability to automatically generate an Excel worksheet file and umount and mount a file system between experiments (to flush the cache).  See the website for the latest version and documentation.  Note: the documentation is generally not up to date with the latest features.  To see all possible features, download and build IOZone and run ‘iozone –h’. 


In general there are three modes:

  1. Regular
  2. Throughput
  3. Cluster


Regular mode uses a single thread to perform the experiments.  It is the default mode.  Throughput mode allows the user to specify how many threads (on a single machine) should execute the experiment.  For some reason, regular mode sometimes returns different results that throughput mode with a single thread.  For this reason, it is best not to compare regular mode results with throughput mode results.  Cluster mode uses a special configuration file to enable multiple client experiments.


Another major point of note is iozone runs tests very quickly, causing a lot of output to occur.  Inevitably, some experiments will return irregular results, possibly due to the fact that the file system is trying to deal with some many different types of requests.  If a experiment does not return an expected result, it is best to run just the individual experiment to see the true performance without all of the interference.



·         iozone -ec -t 1 -r 4M -s 100M -+n -i 0 -i 1
        Run in throughput mode with a single thread, read and write a 100MB file in 4MB chunks in the current directory.  Include fsync and close in the timing and do not perform retests.
·         iozone -Raec -i 2 -y 1k -g 64M -f /mnt/nfs/myfile –U /mnt/nfs
       Perform a random read and write experiment on /mnt/nfs/myfile, include fsync and close in the timings, umount and mount /mnt/nfs between every experiment, use a minimum chunk size of 1k and a maximum file system of 64MB.


IOR (http://www.llnl.gov/icc/lc/siop/downloads/download.html).

With IOR, clients can first write data, and then read data written by another client, avoiding the problem of having to clear a client's cache.  This works well for >2 clients.  For only a single client, the clearcache program at the bottom of this document can remove a file from the client cache without having to umount/mount.

IOR requires mpi (mpich2 http://www-unix.mcs.anl.gov/mpi/mpich2/ All nodes must use ntpd, as IOR timing depends on it.  It is a good idea to sync the time on all nodes before every test just to be sure.  The output of each experiment displays a "Clock Deviation" value.  This is the maximum deviation between the clocks on all participating clients.  If this value is greater than a few tenths of a second, the clocks on every node should be sync’d and the experiment should be re-executed.

IMPORTANT: Another caveat is that if you have a set of MPI clients, which clients actually perform the IOR command is determined by MPI.  It is not guaranteed to always be the same one.  For example, if you have five nodes, iota1-5, the first time you execute a single client experiment it might be run on iota3 and the second iteration executed on iota4.  This is very dangerous for running experiments as the clients execute the read or write requests are not the same from one iteration to another.  To avoid this problem, always use the –machinefile option, which allows the user to specify the order of the client machines used in the experiment.

Below are install and configuration instructions for mpich2.  IOR config, install, syntax is below as well.

mkdir /usr/local/bin/mpich2
./configure --enable-romio --enable-timer-type=gettimeofday -prefix=/usr/local/bin/mpich2
make install
# copy install dir to all machines
# follow remaining instructions in README to ensure mpdboot, mpd, mpdtrace, and mpiexec work correctly

Starting mpi:
mpdboot -n <num of hosts> --file=mpd-clients.hosts
# mpd-clients.hosts has all client names (1 per line) except if you are starting mpi on node1, then omit node1 from mpd-clients.hosts

Executing a command:
mpiexec -machinefile mpd-machinefile -n <number of hosts to run command on> <cmd>
where mpd-machinefile contains the order of hosts to execute the file.  Without a machinefile, mpich2 will randomly pick the hosts to execute the command.  A machinefile lets you specify the hosts to execute the command (in order from top to bottom of file)


failed shutdown:
if somehow all nodes don't exit properly, you need to manually kill the python mpd processes on each client  (xargs is useful for this)

IOR install
To build, follow README instructions.
To run (once mpi is running)
mpiexec -machinefile ../hosts/mpd-machinefile -n <num of hosts total> IOR -a POSIX -N <num of clients in test> -b 500m -d 5 -t 128k -o /mnt/pnfs/file1 -e -g -w -r -s 1 -i <num of repititions> -vv -F -C
where (IOR -h):
-d N  interTestDelay -- delay between reps in seconds
-e    fsync -- perform fsync upon POSIX write close
-g    intraTestBarriers -- use barriers between open, write/read, and close
-s N  segmentCount -- number of segments
-v    verbose -- output information (repeating flag increases level)
-F    filePerProc -- file-per-process (single or multiple files)
-C    reorderTasks -- changes task ordering to n+1 ordering for readback
-B    useO_DIRECT -- uses O_DIRECT for POSIX, bypassing I/O buffers
-k    keepFile -- don't remove the test file(s) on program exit
-x    singleXferAttempt -- do not retry transfer if incomplete
-w    writeFile -- write file
-r    readFile -- read existing file
-t N  transferSize -- size of transfer in bytes (e.g.: 8, 4k, 2m, 1g)

To update the time before each test, I create a script “updatetime” with the following:
mpiexec -n <num of clients> ntpdate -u

Note: You can run just IOR (no mpi) for a single client.  Just leave out mpiexec ....

For 2 to 8 clients, I run:
for i in `seq 2 1 8`; do ./updatetime; mpiexec -machinefile mpd-machinefile -n 8 IOR -a POSIX -N "$i" -b 500m -d 5 -t 128k -o /mnt/pnfs/testior1 -e -g -w -r -s 1 -i 5 -vv -F -C | tee -a outputfilename; done

For a single client I run:
       .IOR -a POSIX -N 1 -b 500m -d 5 -t 128k -o /mnt/pnfs/testior1 -e -g -w -s 1 -i 1 -vv -F -C -k | tee -a outputfilename
       /usr/local/bin/clearcache /mnt/pnfs/testior1.00000000
       IOR -a POSIX -N 1 -b 500m -d 5 -t 128k -o /mnt/pnfs/testior1 -e -g -r -s 1 -i 1 -vv -F -C -k | tee -a outputfilename

Clearcache C program

This code takes in a file and, if the data in the file exists in the page cache, removes it from the page cache.  This is useful to perform multiple read experiments on a single file without unmounting and remounting the filesystem.


Syntax: clearcache <filename>       


#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
main(int argc,char **argv)
        int fd,result;
        printf("Opening: %s\n",argv[1]);
        fd = open(argv[1], O_RDWR);
        printf("FD: %d\n",fd);
        /* result = posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED); */
        result = posix_fadvise(fd, 0, 0, 4);
        printf("Result: %d\n",result);