Design and implementation
In summary, all input histogram logs are normalized to a fixed set of aligned time intervals (assumes good time
synchronization across hosts), so that histograms can then be added directly to obtain cluster-wide histogram,
and then latency percentiles are computed from that by summation.
Histogram log parsing
A fio histogram log consists of 3 columns of metadata followed by a fixed number of columns of histogram
buckets. The 3 metadata columns are:
● time_ms - timestamp in milliseconds from start of test when this histogram’s time interval began
● direction - 0 if the histogram is for reads, 1 if for writes.
● bs -- “block size”, really I/O transfer size for test
direction : (read/write) is only useful at present for finding the end time of the histogram time interval (present
in the next histogram record). The tool merges read and write histograms together at present. However, we
need to keep this field because someday the tool might support separate read and write perf latency
percentiles (yes they can be really different).
time_ms : Separate histogram records are emitted for reads and for writes on the same time interval. And they
can be emitted in any order (read,write) or (write,read). Consequently, if we want to find the subsequent
histogram record for that I/O direction (read or write) in order to get the end_time for the current record, we may
have to read as many as 3 records farther to get it. For example, here’s an excerpt from a real histogram log:
10203, 1, 4096, …
10203, 0, 4096, …
10601, 0, 4096, …
10601, 1, 4096, …
So the time interval for the write (1) and read (0) records is [10203,10601], identical in this case but they don’t
have to be identical.
When we encounter the last histogram record (for an I/O direction) in the log, we can no longer get the end
time from the next histogram record, since there isn’t one, so we instead estimate the end time as the test end
time in millisec.
bs - this field is really not used at all. There are 2 cases: either we are using a fixed I/O transfer size or a
variable one. If the I/O transfer size is fixed, then it is in the command or job file used to run fio and is not
needed in the result. If the I/O transfer size is variable, which fio supports, then the bs field is meaningless
since a variety of I/O transfer sizes would have been used in a single histogram interval. So either way we can
just ignore it.