Tuesday, 25 February 2014

Scaling the block I/O layer, SYSTOR 2013 paper



The paper for this blog-post is one which was published in SYSTOR 2013 " Linux block IO: Introducing multi-queue SSD access on multi-core systems"

This is a interesting paper which discusses about the scalability issues for block IO layer in linux kernel.

The problem: Scalability problems of the Linux block layer:

Block IO layer is the layer that sits between the filesystem and your disk driver. The block layer gives some generic services like IO scheduling and statistics.


According to the paper "Today, with Linux, a single
CPU core can sustain an IO submission rate of around 800
thousand IOPS. Regardless of how many cores are used to
submit IOs, the operating system block layer can not scale
up to over one million IOPS."

The bottlenecks for Scalability:

The bottleneck mainly arises because of a request queue data structure in the block layer. Before the IO is submitted to the disk driver, it is held in a request queue of the block layer. The request queue acts like a “staging area” where the requests are re-ordered, combined, and coalesced. There is a single request queue associated with every device. And there is a single lock that protects access to this queue. This single lock is a source of contention that prevents scaling to higher IOPS.

Other than the request queue lock contention, there are two other problems that hinder performance in multiprocessor systems:

1. Frequent cache line invalidations

When more than one core is accessing the request queue, they all compete for the the request queue lock. Each core would have cached the request queue lock which would be invalidated frequently by some other core. So "the cache line associated with the request queue lock is continuously bounced between those cores"

Also usually one core is responsible for handling the hardware interrupts, which in turn will software-interrupt the other cores. With high IOPS and high interrupts, this results in very frequent context switches and pollution of L1 and L2 caches per core.

2. Remote Memory Access:

In NUMA multiprocessor systems, every processor is associated with some local memory, which it can access faster. The hardware vendors are moving towards NUMA systems because it is difficult to support uniform access shared-memory multiprocessor systems at scale

When a IO completion is handled on a different core, than the one that issued the IO request, it results in remote memory access for data-structures associated with this IO request. If a processor had to access the local memory of another processor, it results in a significant performance drop.

Central Idea of the paper:

To remove the request queue lock contention, the solution is to use multple queues per device and spread the lock contention across all the queues. But that alone is not sufficient.
A request queue of a block layer serves two purposes:

  1. Quality of service guarantees. ( no process should starve because of any other IO intensive process)
  2. A device can only handle certain number of simultaneous in-flight requests. So the block layer adjusts the IO rate submission in order to not overflow the device.

These two activities are separated out and handled in 2 levels of queues.


First level queue Second Level queue
The number of first level queues is according to the number of cores or NUMA sockets on the machine. The number of second level queues depending on the number of hardware contexts (or DMA queues) supported in the device.
The size of the first level queue is not fixed and can keep growing. The size of the second level queue is fixed according to device capability




By binding one first level queue to a one core, the corresponding request queue lock does not suffer cache line invalidations. The implementation is such that the IO submission and completion are handled by the same core. So handling of hardware interrupt is by the same CPU that issued it which removes remote memory access.



CONCLUDING REMARKS:


The results show that the 2 level design can scale more than 10million IOPs. Also the latency of a single request is minimised, when compared to the single queue design.

There are some interesting tidbits in the paper about the future SSD devices. It seems that the upcoming SSD devices will offer flexible number of submission and completion queues and the block device driver will have an additional functionality exposing APIs indicating the available queues. From the paper it was also clear that the future devices will support more number of inflight requests. ( because " in a 1 million IOPS device, the 32 inflight request limit will cause 31 thousand context switches/sec to process IO in batches of 32.")

One contention point in the paper is the assumption about SSDs having random read and write latency that is as fast as sequential access which is not completely true. Also the paper leaves a big gap about what kind of scheduling policy should be used for SSD and importantly where that scheduling policy should be implemented, in the block layer or inside the device.

Overall an important paper. The multiqueue block layer is getting merged in Linux 3.13 [2].


References: