The
paper for this blog-post is one which was published in SYSTOR 2013 "
Linux block IO: Introducing multi-queue SSD access on multi-core
systems"
This
is a interesting paper which discusses about the scalability issues
for block IO layer in linux kernel.
The
problem: Scalability problems of the Linux block layer:
Block
IO layer is the layer that sits between the filesystem and your disk
driver. The block layer gives some generic services like IO
scheduling and statistics.
According
to the paper "Today, with Linux, a single
CPU
core can sustain an IO submission rate of around 800
thousand
IOPS. Regardless of how many cores are used to
submit
IOs, the operating system block layer can not scale
up
to over one million IOPS."
The
bottlenecks for Scalability:
The
bottleneck mainly arises because of a request queue data structure in
the block layer. Before the IO is submitted to the disk driver, it is
held in a request queue of the block layer. The request queue acts
like a “staging area” where the requests are re-ordered,
combined, and coalesced. There is a single request queue associated
with every device. And there is a single lock that protects access to
this queue. This single lock is a source of contention that prevents
scaling to higher IOPS.
Other
than the request queue lock contention, there are two other problems
that hinder performance in multiprocessor systems:
1.
Frequent cache line invalidations
When
more than one core is accessing the request queue, they all compete
for the the request queue lock. Each core would have cached the
request queue lock which would be invalidated frequently by some
other core. So "the cache line associated with the request
queue lock is continuously bounced between those cores"
Also
usually one core is responsible for handling the hardware interrupts,
which in turn will software-interrupt the other cores. With high IOPS
and high interrupts, this results in very frequent context switches
and pollution of L1 and L2 caches per core.
2.
Remote Memory Access:
In
NUMA multiprocessor systems, every processor is associated with some
local memory, which it can access faster. The hardware vendors are
moving towards NUMA systems because it is difficult to support
uniform access shared-memory multiprocessor systems at scale
When
a IO completion is handled on a different core, than the one that
issued the IO request, it results in remote memory access for
data-structures associated with this IO request. If a processor had
to access the local memory of another processor, it results in a
significant performance drop.
Central
Idea of the paper:
To
remove the request queue lock contention, the solution is to use
multple queues per device and spread the lock contention across all
the queues. But that alone is not sufficient.
A
request queue of a block layer serves two purposes:
Quality
of service guarantees. ( no process should starve because of any
other IO intensive process)
-
A
device can only handle certain number of simultaneous in-flight
requests. So the block layer adjusts the IO rate submission in order
to not overflow the device.
These
two activities are separated out and handled in 2 levels of queues.
First level queue |
Second Level queue |
The number of first level
queues is according
to the number of cores
or NUMA sockets
on the machine. |
The number of
second level queues
depending on the
number of hardware
contexts (or DMA
queues) supported
in the device.
|
The size of the first
level queue is not
fixed and can keep
growing. |
The size of the
second level queue
is fixed according
to device capability |
By
binding one first level queue to a one core, the corresponding
request queue lock does not suffer cache line invalidations. The
implementation is such that the IO submission and completion are
handled by the same core. So handling of hardware interrupt is by the
same CPU that issued it which removes remote memory access.
CONCLUDING
REMARKS:
The
results show that the 2 level design can scale more than 10million
IOPs. Also the latency of a single request is minimised, when
compared to the single queue design.
There
are some interesting tidbits in the paper about the future SSD
devices. It seems that the upcoming SSD devices will offer flexible
number of submission and completion queues and the block device
driver will have an additional functionality exposing APIs indicating
the available queues. From the paper it was also clear that the
future devices will support more number of inflight requests. (
because " in a 1 million IOPS device, the 32 inflight request
limit will cause 31 thousand context switches/sec to process IO in
batches of 32.")
One
contention point in the paper is the assumption about SSDs having
random read and write latency that is as fast as sequential access
which is not completely true. Also the paper leaves a big gap about
what kind of scheduling policy should be used for SSD and importantly
where that scheduling policy should be implemented, in the block
layer or inside the device.
Overall
an important paper. The multiqueue block layer is getting merged in
Linux 3.13 [2].
References: