Building Systems: 2014

Please see below the RTI reply from ISRO about Mangalyaan costs

1) Please provide details on the cost breakdown of Mangalyaan's mission?

The break up total budget of the Mars Orbiter project as approved by the government is
given below:

System Description	Approved Cost (Rs in Crores)
Space segment	215.0
Ground segment	70.0
Project Management/Contingency	45.0
Programme elements	10.0
Launch cost (PSLV-XL)	110.0
Total	450.0

2) Please provide information about the team size in the 2.5 year time frame of the project?
3) How much amount was spent on the personnel costs?

ISRO/DOS Centres/Units function under multi project environment wherein people work on
several projects concurrently and the manpower is dynamically redeployed from one project
to another depending on the project requirements. Hence, it is not feasible to determine the
team size and expenditure there upon and in view of this, the information sought by the applicant
is not available on records.

4) what amount was spent on material costs? How much was spent on the equipment costs?

As per available records the amount spent upto March 31, 2014 under Mars Orbiter Mission is Rs. 320.42 crores which also includes sum of Rs 95.34 cores towards the launch vehicle. The amount spent on equipments under Mars Orbiter Mission upto March 31, 2014 is Rs. 15.70 crores

5) What is the amount spent on launch services?

As on 31.3.2014, the amount spent on launch services is Rs 95.34 crores.

The scanned copy of the RTI reply:

The paper for this blog-post is one which was published in SYSTOR 2013 " Linux block IO: Introducing multi-queue SSD access on multi-core systems"

This is a interesting paper which discusses about the scalability issues for block IO layer in linux kernel.

The problem: Scalability problems of the Linux block layer:

Block IO layer is the layer that sits between the filesystem and your disk driver. The block layer gives some generic services like IO scheduling and statistics.

According to the paper "Today, with Linux, a single

CPU core can sustain an IO submission rate of around 800

thousand IOPS. Regardless of how many cores are used to

submit IOs, the operating system block layer can not scale

up to over one million IOPS."

The bottlenecks for Scalability:

The bottleneck mainly arises because of a request queue data structure in the block layer. Before the IO is submitted to the disk driver, it is held in a request queue of the block layer. The request queue acts like a “staging area” where the requests are re-ordered, combined, and coalesced. There is a single request queue associated with every device. And there is a single lock that protects access to this queue. This single lock is a source of contention that prevents scaling to higher IOPS.

Other than the request queue lock contention, there are two other problems that hinder performance in multiprocessor systems:

1. Frequent cache line invalidations

When more than one core is accessing the request queue, they all compete for the the request queue lock. Each core would have cached the request queue lock which would be invalidated frequently by some other core. So "the cache line associated with the request queue lock is continuously bounced between those cores"

Also usually one core is responsible for handling the hardware interrupts, which in turn will software-interrupt the other cores. With high IOPS and high interrupts, this results in very frequent context switches and pollution of L1 and L2 caches per core.

2. Remote Memory Access:

In NUMA multiprocessor systems, every processor is associated with some local memory, which it can access faster. The hardware vendors are moving towards NUMA systems because it is difficult to support uniform access shared-memory multiprocessor systems at scale

When a IO completion is handled on a different core, than the one that issued the IO request, it results in remote memory access for data-structures associated with this IO request. If a processor had to access the local memory of another processor, it results in a significant performance drop.

Central Idea of the paper:

To remove the request queue lock contention, the solution is to use multple queues per device and spread the lock contention across all the queues. But that alone is not sufficient.

A request queue of a block layer serves two purposes:

Quality of service guarantees. ( no process should starve because of any other IO intensive process)
A device can only handle certain number of simultaneous in-flight requests. So the block layer adjusts the IO rate submission in order to not overflow the device.

These two activities are separated out and handled in 2 levels of queues.

First level queue	Second Level queue
The number of first level queues is according to the number of cores or NUMA sockets on the machine.	The number of second level queues depending on the number of hardware contexts (or DMA queues) supported in the device.
The size of the first level queue is not fixed and can keep growing.	The size of the second level queue is fixed according to device capability

By binding one first level queue to a one core, the corresponding request queue lock does not suffer cache line invalidations. The implementation is such that the IO submission and completion are handled by the same core. So handling of hardware interrupt is by the same CPU that issued it which removes remote memory access.

CONCLUDING REMARKS:

The results show that the 2 level design can scale more than 10million IOPs. Also the latency of a single request is minimised, when compared to the single queue design.

There are some interesting tidbits in the paper about the future SSD devices. It seems that the upcoming SSD devices will offer flexible number of submission and completion queues and the block device driver will have an additional functionality exposing APIs indicating the available queues. From the paper it was also clear that the future devices will support more number of inflight requests. ( because " in a 1 million IOPS device, the 32 inflight request limit will cause 31 thousand context switches/sec to process IO in batches of 32.")

One contention point in the paper is the assumption about SSDs having random read and write latency that is as fast as sequential access which is not completely true. Also the paper leaves a big gap about what kind of scheduling policy should be used for SSD and importantly where that scheduling policy should be implemented, in the block layer or inside the device.

Overall an important paper. The multiqueue block layer is getting merged in Linux 3.13 [2].

References:

[1] https://lwn.net/Articles/552904/

[2] http://www.phoronix.com/scan.php?page=news_item&px=MTUwNDc

[3] http://bjorling.me/systor-2013-presentation-of-the-linux-multi-queue-block-layer/

Building Systems

Sunday, 14 December 2014

RTI about Mangalyaan costs

Sunday, 14 September 2014

File system design considerations

Tuesday, 25 February 2014

Scaling the block I/O layer, SYSTOR 2013 paper