SC09 – Interesting Tech – Filesystems/Storage

All the usual suspects were visible in Portland this year. Including Panasas, Data Direct, Isilon, Lustre and IBM/GPFS. But we’ve all seen those all before. Two storage related technologies caught my eye at SC09 because I’d never seen them before.

I caught a technical session from a Korean company called Pspace. They developed a parallel filesystem called Infinistor for a couple of big Telco/ISPs in Korea. It’s a pretty straight-forward parallel filesystem with metadata and object data handled by separate servers. Object servers are always at least N+1 so that you can lose a whole object server without losing access to your data.

The neat things about infinistor are that it keeps track of how often data is accessed and it understands the concept that some storage is faster than others. So you could have some smaller servers based on SSD and Infinistor will replicate frequently accessed content to the fast disks. It can even handle multiple speeds of storage within one object server.

As you might expect from a project born in ISP-land it has a lot of support for replication across multiple sites. Since it’s always good to serve your client data from a node close to them on the network. Infinistor can replicate synchronously or asynchronously. With the latter prioritised for frequently accessed content.

File access is POSIX filesystem (It will do NFS or CIFS) or a REST API.

As ever with big conferences not everything you learn comes from the sessions or the Exhibition hall. I got chatting to an engineer from Pittsburgh Supercomputing Centre about the parallel filesystem they wrote called ZEST. The best thing about this filesystem is that you can’t read from it.

So I should back up for a second here and describe the problem ZEST is trying to solve since most of you are probably thinking “what use is a filesystem you can’t read from”. Here in HPC land we have all these big machines with thousands of very fast cores and big, fast interconnects. All this cost money. Unfortunately the more nodes you are running across the more likely you are to hit a problem (e.g dividing the current day in Mayan Long Count by the least significant digit in the firmware revision of your firmware cause your HCA to turn into a pumpkin or one of the million other failure modes that a wearyingly familiar to HPC ops people around the world). When this happens you don’t want to lose all the time you’ve spent up until the fault happened. And Lo unto the world did come Checkpointing.

Which is basically to say a lot of big codes will periodically dump their running state to disk so that in the event of a problem they can pick up from the last checkpoint. Now obviously this can be Terabytes of data and it takes a while to write it to disk. While you are doing that all those shiny, shiny CPUs are sitting idle. This makes the Intel salesman happy, but make your funding agencies cry.

So the approach in ZEST is to remove all the complexity involved in making a filesystem that you can read from in order to allow clients to write as fast as possible. There are a number of design decisions here that are interesting. ZEST storage servers don’t use RAID but assign write queues to each individual disk. All the checksumming and parity calculations are done on the client ( because these are over-endowed HPC nodes we are talking about here). By stripping away all this complexity ZEST aims to give each write client the the full bandwidth of the disk it’s writing to. Because most codes will be doing checkpointing from multiple nodes at once this is going to add up to significant aggregate bandwidth.

As an offline process the files that have been dumped to disk are re-aggreagated and copied onto a Lustre filesystem from where they can be read. So I kind of lied when I said it was read only. More technical detail can be found in the ZEST paper.